Skip to content

CI Integration¤

Calibrax provides two tools for continuous integration: CIGuard for gating pipelines on regression detection, and BisectionEngine for automatically finding the commit that introduced a regression.

CI Guard¤

CIGuard compares the latest (or specified) run against the stored baseline and returns a GuardResult indicating pass/fail:

from calibrax.storage.store import Store
from calibrax.core.models import MetricDef, MetricDirection, Metric, Point, Run
from calibrax.ci.guard import CIGuard

store = Store("/tmp/calibrax-ci-demo")
run = Run(
    points=(Point(name="fwd", scenario="train",
                  metrics={"throughput": Metric(value=1200.0)}),),
    metric_defs={"throughput": MetricDef(
        name="throughput", unit="samples/sec", direction=MetricDirection.HIGHER)},
)
store.save(run)
store.set_baseline(run.id)

guard = CIGuard(store, threshold=0.05)  # 5% regression threshold

result = guard.check()  # checks latest run against baseline
print(f"Passed: {result.passed}")
print(f"Regressions: {len(result.regressions)}")
print(f"Threshold: {result.threshold}")
print(f"Baseline: {result.baseline_id}")
print(f"Current: {result.current_id}")

if not result.passed:
    for r in result.regressions:
        print(f"  {r.metric}: {r.delta_pct:+.1f}%")

To check a specific run instead of the latest:

result = guard.check(run_id=run.id)

Threshold Guidance

  • 5% (default) — suitable for most metrics, catches meaningful regressions while tolerating measurement noise
  • 2-3% — use for critical metrics like latency in hot paths
  • 10% — use for noisy metrics like energy or GPU utilization

CI Failure Signaling¤

CIGuard does not call sys.exit() directly — it returns a GuardResult so your script can decide how to handle failures. The CLI calibrax check command handles exit codes automatically (see below).

CLI-Based CI¤

The simplest CI integration uses the calibrax check command, which exits with code 1 on regression:

# Run the regression check
calibrax check --data ./benchmark-data --threshold 0.05

# Exit code 0 = pass, 1 = regression detected
echo $?

GitHub Actions Example¤

name: Benchmark Regression Check

on:
  push:
    branches: [main]
  pull_request:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Install dependencies
        run: |
          uv venv
          uv pip install -e ".[stats]"

      - name: Run benchmarks
        run: python run_benchmarks.py --store ./benchmark-data

      - name: Check for regressions
        run: calibrax check --data ./benchmark-data --threshold 0.05

      - name: Update baseline on main
        if: github.ref == 'refs/heads/main' && success()
        run: calibrax baseline --data ./benchmark-data

CodSpeed Benchmarks¤

Calibrax also includes a focused CodSpeed workflow for PR-level performance checks. The workflow installs calibrax[test,performance], runs tests/performance/, and uploads CodSpeed results only when repository-side authentication is configured. Set a repository secret named CODSPEED_TOKEN, or set the repository variable CODSPEED_OIDC_ENABLED=true after enabling the repository in CodSpeed for OpenID Connect uploads. Without either setting, the workflow still runs the performance smoke tests and skips only the external upload step.

Local smoke check:

source activate.sh
uv run --extra performance pytest tests/performance/ --codspeed --no-cov

The performance extra installs pytest-codspeed; normal unit-test installs do not need it.

flowchart LR
    A[Run Benchmarks] --> B[Save to Store]
    B --> C{calibrax check}
    C -->|Pass| D[Update Baseline]
    C -->|Fail| E[Block PR]

    style A fill:#e3f2fd
    style B fill:#e3f2fd
    style C fill:#fff3e0
    style D fill:#c8e6c9
    style E fill:#ffcdd2

Bisection Engine¤

When a regression is detected, BisectionEngine binary-searches the git history to find the commit that introduced it:

from pathlib import Path
from calibrax.ci.bisection import BisectionEngine, BisectionResult
from calibrax.core.models import Metric, Point, Run
from calibrax.analysis.regression import detect_regressions

def run_benchmark(commit: str) -> Run:
    """Check out the commit and run the benchmark suite."""
    # Your benchmark logic here — returns a Run object
    return Run(
        points=(Point(name="fwd", scenario="train",
                      metrics={"throughput": Metric(value=1200.0)}),),
    )

def has_regression(run: Run) -> bool:
    """Return True if the run shows a regression."""
    baseline = store.get_baseline()
    if baseline is None:
        return False
    regressions = detect_regressions(run, baseline, threshold=0.05)
    return len(regressions) > 0

engine = BisectionEngine(
    repo_path=Path("."),
    benchmark_fn=run_benchmark,
    regression_fn=has_regression,
)

# Call engine.bisect() with known good and bad commit hashes:
# result = engine.bisect(good_commit="abc123", bad_commit="def456")
# print(f"Culprit: {result.culprit_commit}")
# print(f"Steps: {result.total_steps}")
# print(f"Commits tested: {len(result.tested_commits)}")

HEAD Restoration

BisectionEngine restores the original HEAD in a finally block after bisection completes, regardless of success or failure.

Best Practices¤

  • Run benchmarks on dedicated hardware with minimal background load for consistent results
  • Use separate thresholds for different metric categories — tighter for latency, looser for energy
  • Update baselines only on the main branch after a successful check
  • Store benchmark data in a persistent location (not a temporary CI directory) so trends accumulate across runs
  • Use BisectionEngine sparingly — it checks out commits and runs full benchmarks, which can be time-consuming

Next Steps¤

  • Production Monitoring


    Monitor deployed models with alerts and health reports

    Monitoring

  • Storage & Baselines


    Manage the store and baseline strategy that CI depends on

    Storage