Storage & Baselines¤

The Store class provides a JSON-per-run file backend for persisting benchmark runs, managing baselines, extracting trends, and ingesting external data.

Directory Layout¤

A store organizes files in a flat structure:

benchmark-data/
├── config.json          # Metric definitions
├── runs/
│   ├── a1b2c3d4e5f6.json
│   ├── f6e5d4c3b2a1.json
│   └── ...
└── baselines/
    └── main.json        # Symlink or copy of the baseline run

Creating a Store¤

from pathlib import Path
from calibrax.storage.store import Store
from calibrax.core.models import MetricDef, MetricDirection, Metric, Point, Run

store = Store(Path("/tmp/calibrax-storage-demo"))

The store creates its directory structure automatically on first use.

Saving and Loading Runs¤

run = Run(
    points=(Point(name="fwd", scenario="train",
                  metrics={"throughput": Metric(value=1200.0)}),),
    metric_defs={"throughput": MetricDef(
        name="throughput", unit="samples/sec", direction=MetricDirection.HIGHER)},
)

# Save a run — stored as {root}/runs/{run.id}.json
path = store.save(run)
print(f"Saved to {path}")

# Load by ID
loaded = store.load(run.id)

# Get the most recent run
latest = store.latest()

# List all runs (newest first)
all_runs = store.list_runs()

# Filter by branch
main_runs = store.list_runs(branch="main")

Querying by Tags¤

Find runs whose points match specific tag values:

# Find all runs with a "framework=flax" tag
flax_runs = store.query(framework="flax")

# Multiple tags are ANDed
filtered = store.query(framework="flax", scenario="training")

Baseline Management¤

Baselines are the reference point for regression detection. A store supports one active baseline at a time:

# Set a run as the baseline
store.set_baseline(run.id)

# Retrieve the current baseline (None if not set)
baseline = store.get_baseline()

# Update the baseline after a successful CI run
# Note: latest() raises FileNotFoundError if the store is empty
latest_passing_run = store.latest()
store.set_baseline(latest_passing_run.id)

Baseline Strategy

Set the baseline after a successful CI check passes. This ensures regressions are always measured against the last known-good state.

Trend Extraction¤

Extract a time series of metric values across runs for a specific point and tags:

trend = store.extract_trend(
    metric="throughput",
    point_name="forward_pass",
    tags={"framework": "flax"},
    n_runs=20,  # limit to last 20 runs
)

print(f"Metric: {trend.metric}, Point: {trend.point_name}")
for tp in trend.points:
    print(f"  {tp.run_id}: {tp.value:.1f} "
          f"[{tp.lower or 0:.1f}, {tp.upper or 0:.1f}]")

The returned TrendSeries contains TrendPoint objects with run_id, timestamp, value, and optional lower/upper bounds and commit hash.

Ingesting External Data¤

Import benchmark results from external JSON files:

import json
from pathlib import Path
from calibrax.core.models import Metric, Point, Run

# Create a sample external results file
external_run = Run(
    points=(Point(name="imported", scenario="test",
                  metrics={"latency": Metric(value=0.42)}),),
)
ext_path = Path("temp/doc-examples/external-results.json")
ext_path.parent.mkdir(parents=True, exist_ok=True)
ext_path.write_text(json.dumps(external_run.to_dict()))

# Ingest the file into the store
run = store.ingest(ext_path)
print(f"Ingested as run {run.id}")

The ingest() method reads the file, creates a Run from the data, saves it to the store, and returns the resulting Run object.

Best Practices¤

Keep one store per project or benchmark suite
Set baselines explicitly rather than relying on "latest" — this prevents a regression from becoming the new baseline
Use extract_trend() to visualize long-term performance changes before investigating specific regressions
Use query() to compare runs across branches or configurations

Next Steps¤

Regression Detection

Compare new runs against stored baselines

Regressions
Exporting Results

Export stored runs to W&B or generate publication tables

Exporters