Skip to content

calibrax.analysis¤

Analysis tools for benchmark data: direction-aware regression detection, single- and multi-metric ranking, cross-configuration comparison reports, power-law scaling fits, and Pareto front computation.

Regression Detection¤

calibrax.analysis.regression ¤

Regression detection for benchmark runs.

Compares a current run against a baseline to flag metrics that degraded beyond a specified threshold.

detect_regressions(run, baseline, threshold=0.05) ¤

Flag metrics that degraded beyond threshold.

Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.

Parameters:

Name Type Description Default
run Run

Current benchmark run.

required
baseline Run

Baseline run to compare against.

required
threshold float

Relative change threshold (e.g. 0.05 = 5%).

0.05

Returns:

Type Description
list[Regression]

List of detected regressions.

Ranking¤

calibrax.analysis.ranking ¤

Ranking and aggregate scoring for benchmark runs.

Ranks entries by metric value and computes weighted aggregate scores across multiple metrics.

rank_table(run, metric, group_by_tag='framework') ¤

Rank entries by metric value, grouped by a tag.

Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
metric str

Metric name to rank by.

required
group_by_tag str

Tag key used to group points (default "framework").

'framework'

Returns:

Type Description
list[RankEntry]

Sorted list of RankEntry, rank 1 = best.

aggregate_score(run, weights) ¤

Weighted aggregate score across metrics.

Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
weights dict[str, float]

{metric_name: weight} — weights are normalized to sum to 1.0.

required

Returns:

Type Description
dict[str, float]

{framework_label: aggregate_score} where score is in [0, 1].

Comparison¤

calibrax.analysis.comparison ¤

Multi-configuration benchmark comparison.

Compares benchmark runs across different configurations (frameworks, hardware, etc.) using MetricDef-aware direction logic and aggregate scoring.

MetricComparison(*, metric_name, values, rankings, best_label, improvement_factors) dataclass ¤

Comparison results for a single metric across configurations.

Attributes:

Name Type Description
metric_name str

Name of the compared metric.

values dict[str, float]

Mapping of configuration label to metric value.

rankings tuple[RankEntry, ...]

Ranked entries for this metric.

best_label str

Label of the best-performing configuration.

improvement_factors dict[str, float]

How much better the best is vs each config.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

ComparisonReport(*, name, labels_compared, metric_comparisons, winner_by_metric, overall_winner) dataclass ¤

Full comparison across multiple metrics and configurations.

Attributes:

Name Type Description
name str

Name of this comparison.

labels_compared tuple[str, ...]

Configuration labels included.

metric_comparisons tuple[MetricComparison, ...]

Per-metric comparison results.

winner_by_metric dict[str, str]

Best label for each metric.

overall_winner str

Best label by aggregate score.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with comparison report fields.

required

Returns:

Type Description
ComparisonReport

Reconstructed ComparisonReport instance.

compare_configurations(runs, metrics=None, *, group_by_tag='framework') ¤

Compare benchmark runs across different configurations.

Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.

Parameters:

Name Type Description Default
runs dict[str, Run]

Mapping of configuration label to benchmark Run.

required
metrics Sequence[str] | None

Subset of metric names to compare. Defaults to all metrics found across all runs.

None
group_by_tag str

Tag key used for grouping (default "framework").

'framework'

Returns:

Type Description
ComparisonReport

ComparisonReport with per-metric comparisons and overall winner.

Raises:

Type Description
ValueError

If fewer than 2 configurations are provided.

Scaling Laws¤

calibrax.analysis.scaling ¤

Scaling law fitting via log-linear regression.

Fits power-law relationships (value = a * size^b) using pure Python log-linear regression. No external dependencies required.

scaling_fit(sizes, values) ¤

Fit power-law: value = a * size^b using log-linear regression.

Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).

Parameters:

Name Type Description Default
sizes list[float]

Input sizes (e.g., batch sizes, dataset sizes).

required
values list[float]

Measured values (e.g., throughput, latency).

required

Returns:

Type Description
ScalingLaw

ScalingLaw with coefficient (a), exponent (b), r_squared, and

ScalingLaw

complexity classification string.

Raises:

Type Description
ValueError

If inputs are empty or have different lengths.

Pareto Front¤

calibrax.analysis.pareto ¤

Pareto front identification for multi-objective benchmark analysis.

Identifies Pareto-optimal points for two metrics, respecting MetricDef.direction for dominance checks.

pareto_front(points, x_metric, y_metric, *, metric_defs=None) ¤

Identify Pareto-optimal points for two metrics.

A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".

Parameters:

Name Type Description Default
points list[Point]

List of benchmark points to analyze.

required
x_metric str

First metric name.

required
y_metric str

Second metric name.

required
metric_defs dict[str, MetricDef] | None

Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics.

None

Returns:

Type Description
list[Point]

List of Pareto-optimal points (subset of input, same order).

Change Point Detection¤

Optional Dependency

Requires ruptures: uv pip install "calibrax[changepoint]"

calibrax.analysis.changepoint ¤

Change point detection for benchmark time series.

Uses the ruptures library to detect significant changes in metric trends, enabling automated identification of performance regressions or improvements over time. Requires the optional ruptures dependency (uv pip install "calibrax[changepoint]").

ChangePoint(*, index, timestamp=None, run_id=None, magnitude=0.0) dataclass ¤

A detected change point in a benchmark trend series.

Attributes:

Name Type Description
index int

Index in the trend series where the change was detected.

timestamp datetime | None

Timestamp of the change point, if available.

run_id str | None

Run ID at the change point, if available.

magnitude float

Absolute difference in mean values before/after the change.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with change point fields.

required

Returns:

Type Description
ChangePoint

Reconstructed ChangePoint instance.

detect_change_points(trend, *, method='pelt', min_size=3, penalty=None) ¤

Detect change points in a benchmark trend series.

Uses the ruptures library for change point detection with configurable algorithms.

Parameters:

Name Type Description Default
trend TrendSeries

TrendSeries containing the metric values over time.

required
method str

Detection method ("pelt", "binseg", or "window").

'pelt'
min_size int

Minimum segment size between change points.

3
penalty float | None

Penalty value for PELT/BinSeg. Auto-calibrated if None.

None

Returns:

Type Description
list[ChangePoint]

List of detected ChangePoint instances, ordered by index.

Raises:

Type Description
ImportError

If ruptures is not installed.

ValueError

If the trend has fewer points than min_size.