calibrax.analysis¤
Analysis tools for benchmark data: direction-aware regression detection, single- and multi-metric ranking, cross-configuration comparison reports, power-law scaling fits, and Pareto front computation.
Regression Detection¤
calibrax.analysis.regression
¤
Regression detection for benchmark runs.
Compares a current run against a baseline to flag metrics that degraded beyond a specified threshold.
detect_regressions(run, baseline, threshold=0.05)
¤
Flag metrics that degraded beyond threshold.
Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
Run
|
Current benchmark run. |
required |
baseline
|
Run
|
Baseline run to compare against. |
required |
threshold
|
float
|
Relative change threshold (e.g. 0.05 = 5%). |
0.05
|
Returns:
| Type | Description |
|---|---|
list[Regression]
|
List of detected regressions. |
Ranking¤
calibrax.analysis.ranking
¤
Ranking and aggregate scoring for benchmark runs.
Ranks entries by metric value and computes weighted aggregate scores across multiple metrics.
rank_table(run, metric, group_by_tag='framework')
¤
Rank entries by metric value, grouped by a tag.
Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
Run
|
Benchmark run with points and metric_defs. |
required |
metric
|
str
|
Metric name to rank by. |
required |
group_by_tag
|
str
|
Tag key used to group points (default "framework"). |
'framework'
|
Returns:
| Type | Description |
|---|---|
list[RankEntry]
|
Sorted list of RankEntry, rank 1 = best. |
aggregate_score(run, weights)
¤
Weighted aggregate score across metrics.
Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
Run
|
Benchmark run with points and metric_defs. |
required |
weights
|
dict[str, float]
|
{metric_name: weight} — weights are normalized to sum to 1.0. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
{framework_label: aggregate_score} where score is in [0, 1]. |
Comparison¤
calibrax.analysis.comparison
¤
Multi-configuration benchmark comparison.
Compares benchmark runs across different configurations (frameworks, hardware, etc.) using MetricDef-aware direction logic and aggregate scoring.
MetricComparison(*, metric_name, values, rankings, best_label, improvement_factors)
dataclass
¤
Comparison results for a single metric across configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
metric_name |
str
|
Name of the compared metric. |
values |
dict[str, float]
|
Mapping of configuration label to metric value. |
rankings |
tuple[RankEntry, ...]
|
Ranked entries for this metric. |
best_label |
str
|
Label of the best-performing configuration. |
improvement_factors |
dict[str, float]
|
How much better the best is vs each config. |
to_dict()
¤
Serialize to a JSON-compatible dictionary.
ComparisonReport(*, name, labels_compared, metric_comparisons, winner_by_metric, overall_winner)
dataclass
¤
Full comparison across multiple metrics and configurations.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Name of this comparison. |
labels_compared |
tuple[str, ...]
|
Configuration labels included. |
metric_comparisons |
tuple[MetricComparison, ...]
|
Per-metric comparison results. |
winner_by_metric |
dict[str, str]
|
Best label for each metric. |
overall_winner |
str
|
Best label by aggregate score. |
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with comparison report fields. |
required |
Returns:
| Type | Description |
|---|---|
ComparisonReport
|
Reconstructed ComparisonReport instance. |
compare_configurations(runs, metrics=None, *, group_by_tag='framework')
¤
Compare benchmark runs across different configurations.
Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runs
|
dict[str, Run]
|
Mapping of configuration label to benchmark Run. |
required |
metrics
|
Sequence[str] | None
|
Subset of metric names to compare. Defaults to all metrics found across all runs. |
None
|
group_by_tag
|
str
|
Tag key used for grouping (default "framework"). |
'framework'
|
Returns:
| Type | Description |
|---|---|
ComparisonReport
|
ComparisonReport with per-metric comparisons and overall winner. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fewer than 2 configurations are provided. |
Scaling Laws¤
calibrax.analysis.scaling
¤
Scaling law fitting via log-linear regression.
Fits power-law relationships (value = a * size^b) using pure Python log-linear regression. No external dependencies required.
scaling_fit(sizes, values)
¤
Fit power-law: value = a * size^b using log-linear regression.
Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sizes
|
list[float]
|
Input sizes (e.g., batch sizes, dataset sizes). |
required |
values
|
list[float]
|
Measured values (e.g., throughput, latency). |
required |
Returns:
| Type | Description |
|---|---|
ScalingLaw
|
ScalingLaw with coefficient (a), exponent (b), r_squared, and |
ScalingLaw
|
complexity classification string. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If inputs are empty or have different lengths. |
Pareto Front¤
calibrax.analysis.pareto
¤
Pareto front identification for multi-objective benchmark analysis.
Identifies Pareto-optimal points for two metrics, respecting MetricDef.direction for dominance checks.
pareto_front(points, x_metric, y_metric, *, metric_defs=None)
¤
Identify Pareto-optimal points for two metrics.
A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
points
|
list[Point]
|
List of benchmark points to analyze. |
required |
x_metric
|
str
|
First metric name. |
required |
y_metric
|
str
|
Second metric name. |
required |
metric_defs
|
dict[str, MetricDef] | None
|
Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics. |
None
|
Returns:
| Type | Description |
|---|---|
list[Point]
|
List of Pareto-optimal points (subset of input, same order). |
Change Point Detection¤
Optional Dependency
Requires ruptures: uv pip install "calibrax[changepoint]"
calibrax.analysis.changepoint
¤
Change point detection for benchmark time series.
Uses the ruptures library to detect significant changes in metric
trends, enabling automated identification of performance regressions
or improvements over time. Requires the optional ruptures dependency
(uv pip install "calibrax[changepoint]").
ChangePoint(*, index, timestamp=None, run_id=None, magnitude=0.0)
dataclass
¤
A detected change point in a benchmark trend series.
Attributes:
| Name | Type | Description |
|---|---|---|
index |
int
|
Index in the trend series where the change was detected. |
timestamp |
datetime | None
|
Timestamp of the change point, if available. |
run_id |
str | None
|
Run ID at the change point, if available. |
magnitude |
float
|
Absolute difference in mean values before/after the change. |
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with change point fields. |
required |
Returns:
| Type | Description |
|---|---|
ChangePoint
|
Reconstructed ChangePoint instance. |
detect_change_points(trend, *, method='pelt', min_size=3, penalty=None)
¤
Detect change points in a benchmark trend series.
Uses the ruptures library for change point detection with
configurable algorithms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trend
|
TrendSeries
|
TrendSeries containing the metric values over time. |
required |
method
|
str
|
Detection method ("pelt", "binseg", or "window"). |
'pelt'
|
min_size
|
int
|
Minimum segment size between change points. |
3
|
penalty
|
float | None
|
Penalty value for PELT/BinSeg. Auto-calibrated if None. |
None
|
Returns:
| Type | Description |
|---|---|
list[ChangePoint]
|
List of detected ChangePoint instances, ordered by index. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If ruptures is not installed. |
ValueError
|
If the trend has fewer points than min_size. |