Skip to content

calibrax.core.models¤

Data model classes for benchmark results. All dataclasses are frozen and immutable, with to_dict() / from_dict() support for JSON serialization.

Key types: Run, Point, Metric, MetricDef, MetricDirection, MetricPriority, Regression, RankEntry, SignificanceResult, ScalingLaw, TrendPoint, TrendSeries.

Key functions: extract_framework_metrics, is_higher_better.

Core data models for the calibrax benchmarking framework.

All types are frozen dataclasses with to_dict()/from_dict() for JSON serde. Uses tuple for immutable sequence fields and StrEnum for fixed value sets.

Numeric fields are converted to Python primitives in to_dict() to handle JAX scalars (jnp.float32, jnp.int32) which are not JSON-serializable.

MetricDirection ¤

Bases: StrEnum

Direction indicating whether higher or lower values are better.

MetricPriority ¤

Bases: StrEnum

Priority level for a metric definition.

MetricDef(*, name, unit, direction, group='', priority=MetricPriority.SECONDARY, description='') dataclass ¤

How to interpret a metric — semantics for direction and grouping.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with metric definition fields.

required

Returns:

Type Description
MetricDef

Reconstructed MetricDef instance.

Metric(*, value, lower=None, upper=None, samples=None) dataclass ¤

Single metric value with optional confidence interval and samples.

to_dict() ¤

Serialize to a JSON-compatible dictionary, omitting None fields.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with metric fields.

required

Returns:

Type Description
Metric

Reconstructed Metric instance.

Point(*, name, scenario, tags=dict(), metrics=dict()) dataclass ¤

One benchmark measurement under one configuration.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary, reconstructing nested Metric objects.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with point fields.

required

Returns:

Type Description
Point

Reconstructed Point instance.

Run(*, points, id=(lambda: uuid4().hex[:12])(), timestamp=datetime.now(), commit=None, branch=None, environment=dict(), metadata=dict(), metric_defs=dict()) dataclass ¤

One execution of a benchmark suite.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary, reconstructing nested objects.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with run fields.

required

Returns:

Type Description
Run

Reconstructed Run instance.

Regression(*, metric, point_name, baseline_value, current_value, delta_pct, direction) dataclass ¤

A detected performance regression.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with regression fields.

required

Returns:

Type Description
Regression

Reconstructed Regression instance.

RankEntry(*, label, value, rank, is_best, delta_from_best) dataclass ¤

One row in a ranking table.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with rank entry fields.

required

Returns:

Type Description
RankEntry

Reconstructed RankEntry instance.

SignificanceResult(*, p_value, statistic, effect_size, significant, method) dataclass ¤

Result of a statistical significance test.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with significance result fields.

required

Returns:

Type Description
SignificanceResult

Reconstructed SignificanceResult instance.

ScalingLaw(*, coefficient, exponent, r_squared, complexity) dataclass ¤

Power-law fit: value = coefficient * size^exponent.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with scaling law fields.

required

Returns:

Type Description
ScalingLaw

Reconstructed ScalingLaw instance.

TrendPoint(*, run_id, timestamp, value, commit=None, lower=None, upper=None) dataclass ¤

One data point in a time-series trend.

to_dict() ¤

Serialize to a JSON-compatible dictionary, omitting None fields.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with trend point fields.

required

Returns:

Type Description
TrendPoint

Reconstructed TrendPoint instance.

TrendSeries(*, metric, point_name, tags=dict(), points=()) dataclass ¤

Time-series trend for a single metric across multiple runs.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary, reconstructing nested TrendPoints.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with trend series fields.

required

Returns:

Type Description
TrendSeries

Reconstructed TrendSeries instance.

is_higher_better(md) ¤

Whether higher values are better for this metric.

Returns True for "higher" or unknown (None) metrics, False for "lower". "info" metrics return True by convention (no ranking semantics).

Parameters:

Name Type Description Default
md MetricDef | None

Metric definition to check, or None for unknown.

required

Returns:

Type Description
bool

True if higher values are better or metric is unknown/info.

extract_framework_metrics(run, metric_names) ¤

Extract per-framework metric values from run points.

Iterates over each point in the run, groups by the framework tag (falling back to the point name), and collects values for the requested metrics. Commonly used by ranking, scoring, and publication modules.

Parameters:

Name Type Description Default
run Run

Benchmark run with points tagged by framework.

required
metric_names Iterable[str]

Metric names to extract (only keys are used if a mapping is passed).

required

Returns:

Type Description
dict[str, dict[str, float]]

Mapping of {framework_label: {metric_name: value}}.