calibrax.core.models¤
Data model classes for benchmark results. All dataclasses are frozen and
immutable, with to_dict() / from_dict() support for JSON serialization.
Key types: Run, Point, Metric, MetricDef, MetricDirection,
MetricPriority, Regression, RankEntry, SignificanceResult, ScalingLaw,
TrendPoint, TrendSeries.
Key functions: extract_framework_metrics, is_higher_better.
Core data models for the calibrax benchmarking framework.
All types are frozen dataclasses with to_dict()/from_dict() for JSON serde. Uses tuple for immutable sequence fields and StrEnum for fixed value sets.
Numeric fields are converted to Python primitives in to_dict() to handle JAX scalars (jnp.float32, jnp.int32) which are not JSON-serializable.
MetricDirection
¤
Bases: StrEnum
Direction indicating whether higher or lower values are better.
MetricPriority
¤
Bases: StrEnum
Priority level for a metric definition.
MetricDef(*, name, unit, direction, group='', priority=MetricPriority.SECONDARY, description='')
dataclass
¤
How to interpret a metric — semantics for direction and grouping.
Metric(*, value, lower=None, upper=None, samples=None)
dataclass
¤
Single metric value with optional confidence interval and samples.
to_dict()
¤
Serialize to a JSON-compatible dictionary, omitting None fields.
Point(*, name, scenario, tags=dict(), metrics=dict())
dataclass
¤
One benchmark measurement under one configuration.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
Run(*, points, id=(lambda: uuid4().hex[:12])(), timestamp=datetime.now(), commit=None, branch=None, environment=dict(), metadata=dict(), metric_defs=dict())
dataclass
¤
One execution of a benchmark suite.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
Regression(*, metric, point_name, baseline_value, current_value, delta_pct, direction)
dataclass
¤
A detected performance regression.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with regression fields. |
required |
Returns:
| Type | Description |
|---|---|
Regression
|
Reconstructed Regression instance. |
RankEntry(*, label, value, rank, is_best, delta_from_best)
dataclass
¤
One row in a ranking table.
SignificanceResult(*, p_value, statistic, effect_size, significant, method)
dataclass
¤
Result of a statistical significance test.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with significance result fields. |
required |
Returns:
| Type | Description |
|---|---|
SignificanceResult
|
Reconstructed SignificanceResult instance. |
ScalingLaw(*, coefficient, exponent, r_squared, complexity)
dataclass
¤
Power-law fit: value = coefficient * size^exponent.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with scaling law fields. |
required |
Returns:
| Type | Description |
|---|---|
ScalingLaw
|
Reconstructed ScalingLaw instance. |
TrendPoint(*, run_id, timestamp, value, commit=None, lower=None, upper=None)
dataclass
¤
One data point in a time-series trend.
to_dict()
¤
Serialize to a JSON-compatible dictionary, omitting None fields.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with trend point fields. |
required |
Returns:
| Type | Description |
|---|---|
TrendPoint
|
Reconstructed TrendPoint instance. |
TrendSeries(*, metric, point_name, tags=dict(), points=())
dataclass
¤
Time-series trend for a single metric across multiple runs.
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary, reconstructing nested TrendPoints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with trend series fields. |
required |
Returns:
| Type | Description |
|---|---|
TrendSeries
|
Reconstructed TrendSeries instance. |
is_higher_better(md)
¤
Whether higher values are better for this metric.
Returns True for "higher" or unknown (None) metrics, False for "lower". "info" metrics return True by convention (no ranking semantics).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
md
|
MetricDef | None
|
Metric definition to check, or None for unknown. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if higher values are better or metric is unknown/info. |
extract_framework_metrics(run, metric_names)
¤
Extract per-framework metric values from run points.
Iterates over each point in the run, groups by the framework tag
(falling back to the point name), and collects values for the requested
metrics. Commonly used by ranking, scoring, and publication modules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
Run
|
Benchmark run with points tagged by framework. |
required |
metric_names
|
Iterable[str]
|
Metric names to extract (only keys are used if a mapping is passed). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Mapping of |