calibrax.metrics.composition¤
Composition framework for grouping and combining metrics. MetricCollection
groups multiple metrics for batch computation, WeightedMetric produces
a single weighted score, MetricSuite organizes metrics by domain,
and ThresholdMetric wraps a metric with a pass/fail threshold for CI gates.
Metric composition: collections, weighted combinations, suites, thresholds.
Provides higher-level abstractions for grouping and combining metrics:
MetricCollection: Group multiple metrics, compute all in one call.WeightedMetric: Weighted combination of metric values into a single score.MetricSuite: Named groups of metrics with domain awareness.ThresholdMetric: Wrap a metric with a pass/fail threshold for CI gates.
MetricCollection(metrics: dict[str, Callable[..., float]])
¤
Group multiple metrics, compute all in one call.
Supports Tier 0 pure functions via callable references.
Usage
collection = MetricCollection({ "mse": mse, "mae": mae, }) results = collection.compute_functional(predictions, targets)
{"mse": 0.01, "mae": 0.05}¤
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
Dictionary mapping metric names to callables. |
Initialize with a dictionary of named metric functions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
dict[str, Callable[..., float]]
|
Mapping of metric names to callable functions. |
required |
names: list[str]
property
¤
Return all metric names in the collection.
compute_functional(predictions: Any, targets: Any, **kwargs: Any) -> dict[str, float]
¤
Compute all functional metrics.
Calls each callable metric with (predictions, targets, **kwargs).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted values. |
required |
targets
|
Any
|
Ground truth values. |
required |
**kwargs
|
Any
|
Additional keyword arguments passed to each function. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary mapping metric names to computed float values. |
add(name: str, metric: Callable[..., float]) -> None
¤
Add a metric to the collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name for the metric. |
required |
metric
|
Callable[..., float]
|
Callable metric function. |
required |
remove(name: str) -> None
¤
Remove a metric by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the metric to remove. |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If metric name not found. |
from_registry(*, domain: str | None = None, tier: MetricTier = MetricTier.PURE_FUNCTION) -> MetricCollection
classmethod
¤
Create a collection from all registered metrics matching filters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str | None
|
Filter by domain (None = all domains). |
None
|
tier
|
MetricTier
|
Filter by tier (default: PURE_FUNCTION). |
PURE_FUNCTION
|
Returns:
| Type | Description |
|---|---|
MetricCollection
|
MetricCollection with matching metrics. |
WeightedMetric(weights: dict[str, float])
¤
Weighted combination of metric values into a single score.
Usage
weighted = WeightedMetric({"mse": 0.7, "mae": 0.3}) score = weighted.compute({"mse": 0.01, "mae": 0.05})
0.7 * 0.01 + 0.3 * 0.05 = 0.022¤
Attributes:
| Name | Type | Description |
|---|---|---|
weights |
dict[str, float]
|
Dictionary mapping metric names to float weights. |
Initialize with metric weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weights
|
dict[str, float]
|
Metric name to weight mapping. Weights need not sum to 1. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If weights dict is empty. |
weights: dict[str, float]
property
¤
Return the weights dictionary.
normalized_weights: dict[str, float]
property
¤
Return weights normalized to sum to 1.0.
compute(metric_values: dict[str, float]) -> float
¤
Compute weighted sum of metric values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_values
|
dict[str, float]
|
Dictionary of metric name to value. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Weighted sum as a Python float. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If a required metric is missing from metric_values. |
MetricSuite()
¤
Named groups of metrics with tier/domain awareness.
Organizes metrics into named groups for structured evaluation. Can auto-populate from the MetricRegistry.
Usage
suite = MetricSuite() suite.add_group("regression", ["mse", "mae", "rmse"]) suite.add_group("classification", ["accuracy", "f1_score"]) results = suite.compute_all(predictions, targets)
{"regression": {"mse": ..., "mae": ..., "rmse": ...},¤
"classification": {"accuracy": ..., "f1_score": ...}}¤
Attributes:
| Name | Type | Description |
|---|---|---|
groups |
Dictionary mapping group names to metric name lists. |
Initialize an empty metric suite.
add_group(group_name: str, metric_names: list[str]) -> None
¤
Add a named group of metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_name
|
str
|
Name for the group. |
required |
metric_names
|
list[str]
|
List of metric names (must be registered in MetricRegistry). |
required |
Raises:
| Type | Description |
|---|---|
KeyError
|
If any metric name is not in the registry. |
compute_all(predictions: Any, targets: Any) -> dict[str, dict[str, float]]
¤
Compute all metrics in all groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted values. |
required |
targets
|
Any
|
Ground truth values. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Nested dict: {group_name: {metric_name: value}}. |
list_groups() -> list[str]
¤
Return all group names.
from_registry_domains() -> MetricSuite
classmethod
¤
Create a suite grouped by domain from the registry.
Returns:
| Type | Description |
|---|---|
MetricSuite
|
MetricSuite with one group per domain containing all |
MetricSuite
|
Tier 0 metrics in that domain. |
ThresholdMetric(metric_name: str, *, min_value: float | None = None, max_value: float | None = None)
¤
Wrap a metric with a pass/fail threshold.
Usage
threshold = ThresholdMetric("mse", max_value=0.01) result = threshold.evaluate(predictions, targets)
{"value": 0.005, "passed": True, "threshold": 0.01, "metric_name": "mse"}¤
Attributes:
| Name | Type | Description |
|---|---|---|
metric_name |
str
|
Name of the metric to evaluate. |
min_value |
float | None
|
Minimum acceptable value (for HIGHER metrics). |
max_value |
float | None
|
Maximum acceptable value (for LOWER metrics). |
Initialize threshold metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_name
|
str
|
Registered metric name. |
required |
min_value
|
float | None
|
Minimum acceptable value (metric must be >= this). |
None
|
max_value
|
float | None
|
Maximum acceptable value (metric must be <= this). |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither min_value nor max_value is provided. |
KeyError
|
If metric_name is not in the registry. |
metric_name: str
property
¤
Get the metric name.
min_value: float | None
property
¤
Get the minimum threshold value.
max_value: float | None
property
¤
Get the maximum threshold value.
evaluate(predictions: Any, targets: Any) -> dict[str, Any]
¤
Compute the metric and check against threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted values. |
required |
targets
|
Any
|
Ground truth values. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with "value" (float), "passed" (bool), "threshold" (float), |
dict[str, Any]
|
"metric_name" (str). |