Skip to content

calibrax.metrics.functional.calibration¤

Calibration metrics for evaluating probabilistic predictions. Measures how well predicted probabilities match observed frequencies. Includes Brier score, expected/maximum calibration error, adaptive ECE, classwise ECE, reliability diagram binning, and Brier decomposition.

Calibration metrics for probability calibration assessment.

Pure functions for measuring how well predicted probabilities match observed frequencies. Calibration metrics differ from classification metrics: classification asks "which class?" while calibration asks "how reliable is the stated confidence?"

Includes 7 functions: brier_score, expected_calibration_error, maximum_calibration_error, reliability_diagram_bins, brier_decomposition, adaptive_calibration_error, classwise_ece.

brier_score(predictions: Any, targets: Any) -> Any ¤

Brier score: mean squared error between probabilities and outcomes.

Note

Direction: LOWER (0.0 = perfect calibration). Range: [0, 1]. Strictly proper scoring rule — minimized by the true predictive distribution.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required

Returns:

Type Description
Any

Brier score as a scalar value.

expected_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any ¤

Expected calibration error (ECE).

Weighted average of |accuracy - confidence| across equal-width bins.

Note

Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule. Sensitive to bin count — prefer adaptive_calibration_error for robustness.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required
num_bins int

Number of equal-width bins.

10

Returns:

Type Description
Any

ECE as a scalar value.

maximum_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any ¤

Maximum calibration error (MCE).

Maximum |accuracy - confidence| across all non-empty bins.

Note

Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule. Reports worst-case bin calibration.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required
num_bins int

Number of equal-width bins.

10

Returns:

Type Description
Any

MCE as a scalar value.

reliability_diagram_bins(predictions: Any, targets: Any, *, num_bins: int = 10) -> dict[str, Any] ¤

Compute binned statistics for reliability diagram plotting.

Note

Not a scalar metric — returns a dict for visualization. NOT registered in MetricRegistry.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required
num_bins int

Number of equal-width bins.

10

Returns:

Type Description
dict[str, Any]

Dictionary with keys: bin_edges, bin_accuracies,

dict[str, Any]

bin_confidences, bin_counts.

brier_decomposition(predictions: Any, targets: Any, *, num_bins: int = 10) -> dict[str, Any] ¤

Decompose Brier score into calibration, resolution, uncertainty.

Property: brier_score = calibration - resolution + uncertainty.

Note

Not a scalar metric — returns a dict with decomposition components. NOT registered in MetricRegistry.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required
num_bins int

Number of equal-width bins.

10

Returns:

Type Description
dict[str, Any]

Dictionary with keys: calibration, resolution, uncertainty, brier.

adaptive_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any ¤

Adaptive calibration error (ACE) with equal-mass binning.

Uses equal-mass bins (equal number of samples per bin) instead of ECE's equal-width bins. More robust to imbalanced confidence distributions.

Note

Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule, but more robust than ECE.

Parameters:

Name Type Description Default
predictions Any

Predicted probabilities in [0, 1].

required
targets Any

Binary ground truth (0 or 1).

required
num_bins int

Number of equal-mass bins.

10

Returns:

Type Description
Any

ACE as a scalar value.

classwise_ece(predictions: Any, targets: Any, *, num_bins: int = 10, num_classes: int | None = None) -> Any ¤

Classwise expected calibration error for multiclass problems.

Computes one-vs-rest ECE for each class, then averages. More informative than single ECE for multiclass calibration.

Note

Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule.

Parameters:

Name Type Description Default
predictions Any

Predicted probability matrix of shape (n_samples, n_classes).

required
targets Any

Ground truth class indices.

required
num_bins int

Number of bins for per-class ECE.

10
num_classes int | None

Number of classes. Inferred from predictions if None.

None

Returns:

Type Description
Any

Mean classwise ECE as a scalar value.