calibrax.metrics.functional.calibration¤
Calibration metrics for evaluating probabilistic predictions. Measures how well predicted probabilities match observed frequencies. Includes Brier score, expected/maximum calibration error, adaptive ECE, classwise ECE, reliability diagram binning, and Brier decomposition.
Calibration metrics for probability calibration assessment.
Pure functions for measuring how well predicted probabilities match observed frequencies. Calibration metrics differ from classification metrics: classification asks "which class?" while calibration asks "how reliable is the stated confidence?"
Includes 7 functions: brier_score, expected_calibration_error, maximum_calibration_error, reliability_diagram_bins, brier_decomposition, adaptive_calibration_error, classwise_ece.
brier_score(predictions: Any, targets: Any) -> Any
¤
Brier score: mean squared error between probabilities and outcomes.
Note
Direction: LOWER (0.0 = perfect calibration). Range: [0, 1]. Strictly proper scoring rule — minimized by the true predictive distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
Returns:
| Type | Description |
|---|---|
Any
|
Brier score as a scalar value. |
expected_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any
¤
Expected calibration error (ECE).
Weighted average of |accuracy - confidence| across equal-width bins.
Note
Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule. Sensitive to bin count — prefer adaptive_calibration_error for robustness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
num_bins
|
int
|
Number of equal-width bins. |
10
|
Returns:
| Type | Description |
|---|---|
Any
|
ECE as a scalar value. |
maximum_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any
¤
Maximum calibration error (MCE).
Maximum |accuracy - confidence| across all non-empty bins.
Note
Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule. Reports worst-case bin calibration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
num_bins
|
int
|
Number of equal-width bins. |
10
|
Returns:
| Type | Description |
|---|---|
Any
|
MCE as a scalar value. |
reliability_diagram_bins(predictions: Any, targets: Any, *, num_bins: int = 10) -> dict[str, Any]
¤
Compute binned statistics for reliability diagram plotting.
Note
Not a scalar metric — returns a dict for visualization. NOT registered in MetricRegistry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
num_bins
|
int
|
Number of equal-width bins. |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys: bin_edges, bin_accuracies, |
dict[str, Any]
|
bin_confidences, bin_counts. |
brier_decomposition(predictions: Any, targets: Any, *, num_bins: int = 10) -> dict[str, Any]
¤
Decompose Brier score into calibration, resolution, uncertainty.
Property: brier_score = calibration - resolution + uncertainty.
Note
Not a scalar metric — returns a dict with decomposition components. NOT registered in MetricRegistry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
num_bins
|
int
|
Number of equal-width bins. |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys: calibration, resolution, uncertainty, brier. |
adaptive_calibration_error(predictions: Any, targets: Any, *, num_bins: int = 10) -> Any
¤
Adaptive calibration error (ACE) with equal-mass binning.
Uses equal-mass bins (equal number of samples per bin) instead of ECE's equal-width bins. More robust to imbalanced confidence distributions.
Note
Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule, but more robust than ECE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probabilities in [0, 1]. |
required |
targets
|
Any
|
Binary ground truth (0 or 1). |
required |
num_bins
|
int
|
Number of equal-mass bins. |
10
|
Returns:
| Type | Description |
|---|---|
Any
|
ACE as a scalar value. |
classwise_ece(predictions: Any, targets: Any, *, num_bins: int = 10, num_classes: int | None = None) -> Any
¤
Classwise expected calibration error for multiclass problems.
Computes one-vs-rest ECE for each class, then averages. More informative than single ECE for multiclass calibration.
Note
Direction: LOWER (0.0 = perfectly calibrated). Range: [0, 1]. NOT a proper scoring rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
Any
|
Predicted probability matrix of shape (n_samples, n_classes). |
required |
targets
|
Any
|
Ground truth class indices. |
required |
num_bins
|
int
|
Number of bins for per-class ECE. |
10
|
num_classes
|
int | None
|
Number of classes. Inferred from predictions if None. |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
Mean classwise ECE as a scalar value. |