Skip to content

calibrax.statistics¤

Statistical analysis tools for benchmark measurements. Provides descriptive statistics with bootstrap confidence intervals, MAD-based outlier detection, significance tests (Welch's t, Mann-Whitney U, Wilcoxon), and Cohen's d effect size.

Analyzer¤

calibrax.statistics.analyzer ¤

Statistical analysis for benchmark measurements.

Provides summary statistics with bootstrap confidence intervals, outlier detection via modified Z-scores, and stability assessment.

StatisticalResult(*, mean, median, std, min, max, cv, ci_lower, ci_upper, n, is_stable) dataclass ¤

Summary statistics with confidence intervals.

Attributes:

Name Type Description
mean float

Arithmetic mean.

median float

Median value.

std float

Sample standard deviation (ddof=1).

min float

Minimum value.

max float

Maximum value.

cv float

Coefficient of variation (std / mean).

ci_lower float

95% bootstrap CI lower bound.

ci_upper float

95% bootstrap CI upper bound.

n int

Number of samples.

is_stable bool

True when CV < STABILITY_CV_THRESHOLD.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

from_dict(data) classmethod ¤

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with statistical result fields.

required

Returns:

Type Description
StatisticalResult

Reconstructed StatisticalResult instance.

StatisticalAnalyzer(bootstrap_resamples=1000, seed=42) ¤

Statistical analysis for benchmark measurements.

Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42

Initialize with bootstrap parameters.

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42

summarize(samples) ¤

Compute summary statistics with bootstrap CI.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values (at least 1).

required

Returns:

Type Description
StatisticalResult

StatisticalResult with all computed statistics.

bootstrap_ci(samples, confidence=0.95) ¤

Percentile bootstrap confidence interval.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values.

required
confidence float

Confidence level (default 0.95 for 95% CI).

0.95

Returns:

Type Description
tuple[float, float]

Tuple of (lower_bound, upper_bound).

detect_outliers(samples, threshold=OUTLIER_Z_THRESHOLD) ¤

Modified Z-score outlier detection.

Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of values to check.

required
threshold float

Modified Z-score threshold (default 3.5).

OUTLIER_Z_THRESHOLD

Returns:

Type Description
list[int]

List of indices where outliers are detected.

Significance Testing¤

Optional Dependency

Significance tests require scipy: uv pip install "calibrax[stats]"

calibrax.statistics.significance ¤

Statistical significance tests for benchmark comparisons.

Provides Welch's t-test, Mann-Whitney U, paired Wilcoxon signed-rank test (with pure-Python sign test fallback), and Cohen's d effect size.

welch_t_test(a, b) ¤

Welch's t-test for unequal variances.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (t_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

mann_whitney_u(a, b) ¤

Mann-Whitney U test for non-parametric distribution comparison.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (u_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

paired_significance_test(a, b, *, alpha=0.05) ¤

Wilcoxon signed-rank test for paired samples.

Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.

Parameters:

Name Type Description Default
a list[float]

First sample (e.g., baseline measurements).

required
b list[float]

Second sample (e.g., current measurements). Must be same length as a.

required
alpha float

Significance threshold (default 0.05).

0.05

Returns:

Type Description
SignificanceResult

SignificanceResult with p_value, statistic, effect_size (Cohen's d),

SignificanceResult

significant flag, and method name.

Raises:

Type Description
ValueError

If samples are empty or have different lengths.

effect_size(a, b) ¤

Cohen's d effect size for two independent samples.

Parameters:

Name Type Description Default
a Sequence[float]

First sample.

required
b Sequence[float]

Second sample.

required

Returns:

Type Description
float

Absolute Cohen's d value. Returns 0.0 if pooled std is zero.