calibrax.statistics¤
Statistical analysis tools for benchmark measurements. Provides descriptive statistics with bootstrap confidence intervals, MAD-based outlier detection, significance tests (Welch's t, Mann-Whitney U, Wilcoxon), and Cohen's d effect size.
Analyzer¤
calibrax.statistics.analyzer
¤
Statistical analysis for benchmark measurements.
Provides summary statistics with bootstrap confidence intervals, outlier detection via modified Z-scores, and stability assessment.
StatisticalResult(*, mean, median, std, min, max, cv, ci_lower, ci_upper, n, is_stable)
dataclass
¤
Summary statistics with confidence intervals.
Attributes:
| Name | Type | Description |
|---|---|---|
mean |
float
|
Arithmetic mean. |
median |
float
|
Median value. |
std |
float
|
Sample standard deviation (ddof=1). |
min |
float
|
Minimum value. |
max |
float
|
Maximum value. |
cv |
float
|
Coefficient of variation (std / mean). |
ci_lower |
float
|
95% bootstrap CI lower bound. |
ci_upper |
float
|
95% bootstrap CI upper bound. |
n |
int
|
Number of samples. |
is_stable |
bool
|
True when CV < STABILITY_CV_THRESHOLD. |
to_dict()
¤
Serialize to a JSON-compatible dictionary.
from_dict(data)
classmethod
¤
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with statistical result fields. |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
Reconstructed StatisticalResult instance. |
StatisticalAnalyzer(bootstrap_resamples=1000, seed=42)
¤
Statistical analysis for benchmark measurements.
Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
Initialize with bootstrap parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
summarize(samples)
¤
Compute summary statistics with bootstrap CI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of measurement values (at least 1). |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
StatisticalResult with all computed statistics. |
bootstrap_ci(samples, confidence=0.95)
¤
Percentile bootstrap confidence interval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of measurement values. |
required |
confidence
|
float
|
Confidence level (default 0.95 for 95% CI). |
0.95
|
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (lower_bound, upper_bound). |
detect_outliers(samples, threshold=OUTLIER_Z_THRESHOLD)
¤
Modified Z-score outlier detection.
Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of values to check. |
required |
threshold
|
float
|
Modified Z-score threshold (default 3.5). |
OUTLIER_Z_THRESHOLD
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of indices where outliers are detected. |
Significance Testing¤
Optional Dependency
Significance tests require scipy: uv pip install "calibrax[stats]"
calibrax.statistics.significance
¤
Statistical significance tests for benchmark comparisons.
Provides Welch's t-test, Mann-Whitney U, paired Wilcoxon signed-rank test (with pure-Python sign test fallback), and Cohen's d effect size.
welch_t_test(a, b)
¤
Welch's t-test for unequal variances.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (t_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
mann_whitney_u(a, b)
¤
Mann-Whitney U test for non-parametric distribution comparison.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (u_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
paired_significance_test(a, b, *, alpha=0.05)
¤
Wilcoxon signed-rank test for paired samples.
Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
list[float]
|
First sample (e.g., baseline measurements). |
required |
b
|
list[float]
|
Second sample (e.g., current measurements). Must be same length as a. |
required |
alpha
|
float
|
Significance threshold (default 0.05). |
0.05
|
Returns:
| Type | Description |
|---|---|
SignificanceResult
|
SignificanceResult with p_value, statistic, effect_size (Cohen's d), |
SignificanceResult
|
significant flag, and method name. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If samples are empty or have different lengths. |
effect_size(a, b)
¤
Cohen's d effect size for two independent samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample. |
required |
b
|
Sequence[float]
|
Second sample. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Absolute Cohen's d value. Returns 0.0 if pooled std is zero. |