Skip to content

Statistics¤

Statistical analysis for benchmark measurements, including bootstrap confidence intervals and outlier detection.

See Also¤


calibrax.statistics ¤

Statistical analysis: bootstrap CI, hypothesis testing, and effect sizes.

BOOTSTRAP_CI_ALPHA module-attribute ¤

BOOTSTRAP_CI_ALPHA: float = 0.05

OUTLIER_Z_THRESHOLD module-attribute ¤

OUTLIER_Z_THRESHOLD: float = 3.5

STABILITY_CV_THRESHOLD module-attribute ¤

STABILITY_CV_THRESHOLD: float = 0.1

StatisticalAnalyzer ¤

StatisticalAnalyzer(bootstrap_resamples: int = 1000, seed: int = 42)

Statistical analysis for benchmark measurements.

Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42

summarize ¤

summarize(samples: Sequence[float]) -> StatisticalResult

Compute summary statistics with bootstrap CI.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values (at least 1).

required

Returns:

Type Description
StatisticalResult

StatisticalResult with all computed statistics.

bootstrap_ci ¤

bootstrap_ci(samples: Sequence[float], confidence: float = 0.95) -> tuple[float, float]

Percentile bootstrap confidence interval.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values.

required
confidence float

Confidence level (default 0.95 for 95% CI).

0.95

Returns:

Type Description
tuple[float, float]

Tuple of (lower_bound, upper_bound).

detect_outliers ¤

detect_outliers(samples: Sequence[float], threshold: float = OUTLIER_Z_THRESHOLD) -> list[int]

Modified Z-score outlier detection.

Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of values to check.

required
threshold float

Modified Z-score threshold (default 3.5).

OUTLIER_Z_THRESHOLD

Returns:

Type Description
list[int]

List of indices where outliers are detected.

StatisticalResult dataclass ¤

StatisticalResult(*, mean: float, median: float, std: float, min: float, max: float, cv: float, ci_lower: float, ci_upper: float, n: int, is_stable: bool)

Summary statistics with confidence intervals.

Attributes:

Name Type Description
mean float

Arithmetic mean.

median float

Median value.

std float

Sample standard deviation (ddof=1).

min float

Minimum value.

max float

Maximum value.

cv float

Coefficient of variation (std / mean).

ci_lower float

95% bootstrap CI lower bound.

ci_upper float

95% bootstrap CI upper bound.

n int

Number of samples.

is_stable bool

True when CV < STABILITY_CV_THRESHOLD.

mean instance-attribute ¤

mean: float

median instance-attribute ¤

median: float

std instance-attribute ¤

std: float

min instance-attribute ¤

min: float

max instance-attribute ¤

max: float

cv instance-attribute ¤

cv: float

ci_lower instance-attribute ¤

ci_lower: float

ci_upper instance-attribute ¤

ci_upper: float

n instance-attribute ¤

n: int

is_stable instance-attribute ¤

is_stable: bool

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict classmethod ¤

from_dict(data: dict[str, Any]) -> StatisticalResult

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with statistical result fields.

required

Returns:

Type Description
StatisticalResult

Reconstructed StatisticalResult instance.

effect_size ¤

effect_size(a: Sequence[float], b: Sequence[float]) -> float

Cohen's d effect size for two independent samples.

Parameters:

Name Type Description Default
a Sequence[float]

First sample.

required
b Sequence[float]

Second sample.

required

Returns:

Type Description
float

Absolute Cohen's d value. Returns 0.0 if pooled std is zero.

mann_whitney_u ¤

mann_whitney_u(a: Sequence[float], b: Sequence[float]) -> tuple[float, float]

Mann-Whitney U test for non-parametric distribution comparison.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (u_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

paired_significance_test ¤

paired_significance_test(a: list[float], b: list[float], *, alpha: float = 0.05) -> SignificanceResult

Wilcoxon signed-rank test for paired samples.

Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.

Parameters:

Name Type Description Default
a list[float]

First sample (e.g., baseline measurements).

required
b list[float]

Second sample (e.g., current measurements). Must be same length as a.

required
alpha float

Significance threshold (default 0.05).

0.05

Returns:

Type Description
SignificanceResult

SignificanceResult with p_value, statistic, effect_size (Cohen's d),

SignificanceResult

significant flag, and method name.

Raises:

Type Description
ValueError

If samples are empty or have different lengths.

welch_t_test ¤

welch_t_test(a: Sequence[float], b: Sequence[float]) -> tuple[float, float]

Welch's t-test for unequal variances.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (t_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

analyzer ¤

Statistical analysis for benchmark measurements.

Provides summary statistics with bootstrap confidence intervals, outlier detection via modified Z-scores, and stability assessment.

STABILITY_CV_THRESHOLD module-attribute ¤

STABILITY_CV_THRESHOLD: float = 0.1

BOOTSTRAP_CI_ALPHA module-attribute ¤

BOOTSTRAP_CI_ALPHA: float = 0.05

OUTLIER_Z_THRESHOLD module-attribute ¤

OUTLIER_Z_THRESHOLD: float = 3.5

StatisticalResult dataclass ¤

StatisticalResult(*, mean: float, median: float, std: float, min: float, max: float, cv: float, ci_lower: float, ci_upper: float, n: int, is_stable: bool)

Summary statistics with confidence intervals.

Attributes:

Name Type Description
mean float

Arithmetic mean.

median float

Median value.

std float

Sample standard deviation (ddof=1).

min float

Minimum value.

max float

Maximum value.

cv float

Coefficient of variation (std / mean).

ci_lower float

95% bootstrap CI lower bound.

ci_upper float

95% bootstrap CI upper bound.

n int

Number of samples.

is_stable bool

True when CV < STABILITY_CV_THRESHOLD.

mean instance-attribute ¤
mean: float
median instance-attribute ¤
median: float
std instance-attribute ¤
std: float
min instance-attribute ¤
min: float
max instance-attribute ¤
max: float
cv instance-attribute ¤
cv: float
ci_lower instance-attribute ¤
ci_lower: float
ci_upper instance-attribute ¤
ci_upper: float
n instance-attribute ¤
n: int
is_stable instance-attribute ¤
is_stable: bool
to_dict ¤
to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict classmethod ¤
from_dict(data: dict[str, Any]) -> StatisticalResult

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with statistical result fields.

required

Returns:

Type Description
StatisticalResult

Reconstructed StatisticalResult instance.

StatisticalAnalyzer ¤

StatisticalAnalyzer(bootstrap_resamples: int = 1000, seed: int = 42)

Statistical analysis for benchmark measurements.

Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42

Parameters:

Name Type Description Default
bootstrap_resamples int

Number of bootstrap resamples for CI computation.

1000
seed int

Random seed for reproducible bootstrap sampling.

42
summarize ¤
summarize(samples: Sequence[float]) -> StatisticalResult

Compute summary statistics with bootstrap CI.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values (at least 1).

required

Returns:

Type Description
StatisticalResult

StatisticalResult with all computed statistics.

bootstrap_ci ¤
bootstrap_ci(samples: Sequence[float], confidence: float = 0.95) -> tuple[float, float]

Percentile bootstrap confidence interval.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of measurement values.

required
confidence float

Confidence level (default 0.95 for 95% CI).

0.95

Returns:

Type Description
tuple[float, float]

Tuple of (lower_bound, upper_bound).

detect_outliers ¤
detect_outliers(samples: Sequence[float], threshold: float = OUTLIER_Z_THRESHOLD) -> list[int]

Modified Z-score outlier detection.

Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.

Parameters:

Name Type Description Default
samples Sequence[float]

Sequence of values to check.

required
threshold float

Modified Z-score threshold (default 3.5).

OUTLIER_Z_THRESHOLD

Returns:

Type Description
list[int]

List of indices where outliers are detected.

significance ¤

Statistical significance tests for benchmark comparisons.

Provides Welch's t-test, Mann-Whitney U, paired Wilcoxon signed-rank test (with pure-Python sign test fallback), and Cohen's d effect size.

welch_t_test ¤

welch_t_test(a: Sequence[float], b: Sequence[float]) -> tuple[float, float]

Welch's t-test for unequal variances.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (t_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

mann_whitney_u ¤

mann_whitney_u(a: Sequence[float], b: Sequence[float]) -> tuple[float, float]

Mann-Whitney U test for non-parametric distribution comparison.

Requires scipy. Raises ImportError with clear message if unavailable.

Parameters:

Name Type Description Default
a Sequence[float]

First sample measurements.

required
b Sequence[float]

Second sample measurements.

required

Returns:

Type Description
tuple[float, float]

Tuple of (u_statistic, p_value).

Raises:

Type Description
ImportError

If scipy is not installed.

paired_significance_test ¤

paired_significance_test(a: list[float], b: list[float], *, alpha: float = 0.05) -> SignificanceResult

Wilcoxon signed-rank test for paired samples.

Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.

Parameters:

Name Type Description Default
a list[float]

First sample (e.g., baseline measurements).

required
b list[float]

Second sample (e.g., current measurements). Must be same length as a.

required
alpha float

Significance threshold (default 0.05).

0.05

Returns:

Type Description
SignificanceResult

SignificanceResult with p_value, statistic, effect_size (Cohen's d),

SignificanceResult

significant flag, and method name.

Raises:

Type Description
ValueError

If samples are empty or have different lengths.

effect_size ¤

effect_size(a: Sequence[float], b: Sequence[float]) -> float

Cohen's d effect size for two independent samples.

Parameters:

Name Type Description Default
a Sequence[float]

First sample.

required
b Sequence[float]

Second sample.

required

Returns:

Type Description
float

Absolute Cohen's d value. Returns 0.0 if pooled std is zero.