Statistics¤
Statistical analysis for benchmark measurements, including bootstrap confidence intervals and outlier detection.
See Also¤
- Benchmarking Overview - All benchmarking tools
- Timing - Timing collection
- Regression - Regression detection
- Benchmarking Guide
calibrax.statistics ¤
Statistical analysis: bootstrap CI, hypothesis testing, and effect sizes.
StatisticalAnalyzer ¤
Statistical analysis for benchmark measurements.
Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
summarize ¤
summarize(samples: Sequence[float]) -> StatisticalResult
Compute summary statistics with bootstrap CI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of measurement values (at least 1). |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
StatisticalResult with all computed statistics. |
bootstrap_ci ¤
detect_outliers ¤
Modified Z-score outlier detection.
Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of values to check. |
required |
threshold
|
float
|
Modified Z-score threshold (default 3.5). |
OUTLIER_Z_THRESHOLD
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of indices where outliers are detected. |
StatisticalResult
dataclass
¤
StatisticalResult(*, mean: float, median: float, std: float, min: float, max: float, cv: float, ci_lower: float, ci_upper: float, n: int, is_stable: bool)
Summary statistics with confidence intervals.
Attributes:
| Name | Type | Description |
|---|---|---|
mean |
float
|
Arithmetic mean. |
median |
float
|
Median value. |
std |
float
|
Sample standard deviation (ddof=1). |
min |
float
|
Minimum value. |
max |
float
|
Maximum value. |
cv |
float
|
Coefficient of variation (std / mean). |
ci_lower |
float
|
95% bootstrap CI lower bound. |
ci_upper |
float
|
95% bootstrap CI upper bound. |
n |
int
|
Number of samples. |
is_stable |
bool
|
True when CV < STABILITY_CV_THRESHOLD. |
from_dict
classmethod
¤
from_dict(data: dict[str, Any]) -> StatisticalResult
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with statistical result fields. |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
Reconstructed StatisticalResult instance. |
effect_size ¤
mann_whitney_u ¤
Mann-Whitney U test for non-parametric distribution comparison.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (u_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
paired_significance_test ¤
paired_significance_test(a: list[float], b: list[float], *, alpha: float = 0.05) -> SignificanceResult
Wilcoxon signed-rank test for paired samples.
Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
list[float]
|
First sample (e.g., baseline measurements). |
required |
b
|
list[float]
|
Second sample (e.g., current measurements). Must be same length as a. |
required |
alpha
|
float
|
Significance threshold (default 0.05). |
0.05
|
Returns:
| Type | Description |
|---|---|
SignificanceResult
|
SignificanceResult with p_value, statistic, effect_size (Cohen's d), |
SignificanceResult
|
significant flag, and method name. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If samples are empty or have different lengths. |
welch_t_test ¤
Welch's t-test for unequal variances.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (t_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
analyzer ¤
Statistical analysis for benchmark measurements.
Provides summary statistics with bootstrap confidence intervals, outlier detection via modified Z-scores, and stability assessment.
StatisticalResult
dataclass
¤
StatisticalResult(*, mean: float, median: float, std: float, min: float, max: float, cv: float, ci_lower: float, ci_upper: float, n: int, is_stable: bool)
Summary statistics with confidence intervals.
Attributes:
| Name | Type | Description |
|---|---|---|
mean |
float
|
Arithmetic mean. |
median |
float
|
Median value. |
std |
float
|
Sample standard deviation (ddof=1). |
min |
float
|
Minimum value. |
max |
float
|
Maximum value. |
cv |
float
|
Coefficient of variation (std / mean). |
ci_lower |
float
|
95% bootstrap CI lower bound. |
ci_upper |
float
|
95% bootstrap CI upper bound. |
n |
int
|
Number of samples. |
is_stable |
bool
|
True when CV < STABILITY_CV_THRESHOLD. |
from_dict
classmethod
¤
from_dict(data: dict[str, Any]) -> StatisticalResult
Deserialize from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with statistical result fields. |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
Reconstructed StatisticalResult instance. |
StatisticalAnalyzer ¤
Statistical analysis for benchmark measurements.
Provides summary statistics with bootstrap confidence intervals, modified Z-score outlier detection, and stability assessment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bootstrap_resamples
|
int
|
Number of bootstrap resamples for CI computation. |
1000
|
seed
|
int
|
Random seed for reproducible bootstrap sampling. |
42
|
summarize ¤
summarize(samples: Sequence[float]) -> StatisticalResult
Compute summary statistics with bootstrap CI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of measurement values (at least 1). |
required |
Returns:
| Type | Description |
|---|---|
StatisticalResult
|
StatisticalResult with all computed statistics. |
bootstrap_ci ¤
detect_outliers ¤
Modified Z-score outlier detection.
Uses median absolute deviation (MAD) instead of standard deviation for robustness against the outliers themselves.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
Sequence[float]
|
Sequence of values to check. |
required |
threshold
|
float
|
Modified Z-score threshold (default 3.5). |
OUTLIER_Z_THRESHOLD
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of indices where outliers are detected. |
significance ¤
Statistical significance tests for benchmark comparisons.
Provides Welch's t-test, Mann-Whitney U, paired Wilcoxon signed-rank test (with pure-Python sign test fallback), and Cohen's d effect size.
welch_t_test ¤
Welch's t-test for unequal variances.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (t_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
mann_whitney_u ¤
Mann-Whitney U test for non-parametric distribution comparison.
Requires scipy. Raises ImportError with clear message if unavailable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Sequence[float]
|
First sample measurements. |
required |
b
|
Sequence[float]
|
Second sample measurements. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (u_statistic, p_value). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If scipy is not installed. |
paired_significance_test ¤
paired_significance_test(a: list[float], b: list[float], *, alpha: float = 0.05) -> SignificanceResult
Wilcoxon signed-rank test for paired samples.
Tests whether two related samples have the same distribution. Uses scipy.stats.wilcoxon when available, falls back to a pure-Python sign test approximation for small samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
list[float]
|
First sample (e.g., baseline measurements). |
required |
b
|
list[float]
|
Second sample (e.g., current measurements). Must be same length as a. |
required |
alpha
|
float
|
Significance threshold (default 0.05). |
0.05
|
Returns:
| Type | Description |
|---|---|
SignificanceResult
|
SignificanceResult with p_value, statistic, effect_size (Cohen's d), |
SignificanceResult
|
significant flag, and method name. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If samples are empty or have different lengths. |