Skip to content

Comparative Benchmarking¤

Compare performance across configurations or versions.

See Also¤

Overview¤

calibrax provides comparative analysis through Run objects containing multiple Point entries (one per framework/configuration). The rank_table() function ranks entries by any metric with direction-aware sorting, while compare_configurations() produces a full comparison report between two runs.

Quick Start¤

from calibrax.analysis import rank_table, compare_configurations
from calibrax.core import Metric, MetricDef, MetricDirection, Point, Run

# Build a run with results from multiple configurations
run = Run(
    points=(
        Point(
            name="CV-1/baseline",
            scenario="CV-1",
            tags={"framework": "baseline"},
            metrics={"throughput": Metric(value=15000.0)},
        ),
        Point(
            name="CV-1/optimized",
            scenario="CV-1",
            tags={"framework": "optimized"},
            metrics={"throughput": Metric(value=20000.0)},
        ),
    ),
    metric_defs={
        "throughput": MetricDef(
            name="throughput",
            unit="elem/s",
            direction=MetricDirection.HIGHER,
        ),
    },
)

# Rank by throughput (direction-aware: higher is better)
rankings = rank_table(run, "throughput")
for row in rankings:
    marker = " (best)" if row.is_best else ""
    print(f"  {row.rank}. {row.label}: {row.value:.0f} elem/s{marker}")

calibrax.analysis ¤

Analysis: regression detection, comparison, ranking, scaling, Pareto fronts.

ComparisonReport dataclass ¤

ComparisonReport(*, name: str, labels_compared: tuple[str, ...], metric_comparisons: tuple[MetricComparison, ...], winner_by_metric: dict[str, str], overall_winner: str)

Full comparison across multiple metrics and configurations.

Attributes:

Name Type Description
name str

Name of this comparison.

labels_compared tuple[str, ...]

Configuration labels included.

metric_comparisons tuple[MetricComparison, ...]

Per-metric comparison results.

winner_by_metric dict[str, str]

Best label for each metric.

overall_winner str

Best label by aggregate score.

name instance-attribute ¤

name: str

labels_compared instance-attribute ¤

labels_compared: tuple[str, ...]

metric_comparisons instance-attribute ¤

metric_comparisons: tuple[MetricComparison, ...]

winner_by_metric instance-attribute ¤

winner_by_metric: dict[str, str]

overall_winner instance-attribute ¤

overall_winner: str

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict classmethod ¤

from_dict(data: dict[str, Any]) -> ComparisonReport

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with comparison report fields.

required

Returns:

Type Description
ComparisonReport

Reconstructed ComparisonReport instance.

MetricComparison dataclass ¤

MetricComparison(*, metric_name: str, values: dict[str, float], rankings: tuple[RankEntry, ...], best_label: str, improvement_factors: dict[str, float])

Comparison results for a single metric across configurations.

Attributes:

Name Type Description
metric_name str

Name of the compared metric.

values dict[str, float]

Mapping of configuration label to metric value.

rankings tuple[RankEntry, ...]

Ranked entries for this metric.

best_label str

Label of the best-performing configuration.

improvement_factors dict[str, float]

How much better the best is vs each config.

metric_name instance-attribute ¤

metric_name: str

values instance-attribute ¤

values: dict[str, float]

rankings instance-attribute ¤

rankings: tuple[RankEntry, ...]

best_label instance-attribute ¤

best_label: str

improvement_factors instance-attribute ¤

improvement_factors: dict[str, float]

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

compare_configurations ¤

compare_configurations(runs: dict[str, Run], metrics: Sequence[str] | None = None, *, group_by_tag: str = 'framework') -> ComparisonReport

Compare benchmark runs across different configurations.

Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.

Parameters:

Name Type Description Default
runs dict[str, Run]

Mapping of configuration label to benchmark Run.

required
metrics Sequence[str] | None

Subset of metric names to compare. Defaults to all metrics found across all runs.

None
group_by_tag str

Tag key used for grouping (default "framework").

'framework'

Returns:

Type Description
ComparisonReport

ComparisonReport with per-metric comparisons and overall winner.

Raises:

Type Description
ValueError

If fewer than 2 configurations are provided.

pareto_front ¤

pareto_front(points: list[Point], x_metric: str, y_metric: str, *, metric_defs: dict[str, MetricDef] | None = None) -> list[Point]

Identify Pareto-optimal points for two metrics.

A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".

Parameters:

Name Type Description Default
points list[Point]

List of benchmark points to analyze.

required
x_metric str

First metric name.

required
y_metric str

Second metric name.

required
metric_defs dict[str, MetricDef] | None

Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics.

None

Returns:

Type Description
list[Point]

List of Pareto-optimal points (subset of input, same order).

aggregate_score ¤

aggregate_score(run: Run, weights: dict[str, float]) -> dict[str, float]

Weighted aggregate score across metrics.

Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
weights dict[str, float]

{metric_name: weight} — weights are normalized to sum to 1.0.

required

Returns:

Type Description
dict[str, float]

{framework_label: aggregate_score} where score is in [0, 1].

rank_table ¤

rank_table(run: Run, metric: str, group_by_tag: str = 'framework') -> list[RankEntry]

Rank entries by metric value, grouped by a tag.

Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
metric str

Metric name to rank by.

required
group_by_tag str

Tag key used to group points (default "framework").

'framework'

Returns:

Type Description
list[RankEntry]

Sorted list of RankEntry, rank 1 = best.

detect_regressions ¤

detect_regressions(run: Run, baseline: Run, threshold: float = 0.05) -> list[Regression]

Flag metrics that degraded beyond threshold.

Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.

Parameters:

Name Type Description Default
run Run

Current benchmark run.

required
baseline Run

Baseline run to compare against.

required
threshold float

Relative change threshold (e.g. 0.05 = 5%).

0.05

Returns:

Type Description
list[Regression]

List of detected regressions.

scaling_fit ¤

scaling_fit(sizes: list[float], values: list[float]) -> ScalingLaw

Fit power-law: value = a * size^b using log-linear regression.

Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).

Parameters:

Name Type Description Default
sizes list[float]

Input sizes (e.g., batch sizes, dataset sizes).

required
values list[float]

Measured values (e.g., throughput, latency).

required

Returns:

Type Description
ScalingLaw

ScalingLaw with coefficient (a), exponent (b), r_squared, and

ScalingLaw

complexity classification string.

Raises:

Type Description
ValueError

If inputs are empty or have different lengths.

changepoint ¤

Change point detection for benchmark time series.

Uses the ruptures library to detect significant changes in metric trends, enabling automated identification of performance regressions or improvements over time. Requires the optional ruptures dependency (uv pip install "calibrax[changepoint]").

RUPTURES_AVAILABLE module-attribute ¤

RUPTURES_AVAILABLE = True

logger module-attribute ¤

logger = getLogger(__name__)

ChangePoint dataclass ¤

ChangePoint(*, index: int, timestamp: datetime | None = None, run_id: str | None = None, magnitude: float = 0.0)

A detected change point in a benchmark trend series.

Attributes:

Name Type Description
index int

Index in the trend series where the change was detected.

timestamp datetime | None

Timestamp of the change point, if available.

run_id str | None

Run ID at the change point, if available.

magnitude float

Absolute difference in mean values before/after the change.

index instance-attribute ¤
index: int
timestamp class-attribute instance-attribute ¤
timestamp: datetime | None = None
run_id class-attribute instance-attribute ¤
run_id: str | None = None
magnitude class-attribute instance-attribute ¤
magnitude: float = 0.0
to_dict ¤
to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict classmethod ¤
from_dict(data: dict[str, Any]) -> ChangePoint

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with change point fields.

required

Returns:

Type Description
ChangePoint

Reconstructed ChangePoint instance.

detect_change_points ¤

detect_change_points(trend: TrendSeries, *, method: str = 'pelt', min_size: int = 3, penalty: float | None = None) -> list[ChangePoint]

Detect change points in a benchmark trend series.

Uses the ruptures library for change point detection with configurable algorithms.

Parameters:

Name Type Description Default
trend TrendSeries

TrendSeries containing the metric values over time.

required
method str

Detection method ("pelt", "binseg", or "window").

'pelt'
min_size int

Minimum segment size between change points.

3
penalty float | None

Penalty value for PELT/BinSeg. Auto-calibrated if None.

None

Returns:

Type Description
list[ChangePoint]

List of detected ChangePoint instances, ordered by index.

Raises:

Type Description
ImportError

If ruptures is not installed.

ValueError

If the trend has fewer points than min_size.

comparison ¤

Multi-configuration benchmark comparison.

Compares benchmark runs across different configurations (frameworks, hardware, etc.) using MetricDef-aware direction logic and aggregate scoring.

MetricComparison dataclass ¤

MetricComparison(*, metric_name: str, values: dict[str, float], rankings: tuple[RankEntry, ...], best_label: str, improvement_factors: dict[str, float])

Comparison results for a single metric across configurations.

Attributes:

Name Type Description
metric_name str

Name of the compared metric.

values dict[str, float]

Mapping of configuration label to metric value.

rankings tuple[RankEntry, ...]

Ranked entries for this metric.

best_label str

Label of the best-performing configuration.

improvement_factors dict[str, float]

How much better the best is vs each config.

metric_name instance-attribute ¤
metric_name: str
values instance-attribute ¤
values: dict[str, float]
rankings instance-attribute ¤
rankings: tuple[RankEntry, ...]
best_label instance-attribute ¤
best_label: str
improvement_factors instance-attribute ¤
improvement_factors: dict[str, float]
to_dict ¤
to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

ComparisonReport dataclass ¤

ComparisonReport(*, name: str, labels_compared: tuple[str, ...], metric_comparisons: tuple[MetricComparison, ...], winner_by_metric: dict[str, str], overall_winner: str)

Full comparison across multiple metrics and configurations.

Attributes:

Name Type Description
name str

Name of this comparison.

labels_compared tuple[str, ...]

Configuration labels included.

metric_comparisons tuple[MetricComparison, ...]

Per-metric comparison results.

winner_by_metric dict[str, str]

Best label for each metric.

overall_winner str

Best label by aggregate score.

name instance-attribute ¤
name: str
labels_compared instance-attribute ¤
labels_compared: tuple[str, ...]
metric_comparisons instance-attribute ¤
metric_comparisons: tuple[MetricComparison, ...]
winner_by_metric instance-attribute ¤
winner_by_metric: dict[str, str]
overall_winner instance-attribute ¤
overall_winner: str
to_dict ¤
to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict classmethod ¤
from_dict(data: dict[str, Any]) -> ComparisonReport

Deserialize from a dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with comparison report fields.

required

Returns:

Type Description
ComparisonReport

Reconstructed ComparisonReport instance.

compare_configurations ¤

compare_configurations(runs: dict[str, Run], metrics: Sequence[str] | None = None, *, group_by_tag: str = 'framework') -> ComparisonReport

Compare benchmark runs across different configurations.

Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.

Parameters:

Name Type Description Default
runs dict[str, Run]

Mapping of configuration label to benchmark Run.

required
metrics Sequence[str] | None

Subset of metric names to compare. Defaults to all metrics found across all runs.

None
group_by_tag str

Tag key used for grouping (default "framework").

'framework'

Returns:

Type Description
ComparisonReport

ComparisonReport with per-metric comparisons and overall winner.

Raises:

Type Description
ValueError

If fewer than 2 configurations are provided.

pareto ¤

Pareto front identification for multi-objective benchmark analysis.

Identifies Pareto-optimal points for two metrics, respecting MetricDef.direction for dominance checks.

pareto_front ¤

pareto_front(points: list[Point], x_metric: str, y_metric: str, *, metric_defs: dict[str, MetricDef] | None = None) -> list[Point]

Identify Pareto-optimal points for two metrics.

A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".

Parameters:

Name Type Description Default
points list[Point]

List of benchmark points to analyze.

required
x_metric str

First metric name.

required
y_metric str

Second metric name.

required
metric_defs dict[str, MetricDef] | None

Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics.

None

Returns:

Type Description
list[Point]

List of Pareto-optimal points (subset of input, same order).

ranking ¤

Ranking and aggregate scoring for benchmark runs.

Ranks entries by metric value and computes weighted aggregate scores across multiple metrics.

rank_table ¤

rank_table(run: Run, metric: str, group_by_tag: str = 'framework') -> list[RankEntry]

Rank entries by metric value, grouped by a tag.

Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
metric str

Metric name to rank by.

required
group_by_tag str

Tag key used to group points (default "framework").

'framework'

Returns:

Type Description
list[RankEntry]

Sorted list of RankEntry, rank 1 = best.

aggregate_score ¤

aggregate_score(run: Run, weights: dict[str, float]) -> dict[str, float]

Weighted aggregate score across metrics.

Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.

Parameters:

Name Type Description Default
run Run

Benchmark run with points and metric_defs.

required
weights dict[str, float]

{metric_name: weight} — weights are normalized to sum to 1.0.

required

Returns:

Type Description
dict[str, float]

{framework_label: aggregate_score} where score is in [0, 1].

regression ¤

Regression detection for benchmark runs.

Compares a current run against a baseline to flag metrics that degraded beyond a specified threshold.

detect_regressions ¤

detect_regressions(run: Run, baseline: Run, threshold: float = 0.05) -> list[Regression]

Flag metrics that degraded beyond threshold.

Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.

Parameters:

Name Type Description Default
run Run

Current benchmark run.

required
baseline Run

Baseline run to compare against.

required
threshold float

Relative change threshold (e.g. 0.05 = 5%).

0.05

Returns:

Type Description
list[Regression]

List of detected regressions.

scaling ¤

Scaling law fitting via log-linear regression.

Fits power-law relationships (value = a * size^b) using pure Python log-linear regression. No external dependencies required.

scaling_fit ¤

scaling_fit(sizes: list[float], values: list[float]) -> ScalingLaw

Fit power-law: value = a * size^b using log-linear regression.

Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).

Parameters:

Name Type Description Default
sizes list[float]

Input sizes (e.g., batch sizes, dataset sizes).

required
values list[float]

Measured values (e.g., throughput, latency).

required

Returns:

Type Description
ScalingLaw

ScalingLaw with coefficient (a), exponent (b), r_squared, and

ScalingLaw

complexity classification string.

Raises:

Type Description
ValueError

If inputs are empty or have different lengths.