Comparative Benchmarking¤

External package

This page documents calibrax, the benchmarking library datarax depends on.

Compare performance across configurations or versions.

Overview¤

calibrax provides comparative analysis through Run objects containing multiple Point entries (one per framework/configuration). The rank_table() function ranks entries by any metric with direction-aware sorting, while compare_configurations() produces a comparison report across a labeled mapping of runs.

Quick Start¤

from calibrax.analysis import rank_table, compare_configurations
from calibrax.core import Metric, MetricDef, MetricDirection, Point, Run

# Build a run with results from multiple configurations
run = Run(
    points=(
        Point(
            name="CV-1/baseline",
            scenario="CV-1",
            tags={"framework": "baseline"},
            metrics={"throughput": Metric(value=15000.0)},
        ),
        Point(
            name="CV-1/optimized",
            scenario="CV-1",
            tags={"framework": "optimized"},
            metrics={"throughput": Metric(value=20000.0)},
        ),
    ),
    metric_defs={
        "throughput": MetricDef(
            name="throughput",
            unit="elem/s",
            direction=MetricDirection.HIGHER,
        ),
    },
)

# Rank by throughput (direction-aware: higher is better)
rankings = rank_table(run, "throughput")
for row in rankings:
    marker = " (best)" if row.is_best else ""
    print(f"  {row.rank}. {row.label}: {row.value:.0f} elem/s{marker}")

calibrax.analysis ¤

Analysis: regression detection, comparison, ranking, scaling, Pareto fronts.

ComparisonReport `dataclass` ¤

ComparisonReport(*, name: str, labels_compared: tuple[str, ...], metric_comparisons: tuple[MetricComparison, ...], winner_by_metric: dict[str, str], overall_winner: str)

Full comparison across multiple metrics and configurations.

Attributes:

Name	Type	Description
`name`	`str`	Name of this comparison.
`labels_compared`	`tuple[str, ...]`	Configuration labels included.
`metric_comparisons`	`tuple[MetricComparison, ...]`	Per-metric comparison results.
`winner_by_metric`	`dict[str, str]`	Best label for each metric.
`overall_winner`	`str`	Best label by aggregate score.

name `instance-attribute` ¤

name: str

labels_compared `instance-attribute` ¤

labels_compared: tuple[str, ...]

metric_comparisons `instance-attribute` ¤

metric_comparisons: tuple[MetricComparison, ...]

winner_by_metric `instance-attribute` ¤

winner_by_metric: dict[str, str]

overall_winner `instance-attribute` ¤

overall_winner: str

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict `classmethod` ¤

from_dict(data: dict[str, Any]) -> ComparisonReport

Deserialize from a dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with comparison report fields.	required

Returns:

Type	Description
`ComparisonReport`	Reconstructed ComparisonReport instance.

MetricComparison `dataclass` ¤

MetricComparison(*, metric_name: str, values: dict[str, float], rankings: tuple[RankEntry, ...], best_label: str, improvement_factors: dict[str, float])

Comparison results for a single metric across configurations.

Attributes:

Name	Type	Description
`metric_name`	`str`	Name of the compared metric.
`values`	`dict[str, float]`	Mapping of configuration label to metric value.
`rankings`	`tuple[RankEntry, ...]`	Ranked entries for this metric.
`best_label`	`str`	Label of the best-performing configuration.
`improvement_factors`	`dict[str, float]`	How much better the best is vs each config.

metric_name `instance-attribute` ¤

metric_name: str

values `instance-attribute` ¤

values: dict[str, float]

rankings `instance-attribute` ¤

rankings: tuple[RankEntry, ...]

best_label `instance-attribute` ¤

best_label: str

improvement_factors `instance-attribute` ¤

improvement_factors: dict[str, float]

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

compare_configurations ¤

compare_configurations(runs: dict[str, Run], metrics: Sequence[str] | None = None, *, group_by_tag: str = 'framework') -> ComparisonReport

Compare benchmark runs across different configurations.

Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.

Parameters:

Name	Type	Description	Default
`runs`	`dict[str, Run]`	Mapping of configuration label to benchmark Run.	required
`metrics`	`Sequence[str] \| None`	Subset of metric names to compare. Defaults to all metrics found across all runs.	`None`
`group_by_tag`	`str`	Tag key used for grouping (default "framework").	`'framework'`

Returns:

Type	Description
`ComparisonReport`	ComparisonReport with per-metric comparisons and overall winner.

Raises:

Type	Description
`ValueError`	If fewer than 2 configurations are provided.

pareto_front ¤

pareto_front(points: list[Point], x_metric: str, y_metric: str, *, metric_defs: dict[str, MetricDef] | None = None) -> list[Point]

Identify Pareto-optimal points for two metrics.

A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".

Parameters:

Name	Type	Description	Default
`points`	`list[Point]`	List of benchmark points to analyze.	required
`x_metric`	`str`	First metric name.	required
`y_metric`	`str`	Second metric name.	required
`metric_defs`	`dict[str, MetricDef] \| None`	Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics.	`None`

Returns:

Type	Description
`list[Point]`	List of Pareto-optimal points (subset of input, same order).

aggregate_score ¤

aggregate_score(run: Run, weights: dict[str, float]) -> dict[str, float]

Weighted aggregate score across metrics.

Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Benchmark run with points and metric_defs.	required
`weights`	`dict[str, float]`	{metric_name: weight} — weights are normalized to sum to 1.0.	required

Returns:

Type	Description
`dict[str, float]`	{framework_label: aggregate_score} where score is in [0, 1].

rank_table ¤

rank_table(run: Run, metric: str, group_by_tag: str = 'framework') -> list[RankEntry]

Rank entries by metric value, grouped by a tag.

Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Benchmark run with points and metric_defs.	required
`metric`	`str`	Metric name to rank by.	required
`group_by_tag`	`str`	Tag key used to group points (default "framework").	`'framework'`

Returns:

Type	Description
`list[RankEntry]`	Sorted list of RankEntry, rank 1 = best.

detect_regressions ¤

detect_regressions(run: Run, baseline: Run, threshold: float = 0.05) -> list[Regression]

Flag metrics that degraded beyond threshold.

Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Current benchmark run.	required
`baseline`	`Run`	Baseline run to compare against.	required
`threshold`	`float`	Relative change threshold (e.g. 0.05 = 5%).	`0.05`

Returns:

Type	Description
`list[Regression]`	List of detected regressions.

scaling_fit ¤

scaling_fit(sizes: list[float], values: list[float]) -> ScalingLaw

Fit power-law: value = a * size^b using log-linear regression.

Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).

Parameters:

Name	Type	Description	Default
`sizes`	`list[float]`	Input sizes (e.g., batch sizes, dataset sizes).	required
`values`	`list[float]`	Measured values (e.g., throughput, latency).	required

Returns:

Type	Description
`ScalingLaw`	ScalingLaw with coefficient (a), exponent (b), r_squared, and
`ScalingLaw`	complexity classification string.

Raises:

Type	Description
`ValueError`	If inputs are empty or have different lengths.

changepoint ¤

Change point detection for benchmark time series.

Uses the ruptures library to detect significant changes in metric trends, enabling automated identification of performance regressions or improvements over time. Requires the optional ruptures dependency (uv pip install "calibrax[changepoint]").

RUPTURES_AVAILABLE `module-attribute` ¤

RUPTURES_AVAILABLE = True

logger `module-attribute` ¤

logger = logging.getLogger(__name__)

ChangePoint `dataclass` ¤

ChangePoint(*, index: int, timestamp: datetime | None = None, run_id: str | None = None, magnitude: float = 0.0)

A detected change point in a benchmark trend series.

Attributes:

Name	Type	Description
`index`	`int`	Index in the trend series where the change was detected.
`timestamp`	`datetime \| None`	Timestamp of the change point, if available.
`run_id`	`str \| None`	Run ID at the change point, if available.
`magnitude`	`float`	Absolute difference in mean values before/after the change.

index `instance-attribute` ¤

index: int

timestamp `class-attribute` `instance-attribute` ¤

timestamp: datetime | None = None

run_id `class-attribute` `instance-attribute` ¤

run_id: str | None = None

magnitude `class-attribute` `instance-attribute` ¤

magnitude: float = 0.0

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict `classmethod` ¤

from_dict(data: dict[str, Any]) -> ChangePoint

Deserialize from a dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with change point fields.	required

Returns:

Type	Description
`ChangePoint`	Reconstructed ChangePoint instance.

detect_change_points ¤

detect_change_points(trend: TrendSeries, *, method: str = 'pelt', min_size: int = 3, penalty: float | None = None) -> list[ChangePoint]

Detect change points in a benchmark trend series.

Uses the ruptures library for change point detection with configurable algorithms.

Parameters:

Name	Type	Description	Default
`trend`	`TrendSeries`	TrendSeries containing the metric values over time.	required
`method`	`str`	Detection method ("pelt", "binseg", or "window").	`'pelt'`
`min_size`	`int`	Minimum segment size between change points.	`3`
`penalty`	`float \| None`	Penalty value for PELT/BinSeg. Auto-calibrated if None.	`None`

Returns:

Type	Description
`list[ChangePoint]`	List of detected ChangePoint instances, ordered by index.

Raises:

Type	Description
`ImportError`	If ruptures is not installed.
`ValueError`	If the trend has fewer points than min_size.

comparison ¤

Multi-configuration benchmark comparison.

Compares benchmark runs across different configurations (frameworks, hardware, etc.) using MetricDef-aware direction logic and aggregate scoring.

MetricComparison `dataclass` ¤

MetricComparison(*, metric_name: str, values: dict[str, float], rankings: tuple[RankEntry, ...], best_label: str, improvement_factors: dict[str, float])

Comparison results for a single metric across configurations.

Attributes:

Name	Type	Description
`metric_name`	`str`	Name of the compared metric.
`values`	`dict[str, float]`	Mapping of configuration label to metric value.
`rankings`	`tuple[RankEntry, ...]`	Ranked entries for this metric.
`best_label`	`str`	Label of the best-performing configuration.
`improvement_factors`	`dict[str, float]`	How much better the best is vs each config.

metric_name `instance-attribute` ¤

metric_name: str

values `instance-attribute` ¤

values: dict[str, float]

rankings `instance-attribute` ¤

rankings: tuple[RankEntry, ...]

best_label `instance-attribute` ¤

best_label: str

improvement_factors `instance-attribute` ¤

improvement_factors: dict[str, float]

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

ComparisonReport `dataclass` ¤

ComparisonReport(*, name: str, labels_compared: tuple[str, ...], metric_comparisons: tuple[MetricComparison, ...], winner_by_metric: dict[str, str], overall_winner: str)

Full comparison across multiple metrics and configurations.

Attributes:

Name	Type	Description
`name`	`str`	Name of this comparison.
`labels_compared`	`tuple[str, ...]`	Configuration labels included.
`metric_comparisons`	`tuple[MetricComparison, ...]`	Per-metric comparison results.
`winner_by_metric`	`dict[str, str]`	Best label for each metric.
`overall_winner`	`str`	Best label by aggregate score.

name `instance-attribute` ¤

name: str

labels_compared `instance-attribute` ¤

labels_compared: tuple[str, ...]

metric_comparisons `instance-attribute` ¤

metric_comparisons: tuple[MetricComparison, ...]

winner_by_metric `instance-attribute` ¤

winner_by_metric: dict[str, str]

overall_winner `instance-attribute` ¤

overall_winner: str

to_dict ¤

to_dict() -> dict[str, Any]

Serialize to a JSON-compatible dictionary.

from_dict `classmethod` ¤

from_dict(data: dict[str, Any]) -> ComparisonReport

Deserialize from a dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with comparison report fields.	required

Returns:

Type	Description
`ComparisonReport`	Reconstructed ComparisonReport instance.

compare_configurations ¤

compare_configurations(runs: dict[str, Run], metrics: Sequence[str] | None = None, *, group_by_tag: str = 'framework') -> ComparisonReport

Compare benchmark runs across different configurations.

Builds a merged Run from all provided runs, using configuration labels as framework tags, then leverages rank_table and aggregate_score.

Parameters:

Name	Type	Description	Default
`runs`	`dict[str, Run]`	Mapping of configuration label to benchmark Run.	required
`metrics`	`Sequence[str] \| None`	Subset of metric names to compare. Defaults to all metrics found across all runs.	`None`
`group_by_tag`	`str`	Tag key used for grouping (default "framework").	`'framework'`

Returns:

Type	Description
`ComparisonReport`	ComparisonReport with per-metric comparisons and overall winner.

Raises:

Type	Description
`ValueError`	If fewer than 2 configurations are provided.

pareto ¤

Pareto front identification for multi-objective benchmark analysis.

Identifies Pareto-optimal points for two metrics, respecting MetricDef.direction for dominance checks.

pareto_front ¤

pareto_front(points: list[Point], x_metric: str, y_metric: str, *, metric_defs: dict[str, MetricDef] | None = None) -> list[Point]

Identify Pareto-optimal points for two metrics.

A point is Pareto-optimal if no other point is strictly better on both metrics. Uses MetricDef.direction to determine "better".

Parameters:

Name	Type	Description	Default
`points`	`list[Point]`	List of benchmark points to analyze.	required
`x_metric`	`str`	First metric name.	required
`y_metric`	`str`	Second metric name.	required
`metric_defs`	`dict[str, MetricDef] \| None`	Optional metric definitions for direction. If not provided, defaults to higher-is-better for both metrics.	`None`

Returns:

Type	Description
`list[Point]`	List of Pareto-optimal points (subset of input, same order).

ranking ¤

Ranking and aggregate scoring for benchmark runs.

Ranks entries by metric value and computes weighted aggregate scores across multiple metrics.

rank_table ¤

rank_table(run: Run, metric: str, group_by_tag: str = 'framework') -> list[RankEntry]

Rank entries by metric value, grouped by a tag.

Uses MetricDef.direction for determining best-is-highest vs best-is-lowest.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Benchmark run with points and metric_defs.	required
`metric`	`str`	Metric name to rank by.	required
`group_by_tag`	`str`	Tag key used to group points (default "framework").	`'framework'`

Returns:

Type	Description
`list[RankEntry]`	Sorted list of RankEntry, rank 1 = best.

aggregate_score ¤

aggregate_score(run: Run, weights: dict[str, float]) -> dict[str, float]

Weighted aggregate score across metrics.

Normalizes each metric to [0, 1] range (best = 1.0, worst = 0.0), then computes a weighted sum. Uses MetricDef.direction for normalization.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Benchmark run with points and metric_defs.	required
`weights`	`dict[str, float]`	{metric_name: weight} — weights are normalized to sum to 1.0.	required

Returns:

Type	Description
`dict[str, float]`	{framework_label: aggregate_score} where score is in [0, 1].

regression ¤

Regression detection for benchmark runs.

Compares a current run against a baseline to flag metrics that degraded beyond a specified threshold.

detect_regressions ¤

detect_regressions(run: Run, baseline: Run, threshold: float = 0.05) -> list[Regression]

Flag metrics that degraded beyond threshold.

Uses MetricDef.direction: 'higher' metrics regress when they decrease, 'lower' metrics regress when they increase. 'info' metrics are skipped.

Parameters:

Name	Type	Description	Default
`run`	`Run`	Current benchmark run.	required
`baseline`	`Run`	Baseline run to compare against.	required
`threshold`	`float`	Relative change threshold (e.g. 0.05 = 5%).	`0.05`

Returns:

Type	Description
`list[Regression]`	List of detected regressions.

scaling ¤

Scaling law fitting via log-linear regression.

Fits power-law relationships (value = a * size^b) using pure Python log-linear regression. No external dependencies required.

scaling_fit ¤

scaling_fit(sizes: list[float], values: list[float]) -> ScalingLaw

Fit power-law: value = a * size^b using log-linear regression.

Takes log of both sides: log(value) = log(a) + b * log(size), then fits a linear regression. Pure Python (no scipy/numpy needed).

Parameters:

Name	Type	Description	Default
`sizes`	`list[float]`	Input sizes (e.g., batch sizes, dataset sizes).	required
`values`	`list[float]`	Measured values (e.g., throughput, latency).	required

Returns:

Type	Description
`ScalingLaw`	ScalingLaw with coefficient (a), exponent (b), r_squared, and
`ScalingLaw`	complexity classification string.

Raises:

Type	Description
`ValueError`	If inputs are empty or have different lengths.

Comparative Benchmarking¤

See Also¤

Overview¤

Quick Start¤

calibrax.analysis ¤

ComparisonReport dataclass ¤

name instance-attribute ¤

labels_compared instance-attribute ¤

metric_comparisons instance-attribute ¤

winner_by_metric instance-attribute ¤

overall_winner instance-attribute ¤

to_dict ¤

from_dict classmethod ¤

MetricComparison dataclass ¤

metric_name instance-attribute ¤

values instance-attribute ¤

rankings instance-attribute ¤

best_label instance-attribute ¤

improvement_factors instance-attribute ¤

to_dict ¤

compare_configurations ¤

pareto_front ¤

aggregate_score ¤

rank_table ¤

detect_regressions ¤

scaling_fit ¤

changepoint ¤

RUPTURES_AVAILABLE module-attribute ¤

logger module-attribute ¤

ChangePoint dataclass ¤

index instance-attribute ¤

timestamp class-attribute instance-attribute ¤

run_id class-attribute instance-attribute ¤

magnitude class-attribute instance-attribute ¤

to_dict ¤

from_dict classmethod ¤

detect_change_points ¤

comparison ¤

MetricComparison dataclass ¤

metric_name instance-attribute ¤

values instance-attribute ¤

rankings instance-attribute ¤

best_label instance-attribute ¤

improvement_factors instance-attribute ¤

to_dict ¤

ComparisonReport dataclass ¤

name instance-attribute ¤

labels_compared instance-attribute ¤

metric_comparisons instance-attribute ¤

winner_by_metric instance-attribute ¤

overall_winner instance-attribute ¤

to_dict ¤

from_dict classmethod ¤

compare_configurations ¤

pareto ¤

pareto_front ¤

ranking ¤

rank_table ¤

aggregate_score ¤

regression ¤

detect_regressions ¤

scaling ¤

scaling_fit ¤

ComparisonReport `dataclass` ¤

name `instance-attribute` ¤

labels_compared `instance-attribute` ¤

metric_comparisons `instance-attribute` ¤

winner_by_metric `instance-attribute` ¤

overall_winner `instance-attribute` ¤

from_dict `classmethod` ¤

MetricComparison `dataclass` ¤

metric_name `instance-attribute` ¤

values `instance-attribute` ¤

rankings `instance-attribute` ¤

best_label `instance-attribute` ¤

improvement_factors `instance-attribute` ¤

RUPTURES_AVAILABLE `module-attribute` ¤

logger `module-attribute` ¤

ChangePoint `dataclass` ¤

index `instance-attribute` ¤

timestamp `class-attribute` `instance-attribute` ¤

run_id `class-attribute` `instance-attribute` ¤

magnitude `class-attribute` `instance-attribute` ¤

from_dict `classmethod` ¤

MetricComparison `dataclass` ¤

metric_name `instance-attribute` ¤

values `instance-attribute` ¤

rankings `instance-attribute` ¤

best_label `instance-attribute` ¤

improvement_factors `instance-attribute` ¤

ComparisonReport `dataclass` ¤

name `instance-attribute` ¤

labels_compared `instance-attribute` ¤

metric_comparisons `instance-attribute` ¤

winner_by_metric `instance-attribute` ¤

overall_winner `instance-attribute` ¤

from_dict `classmethod` ¤