Skip to content

Dashboard & calibrax¤

calibrax is a standalone Python library that powers benchmark analysis and the W&B dashboard. It provides direction-aware metrics, regression detection, statistical analysis, and automated export to Weights & Biases.

graph LR
    R[Benchmark Runner] -->|JSON| S[calibrax Store]
    S -->|Analysis| A[Regressions / Rankings / CIs]
    S -->|Export| W[W&B Dashboard]
    A -->|CI Gate| G[Pass / Fail]

Why W&B?¤

Rather than building a custom visualization frontend, we use W&B as the dashboard UI. It provides interactive charts, run history, comparison tables, filtering, and team collaboration — all well-established and familiar to ML practitioners. W&B is free for open-source projects.

What calibrax adds on top of W&B:

  • Direction-aware metric definitions (higher throughput is better, lower latency is better)
  • Best-value highlighting with correct direction semantics
  • Regression detection with CI gate (calibrax check)
  • Bootstrap confidence intervals (pure Python, no scipy)
  • Framework ranking tables
  • JSON-per-run local storage (works fully offline)

Installation¤

Install calibrax directly from GitHub:

# Core (analysis + CLI, no W&B dependency)
uv pip install "calibrax @ git+https://github.com/avitai/calibrax.git"

# With W&B export support
uv pip install "calibrax[wandb] @ git+https://github.com/avitai/calibrax.git"

Data Model¤

calibrax's data model captures the semantics that W&B doesn't track natively:

Concept Description
MetricDef How to interpret a metric — name, unit, direction (higher/lower/info), group, priority
Metric A single value with optional CI bounds and raw samples
Point One benchmark + one configuration (e.g., "CV-1/small" for Datarax)
Run One execution of a benchmark suite — a collection of Points

All metric definitions live in a config.json file inside the local data store:

{
  "wandb_project": "datarax-benchmarks",
  "metric_defs": {
    "throughput": {
      "unit": "elem/s",
      "direction": "higher",
      "group": "Throughput",
      "priority": "primary"
    },
    "latency_p50": {
      "unit": "ms",
      "direction": "lower",
      "group": "Latency",
      "priority": "primary"
    }
  }
}

CLI Reference¤

calibrax ingest¤

Import benchmark results from a JSON file into the local store.

calibrax ingest --data benchmark-data/ --input results.json

calibrax export¤

Export the latest run to W&B. Reads project/entity from config.json.

export WANDB_API_KEY="..."
calibrax export --data benchmark-data/

Override project or entity:

calibrax export --data benchmark-data/ --project my-project --entity my-team

calibrax check¤

Run regression detection against baseline. Exits with code 1 if regressions exceed the threshold. This is the CI gate — fully offline, no W&B needed.

calibrax check --data benchmark-data/ --threshold 0.05

Example output:

FAIL: 2 regression(s) detected:
  ↓ CV-1/small throughput: 20158.00 → 18000.00 (-10.7%)
  ↑ CV-1/small latency_p50: 12.00 → 15.00 (+25.0%)

calibrax baseline¤

Set a run as the regression baseline.

calibrax baseline --data benchmark-data/ --run latest

calibrax summary¤

Print a human-readable run summary to the terminal.

calibrax summary --data benchmark-data/

calibrax trend¤

Show metric trend across runs. Tracks how a specific metric for a specific framework/point changes over time.

calibrax trend --data benchmark-data/ --metric throughput --point CV-1/small --framework Datarax
calibrax trend --data benchmark-data/ --metric throughput --point CV-1/small --framework Datarax --n-runs 10

Output shows timestamp, value, optional CI bounds, and commit hash for each run.


Python API¤

from calibrax.analysis import detect_regressions, rank_table
from calibrax.core import (
    Metric,
    MetricDef,
    MetricDirection,
    MetricPriority,
    Point,
    Run,
)
from calibrax.storage import Store

# Initialize store (JSON backend)
store = Store("benchmark-data/")

# Create and save a run
run = Run(
    points=(
        Point(
            name="CV-1/small",
            scenario="CV-1",
            tags={"framework": "Datarax"},
            metrics={"throughput": Metric(value=20158.0)},
        ),
        Point(
            name="CV-1/small",
            scenario="CV-1",
            tags={"framework": "Grain"},
            metrics={"throughput": Metric(value=20071.0)},
        ),
    ),
    metric_defs={
        "throughput": MetricDef(
            name="throughput",
            unit="elem/s",
            direction=MetricDirection.HIGHER,
            group="Throughput",
            priority=MetricPriority.PRIMARY,
        ),
    },
)
store.save(run)

# Run analysis (all offline, no W&B needed)
baseline = store.get_baseline()
if baseline:
    regressions = detect_regressions(run, baseline)
    ranks = rank_table(run, "throughput", group_by_tag="framework")

Export to W&B¤

from calibrax.exporters.wandb import WandBExporter

exporter = WandBExporter(project="datarax-benchmarks")
url = exporter.export_run(run)
exporter.export_analysis(run, baseline)
print(f"Dashboard: {url}")

W&B Authentication¤

Credentials are read exclusively from environment variables — never stored in config files or committed to git.

Local Development¤

export WANDB_API_KEY="your-key-here"
calibrax export --data benchmark-data/

CI (GitHub Actions)¤

Add WANDB_API_KEY as a repository secret, then reference it in workflows:

env:
  WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}

Offline Mode¤

For testing or environments without internet access:

export WANDB_MODE=offline
calibrax export --data benchmark-data/

Auth validation

calibrax validates authentication before calling wandb.init(). If WANDB_API_KEY is not set and offline mode is not enabled, the CLI prints an actionable error message instead of failing inside wandb.


CI Integration¤

Two GitHub Actions workflows automate benchmarking:

Performance Gate (per-PR)¤

File: .github/workflows/benchmark-gate.yml

Runs on every PR that touches src/datarax/, benchmarks/, or pyproject.toml. Executes Tier 1 benchmarks and runs calibrax check for regression detection.

graph TD
    PR[Pull Request] --> CI[benchmark-gate.yml]
    CI --> T1[Run Tier 1 benchmarks]
    T1 --> CK[calibrax check --threshold 0.05]
    CK -->|No regressions| PASS[PR passes]
    CK -->|Regressions found| WARN[Warning logged]

Non-blocking gate

The regression check is currently non-blocking (continue-on-error: true in the workflow). Regressions are logged as warnings but will not prevent PR merge. This will change once a stable baseline is established.

Nightly Comparative (scheduled)¤

File: .github/workflows/benchmark-nightly.yml

Runs the full comparative suite daily at 2 AM UTC. Exports results to W&B when WANDB_API_KEY is available.

graph TD
    CRON[2 AM UTC] --> N[benchmark-nightly.yml]
    N --> FULL[Run full comparative suite]
    FULL --> EX[calibrax export to W&B]
    FULL --> ART[Upload artifacts - 90 days]

GPU and TPU jobs are defined but commented out until cloud credits are available.


Local Data Store¤

All results are saved locally as JSON regardless of W&B status. The default store directory is benchmark-data/ (gitignored — not committed to version control). The directory structure is:

<store-dir>/
├── runs/
│   ├── <timestamp>_<hash>.json
│   └── ...
├── baselines/
│   └── main.json
└── config.json

This ensures:

  • Offline access: analysis and regression detection work without internet
  • No lock-in: data is portable and not dependent on W&B
  • CI independence: the regression gate uses only local JSON files

W&B Export Details¤

When calibrax export runs, it creates:

W&B Artifact Content
wandb.config Environment fingerprint (CPU, GPU, OS, Python version)
Summary metrics Slash-grouped: Throughput/throughput/Datarax, Latency/latency_p50/Grain
comparison table wandb.Table with " *" suffix on best values
comparison_styled wandb.Html with <b> tags on best values
rankings/* tables Per-metric ranking with rank, value, is_best, delta %
Regression alerts wandb.Alert for each detected regression

Slash notation

W&B automatically groups metrics by slash prefix. Throughput/throughput/Datarax appears under a "Throughput" panel group. This keeps the dashboard organized without manual configuration.


Regression Detection¤

calibrax's regression detection is direction-aware: a throughput drop is a regression, but a latency drop is an improvement.

Direction Regression when Improvement when
higher Value decreases beyond threshold Value increases
lower Value increases beyond threshold Value decreases
info Never (skipped) Never (skipped)

The default threshold is 5% — any change beyond this triggers a regression flag.

Points are matched between runs using a composite key of (name, tags), ensuring that "CV-1/small for Datarax" is compared against the correct baseline even when multiple frameworks share the same point name.