Dashboard & calibrax¤
calibrax is a standalone Python library that powers benchmark analysis and the W&B dashboard. It provides direction-aware metrics, regression detection, statistical analysis, and automated export to Weights & Biases.
graph LR
R[Benchmark Runner] -->|JSON| S[calibrax Store]
S -->|Analysis| A[Regressions / Rankings / CIs]
S -->|Export| W[W&B Dashboard]
A -->|CI Gate| G[Pass / Fail]
Why W&B?¤
Rather than building a custom visualization frontend, we use W&B as the dashboard UI. It provides interactive charts, run history, comparison tables, filtering, and team collaboration — all well-established and familiar to ML practitioners. W&B is free for open-source projects.
What calibrax adds on top of W&B:
- Direction-aware metric definitions (
higherthroughput is better,lowerlatency is better) - Best-value highlighting with correct direction semantics
- Regression detection with CI gate (
calibrax check) - Bootstrap confidence intervals (pure Python, no scipy)
- Framework ranking tables
- JSON-per-run local storage (works fully offline)
Installation¤
Install calibrax directly from GitHub:
# Core (analysis + CLI, no W&B dependency)
uv pip install "calibrax @ git+https://github.com/avitai/calibrax.git"
# With W&B export support
uv pip install "calibrax[wandb] @ git+https://github.com/avitai/calibrax.git"
Data Model¤
calibrax's data model captures the semantics that W&B doesn't track natively:
| Concept | Description |
|---|---|
MetricDef |
How to interpret a metric — name, unit, direction (higher/lower/info), group, priority |
Metric |
A single value with optional CI bounds and raw samples |
Point |
One benchmark + one configuration (e.g., "CV-1/small" for Datarax) |
Run |
One execution of a benchmark suite — a collection of Points |
All metric definitions live in a config.json file inside the local data store:
{
"wandb_project": "datarax-benchmarks",
"metric_defs": {
"throughput": {
"unit": "elem/s",
"direction": "higher",
"group": "Throughput",
"priority": "primary"
},
"latency_p50": {
"unit": "ms",
"direction": "lower",
"group": "Latency",
"priority": "primary"
}
}
}
CLI Reference¤
calibrax ingest¤
Import benchmark results from a JSON file into the local store.
calibrax export¤
Export the latest run to W&B. Reads project/entity from config.json.
Override project or entity:
calibrax check¤
Run regression detection against baseline. Exits with code 1 if regressions exceed the threshold. This is the CI gate — fully offline, no W&B needed.
Example output:
FAIL: 2 regression(s) detected:
↓ CV-1/small throughput: 20158.00 → 18000.00 (-10.7%)
↑ CV-1/small latency_p50: 12.00 → 15.00 (+25.0%)
calibrax baseline¤
Set a run as the regression baseline.
calibrax summary¤
Print a human-readable run summary to the terminal.
calibrax trend¤
Show metric trend across runs. Tracks how a specific metric for a specific framework/point changes over time.
calibrax trend --data benchmark-data/ --metric throughput --point CV-1/small --framework Datarax
calibrax trend --data benchmark-data/ --metric throughput --point CV-1/small --framework Datarax --n-runs 10
Output shows timestamp, value, optional CI bounds, and commit hash for each run.
Python API¤
from calibrax.analysis import detect_regressions, rank_table
from calibrax.core import (
Metric,
MetricDef,
MetricDirection,
MetricPriority,
Point,
Run,
)
from calibrax.storage import Store
# Initialize store (JSON backend)
store = Store("benchmark-data/")
# Create and save a run
run = Run(
points=(
Point(
name="CV-1/small",
scenario="CV-1",
tags={"framework": "Datarax"},
metrics={"throughput": Metric(value=20158.0)},
),
Point(
name="CV-1/small",
scenario="CV-1",
tags={"framework": "Grain"},
metrics={"throughput": Metric(value=20071.0)},
),
),
metric_defs={
"throughput": MetricDef(
name="throughput",
unit="elem/s",
direction=MetricDirection.HIGHER,
group="Throughput",
priority=MetricPriority.PRIMARY,
),
},
)
store.save(run)
# Run analysis (all offline, no W&B needed)
baseline = store.get_baseline()
if baseline:
regressions = detect_regressions(run, baseline)
ranks = rank_table(run, "throughput", group_by_tag="framework")
Export to W&B¤
from calibrax.exporters.wandb import WandBExporter
exporter = WandBExporter(project="datarax-benchmarks")
url = exporter.export_run(run)
exporter.export_analysis(run, baseline)
print(f"Dashboard: {url}")
W&B Authentication¤
Credentials are read exclusively from environment variables — never stored in config files or committed to git.
Local Development¤
CI (GitHub Actions)¤
Add WANDB_API_KEY as a repository secret, then reference it in workflows:
Offline Mode¤
For testing or environments without internet access:
Auth validation
calibrax validates authentication before calling wandb.init(). If WANDB_API_KEY is not set and offline mode is not enabled, the CLI prints an actionable error message instead of failing inside wandb.
CI Integration¤
Two GitHub Actions workflows automate benchmarking:
Performance Gate (per-PR)¤
File: .github/workflows/benchmark-gate.yml
Runs on every PR that touches src/datarax/, benchmarks/, or pyproject.toml. Executes Tier 1 benchmarks and runs calibrax check for regression detection.
graph TD
PR[Pull Request] --> CI[benchmark-gate.yml]
CI --> T1[Run Tier 1 benchmarks]
T1 --> CK[calibrax check --threshold 0.05]
CK -->|No regressions| PASS[PR passes]
CK -->|Regressions found| WARN[Warning logged]
Non-blocking gate
The regression check is currently non-blocking (continue-on-error: true in the workflow). Regressions are logged as warnings but will not prevent PR merge. This will change once a stable baseline is established.
Nightly Comparative (scheduled)¤
File: .github/workflows/benchmark-nightly.yml
Runs the full comparative suite daily at 2 AM UTC. Exports results to W&B when WANDB_API_KEY is available.
graph TD
CRON[2 AM UTC] --> N[benchmark-nightly.yml]
N --> FULL[Run full comparative suite]
FULL --> EX[calibrax export to W&B]
FULL --> ART[Upload artifacts - 90 days]
GPU and TPU jobs are defined but commented out until cloud credits are available.
Local Data Store¤
All results are saved locally as JSON regardless of W&B status. The default store directory is benchmark-data/ (gitignored — not committed to version control). The directory structure is:
<store-dir>/
├── runs/
│ ├── <timestamp>_<hash>.json
│ └── ...
├── baselines/
│ └── main.json
└── config.json
This ensures:
- Offline access: analysis and regression detection work without internet
- No lock-in: data is portable and not dependent on W&B
- CI independence: the regression gate uses only local JSON files
W&B Export Details¤
When calibrax export runs, it creates:
| W&B Artifact | Content |
|---|---|
wandb.config |
Environment fingerprint (CPU, GPU, OS, Python version) |
| Summary metrics | Slash-grouped: Throughput/throughput/Datarax, Latency/latency_p50/Grain |
comparison table |
wandb.Table with " *" suffix on best values |
comparison_styled |
wandb.Html with <b> tags on best values |
rankings/* tables |
Per-metric ranking with rank, value, is_best, delta % |
| Regression alerts | wandb.Alert for each detected regression |
Slash notation
W&B automatically groups metrics by slash prefix. Throughput/throughput/Datarax appears under a "Throughput" panel group. This keeps the dashboard organized without manual configuration.
Regression Detection¤
calibrax's regression detection is direction-aware: a throughput drop is a regression, but a latency drop is an improvement.
| Direction | Regression when | Improvement when |
|---|---|---|
higher |
Value decreases beyond threshold | Value increases |
lower |
Value increases beyond threshold | Value decreases |
info |
Never (skipped) | Never (skipped) |
The default threshold is 5% — any change beyond this triggers a regression flag.
Points are matched between runs using a composite key of (name, tags), ensuring that "CV-1/small for Datarax" is compared against the correct baseline even when multiple frameworks share the same point name.