Benchmark Methodology¤

This page summarizes the measurement methodology used in the Datarax benchmark suite.

Timing Protocol¤

All benchmarks use time.perf_counter() for wall-clock measurement, with optional GPU synchronization barriers for accurate accelerator timing.

sequenceDiagram
    participant Runner
    participant Adapter
    participant Timer

    loop R repetitions
        Runner->>Adapter: setup(config, data)
        Runner->>Adapter: warmup(W batches)
        Runner->>Timer: start measurement
        loop N batches
            Adapter->>Timer: record per-batch time
        end
        Runner->>Timer: stop measurement
        Runner->>Adapter: teardown()
    end
    Runner->>Runner: select median by wall_clock_sec

Warmup Strategy¤

Warmup ensures JIT compilation, caching, and pipeline priming are excluded from measurements:

Profile	Warmup Batches	Measurement Batches
CI CPU	3	20
GPU A100	8	50
GPU RTX 4090	6	40
TPU v5e	8	50

Why warmup matters

JAX's XLA compiler JIT-compiles on first execution. Without warmup, the first batch includes compilation overhead that can be 100x slower than subsequent batches.

Repetitions and Statistics¤

Each scenario runs multiple repetitions. The median result is selected to reduce sensitivity to outlier runs (cold caches, GC pauses, etc.).

Statistical analysis uses:

Coefficient of Variation (CV): Measurement stability check — CV < 10% required for publishable results
Bootstrap CI: 95% confidence intervals via 1000 bootstrap resamples (calibrax.statistics.StatisticalAnalyzer)
Threshold-based regression detection: Direction-aware comparison against baseline (calibrax.analysis.detect_regressions). Default threshold is 5% — see Dashboard & calibrax for details
Modified Z-score: Outlier detection using MAD-based robust statistics

Fairness Principles¤

Same data: All frameworks process identical synthetic datasets
Same hardware: All frameworks run on the same machine in sequence
Cache clearing: JAX caches, Python GC, and CUDA memory cleared between framework runs
Supported scenarios only: Each framework runs only the scenarios it supports (no penalty for missing features)
Equal transforms: Each adapter implements the same transforms required by a scenario (e.g., CV-1 requires Normalize + CastToFloat32). Adapters that cannot implement a scenario's transforms are excluded from that scenario rather than measured with less work
Profile-gated defaults: Hardware profile scenario include/exclude lists are applied by default; explicit --scenarios overrides profile gating
Backend truth: Each manifest records both requested_platform and active_backend; mismatches fail validation in the automated Vast workflow
Two fairness lenses:
- Same-backend head-to-head: compare frameworks only on the shared supported-scenario intersection.
- Native-optimal capability: evaluate each framework on scenarios that reflect its best native path.
Provisioning determinism: Automated Vast runs pin a named cluster, validate hardware/backend before benchmarking, and apply timeout -> status check -> same-cluster retry before optional fallback.
Stall fail-fast diagnostics: If launch output is silent past the configured stall threshold, automation runs sky queue and sky logs --tail and exits with a clear failure reason instead of waiting indefinitely.
Stage-level GPU reservation: Remote verify and benchmark stage commands request GPU resources explicitly (sky exec --gpus <class>:1) to prevent no-device stage execution.
Artifact layout integrity: Cloud artifact collection validates transfer method compatibility and normalizes nested results/results/* layouts that can occur with some scp variants.
Transform fidelity: Adapter setup fails fast on any transform it does not implement (resolve_transforms) instead of silently skipping it, so a measured run can never execute a lighter pipeline than the scenario requested. A determinism test asserts this for every installed adapter.

Materialization Semantics¤

The timed loop calls each adapter's _materialize_batch() per batch, so a batch's cost is whatever its framework's real delivery path costs — and that differs architecturally:

Per-record collation (Grain, PyTorch DataLoader, SPDL): every batch gathers and copies individual records. Throughput reflects real per-batch assembly work.
Zero-copy slicing (jax-dataloader over preloaded arrays): batches are views into memory that already exists; transform-free scenarios then measure little more than slice arithmetic and can report throughput near memory bandwidth (tens of millions of elements per second). This is the framework's genuine behavior for in-memory data, not a measurement error, but it is not comparable to per-record loaders on such scenarios.
Columnar views (HuggingFace Datasets): tabular columns convert from Arrow with near-zero copies while image columns pay a per-access conversion, which is why its numbers are strongly bimodal across modalities.

Cross-framework comparisons should therefore lead with scenarios that include transforms (where every adapter does real per-batch work) and treat transform-free iteration numbers as an architecture probe, not a ranking.

Backend Truth Contract¤

Every canonical benchmark run must record and validate:

Field	Source	Expected value for GPU runs
`requested_platform`	Runner CLI/profile	`gpu`
`active_backend`	`init_platform()` / JAX	`gpu`
`environment.platform.devices`	Runtime probe	Includes `cuda` devices
`gpu_name`	Environment capture	Matches expected hardware class (for Vast automation: A100)

Automated Vast two-pass runs fail fast if any of these checks do not match expected values.

Scenario Categories¤

The catalog holds 37 scenarios: 28 standard scenarios plus 9 heavy (H*) production-realistic variants that run on the A100 cloud profile.

Category	IDs	What It Measures
Computer Vision	CV-1, CV-2, CV-3, CV-4	Image loading + augmentation throughput
NLP	NLP-1, NLP-2	Tokenization pipeline throughput
Tabular	TAB-1, TAB-2	Structured data loading
Multimodal	MM-1, MM-2	Multi-modal data interleaving
Pipeline Complexity	PC-1 to PC-5	DAG depth, branching, caching
I/O Patterns	IO-1, IO-2, IO-3, IO-4	Sequential vs random, streaming, caching
Distributed	DIST-1, DIST-2	Multi-device sharding and mesh config
Production	PR-1, PR-2	Checkpointing, determinism
Augmentation	AUG-1, AUG-2, AUG-3	Stochastic transform pipeline overhead
Datarax Unique	NNX-1, XFMR-1	Flax NNX integration, JIT+vmap acceleration
Computer Vision (heavy)	HCV-1, HCV-2	ImageNet-scale classification, dense prediction
NLP (heavy)	HNLP-1, HNLP-2	Long-context pretraining, tokenization at scale
Tabular (heavy)	HTAB-1	Large-scale recommendation pipeline
Multimodal (heavy)	HMM-1	CLIP-scale vision-language contrastive pipeline
Pipeline Complexity (heavy)	HPC-1, HPC-2	SSL augmentation chain, multi-view DAG
Distributed (heavy)	HDIST-1	Multi-device sharded data pipeline

Stability Validation¤

Before publishing results, the StabilityValidator checks that all measurements have CV < 10%. Unstable scenarios are flagged for additional repetitions.

from benchmarks.analysis.stability import StabilityValidator
from benchmarks.runners.full_runner import ComparativeResults

results = ComparativeResults.load("benchmark-data/reports/latest")
validator = StabilityValidator(cv_threshold=0.10)
report = validator.validate(results)

print(f"Stable: {report.stable_count}/{report.total_results}")
for sid, adapter, cv in report.unstable_scenarios:
    print(f"  UNSTABLE: {sid}/{adapter} CV={cv:.2%}")