Benchmark Results¤

Datarax includes a benchmark catalog of 37 scenarios and a canonical cross-framework comparison set across 16 adapters (Datarax iter + scan + 14 peer frameworks). Results are analyzed with calibrax and published to a W&B dashboard.

Per-adapter scenario coverage is recorded empirically in the Coverage Matrix; scenarios where a peer framework leads are tracked in the Optimization Backlog.

Overview¤

The benchmark suite evaluates:

Throughput: Elements processed per second
Latency: Per-batch processing time distribution (p50, p95, p99)
Memory: Peak RSS and GPU memory usage
Scaling: Performance across batch sizes, workers, and devices
Feature coverage: Which scenarios each framework supports

All metrics are direction-aware (higher throughput is better, lower latency is better), which drives automatic regression detection and ranking.

Framework Comparison¤

We benchmark against frameworks across three tiers:

Tier	Frameworks
Tier 1 (JAX-native)	Grain, JAX DataLoader
Tier 2 (Cross-framework)	tf.data, PyTorch DataLoader, DALI, FFCV, SPDL
Tier 3 (Ecosystem)	MosaicML, WebDataset, HF Datasets, Ray Data, LitData, Energon, Deep Lake

graph LR
    D[Datarax] --> T1[Tier 1: JAX-native]
    D --> T2[Tier 2: Cross-framework]
    D --> T3[Tier 3: Ecosystem]
    T1 --> G[Grain]
    T1 --> JDL[JAX DataLoader]
    T2 --> TF[tf.data]
    T2 --> PT[PyTorch DL]
    T2 --> DA[DALI]
    T2 --> FF[FFCV]
    T2 --> SP[SPDL]

Sections¤

Methodology -- Timing protocol, warmup strategy, statistical analysis
Framework Comparison -- Results with charts and comparative analysis
Cloud Benchmarking -- Running benchmarks on Vast.ai, Lambda, GCP via SkyPilot
Dashboard & calibrax -- W&B dashboard, regression gates, and the calibrax analysis library

Running Benchmarks End-to-End¤

The full workflow: install dependencies, run benchmarks, analyze results, export to W&B.

1. Install¤

# Core datarax + all benchmark adapters
uv sync --all-extras

# calibrax (analysis CLI, installed from GitHub)
uv pip install "calibrax @ git+https://github.com/avitai/calibrax.git"

# Optional: W&B export support
uv pip install "calibrax[wandb] @ git+https://github.com/avitai/calibrax.git"

2. Run¤

uv run python -m benchmarks.cli CLI (Recommended)Shell ScriptCustom SubsetAutomated Vast Two-PassCI Gate

The uv run python -m benchmarks.cli CLI is the preferred entry point. It runs benchmarks, converts results, stores them, and optionally exports to W&B — all in one command.

uv run python -m benchmarks.cli run --platform cpu --repetitions 3
uv run python -m benchmarks.cli run --platform cpu --scenarios CV-1 --scenarios NLP-1 --adapters Datarax --adapters "Google Grain" --repetitions 3
uv run python -m benchmarks.cli run --platform cpu --wandb --charts  # With W&B export and chart generation

Repeated flags for the benchmarks.cli CLI

The click-based benchmarks.cli run command takes repeated --scenarios/--adapters flags (one value each), not a space-separated list. The space-separated form (--scenarios CV-1 NLP-1) works only for the argparse runners (full_runner.py, benchmark_runner.py).

Nightly CI runs benchmarks.cli run

The nightly CI workflow runs uv run python -m benchmarks.cli run — this is the preferred entry point for all benchmark runs.

Run the profile-defined scenario set across every installed adapter:

./scripts/run_full_benchmark.sh             # CPU, 5 repetitions
./scripts/run_full_benchmark.sh cpu 3       # CPU, 3 repetitions
./scripts/run_full_benchmark.sh gpu         # GPU (requires CUDA)

Or via Python directly:

uv run python -m benchmarks.runners.full_runner --platform cpu --repetitions 5

Run specific scenarios and adapters:

uv run python -m benchmarks.runners.full_runner \
    --platform cpu \
    --scenarios CV-1 NLP-1 TAB-1 \
    --adapters Datarax "Google Grain" "PyTorch DataLoader" \
    --repetitions 3

Preferred for canonical GPU reports and backend-truth validation:

./.venv/bin/python -m benchmarks.automation.vast_orchestrator \
    --infra vast \
    --cluster datarax-vast-a100 \
    --mode two-pass \
    --on-demand \
    --download-dir benchmark-data/reports/vast/latest \
    --analyze \
    --yes \
    --no-spot-fallback \
    --launch-timeout-sec 900 \
    --stall-timeout-sec 300

The orchestrator reports live progress from long sky commands using peek: lines (setup, launch, benchmark execution milestones). For on-demand runs, timeout handling checks cluster visibility and retries the same cluster before optional spot fallback. If launch output stalls, the orchestrator fails fast and automatically captures sky queue + sky logs --tail diagnostics. Stage runs are executed with explicit GPU reservation (sky exec --gpus A100:1) and artifact download logic auto-normalizes scp nested-layout edge cases.

Dry-run (no provisioning):

./.venv/bin/python -m benchmarks.automation.vast_orchestrator \
    --infra vast \
    --cluster datarax-vast-a100 \
    --mode two-pass \
    --download-dir benchmark-data/reports/vast/latest \
    --dry-run \
    --yes

Lightweight regression gate — 6 fast gate scenarios, Datarax only:

uv run python -m benchmarks.runners.ci_runner --repetitions 3

Runs automatically on PRs touching src/datarax/ or benchmarks/. See Performance Gate.

Results are saved to a local benchmark-data/ directory (not committed to version control).

Runner Options¤

Defaults differ between the click CLI (benchmarks.cli run) and the argparse runner (benchmarks.runners.full_runner); both are shown where they diverge.

Flag	`cli run` default	`full_runner` default	Description
`--platform`	`cpu`	`cpu`	Target platform (see Platforms below)
`--scenarios`	profile include list	profile include list	Scenario IDs (see Scenarios below). `cli run` takes repeated flags (`--scenarios CV-1 --scenarios NLP-1`); `full_runner` takes a space-separated list (`--scenarios CV-1 NLP-1`). Explicit values override profile include/exclude lists.
`--adapters`	all installed	all installed	Adapter names (see Adapters below). `cli run` takes repeated flags; `full_runner` takes a space-separated list.
`--profile`	`ci_cpu`	`ci_cpu`	Hardware profile (see Profiles below)
`--repetitions`	`3`	`5`	Number of repetitions per scenario (median is selected)
`--output-dir`	`benchmark-data/reports/latest`	`benchmark-data/reports/releases/v1.0`	Output directory for result JSONs (local only, not committed)
`--wandb/--no-wandb`	`--wandb`	n/a	Enable/disable W&B export (`benchmarks.cli` only)
`--charts/--no-charts`	`--charts`	n/a	Enable/disable chart generation (`benchmarks.cli` only)
`--baseline/--no-baseline`	`--baseline`	n/a	Set run as baseline for future comparisons (`benchmarks.cli` only)
`--data`	`benchmark-data`	n/a	calibrax store directory path (`benchmarks.cli` only)
`--project`	from config	n/a	W&B project override (`benchmarks.cli` only)
`--entity`	from config	n/a	W&B entity override (`benchmarks.cli` only)

Platforms¤

Value	Description
`cpu`	CPU-only execution (default, works everywhere)
`gpu`	GPU execution (requires CUDA)
`tpu`	TPU execution (requires TRC access or GCP)

Scenarios¤

37 scenarios across 10 categories (28 standard + 9 heavy H* variants). Pass any combination to --scenarios:

ID	Category	Description
`CV-1`	Computer Vision	Image classification pipeline (canonical)
`CV-2`	Computer Vision	High-resolution medical imaging (3D U-Net)
`CV-3`	Computer Vision	Batch-level mixing (MixUp/CutMix)
`CV-4`	Computer Vision	Multi-resolution pipeline
`NLP-1`	NLP	Token-based LLM pretraining data
`NLP-2`	NLP	Variable-length text with dynamic padding
`TAB-1`	Tabular	Dense feature table loading
`TAB-2`	Tabular	Sparse feature processing (DLRM pattern)
`MM-1`	Multimodal	Image-text pair loading (CLIP-style)
`MM-2`	Multimodal	Audio-text pair loading (ASR-style)
`PC-1`	Pipeline Complexity	Deep transform chain scaling
`PC-2`	Pipeline Complexity	Branching/parallel pipeline (DAG)
`PC-3`	Pipeline Complexity	Differentiable rebatching
`PC-4`	Pipeline Complexity	Probabilistic & conditional pipeline
`PC-5`	Pipeline Complexity	End-to-end differentiable pipeline
`IO-1`	I/O Patterns	Source backend comparison
`IO-2`	I/O Patterns	Streaming vs eager loading
`IO-3`	I/O Patterns	Mixed-source pipeline
`IO-4`	I/O Patterns	Cache node effectiveness
`DIST-1`	Distributed	Multi-device sharding & prefetch
`DIST-2`	Distributed	Device mesh configuration
`PR-1`	Production	Checkpoint save/restore cycle
`PR-2`	Production	Multi-epoch determinism verification
`AUG-1`	Augmentation	Stochastic augmentation chain throughput
`AUG-2`	Augmentation	Deterministic vs stochastic transform overhead
`AUG-3`	Augmentation	Stochastic depth pipeline behavior
`NNX-1`	Datarax Unique	Flax NNX module integration overhead
`XFMR-1`	Datarax Unique	JIT + vmap transform acceleration
`HCV-1`	Computer Vision (heavy)	ImageNet-scale image classification
`HCV-2`	Computer Vision (heavy)	Dense prediction / segmentation pipeline
`HNLP-1`	NLP (heavy)	Long-context LLM pretraining data pipeline
`HNLP-2`	NLP (heavy)	Text tokenization pipeline
`HTAB-1`	Tabular (heavy)	Large-scale recommendation system pipeline
`HMM-1`	Multimodal (heavy)	Vision-language contrastive pipeline (CLIP-scale)
`HPC-1`	Pipeline Complexity (heavy)	SSL/contrastive learning augmentation chain
`HPC-2`	Pipeline Complexity (heavy)	Multi-view DAG augmentation
`HDIST-1`	Distributed (heavy)	Multi-device sharded data pipeline

Heavy (H*) scenarios

The H* variants use production-realistic, ImageNet/CLIP-scale data. They run on the A100 cloud profile (gpu_a100); the 24 GB gpu_rtx4090 profile runs the non-heavy set. See Hardware Profiles.

Adapters¤

16 adapters (Datarax iter + scan + 14 peer frameworks). Pass any combination to --adapters. Only adapters whose framework is installed will run.

Each adapter supports only the scenarios where it implements the required transforms (e.g., CV-1 requires Normalize + CastToFloat32). Adapters that cannot implement a scenario's transforms are excluded from that scenario rather than measured with less work. The Scenarios column below is the empirical per-adapter coverage count from the Coverage Matrix; counts for uninstalled frameworks reflect their declared support.

`--adapters` value	Tier	Framework	Scenarios
`Datarax`	--	Datarax iter-mode (always available)	37
`Datarax-scan`	--	Datarax whole-epoch `nnx.scan` variant	37
`Google Grain`	Tier 1	Google Grain	25
`jax-dataloader`	Tier 1	JAX DataLoader	13
`tf.data`	Tier 2	TensorFlow tf.data	25
`PyTorch DataLoader`	Tier 2	PyTorch DataLoader	25
`NVIDIA DALI`	Tier 2	NVIDIA DALI	9
`FFCV`	Tier 2	FFCV	13
`SPDL`	Tier 2	SPDL	25
`MosaicML Streaming`	Tier 3	MosaicML Streaming	2
`WebDataset`	Tier 3	WebDataset	2
`HuggingFace Datasets`	Tier 3	HuggingFace Datasets	13
`Ray Data`	Tier 3	Ray Data	2
`LitData`	Tier 3	LitData	1
`Energon`	Tier 3	Megatron Energon	1
`Deep Lake`	Tier 3	Deep Lake	13

25 of 37 scenarios run on ≥3 frameworks — the set with meaningful cross-framework comparison. Best-covered are CV-1 and NLP-1 (14 frameworks each) and TAB-1 (12). See the Coverage Matrix for the full per-scenario breakdown.

Names with spaces require shell quotes

Adapter names are exact-match. Names containing spaces must be quoted on the command line: --adapters Datarax "Google Grain" "PyTorch DataLoader". Single-word names like Datarax, FFCV, SPDL don't need quotes.

Hardware Profiles¤

Profiles control warmup batches, measurement batches, and timeouts:

Profile	Backend	Warmup	Batches	Timeout	Default Scenario Set
`ci_cpu`	CPU	3	20	5 min	6 scenarios (CI gate set)
`gpu_a100`	GPU	8	50	10 min	15 scenarios (includes heavy `HCV-1`, `HPC-1`)
`gpu_rtx4090`	GPU	6	40	10 min	28 scenarios (all non-heavy; fit the 24 GB card)
`gpu_rtx4090_real`	GPU	6	40	10 min	5 real-data scenarios (CV-1, CV-3, NLP-1, TAB-1, MM-1 pinned to `real_*` variants)
`tpu_v5e`	TPU	8	50	10 min	10 TPU-compatible scenarios

Profile scenario include lists are enforced by default. Use --scenarios to run an explicit scenario subset outside the default profile list. The gpu_rtx4090_real profile serves every adapter the same raw numpy bytes from cached real datasets, pinning each scenario to its real_* variant (set DATARAX_BENCH_DOWNLOAD=1 for the one-time materialization).

3. Analyze¤

After running, use calibrax to check for regressions and view a summary:

# Terminal summary
calibrax summary --data benchmark-data/

# Regression check against baseline (exits non-zero on failure)
calibrax check --data benchmark-data/ --threshold 0.05

# Set current run as the baseline for future comparisons
calibrax baseline --data benchmark-data/ --run latest

4. Export to W&B (optional)¤

export WANDB_API_KEY="..."
calibrax export --data benchmark-data/

See Dashboard & calibrax for W&B setup details and the full calibrax CLI reference.

Reproducibility¤

All benchmarks use:

Deterministic synthetic data (seed=42) -- identical inputs across all frameworks
Fixed warmup protocol -- 3-8 batches depending on hardware profile
Median-of-N selection -- reduces sensitivity to outlier runs
Environment fingerprinting -- hardware/software version tracking
Backend-truth recording -- manifests include both requested_platform and active_backend
JSON-per-run storage -- every run saved locally for offline analysis
CLI run-state visibility -- orchestrator emits progress bars, heartbeat updates, live peek: command output, and failure summaries