Skip to content

HuggingFace Datasets Quick Reference¤

Metadata Value
Level Beginner
Runtime ~5 min
Prerequisites Basic Datarax pipeline knowledge
Format Python + Jupyter

Overview¤

Load and process datasets from HuggingFace Hub using Datarax's HFEagerSource. This enables access to thousands of pre-built datasets with seamless integration into your data pipelines.

What You'll Learn¤

  1. Configure HFEagerSource for HuggingFace datasets
  2. Use streaming mode for large datasets
  3. Inspect dataset structure and contents
  4. Apply transformations to HuggingFace data
  5. Handle different data types (images, text, tabular)

Coming from PyTorch?¤

PyTorch Datarax
datasets.load_dataset("mnist") HFEagerSource(HFEagerConfig(name="mnist"))
dataset["train"] HFEagerConfig(split="train")
IterableDataset + DataLoader HFEagerSource with streaming=True
dataset.map(transform) Pipeline(source=..., stages=[operator], ...)
Manual batching in DataLoader Pipeline(source=source, stages=[], batch_size=32, rngs=nnx.Rngs(0))

Key difference: Datarax integrates HuggingFace datasets directly into JAX pipelines with automatic array conversion.

Coming from TensorFlow?¤

TensorFlow Datarax
tfds.load("mnist") HFEagerSource(HFEagerConfig(name="mnist"))
dataset.take(1000) Use split syntax: split="train[:1000]"
dataset.batch(32).prefetch(2) Pipeline(source=source, stages=[], batch_size=32, rngs=nnx.Rngs(0))
dataset.map(preprocess) Pipeline(source=..., stages=[operator], ...)

Key difference: HuggingFace Hub has a larger dataset catalog (100,000+) compared to TFDS, and Datarax provides unified access.

Files¤

Quick Start¤

# Install datarax with data dependencies
uv pip install "datarax[data]"

# Run the Python script
python examples/integration/huggingface/01_hf_quickref.py

# Or launch the Jupyter notebook
jupyter lab examples/integration/huggingface/01_hf_quickref.ipynb

Note: First run may download dataset files from HuggingFace Hub.

Step 1: Configure HuggingFace Source¤

HFEagerConfig specifies which dataset to load.

Note: You can also use the factory function from_hf(name, split, ...) which auto-selects between eager and streaming modes.

Key Parameters¤

Parameter Description Example
name Dataset identifier "mnist", "imdb", "squad"
split Which split to use "train", "test", "validation"
streaming Enable for large datasets True avoids full download
subset Dataset variant/configuration "en" for multilingual datasets
import jax
from flax import nnx
from datarax.sources import HFEagerConfig, HFEagerSource

# Load MNIST dataset in streaming mode
config = HFEagerConfig(
    name="mnist",
    split="train",
    streaming=True,  # Stream data instead of downloading all
)

source = HFEagerSource(config, rngs=nnx.Rngs(0))
print(f"Loaded HuggingFace dataset: {config.name}")

# Check dataset size (may not be available in streaming mode)
try:
    print(f"Dataset size: {len(source)}")
except (NotImplementedError, TypeError):
    print("Dataset size: N/A (streaming mode)")

Terminal Output:

JAX devices: [CudaDevice(id=0)]
Loaded HuggingFace dataset: mnist
Dataset size: N/A (streaming mode)

Step 2: Create Pipeline and Inspect Data¤

Build a pipeline and examine what data the dataset provides.

flowchart LR
    subgraph HF["HuggingFace Hub"]
        HUB[Dataset Repository<br/>mnist]
    end

    subgraph Source["HFEagerSource"]
        CFG[HFEagerConfig<br/>streaming=True]
        SRC[HFEagerSource<br/>Load on demand]
    end

    subgraph Pipeline["Pipeline"]
        FS[Pipeline<br/>batch_size=32]
        OPS[Operators<br/>Transformations]
    end

    subgraph Output["Output"]
        OUT[Batched Data<br/>JAX arrays]
    end

    HUB --> CFG --> SRC --> FS --> OPS --> OUT
from datarax.pipeline import Pipeline

# Create pipeline with batch_size=1 for inspection
pipeline = Pipeline(source=source, stages=[], batch_size=1, rngs=nnx.Rngs(0))

# Get first few examples
print("First 3 examples:")
example_iter = iter(pipeline)

for i in range(3):
    batch = next(example_iter)
    data = batch.get_data()

    print(f"\nExample {i + 1}:")
    print(f"  Keys: {list(data.keys())}")

    for key, value in data.items():
        if hasattr(value, "shape"):
            print(f"  {key}: shape={value.shape}, dtype={value.dtype}")
        else:
            print(f"  {key}: {type(value).__name__} = {value}")

Terminal Output:

First 3 examples:

Example 1:
  Keys: ['image', 'label']
  image: shape=(1, 28, 28), dtype=uint8
  label: shape=(1,), dtype=int64

Example 2:
  Keys: ['image', 'label']
  image: shape=(1, 28, 28), dtype=uint8
  label: shape=(1,), dtype=int64

Example 3:
  Keys: ['image', 'label']
  image: shape=(1, 28, 28), dtype=uint8
  label: shape=(1,), dtype=int64

Step 3: Apply Transformations¤

Add operators to transform the HuggingFace data.

import jax.numpy as jnp
from datarax.pipeline import Pipeline
from datarax.operators import ElementOperator, ElementOperatorConfig

# Define a normalization transform
def normalize_image(element, key=None):
    """Normalize image to [0, 1] range and add channel dimension."""
    image = element.data.get("image")
    if image is not None and hasattr(image, "dtype"):
        # Normalize to [0, 1]
        normalized = image.astype(jnp.float32) / 255.0
        # Add channel dimension if needed
        if normalized.ndim == 2:
            normalized = normalized[..., None]
        return element.update_data({"image": normalized})
    return element

# Create operator
normalizer = ElementOperator(
    ElementOperatorConfig(stochastic=False),
    fn=normalize_image,
    rngs=nnx.Rngs(0),
)

# Build transformed pipeline (need fresh source for new iteration)
source2 = HFEagerSource(config, rngs=nnx.Rngs(1))
transformed_pipeline = Pipeline(source=source2, stages=[normalizer], batch_size=32, rngs=nnx.Rngs(0))

# Process a batch
batch = next(iter(transformed_pipeline))
image_batch = batch["image"]

print("Transformed batch:")
print(f"  Image shape: {image_batch.shape}")
print(f"  Image range: [{image_batch.min():.3f}, {image_batch.max():.3f}]")

Terminal Output:

Transformed batch:
  Image shape: (32, 28, 28, 1)
  Image range: [0.000, 1.000]

Computer Vision¤

# CIFAR-10: 60K 32x32 color images, 10 classes
config = HFEagerConfig(name="cifar10", split="train", streaming=True)

# ImageNet-1K: 1.28M images, 1000 classes
config = HFEagerConfig(name="imagenet-1k", split="train", streaming=True)

# Fashion-MNIST: 70K 28x28 grayscale fashion items
config = HFEagerConfig(name="fashion_mnist", split="train", streaming=True)

Natural Language Processing¤

# IMDB: 50K movie reviews (sentiment analysis)
config = HFEagerConfig(name="imdb", split="train", streaming=True)

# SQuAD: Reading comprehension dataset
config = HFEagerConfig(name="squad", split="train", streaming=True)

# WikiText-103: Language modeling dataset
config = HFEagerConfig(name="wikitext", subset="wikitext-103-v1", split="train", streaming=True)

Multimodal¤

# COCO Captions: Image captioning
config = HFEagerConfig(name="coco", subset="2017", split="train", streaming=True)

# Conceptual Captions: 3.3M image-text pairs
config = HFEagerConfig(name="conceptual_captions", split="train", streaming=True)

Streaming vs Non-Streaming Mode¤

# Streaming: Downloads data on-demand
config = HFEagerConfig(
    name="imagenet-1k",
    split="train",
    streaming=True,  # No full download
)
source = HFEagerSource(config, rngs=nnx.Rngs(0))

# Advantages:
# - No large upfront download
# - Lower disk space usage
# - Start training immediately

# Disadvantages:
# - Requires network connection
# - May have variable latency

Non-Streaming Mode¤

# Non-streaming: Downloads full dataset first
config = HFEagerConfig(
    name="mnist",
    split="train",
    streaming=False,
)
source = HFEagerSource(config, rngs=nnx.Rngs(0))

# Advantages:
# - Faster iteration (local access)
# - Works offline after download
# - Deterministic ordering

# Disadvantages:
# - Large upfront download
# - Requires disk space

Results Summary¤

Feature Value
Dataset MNIST from HuggingFace Hub
Mode Streaming (no full download)
Batch Size 32
Output Shape (32, 28, 28, 1)
Normalization [0, 255] → [0, 1]

HuggingFace Integration Benefits¤

  • Access to 100,000+ datasets across all domains
  • Automatic caching and versioning
  • Streaming for large datasets (TB-scale)
  • Seamless Datarax pipeline integration
  • Community-maintained datasets

Dataset Discovery¤

Explore datasets at HuggingFace Hub:

# Search datasets by keyword
# Visit: https://huggingface.co/datasets?search=mnist

# View dataset card for details
# Visit: https://huggingface.co/datasets/mnist

Next Steps¤