Array Record Source¤

Data source for ArrayRecord format (efficient record-oriented storage with parallel random access).

datarax.sources.array_record_source ¤

Data source for reading from ArrayRecord format files.

logger `module-attribute` ¤

logger = logging.getLogger(__name__)

ArrayRecordSourceConfig `dataclass` ¤

ArrayRecordSourceConfig(cacheable: bool = False, batch_stats_fn: Callable | Module | None = None, precomputed_stats: dict[str, Any] | None = None, stochastic: bool = False, stream_name: str | None = None, seed: int = 42, num_epochs: int = -1, shuffle_files: bool = False, local_files_only: bool = False)

Bases: StructuralConfig

Configuration for ArrayRecordSourceModule.

Inherits from StructuralConfig for runtime immutability.

Attributes:

Name	Type	Description
`seed`	`int`	Random seed for shuffling (used internally, not by Grain).
`num_epochs`	`int`	Number of epochs (-1 for infinite).
`shuffle_files`	`bool`	Whether to shuffle file order (handled internally).
`local_files_only`	`bool`	If True, validate every path exists at construction time and raise `FileNotFoundError` with path context if any are missing. ArrayRecord sources never download, so this flag is primarily a UX improvement over Grain's lower-level errors.

seed `class-attribute` `instance-attribute` ¤

seed: int = 42

num_epochs `class-attribute` `instance-attribute` ¤

num_epochs: int = -1

shuffle_files `class-attribute` `instance-attribute` ¤

shuffle_files: bool = False

local_files_only `class-attribute` `instance-attribute` ¤

local_files_only: bool = False

cacheable `class-attribute` `instance-attribute` ¤

cacheable: bool = False

batch_stats_fn `class-attribute` `instance-attribute` ¤

batch_stats_fn: Callable | Module | None = None

precomputed_stats `class-attribute` `instance-attribute` ¤

precomputed_stats: dict[str, Any] | None = None

stochastic `class-attribute` `instance-attribute` ¤

stochastic: bool = False

stream_name `class-attribute` `instance-attribute` ¤

stream_name: str | None = None

ArrayRecordSourceModule ¤

ArrayRecordSourceModule(config: ArrayRecordSourceConfig, paths: str | list[str], *, rngs: Rngs | None = None, name: str | None = None)

Bases: DataSourceModule

Stateful wrapper for Grain's ArrayRecordDataSource.

This module wraps Grain's ArrayRecordDataSource while maintaining stateful iteration through NNX Variables, following TDD principles and critical technical guidelines.

Note: Grain's ArrayRecordDataSource doesn't accept a seed parameter directly. Shuffling is handled at the sampler level or through file ordering.

Resource management: the underlying ArrayRecord readers hold C++ file handles that are not reliably freed by garbage collection. Use the module as a context manager (with ArrayRecordSourceModule(...) as source:) or call close() explicitly between phases to avoid "Too many open files" on long-running jobs.

Parameters:

Name	Type	Description	Default
`config`	`ArrayRecordSourceConfig`	Configuration for the source.	required
`paths`	`str \| list[str]`	Path pattern or list of paths to ArrayRecord files.	required
`rngs`	`Rngs \| None`	NNX Rngs for additional randomness.	`None`
`name`	`str \| None`	Optional name for the module.	`None`

config `instance-attribute` ¤

config: ArrayRecordSourceConfig

grain_source `instance-attribute` ¤

grain_source = grain.sources.ArrayRecordDataSource(paths=paths)

current_index `instance-attribute` ¤

current_index = nnx.Variable(0)

current_epoch `instance-attribute` ¤

current_epoch = nnx.Variable(0)

total_records `instance-attribute` ¤

total_records = nnx.Variable(len(self.grain_source))

prefetch_cache `instance-attribute` ¤

prefetch_cache: Variable[dict[str, Any]] = nnx.Variable({})

iterator_initialized `instance-attribute` ¤

iterator_initialized = nnx.Variable(False)

shuffled_indices `instance-attribute` ¤

shuffled_indices: Variable[ndarray | None] = nnx.Variable(None)

rngs `instance-attribute` ¤

rngs = rngs

name `instance-attribute` ¤

name = nnx.static(name)

stochastic `instance-attribute` ¤

stochastic = config.stochastic

stream_name `instance-attribute` ¤

stream_name = config.stream_name

get_state ¤

get_state() -> dict[str, Any]

Get complete state for checkpointing.

set_state ¤

set_state(state: dict[str, Any]) -> None

Restore state from checkpoint.

get_batch_at ¤

get_batch_at(start: int, size: int, key: Any | None = None) -> list[Any]

Stateless indexed batch access for Pipeline-driven iteration.

Returns size records starting at logical position start, wrapping at the end of the dataset and applying any active shuffle permutation. Does not advance self.current_index or any other internal state.

ArrayRecord records are loaded host-side (Grain is a Python library), so this method requires a concrete Python int for start. Driving an ArrayRecord source under nnx.scan (Tier C of the pipeline integration story) currently requires wrapping the host-side fetch in jax.experimental.io_callback — left as a future enhancement. Tier A (Python iteration) and Tier B (single step()) work today.

Parameters:

Name	Type	Description	Default
`start`	`int`	Concrete starting index (Python int).	required
`size`	`int`	Number of records to return.	required
`key`	`Any \| None`	Reserved for future shuffled-mode support; currently ignored (shuffle uses `self.shuffled_indices`).	`None`

Returns:

Type	Description
`list[Any]`	List of `size` records as returned by the underlying
`list[Any]`	Grain source. Records are typically Python dicts; callers
`list[Any]`	(typically a parse / decode operator) handle structure.

Raises:

Type	Description
`TypeError`	If `start` is a JAX tracer (not host-side concrete). ArrayRecord cannot be traced through `nnx.scan` without an io_callback wrapper.

close ¤

close() -> None

Release the underlying ArrayRecord C++ file handles.

ArrayRecord readers hold C++ file handles that Python garbage collection does not reliably release; long-running pipelines that repeatedly create sources can exhaust the file-descriptor limit ("Too many open files"). Call close() — or use the source as a context manager — between phases that open new sources.

Delegates to Grain's own cleanup: ArrayRecordDataSource.close() on grain >= 0.2.19, falling back to its context-manager __exit__ on grain 0.2.18. Idempotent and safe to call multiple times.

get_operation_stats ¤

get_operation_stats() -> dict[str, int]

Get operation statistics.

Note: This method converts JAX arrays to Python ints for introspection. It is intended for use outside of JIT-compiled functions.

Returns:

Type	Description
`dict[str, int]`	Dictionary with 'applied_count' and 'skipped_count'

reset_operation_stats ¤

reset_operation_stats() -> None

Reset operation statistics to zero.

Note: Creates new JAX arrays to reset the counters.

compute_statistics ¤

compute_statistics(data: Any) -> dict[str, Any] | None

Compute statistics from data using batch_stats_fn.

If batch_stats_fn is not configured, returns None. Computed statistics are cached in _computed_stats.

Parameters:

Name	Type	Description	Default
`data`	`Any`	Input data to compute statistics from	required

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary of statistics, or None if no batch_stats_fn configured

get_statistics ¤

get_statistics() -> dict[str, Any] | None

Get current statistics.

Returns precomputed_stats if configured (unless reset was called), otherwise returns cached computed statistics, or None if no statistics available.

Returns:

Type	Description
`dict[str, Any] \| None`	Dictionary of statistics, or None if no statistics available

set_statistics ¤

set_statistics(stats: dict[str, Any]) -> None

Manually set statistics.

This overwrites any previously computed statistics and clears reset flag.

Parameters:

Name	Type	Description	Default
`stats`	`dict[str, Any]`	Dictionary of statistics to set	required

reset_statistics ¤

reset_statistics() -> None

Reset all statistics to None.

This clears both computed statistics and marks that precomputed_stats should be ignored (via internal flag). After reset, get_statistics() will return None until new statistics are set or computed.

reset_cache ¤

reset_cache() -> None

Clear the cache.

Only has effect if cacheable=True in config.

copy ¤

copy(*, config: DataraxModuleConfig | None = None, rngs: Rngs | None = None, name: str | None = None) -> DataraxModule

Create a copy of this module with optional config/parameter changes.

This allows creating a new module instance with modified configuration while preserving other attributes. Useful for hyperparameter tuning.

Parameters:

Name	Type	Description	Default
`config`	`DataraxModuleConfig \| None`	New config (if None, uses current config)	`None`
`rngs`	`Rngs \| None`	New RNG state (if None, uses current rngs)	`None`
`name`	`str \| None`	New name (if None, uses current name)	`None`

Returns:

Type	Description
`DataraxModule`	New module instance with updated parameters

Examples:

Change configuration¤

new_config = DataraxModuleConfig(cacheable=True) new_module = module.copy(config=new_config)

Change name only¤

renamed = module.copy(name="new_name")

Note

Subclasses can override this method to provide more fine-grained control over copying, such as allowing individual config field updates without requiring dataclass replace().

clone ¤

clone() -> DataraxModule

Create a new instance with the same state as this module.

Uses NNX's clone function for proper deep cloning of all state.

Returns:

Type	Description
`DataraxModule`	A new module instance with the same state.

requires_rng_streams ¤

requires_rng_streams() -> list[str] | None

Get the list of RNG streams required by this module.

Returns:

Type	Description
`list[str] \| None`	A list of required RNG stream names, or None if no RNG streams
`list[str] \| None`	are required.

ensure_rng_streams ¤

ensure_rng_streams(stream_names: list[str]) -> None

Ensure that the required RNG streams are available.

Parameters:

Name	Type	Description	Default
`stream_names`	`list[str]`	A list of available RNG stream names.	required

Raises:

Type	Description
`ValueError`	If a required RNG stream is not available.

process ¤

process(input: Any, *args: Any, **kwargs: Any) -> Any

Process input structure.

This method transforms the structure/organization of input data without modifying the data values themselves.

Subclasses MUST implement this method.

The input/output types depend on the specific structural processor:

Batcher: list[Element] -> list[Batch]
Sampler: int -> list[int]
Sharder: Batch -> Sharded[Batch]
Splitter: Dataset -> tuple[Dataset, Dataset]

Parameters:

Name	Type	Description	Default
`input`	`Any`	Input to process (type varies by processor)	required
`*args`	`Any`	Additional positional arguments (processor-specific)	`()`
`**kwargs`	`Any`	Additional keyword arguments (processor-specific)	`{}`

Returns:

Type	Description
`Any`	Processed output (type varies by processor)

Examples:

Batcher implementation:

def process(self, elements: list[Element]) -> list[Batch]:
    batches = []
    for i in range(0, len(elements), self.config.batch_size):
        batch_elements = elements[i:i + self.config.batch_size]
        batches.append(Batch.from_elements(batch_elements))
    return batches

Sampler implementation (deterministic):

def process(self, dataset_size: int) -> list[int]:
    return list(range(min(self.config.num_samples, dataset_size)))

Sampler implementation (stochastic):

def process(self, dataset_size: int) -> list[int]:
    rng = self.rngs[self.config.stream_name]()
    indices = jax.random.choice(
        rng, dataset_size, shape=(self.config.num_samples,),
        replace=self.config.replacement
    )
    return indices.tolist()

supports_indexed_access ¤

supports_indexed_access() -> bool

Whether the source implements random-access get_batch_at.

Random-access (eager) sources return True; forward-only streaming sources return False, so the Pipeline drives them sequentially via get_batch instead of the jitted, indexed step.

Returns:

Type	Description
`bool`	`False` on the base class; random-access subclasses override it.

element_spec ¤

element_spec() -> Any

Return a PyTree of jax.ShapeDtypeStruct describing per-element output.

Downstream consumers (operators, batchers, models) use this contract to pre-allocate buffers, auto-size learnable layers, and statically validate operator chains. Subclasses MUST override this method.

Returns:

Type	Description
`Any`	A PyTree (typically a dict) whose leaves are `jax.ShapeDtypeStruct`
`Any`	instances describing one emitted element.

Raises:

Type	Description
`NotImplementedError`	Always, on the base class. Subclasses must override.

Array Record Source¤

See Also¤

datarax.sources.array_record_source ¤

logger module-attribute ¤

ArrayRecordSourceConfig dataclass ¤

seed class-attribute instance-attribute ¤

num_epochs class-attribute instance-attribute ¤

shuffle_files class-attribute instance-attribute ¤

local_files_only class-attribute instance-attribute ¤

cacheable class-attribute instance-attribute ¤

batch_stats_fn class-attribute instance-attribute ¤

precomputed_stats class-attribute instance-attribute ¤

stochastic class-attribute instance-attribute ¤

stream_name class-attribute instance-attribute ¤

ArrayRecordSourceModule ¤

config instance-attribute ¤

grain_source instance-attribute ¤

current_index instance-attribute ¤

current_epoch instance-attribute ¤

total_records instance-attribute ¤

prefetch_cache instance-attribute ¤

iterator_initialized instance-attribute ¤

shuffled_indices instance-attribute ¤

rngs instance-attribute ¤

name instance-attribute ¤

stochastic instance-attribute ¤

stream_name instance-attribute ¤

get_state ¤

set_state ¤

get_batch_at ¤

close ¤

get_operation_stats ¤

reset_operation_stats ¤

compute_statistics ¤

get_statistics ¤

set_statistics ¤

reset_statistics ¤

reset_cache ¤

copy ¤

Change configuration¤

Change name only¤

clone ¤

requires_rng_streams ¤

ensure_rng_streams ¤

process ¤

supports_indexed_access ¤

element_spec ¤

logger `module-attribute` ¤

ArrayRecordSourceConfig `dataclass` ¤

seed `class-attribute` `instance-attribute` ¤

num_epochs `class-attribute` `instance-attribute` ¤

shuffle_files `class-attribute` `instance-attribute` ¤

local_files_only `class-attribute` `instance-attribute` ¤

cacheable `class-attribute` `instance-attribute` ¤

batch_stats_fn `class-attribute` `instance-attribute` ¤

precomputed_stats `class-attribute` `instance-attribute` ¤

stochastic `class-attribute` `instance-attribute` ¤

stream_name `class-attribute` `instance-attribute` ¤

config `instance-attribute` ¤

grain_source `instance-attribute` ¤

current_index `instance-attribute` ¤

current_epoch `instance-attribute` ¤

total_records `instance-attribute` ¤

prefetch_cache `instance-attribute` ¤

iterator_initialized `instance-attribute` ¤

shuffled_indices `instance-attribute` ¤

rngs `instance-attribute` ¤

name `instance-attribute` ¤

stochastic `instance-attribute` ¤

stream_name `instance-attribute` ¤