Skip to content

Roofline Model¤

Analyze compute vs memory bottlenecks.

See Also¤


datarax.performance.roofline ¤

Roofline analysis for hardware-aware performance optimization.

This module provides tools for analyzing operations based on the roofline model to identify performance bottlenecks and suggest optimizations.

logger module-attribute ¤

logger = getLogger(__name__)

HARDWARE_SPECS module-attribute ¤

HARDWARE_SPECS = {'tpu_v5e': HardwareSpecs(peak_flops_bf16=197000000000000.0, hbm_bandwidth=820000000000.0, vmem_bandwidth=18000000000000.0, critical_intensity=240, matrix_unit_size=(128, 128), optimal_batch_size=240, memory_layout='row_major', use_vmem_optimization=True, preferred_tile_size=128), 'h100': HardwareSpecs(peak_flops_bf16=989000000000000.0, hbm_bandwidth=3350000000000.0, critical_intensity=298, tensor_core_shapes=[(16, 16, 8), (32, 8, 16)], optimal_batch_size=298, memory_layout='NHWC', use_vmem_optimization=False, preferred_tile_size=16), 'a100': HardwareSpecs(peak_flops_bf16=312000000000000.0, hbm_bandwidth=1550000000000.0, critical_intensity=201, tensor_core_shapes=[(16, 16, 8)], optimal_batch_size=128, memory_layout='NHWC', use_vmem_optimization=False, preferred_tile_size=16), 'cpu': HardwareSpecs(peak_flops_bf16=1000000000000.0, hbm_bandwidth=100000000000.0, critical_intensity=10, optimal_batch_size=32, memory_layout='row_major', use_vmem_optimization=False, preferred_tile_size=64)}

HardwareSpecs dataclass ¤

HardwareSpecs(peak_flops_bf16: float, hbm_bandwidth: float, critical_intensity: float, optimal_batch_size: int, matrix_unit_size: tuple[int, int] | None = None, vmem_bandwidth: float | None = None, memory_layout: str = 'row_major', use_vmem_optimization: bool = False, tensor_core_shapes: list[tuple[int, int, int]] | None = None, preferred_tile_size: int = 128)

Hardware specifications for roofline analysis.

peak_flops_bf16 instance-attribute ¤

peak_flops_bf16: float

hbm_bandwidth instance-attribute ¤

hbm_bandwidth: float

critical_intensity instance-attribute ¤

critical_intensity: float

optimal_batch_size instance-attribute ¤

optimal_batch_size: int

matrix_unit_size class-attribute instance-attribute ¤

matrix_unit_size: tuple[int, int] | None = None

vmem_bandwidth class-attribute instance-attribute ¤

vmem_bandwidth: float | None = None

memory_layout class-attribute instance-attribute ¤

memory_layout: str = 'row_major'

use_vmem_optimization class-attribute instance-attribute ¤

use_vmem_optimization: bool = False

tensor_core_shapes class-attribute instance-attribute ¤

tensor_core_shapes: list[tuple[int, int, int]] | None = None

preferred_tile_size class-attribute instance-attribute ¤

preferred_tile_size: int = 128

RooflineAnalyzer ¤

RooflineAnalyzer(hardware: str = 'auto')

Analyze operations based on roofline model for performance optimization.

Parameters:

Name Type Description Default
hardware str

Target hardware ('tpu_v5e', 'h100', 'a100', 'cpu', 'auto')

'auto'

hardware_name instance-attribute ¤

hardware_name = hardware

hw_specs instance-attribute ¤

hw_specs = get(hardware, HARDWARE_SPECS['cpu'])

analyze_operation ¤

analyze_operation(func: Callable, *args: Any, output_shape: tuple | None = None, **kwargs: Any) -> dict[str, Any]

Analyze a JAX operation using roofline model.

Parameters:

Name Type Description Default
func Callable

Function to analyze

required
args Any

Arguments to the function

()
output_shape tuple | None

Optional output shape for memory estimation

None
kwargs Any

Keyword arguments to the function

{}

Returns:

Type Description
dict[str, Any]

Analysis dict with performance metrics and recommendations

find_optimal_batch_size ¤

find_optimal_batch_size(sample_input: Array, target_hardware: str | None = None) -> int

Find optimal batch size for compute-bound operation.

Parameters:

Name Type Description Default
sample_input Array

Sample input tensor

required
target_hardware str | None

Optional target hardware override

None

Returns:

Type Description
int

Optimal batch size

optimize_for_arithmetic_intensity ¤

optimize_for_arithmetic_intensity(operation: Callable, target_intensity: float = 240) -> Callable

Optimize operation for target arithmetic intensity.

Parameters:

Name Type Description Default
operation Callable

Operation to optimize

required
target_intensity float

Target arithmetic intensity

240

Returns:

Type Description
Callable

Optimized operation

optimize_shapes ¤

optimize_shapes(tensors: list[Array]) -> list[Array]

Optimize tensor shapes for hardware.

Parameters:

Name Type Description Default
tensors list[Array]

List of tensors to optimize

required

Returns:

Type Description
list[Array]

List of optimized tensors

cast_to_optimal_precision ¤

cast_to_optimal_precision(tensors: list[Array]) -> list[Array]

Cast tensors to optimal precision for hardware.

Parameters:

Name Type Description Default
tensors list[Array]

List of tensors to cast

required

Returns:

Type Description
list[Array]

List of casted tensors