Benchmark Monitor¤
Real-time pipeline monitoring with GPU profiling and alerting.
See Also¤
- Benchmarking Overview - All benchmarking tools
- Monitoring - General monitoring
- Benchmarking Guide
Overview¤
This module provides three monitoring components:
- AdvancedMonitor — Real-time pipeline monitoring with configurable thresholds, background metric collection (CPU, memory, GPU), and trend analysis. Integrates with
AlertManagerfor threshold-based alerts. - ProductionMonitor — Extends
AdvancedMonitorwith production-specific features: performance baseline tracking, per-pipeline execution recording, error rate monitoring, and health reports. - AlertManager — Manages threshold-based alerts with configurable severity levels and pluggable alert handlers.
Quick Start¤
from calibrax.monitoring import AdvancedMonitor
monitor = AdvancedMonitor()
# Configure thresholds
monitor.set_threshold("gpu_memory_utilization", 0.85)
monitor.set_threshold("memory_usage_mb", 4000)
# Start background monitoring (collects metrics every 5 seconds)
monitor.start_monitoring(interval=5.0)
# ... run your pipeline ...
# Stop and check results
monitor.stop_monitoring()
summary = monitor.get_monitoring_summary()
print(f"Alerts triggered: {summary['total_alerts']}")
print(f"Critical alerts: {summary['critical_alerts']}")
calibrax.monitoring ¤
Monitoring: alerting, production monitoring, and threshold tracking.
AdvancedMonitor ¤
AdvancedMonitor(alert_manager: AlertManager | None = None, gpu_profiler: GPUProfilerProtocol | None = None, resource_monitor: ResourceMonitor | None = None)
Background resource monitor with threshold-based alerting.
Collects CPU, memory, and optional GPU metrics on a daemon thread. Triggers alerts when thresholds are exceeded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alert_manager
|
AlertManager | None
|
Alert manager for dispatching alerts. Created if not provided. |
None
|
gpu_profiler
|
GPUProfilerProtocol | None
|
Optional GPU profiler for GPU metrics. |
None
|
resource_monitor
|
ResourceMonitor | None
|
Optional ResourceMonitor for background sampling. |
None
|
set_threshold ¤
start_monitoring ¤
start_monitoring(interval: float = 5.0) -> None
Start background monitoring on a daemon thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval
|
float
|
Seconds between metric collection cycles. |
5.0
|
stop_monitoring ¤
Stop background monitoring and wait for the thread to finish.
Alert
dataclass
¤
Alert(*, message: str, severity: AlertSeverity, metric_name: str, metric_value: float, threshold: float, timestamp: float = time(), metadata: dict[str, Any] = dict())
A single monitoring alert triggered by a threshold violation.
Attributes:
| Name | Type | Description |
|---|---|---|
message |
str
|
Human-readable description of the alert. |
severity |
AlertSeverity
|
Alert severity level. |
metric_name |
str
|
Name of the metric that triggered the alert. |
metric_value |
float
|
Observed value that triggered the alert. |
threshold |
float
|
Threshold that was exceeded. |
timestamp |
float
|
When the alert was triggered. |
metadata |
dict[str, Any]
|
Additional context about the alert. |
metadata
class-attribute
instance-attribute
¤
AlertManager ¤
AlertManager(max_alerts: int = 1000)
Thread-safe alert storage with callback handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_alerts
|
int
|
Maximum number of alerts to retain (oldest dropped first). |
1000
|
add_alert_handler ¤
trigger_alert ¤
trigger_alert(message: str, severity: AlertSeverity, metric_name: str, metric_value: float, threshold: float, metadata: dict[str, Any] | None = None) -> None
Create and store an alert, notifying all registered handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable alert description. |
required |
severity
|
AlertSeverity
|
Severity level. |
required |
metric_name
|
str
|
Metric that triggered the alert. |
required |
metric_value
|
float
|
Observed metric value. |
required |
threshold
|
float
|
Threshold that was exceeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
get_recent_alerts ¤
get_alerts_by_severity ¤
get_alerts_by_severity(severity: AlertSeverity) -> list[Alert]
Return all alerts matching the given severity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
AlertSeverity
|
Severity level to filter by. |
required |
Returns:
| Type | Description |
|---|---|
list[Alert]
|
List of matching alerts. |
ProductionMonitor ¤
ProductionMonitor(**kwargs: Any)
Bases: AdvancedMonitor
Extended monitor with pipeline health tracking and performance baselines.
Tracks pipeline execution times, success rates, and detects performance degradation against configured baselines.
set_threshold ¤
start_monitoring ¤
start_monitoring(interval: float = 5.0) -> None
Start background monitoring on a daemon thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval
|
float
|
Seconds between metric collection cycles. |
5.0
|
stop_monitoring ¤
Stop background monitoring and wait for the thread to finish.
get_monitoring_summary ¤
set_performance_baseline ¤
record_pipeline_execution ¤
record_pipeline_execution(pipeline_name: str, execution_time: float, success: bool, metadata: dict[str, Any] | None = None) -> None
Record a pipeline execution for health tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pipeline_name
|
str
|
Name of the pipeline that executed. |
required |
execution_time
|
float
|
Wall-clock execution time in seconds. |
required |
success
|
bool
|
Whether the execution succeeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
monitor ¤
Alert management and background metric monitoring.
Provides threshold-based alerting with configurable handlers and background monitoring of system resources via daemon thread.
Alert
dataclass
¤
Alert(*, message: str, severity: AlertSeverity, metric_name: str, metric_value: float, threshold: float, timestamp: float = time(), metadata: dict[str, Any] = dict())
A single monitoring alert triggered by a threshold violation.
Attributes:
| Name | Type | Description |
|---|---|---|
message |
str
|
Human-readable description of the alert. |
severity |
AlertSeverity
|
Alert severity level. |
metric_name |
str
|
Name of the metric that triggered the alert. |
metric_value |
float
|
Observed value that triggered the alert. |
threshold |
float
|
Threshold that was exceeded. |
timestamp |
float
|
When the alert was triggered. |
metadata |
dict[str, Any]
|
Additional context about the alert. |
metadata
class-attribute
instance-attribute
¤
AlertManager ¤
AlertManager(max_alerts: int = 1000)
Thread-safe alert storage with callback handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_alerts
|
int
|
Maximum number of alerts to retain (oldest dropped first). |
1000
|
add_alert_handler ¤
trigger_alert ¤
trigger_alert(message: str, severity: AlertSeverity, metric_name: str, metric_value: float, threshold: float, metadata: dict[str, Any] | None = None) -> None
Create and store an alert, notifying all registered handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable alert description. |
required |
severity
|
AlertSeverity
|
Severity level. |
required |
metric_name
|
str
|
Metric that triggered the alert. |
required |
metric_value
|
float
|
Observed metric value. |
required |
threshold
|
float
|
Threshold that was exceeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
get_recent_alerts ¤
get_alerts_by_severity ¤
get_alerts_by_severity(severity: AlertSeverity) -> list[Alert]
Return all alerts matching the given severity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
AlertSeverity
|
Severity level to filter by. |
required |
Returns:
| Type | Description |
|---|---|
list[Alert]
|
List of matching alerts. |
AdvancedMonitor ¤
AdvancedMonitor(alert_manager: AlertManager | None = None, gpu_profiler: GPUProfilerProtocol | None = None, resource_monitor: ResourceMonitor | None = None)
Background resource monitor with threshold-based alerting.
Collects CPU, memory, and optional GPU metrics on a daemon thread. Triggers alerts when thresholds are exceeded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alert_manager
|
AlertManager | None
|
Alert manager for dispatching alerts. Created if not provided. |
None
|
gpu_profiler
|
GPUProfilerProtocol | None
|
Optional GPU profiler for GPU metrics. |
None
|
resource_monitor
|
ResourceMonitor | None
|
Optional ResourceMonitor for background sampling. |
None
|
set_threshold ¤
start_monitoring ¤
start_monitoring(interval: float = 5.0) -> None
Start background monitoring on a daemon thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval
|
float
|
Seconds between metric collection cycles. |
5.0
|
stop_monitoring ¤
Stop background monitoring and wait for the thread to finish.
production ¤
Production-grade monitoring with pipeline health tracking.
Extends AdvancedMonitor with performance baselines, pipeline execution tracking, and health report generation.
ProductionMonitor ¤
ProductionMonitor(**kwargs: Any)
Bases: AdvancedMonitor
Extended monitor with pipeline health tracking and performance baselines.
Tracks pipeline execution times, success rates, and detects performance degradation against configured baselines.
set_performance_baseline ¤
record_pipeline_execution ¤
record_pipeline_execution(pipeline_name: str, execution_time: float, success: bool, metadata: dict[str, Any] | None = None) -> None
Record a pipeline execution for health tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pipeline_name
|
str
|
Name of the pipeline that executed. |
required |
execution_time
|
float
|
Wall-clock execution time in seconds. |
required |
success
|
bool
|
Whether the execution succeeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
get_pipeline_health_report ¤
set_threshold ¤
start_monitoring ¤
start_monitoring(interval: float = 5.0) -> None
Start background monitoring on a daemon thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval
|
float
|
Seconds between metric collection cycles. |
5.0
|
stop_monitoring ¤
Stop background monitoring and wait for the thread to finish.