calibrax.monitoring¤
Runtime monitoring and alerting. AlertManager handles alert collection and
dispatch, AdvancedMonitor adds threshold-based background monitoring, and
ProductionMonitor extends it with pipeline execution tracking and health
reports.
Monitor¤
calibrax.monitoring.monitor
¤
Alert management and background metric monitoring.
Provides threshold-based alerting with configurable handlers and background monitoring of system resources via daemon thread.
AlertSeverity
¤
Bases: StrEnum
Severity levels for monitoring alerts.
Alert(*, message, severity, metric_name, metric_value, threshold, timestamp=time.time(), metadata=dict())
dataclass
¤
A single monitoring alert triggered by a threshold violation.
Attributes:
| Name | Type | Description |
|---|---|---|
message |
str
|
Human-readable description of the alert. |
severity |
AlertSeverity
|
Alert severity level. |
metric_name |
str
|
Name of the metric that triggered the alert. |
metric_value |
float
|
Observed value that triggered the alert. |
threshold |
float
|
Threshold that was exceeded. |
timestamp |
float
|
When the alert was triggered. |
metadata |
dict[str, Any]
|
Additional context about the alert. |
to_dict()
¤
Serialize to a JSON-compatible dictionary.
AlertManager(max_alerts=1000)
¤
Thread-safe alert storage with callback handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_alerts
|
int
|
Maximum number of alerts to retain (oldest dropped first). |
1000
|
Initialize the alert manager.
add_alert_handler(handler)
¤
Register a callback invoked on each new alert.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
handler
|
Callable[[Alert], None]
|
Callable that receives an Alert instance. |
required |
trigger_alert(message, severity, metric_name, metric_value, threshold, metadata=None)
¤
Create and store an alert, notifying all registered handlers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable alert description. |
required |
severity
|
AlertSeverity
|
Severity level. |
required |
metric_name
|
str
|
Metric that triggered the alert. |
required |
metric_value
|
float
|
Observed metric value. |
required |
threshold
|
float
|
Threshold that was exceeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
get_recent_alerts(count=10)
¤
Return the most recent alerts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
Maximum number of alerts to return. |
10
|
Returns:
| Type | Description |
|---|---|
list[Alert]
|
List of recent alerts, newest first. |
get_alerts_by_severity(severity)
¤
Return all alerts matching the given severity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
severity
|
AlertSeverity
|
Severity level to filter by. |
required |
Returns:
| Type | Description |
|---|---|
list[Alert]
|
List of matching alerts. |
clear_alerts()
¤
Remove all stored alerts.
AdvancedMonitor(alert_manager=None, gpu_profiler=None, resource_monitor=None)
¤
Background resource monitor with threshold-based alerting.
Collects CPU, memory, and optional GPU metrics on a daemon thread. Triggers alerts when thresholds are exceeded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alert_manager
|
AlertManager | None
|
Alert manager for dispatching alerts. Created if not provided. |
None
|
gpu_profiler
|
GPUProfilerProtocol | None
|
Optional GPU profiler for GPU metrics. |
None
|
resource_monitor
|
ResourceMonitor | None
|
Optional ResourceMonitor for background sampling. |
None
|
Initialize the monitor.
alert_manager
property
¤
Access the underlying alert manager.
set_threshold(metric_name, threshold)
¤
Set an alerting threshold for a metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_name
|
str
|
Name of the metric to monitor. |
required |
threshold
|
float
|
Value above which an alert is triggered. |
required |
start_monitoring(interval=5.0)
¤
Start background monitoring on a daemon thread.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval
|
float
|
Seconds between metric collection cycles. |
5.0
|
stop_monitoring()
¤
Stop background monitoring and wait for the thread to finish.
get_monitoring_summary()
¤
Return a summary of current monitoring state.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with thresholds, alert counts, and metric history summaries. |
Production Monitor¤
calibrax.monitoring.production
¤
Production-grade monitoring with pipeline health tracking.
Extends AdvancedMonitor with performance baselines, pipeline execution tracking, and health report generation.
ProductionMonitor(**kwargs)
¤
Bases: AdvancedMonitor
Extended monitor with pipeline health tracking and performance baselines.
Tracks pipeline execution times, success rates, and detects performance degradation against configured baselines.
Initialize the production monitor.
set_performance_baseline(metric_name, baseline_value)
¤
Set a performance baseline for degradation detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric_name
|
str
|
Metric to track against baseline. |
required |
baseline_value
|
float
|
Expected baseline value. |
required |
record_pipeline_execution(pipeline_name, execution_time, success, metadata=None)
¤
Record a pipeline execution for health tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pipeline_name
|
str
|
Name of the pipeline that executed. |
required |
execution_time
|
float
|
Wall-clock execution time in seconds. |
required |
success
|
bool
|
Whether the execution succeeded. |
required |
metadata
|
dict[str, Any] | None
|
Optional additional context. |
None
|
get_pipeline_health_report()
¤
Generate a health report across all tracked pipelines.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with per-pipeline statistics and overall health status. |