Skip to content

calibrax.monitoring¤

Runtime monitoring and alerting. AlertManager handles alert collection and dispatch, AdvancedMonitor adds threshold-based background monitoring, and ProductionMonitor extends it with pipeline execution tracking and health reports.

Monitor¤

calibrax.monitoring.monitor ¤

Alert management and background metric monitoring.

Provides threshold-based alerting with configurable handlers and background monitoring of system resources via daemon thread.

AlertSeverity ¤

Bases: StrEnum

Severity levels for monitoring alerts.

Alert(*, message, severity, metric_name, metric_value, threshold, timestamp=time.time(), metadata=dict()) dataclass ¤

A single monitoring alert triggered by a threshold violation.

Attributes:

Name Type Description
message str

Human-readable description of the alert.

severity AlertSeverity

Alert severity level.

metric_name str

Name of the metric that triggered the alert.

metric_value float

Observed value that triggered the alert.

threshold float

Threshold that was exceeded.

timestamp float

When the alert was triggered.

metadata dict[str, Any]

Additional context about the alert.

to_dict() ¤

Serialize to a JSON-compatible dictionary.

AlertManager(max_alerts=1000) ¤

Thread-safe alert storage with callback handlers.

Parameters:

Name Type Description Default
max_alerts int

Maximum number of alerts to retain (oldest dropped first).

1000

Initialize the alert manager.

add_alert_handler(handler) ¤

Register a callback invoked on each new alert.

Parameters:

Name Type Description Default
handler Callable[[Alert], None]

Callable that receives an Alert instance.

required

trigger_alert(message, severity, metric_name, metric_value, threshold, metadata=None) ¤

Create and store an alert, notifying all registered handlers.

Parameters:

Name Type Description Default
message str

Human-readable alert description.

required
severity AlertSeverity

Severity level.

required
metric_name str

Metric that triggered the alert.

required
metric_value float

Observed metric value.

required
threshold float

Threshold that was exceeded.

required
metadata dict[str, Any] | None

Optional additional context.

None

get_recent_alerts(count=10) ¤

Return the most recent alerts.

Parameters:

Name Type Description Default
count int

Maximum number of alerts to return.

10

Returns:

Type Description
list[Alert]

List of recent alerts, newest first.

get_alerts_by_severity(severity) ¤

Return all alerts matching the given severity.

Parameters:

Name Type Description Default
severity AlertSeverity

Severity level to filter by.

required

Returns:

Type Description
list[Alert]

List of matching alerts.

clear_alerts() ¤

Remove all stored alerts.

AdvancedMonitor(alert_manager=None, gpu_profiler=None, resource_monitor=None) ¤

Background resource monitor with threshold-based alerting.

Collects CPU, memory, and optional GPU metrics on a daemon thread. Triggers alerts when thresholds are exceeded.

Parameters:

Name Type Description Default
alert_manager AlertManager | None

Alert manager for dispatching alerts. Created if not provided.

None
gpu_profiler GPUProfilerProtocol | None

Optional GPU profiler for GPU metrics.

None
resource_monitor ResourceMonitor | None

Optional ResourceMonitor for background sampling.

None

Initialize the monitor.

alert_manager property ¤

Access the underlying alert manager.

set_threshold(metric_name, threshold) ¤

Set an alerting threshold for a metric.

Parameters:

Name Type Description Default
metric_name str

Name of the metric to monitor.

required
threshold float

Value above which an alert is triggered.

required

start_monitoring(interval=5.0) ¤

Start background monitoring on a daemon thread.

Parameters:

Name Type Description Default
interval float

Seconds between metric collection cycles.

5.0

stop_monitoring() ¤

Stop background monitoring and wait for the thread to finish.

get_monitoring_summary() ¤

Return a summary of current monitoring state.

Returns:

Type Description
dict[str, Any]

Dictionary with thresholds, alert counts, and metric history summaries.

Production Monitor¤

calibrax.monitoring.production ¤

Production-grade monitoring with pipeline health tracking.

Extends AdvancedMonitor with performance baselines, pipeline execution tracking, and health report generation.

ProductionMonitor(**kwargs) ¤

Bases: AdvancedMonitor

Extended monitor with pipeline health tracking and performance baselines.

Tracks pipeline execution times, success rates, and detects performance degradation against configured baselines.

Initialize the production monitor.

set_performance_baseline(metric_name, baseline_value) ¤

Set a performance baseline for degradation detection.

Parameters:

Name Type Description Default
metric_name str

Metric to track against baseline.

required
baseline_value float

Expected baseline value.

required

record_pipeline_execution(pipeline_name, execution_time, success, metadata=None) ¤

Record a pipeline execution for health tracking.

Parameters:

Name Type Description Default
pipeline_name str

Name of the pipeline that executed.

required
execution_time float

Wall-clock execution time in seconds.

required
success bool

Whether the execution succeeded.

required
metadata dict[str, Any] | None

Optional additional context.

None

get_pipeline_health_report() ¤

Generate a health report across all tracked pipelines.

Returns:

Type Description
dict[str, Any]

Dictionary with per-pipeline statistics and overall health status.