Production Monitoring¤

Calibrax provides a monitoring stack for tracking metrics at runtime, triggering alerts when thresholds are exceeded, and generating pipeline health reports.

Alert Management¤

AlertManager collects alerts and dispatches them to registered handlers:

from calibrax.monitoring.monitor import AlertManager, AlertSeverity

manager = AlertManager(max_alerts=1000)

# Register a handler (any callable that accepts an Alert)
def log_alert(alert):
    print(f"[{alert.severity.value}] {alert.message}: "
          f"{alert.metric_name}={alert.metric_value:.2f} "
          f"(threshold={alert.threshold:.2f})")

manager.add_alert_handler(log_alert)

# Trigger an alert
manager.trigger_alert(
    message="Throughput below minimum",
    severity=AlertSeverity.WARNING,
    metric_name="throughput",
    metric_value=850.0,
    threshold=1000.0,
)

[warning] Throughput below minimum: throughput=850.00 (threshold=1000.00)

Alert Severity Levels¤

Severity	Use Case
`INFO`	Informational — metric crossed a soft threshold
`WARNING`	Performance degraded but within tolerance
`ERROR`	Significant regression requiring investigation
`CRITICAL`	Service-impacting issue requiring immediate action

Querying Alerts¤

# Most recent alerts
recent = manager.get_recent_alerts(count=5)

# Filter by severity
errors = manager.get_alerts_by_severity(AlertSeverity.ERROR)

# Clear all alerts
manager.clear_alerts()

Threshold-Based Monitoring¤

AdvancedMonitor runs a background thread that periodically checks metric values against configured thresholds:

from calibrax.monitoring.monitor import AdvancedMonitor

monitor = AdvancedMonitor()

# Set thresholds
monitor.set_threshold("latency", 1.5)       # alert if latency > 1.5
monitor.set_threshold("throughput", 500.0)   # alert if throughput > 500

# Start background monitoring (checks every 5 seconds)
monitor.start_monitoring(interval=5.0)

# ... run workloads ...

# Stop monitoring and get summary
monitor.stop_monitoring()
summary = monitor.get_monitoring_summary()
print(f"Thresholds: {summary['thresholds']}")
print(f"Alert count: {summary['alert_count']}")
for metric, history in summary["metric_history"].items():
    print(f"  {metric}: latest={history['latest']:.2f}, "
          f"min={history['min']:.2f}, max={history['max']:.2f}")

AdvancedMonitor optionally accepts a GPUProfilerProtocol and a ResourceMonitor to include GPU and resource metrics in the monitoring loop.

Production Monitor¤

ProductionMonitor extends AdvancedMonitor with pipeline execution tracking and health reporting:

from calibrax.monitoring.production import ProductionMonitor

monitor = ProductionMonitor()

# Set performance baselines
monitor.set_performance_baseline("inference_latency", 0.05)

# Record pipeline executions
monitor.record_pipeline_execution(
    pipeline_name="inference",
    execution_time=0.048,
    success=True,
)

monitor.record_pipeline_execution(
    pipeline_name="inference",
    execution_time=0.150,
    success=False,
    metadata={"error": "timeout"},
)

# Get health report
report = monitor.get_pipeline_health_report()
print(f"Overall health: {report['overall_health']}")
for name, stats in report["pipelines"].items():
    print(f"\n{name}:")
    print(f"  Total executions: {stats['total_executions']}")
    print(f"  Success rate: {stats['success_rate']:.1%}")
    print(f"  Mean time: {stats['mean_execution_time']:.4f}s")
    print(f"  Health: {stats['health']}")

Overall health: degraded

inference:
  Total executions: 2
  Success rate: 50.0%
  Mean time: 0.0990s
  Health: degraded

Health Status Values¤

Status	Meaning
`healthy`	All pipelines have high success rate and normal execution times
`degraded`	Some pipelines have elevated error rates or slow execution
`critical`	One or more pipelines are failing consistently

Best Practices¤

Use AlertManager directly when monitoring is event-driven (you push metrics)
Use AdvancedMonitor when monitoring is poll-based (periodic checks)
Use ProductionMonitor for full pipeline health tracking with execution history
Set conservative thresholds initially and tighten them as you understand your workload's normal variance
Register multiple handlers (e.g., logging + Slack webhook) for alert fan-out

Next Steps¤

CI Integration

Use monitoring thresholds to gate CI pipelines

CI integration
Writing Adapters

Wrap external models for monitoring with Calibrax

Adapters