Model Monitoring
Model Monitoring Categories
Data Quality & Drift
- Data Drift: Input feature distribution changes
- Concept Drift: Relationship between features and target changes
- Schema Validation: Data type and format compliance
- Missing Data: Null value rates and patterns
- Outlier Detection: Anomalous input values
Model Performance
- Prediction Accuracy: Precision, recall, F1, RMSE
- Prediction Drift: Output distribution changes
- Confidence Scores: Model certainty tracking
- Business Metrics: Revenue, conversion, engagement
- Comparative Analysis: vs baseline/champion model
Infrastructure & System
- Latency: P50, P95, P99 response times
- Throughput: Requests per second
- Error Rates: HTTP errors, exceptions
- Resource Usage: CPU, memory, GPU utilization
- Cost Metrics: Inference cost per request
Model Health
- Model Version: Deployed version tracking
- Dependency Health: Feature store, DBs connectivity
- Data Freshness: Feature recency
- Retraining Status: Last training date
- A/B Test Progress: Traffic splits and metrics
Drift Detection Methods
| Method | Type | Description | Best For |
|---|---|---|---|
| KL Divergence | Statistical | Measures difference between two probability distributions | Continuous features, distribution comparison |
| Kolmogorov-Smirnov Test | Statistical | Two-sample test comparing distributions | Continuous variables, univariate drift |
| Population Stability Index (PSI) | Statistical | Measures distribution shift in binned data | Credit scoring, finance applications |
| Chi-Square Test | Statistical | Tests independence of categorical variables | Categorical features |
| Adversarial Validation | ML-based | Train classifier to distinguish train vs production data | Multivariate drift, complex patterns |
| Domain Classifier | ML-based | Model predicting if data is from training or production | High-dimensional data, deep learning |
| CUSUM | Sequential | Cumulative sum control chart for change detection | Time-series data, gradual drift |
ML Monitoring Tools
Open-source ML monitoring tool for data drift, model performance, and target drift detection with interactive reports.
- Data drift detection
- Model performance monitoring
- Test suites and reports
- Integration with ML platforms
- Custom metrics support
- Production ML monitoring
- Data quality validation
- A/B test analysis
- Model debugging
AI observability platform with data and ML monitoring, providing privacy-preserving model and data health insights.
- Data profiling and drift
- Model performance tracking
- Privacy-preserving monitoring
- Anomaly detection
- Integration with MLflow, SageMaker
- Enterprise ML monitoring
- Regulated industries
- Privacy-sensitive applications
- Multi-model observability
ML observability platform for monitoring, explaining, and troubleshooting production ML models with drift detection.
- Model performance monitoring
- Drift and data quality
- Explainability (SHAP)
- Embedding analysis
- Automated troubleshooting
- Production model monitoring
- NLP and CV models
- Embedding drift tracking
- Model debugging
Enterprise MLOps platform for monitoring, explaining, and analyzing ML models with focus on responsible AI.
- Model monitoring and alerts
- Explainable AI
- Fairness and bias detection
- Performance tracking
- Root cause analysis
- Enterprise ML governance
- Regulated industries
- Responsible AI programs
- High-stakes predictions
Extends W&B experiment tracking with production monitoring for deployed models and real-time performance tracking.
- Production model tracking
- Performance dashboards
- Alerting on degradation
- Integration with W&B ecosystem
- Custom metrics
- End-to-end ML lifecycle
- Experiment to production
- Team collaboration
- Research to deployment
Open-source monitoring stack for metrics collection, visualization, and alerting, adaptable for ML monitoring.
- Time-series metrics storage
- Custom dashboards
- Alerting rules
- Large ecosystem
- Self-hosted option
- Custom ML metrics
- Infrastructure monitoring
- Cost-conscious teams
- Full control deployments
Alerting Best Practices
Alert Severity Levels
P0 - Critical: Model serving failures, major accuracy drops (>20%)
P1 - High: Moderate drift (>10%), latency spikes (2x)
P2 - Medium: Minor drift (5-10%), warning thresholds
P3 - Low: Informational, trends, scheduled reports
Alert Routing
PagerDuty/Opsgenie: P0/P1 on-call escalation
Slack/Teams: P2 team channels
Email: P3 daily/weekly digests
Dashboards: All severities visible
Alert Fatigue Prevention
Aggregation: Group similar alerts
Thresholds: Tune to reduce false positives
Rate Limiting: Max alerts per time window
Auto-resolution: Clear when conditions normalize
Retraining Triggers & Strategies
| Trigger Type | Condition | Frequency | Pros/Cons |
|---|---|---|---|
| Time-Based | Fixed schedule (daily, weekly, monthly) | Regular intervals | ✓ Predictable, simple ✗ May retrain unnecessarily |
| Performance-Based | Accuracy drops below threshold | On demand | ✓ Reactive to issues ✗ May be too late |
| Data Drift | Distribution shift detected | On demand | ✓ Proactive ✗ Requires drift detection |
| Data Volume | N new samples accumulated | Variable | ✓ Data-driven ✗ May miss temporal patterns |
| Hybrid | Combination of above | Adaptive | ✓ Flexible, comprehensive ✗ More complex |
| Continuous Learning | Online learning, constant updates | Real-time | ✓ Always current ✗ Resource intensive, stability risk |
Key Dashboard Metrics
Performance SLIs
- Latency: P50, P95, P99
- Throughput (RPS)
- Error rate
- Availability %
Model Metrics
- Accuracy/F1/RMSE
- Prediction distribution
- Confidence scores
- Drift scores
Business KPIs
- Conversion rate
- Revenue impact
- User engagement
- Cost per prediction
