Overview

In regular software engineering, we tend to monitor if the software is at least working —no errors, good response timing, etc.—which is usually enough. But what can go wrong with the machine learning code in runtime?

Regular software (say the CRM system) rarely breaks with no code changes or significant input data changes. ML software can be sensitive to minor distribution changes (seasonality, trends, new cameras, and microphones for visual/audio data).

Good monitoring comes with the following benefits:

We’re alerted when things break.
We can learn what’s broken and why.
We can inspect trends over long time frames.
We can compare system behavior across different versions and experimental groups (e.g., AB/testing).

ML-specific monitoring

In ML engineering, we should also monitor the quality of our models and pipelines, and carefully look for things like concept and data drift. At the same time, regular software problems are still there and can’t be ignored as well.

We’ll cover three main aspects of machine learning monitoring in this lesson:

Service Health
Data Health
Model Health

Service health

Any machine learning service is primarily a code. So above all, we should be aware of standard things that can possibly go wrong with our service. A model with high accuracy that doesn’t work, is unstable, or periodically crashes in runtime is a bad model.

Metrics

Let’s look at the metrics we should be aware of when talking about service health.

Uptime/downtime: Is our ML service even working? What percentage of time has a service been working and available (uptime)? What percentage of time has it not (downtime)? Does it meet our SLA (service-level agreement)?
QPS (queries per second): What is the bandwidth of our service? How many requests does it process per second? What is the maximum?
Latency and response time: What is the median or maximum time taken to process each request? What is the 99th percentile? Does it meet our SLA? How does it affect our user’s experience?

Note: There is a good practice to abort a model run if it meets 95 percent of the time limit and returns some default value (like if a model returned an error).
RAM: How much memory does our service consume (separately during training and inference)? Does this value increase over time while we gather more data?
Training time: Similarly, how much time does our model require for retraining? Is this value alright for our cycle? Does it increase with an increasing number of samples or features?

Note: Some of the metrics above are relevant only for real-time service, while others are more universal.

Tools

A typical stack for monitoring (not only for ML-related ones) is Prometheus and Grafana bundle.

Prometheus is an open-source monitoring and alerting system. Prometheus collects and stores its metrics as time series data (recorded metrics).
Grafana is open-source software that enables us to query, visualize, alert, and explore our metrics, logs, and traces wherever they are stored.

Both are easy to install and ideally work together.

Data health

In this section, we’ll cover both ETL-related data health ...

Introduction to Reliable ML

Software Testing

Best and Worst Practices

ML-Specific Tests

ML Software Reliability outside of Tests

Wrapping Up

Appendix

ML Monitoring Guide