Overview

When the code of our machine learning service is written and put into production, it should run as smoothly as a living organism. We don’t always have the option to intervene and fix something on the go. There aren’t many things to do to make this code reliable, but we’ll discuss most of them in this section.

Fail fast vs. fail safe

Depending on the domain and application, we should either process exceptions/bugs silently (log or trigger an alert and continue running) or completely stop runtime. This decision is also called the robustness vs. correctness issue.

Robustness vs. correctness

Steve McConnell, in Code Complete precisely highlights the difference between robustness and correctness:

"As the video game and x-ray examples show us, the most appropriate error processing style depends on the kind of software the error occurs in. These examples also illustrate that error processing generally favors more correctness or robustness. Developers tend to use these terms informally, but, strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; returning no result is better than returning an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that ...

Introduction to Reliable ML

Software Testing

Best and Worst Practices

ML-Specific Tests

ML Software Reliability outside of Tests

Wrapping Up

Appendix

Runtime Checks

Overview

Fail fast vs. fail safe

Robustness vs. correctness