Image Classification Metrics
Learn about various metrics to evaluate the performance of our image classification model.
We'll cover the following
This lesson will cover how the performance of image classification algorithms is measured. Metrics help determine the performance of all the models that we have trained. Therefore, we’ll be able to select the best model that fits our needs.
Accuracy
Accuracy is the most commonly used metric for classification algorithms due to its simplicity.
Accuracy refers to the total number of correct predictions made, divided by the total number of all predictions. It’s often multiplied by 100 to bind the value between 0 and 100.
For example, our model made 10 predictions, and 7 of them are correct. The accuracy will be 70%.
For binary classification, the concept is the same, but it consists of the following items:
True Positive(TP)
: This is the number of positive class samples our model predicted correctly.True Negative(TN)
: This is the number of negative class samples our model predicted correctly.False Positive(FP)
: This is the number of negative class samples our model predicted incorrectly. In statistical terminology, it’s known as a Type-I error.False Negative(FN)
: This is the number of positive class samples our model predicted incorrectly. In statistical terminology, it’s known as a Type-II error.
Precision
Precision refers to the ratio of true positive samples predicted versus the total number of positive samples predicted.
For example, given a binary classification model that predicts whether an image is a fruit or not a fruit, we can calculate precision as follows:
Precision = Correctly predicted as fruit / (Correctly predicted as fruit + Non-fruit incorrectly predicted as fruit)
Let’s look at the following slide widget to better understand precision:
Precision is an important metric for use cases where false negatives aren’t a concern. Most of the use cases revolve around the recommendation system.
For example, a model that classifies an input image submitted by users for a search engine should have high precision. We want to return search results as precisely as possible. It’s fine if we don’t return some of the relevant results.
Recall
Recall, on the other hand, refers to the ratio of true positive samples predicted against all the available positive samples. It’s also known as sensitivity or hit rate.
We can calculate recall by entering the following:
Recall = Correctly predicted as fruit / (Correctly predicted as fruit + Fruit incorrectly predicted as non-fruit)
Let’s look at the following slide widget to better understand recall:
The recall metric is useful to evaluate the performance of models that don’t mind if there are many false positives but will incur high costs for false negatives.
For example, a model that takes lung images of a patient and predicts if the patient is suffering from COVID-19 should have a high recall. The main goal is to identify all of the infected patients. It’s acceptable if the model wrongly classifies a person as COVID-19 positive, but it can be disastrous if it lets a COVID-19 patient go undetected.
F1-Score
Most developers and practitioners use the F1-score to get the best of both worlds since the formula represents the harmonic mean of recall and precision.
The value of the F1-score ranges from zero to one. A high score indicates that our model generalizes well and has good performance. However, we’ll be unable to identify the root problem if our model has a low F1-score, which can be caused by a surplus of false positives or false negatives.
Confusion matrix
Alternatively, the confusion matrix serves as a complement to our metrics. The confusion matrix is not a metric, but rather a two-dimensional tabular visualization of the ground truth labels versus model predictions.
The following example showcases the confusion matrix for a 3-class classification model:
If we analyze the model, we see that it can’t correctly classify a banana
. However, it performs well when identifying an apple
. It also struggles to classify grape
, mistaking it for a banana
. We can determine that the problem lies with banana
. It might be due to imbalanced training data or some of the datasets being labeled incorrectly as banana
or grape
.
Top-5 accuracy/error
Sometimes, top-1 and top-5 accuracy/error help to evaluate the performance of an image classification model. The top-1 accuracy can be described as how the accuracy works. The prediction with the highest probability must be exactly the same as the expected label.
Meanwhile, the top-5 accuracy means that one of the five highest predictions must match the expected label. The label counts as correct as long as the expected label is within the top five predictions.
Let’s assume we have a test image of an apple, and the prediction is as follows:
Prediction | Score |
---|---|
Peach | 0.4 |
Papaya | 0.3 |
Apple | 0.2 |
Banana | 0.08 |
Grape | 0.02 |
In this case, it counts as wrong for top-1 accuracy since it predicts peach
, but the ground truth is apple
.
If we use top-5 accuracy, it counts as correct since apple is within the top five predictions. On a side note, we should only use top-5 accuracy if we have many labels.
On the other hand, the top-5 error is just the inverse version of accuracy. We can easily calculate it by subtracting the top-1 accuracy from 100. If the top-5 accuracy is 72%, then the top-5 error will be 28%.