Loss

Understand the steps in computing the appropriate loss, and explore the binary-cross entropy loss.

Defining the appropriate loss

We already have a model, and now we need to define an appropriate loss for it. A binary classification problem calls for the binary cross-entropy (BCE) loss, which is sometimes known as log loss.

The BCE loss requires the predicted probabilities, as returned by the sigmoid function, and the true labels (y) for its computation. For each data point i in the training set, it starts by computing the error corresponding to the point’s true class.

If the data point belongs to the positive class (y=1), we would like our model to predict a probability close to one, right? A perfect one would result in the logarithm of one, which is zero. It makes sense; a perfect prediction means zero loss. It goes like this:

yi=1=>errori=log(P(yi=1))y_i = 1 => error_i = log(P(y_i = 1))

What if the data point belongs to the negative class (y=0)? Then, we cannot simply use the predicted probability. Why not? Because the model outputs the probability of a point belonging to the positive, not the negative class. Luckily, the latter can be easily computed:

P(yi=0)=1P(yi=1)P(y_i = 0) = 1 - P(y_i = 1)

And thus, the error associated with a data point belonging to the negative class goes like this:

yi=0=>errori=log(1P(yi=1))y_i = 0 => error_i = log(1 - P(y_i = 1))

Once all errors are computed, they are aggregated into a loss value.

Binary cross-entropy loss

For the binary cross-entropy loss, we simply take the average of the errors and invert its sign. This will have the following equation:

BCE(y)=1(Npos + Nneg)[i=1Nposlog(P(yi=1))+i=1Nneglog(1P(yi=1))]BCE(y) = -\dfrac{1}{(N_{pos} \space + \space N_{neg})} \begin{bmatrix}{\sum_{i=1}^{N_{pos}}} log(P(y_i = 1))+ \sum_{i=1}^{N_{neg}} log(1 - P(y_i = 1))\end{bmatrix}

Let us assume we have two dummy data points, one for each class. Then, let us pretend our model made predictions for them: 0.9 and 0.2. The predictions are not bad since it predicts a 90% probability of being positive for an actual positive and only 20% of being positive for an actual negative. What does this look like in code? Here it is:

Get hands-on with 1400+ tech skills courses.