Appendix A: Ideal Loss Values
Introduction to the ideal loss values: what they are and which loss value is most suited in the classification task.
We'll cover the following...
When training a GAN, the ideal state we want to reach is a balance between the generator and the discriminator. When this happens, the discriminator is no longer able to separate real data from generated data. This is because the generator has learned to create data that looks like it could have come from the real dataset.
Let’s work out what the discriminator loss should be when this balance is reached. We’ll do this for both the mean squared error loss and the binary cross-entropy loss.
MSE Loss
The mean squared error loss has a simple definition. The difference between the value that emerges from an output node and desired target value is the error. It can be positive or negative. If we square this error, the value is always positive. The mean squared error is the mean average of these squared errors.
This loss is written mathematically as follows. For each of the output layer nodes, the actual output is , and the desired target is ,
Because a discriminator has only one node, we can simplify this.
When the discriminator can’t tell the real data from the generated data, it doesn’t output 1 because that would mean it is totally confident the data is real. It doesn’t output 0 because that would mean it is fully confident the data is generated.
The discriminator outputs 0.5 because it is equally not confident about the data being real or generated.
If the output is 0.5, and the target is 1, the error is 0.5. When the target is 0, the error is -0.5. When these errors are squared, they both become 0.25.
So the MSE loss of a balanced GAN is 0.25.
BCE Loss
The binary cross-entropy loss is based on the ideas of probability and uncertainty. Let’s take it step by step.
Looking back at the MNIST classifier, the neural network has 10 output nodes, one for each possible classification. If the trained network thought an image was of the digit 4, the fourth output node would have a high value, and the other nodes would have a low value.
We previously talked about these values as a measure of confidence in the classification. It is an easy step to think of these values as probabilities. This step is made easier by the fact that the output nodes only have values between 0 and 1, just like probabilities.
The above shows an image of a 4, and the outputs of the classifier. The network has assigned a high probability to the fourth node, which means it thinks the image is highly likely to be a 4. It has also assigned a moderate probability to the ninth node, to say it thinks the image could be a 9. It has assigned very low probabilities to other nodes because it doesn’t think the image looks at all like a 2 or a 3, for example.
Have a look at the following table which shows examples of an output node’s value x and the desired output y.
In the first row, the neural network has given a classification with a probability of 0.9. The target is 1.0, so it’s almost correct. A good loss function would have a low value for this output.
In the second row, the classification has a very low probability 0.1 which is the network saying it doesn’t really think the classification applies. The target is 1.0, so the network has got this very wrong. A good loss function would have a high value for this output.
Let’s now move from probability to uncertainty.
Entropy
Entropy is a mathematical idea for describing uncertainty.
-
If we had a very biased coin with both sides having a head, the chance of getting a heads is 100%. Similarly, the chance of getting a tail is 0%. In both cases, we’re 100% certain of the outcome. The uncertainty is zero, so we say the entropy is 0.
-
If we had a fair coin, with heads on one side and tails on the other, we’re maximally uncertain about the outcome. The entropy is the highest.
The mathematical expression that gives us this entropy is:
The sum is over all potential outcomes, and is the probability of those outcomes. We won’t go into where this expression comes from, but we will see visually why it has the right shape.
The following graph shows the entropy, calculated using the above expression, for a coin where the probability is the chance of a head.
The graph shows us that when we have a coin where both sides are a head and p(heads) = 1, the uncertainty is 0. It also shows us that the uncertainty is zero when both sides are tails and p(heads) = 0. The entropy is highest when the coin is fair and p(heads) = 0.5.
So we can see how it works. Let’s do the calculation for a coin with both sides as heads so p(heads) = 1.
...