Entropy or information theory tends to express a certain measure of information within an event. The theory was proposed and developed by Claude Shannon at Bell Labs.
Let's say that we have the event of a coin toss with the outcome as either getting heads or tails. Now we know that there is a 50/50 chance of getting both. However, let's suppose now that we have a person who states that the probability of getting heads on a specific toss is 95%. This is highly unlikely but, at the same time, intriguing.
This highly improbable event somehow happens on the specific toss, as the person states. So now we know that every time he chooses to toss the coin, heads will come first in the toss, so there is little uncertainty and hence less information to receive. But if we are to say that some other person decides to toss the coin now with the 50/50 probability and ends up getting heads thrice in a row (highly uncertain event), then there is plenty of information to get from this toss.
Hence if the probability of some event is low, the information gain is high. This can be stated by the function defined below:
Here
In our short example, the coin toss event will never have a large information gain due to the limited 50/50 probabilities assigned to both outcomes (heads and tails), respectively. However, in comparison to this, if we take an event that has a smaller chance of occurring such as picking a red marble from uniquely colored marbles in a bag, let's say with a probability of 1/9. We have an information gain greater than the one obtained in the typical coin toss event.
Applying this theorem to discrete variables, we can state that the entropy;
In the equation above, b is the base of the logarithm, which is usually
Another way to define entropy in the context of
Where,
Now it's time to go ahead and take a look at an example.
Let's take our coin toss example that we discussed at the start, where heads had a 95% chance of occurring. Then we can find the entropy as follows:
However, for the 50/50 toss, the entropy would be:
So the entropy is greater on the uncertainty rather than the certainty of the heads occurring.
Some of the properties of entropy are stated below:
Adding or removing a zero-probability event will not change the entropy.
Entropy calculated on two events
The entropy of a variable will decrease when its function estimated value is passed through the original function again.
If
Entropy in a probability mass function;
Entropy is used in
Combinatorics (Loomis–Whitney inequality and binomial coefficient approximation).
Machine Learning (Decision trees, Machine learning, Bayesian inference).