Visualization with Distributions
This lesson introduces us with types of distributions used to visualize data , and how Python provides support with its libraries.
We'll cover the following
Introduction to distributions #
A probability distribution is a mathematical function that provides the probabilities of the occurrence of different possible outcomes.
For example, you might have a program that returns 1 with a 50% probability and 0 with a 50% probability. Thus, 50% of your probability distribution would be assigned to 1 and 50% to 0.
If you were to plot this expected distribution, you would have two bars of equal height for 1 and 0.
Often, with data you don’t know the mathematical function which generated your data, so instead you observe the empirical distribution. You might sample 10 colored balls from a bag and get 2 red, 3 yellow, and 5 green. That would then be your empirical distribution and you could graphical represent it with 3 bars. One of height 2 for red, one of height 3 for yellow, and one of height 5 for green.
Seaborn has a few ways to plot distributions:
- Histograms
- Box plots
- Violin plots
- Joint plots
Types of distributions #
Histogram #
As explained by Wikipedia, a histogram is an estimate of the probability distribution of a continuous variable. To construct a histogram, the first step is to bin the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
You create a histogram with the distplot()
function is Seaborn. You only need to pass one argument which is the continuous variable for which you would like to construct a histogram.
Histograms, though, have a very important parameter - the number of bins. You specify the number of bins with the bin
parameter. If unspecified, Seaborn tries to find a useful number of bins to use. It is important to remember, though that the more bins, the higher variance your plot will have; the fewer bins, the more bias. Be thoughtful when choosing the number of bins as it might change the way you view your data.
Let’s take a look: