Visualization with Distributions

This lesson introduces us with types of distributions used to visualize data , and how Python provides support with its libraries.

Introduction to distributions #

A probability distribution is a mathematical function that provides the probabilities of the occurrence of different possible outcomes.

For example, you might have a program that returns 1 with a 50% probability and 0 with a 50% probability. Thus, 50% of your probability distribution would be assigned to 1 and 50% to 0.

If you were to plot this expected distribution, you would have two bars of equal height for 1 and 0.

Often, with data you don’t know the mathematical function which generated your data, so instead you observe the empirical distribution. You might sample 10 colored balls from a bag and get 2 red, 3 yellow, and 5 green. That would then be your empirical distribution and you could graphical represent it with 3 bars. One of height 2 for red, one of height 3 for yellow, and one of height 5 for green.

Seaborn has a few ways to plot distributions:

  • Histograms
  • Box plots
  • Violin plots
  • Joint plots

Types of distributions #

Histogram #

As explained by Wikipedia, a histogram is an estimate of the probability distribution of a continuous variable. To construct a histogram, the first step is to bin the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.

You create a histogram with the distplot() function is Seaborn. You only need to pass one argument which is the continuous variable for which you would like to construct a histogram.

Histograms, though, have a very important parameter - the number of bins. You specify the number of bins with the bin parameter. If unspecified, Seaborn tries to find a useful number of bins to use. It is important to remember, though that the more bins, the higher variance your plot will have; the fewer bins, the more bias. Be thoughtful when choosing the number of bins as it might change the way you view your data.

Let’s take a look: