Home/Blog/Programming/How to plot a histogram in Python using Matplotlib
Home/Blog/Programming/How to plot a histogram in Python using Matplotlib

How to plot a histogram in Python using Matplotlib

Saif Ali
Jul 03, 2024
11 min read

Histograms are essential tools in data analysis and visualization. They provide a graphical representation of the distribution of a dataset, helping analysts understand the underlying patterns and characteristics of the data. A histogram divides the data into intervals called bins and displays the frequency of data points falling into each bin as bars.

Understanding how to plot histograms in Python using Matplotlib is crucial for software developers and data professionals. It equips individuals with the skills to visualize and analyze data effectively, a fundamental aspect of many software development and data science tasks. Whether you’re building machine learning models, designing data-driven applications, or simply exploring datasets for insights, the ability to create informative histograms can significantly enhance your capabilities.

Matplotlib, a popular Python library for data visualization, offers a straightforward way to create histograms. You can customize various aspects of the histogram, such as the number of bins, colors, and labels, to effectively communicate insights from your data.

In this blog, we’ll explore how to plot histograms in Python using Matplotlib, covering the syntax, parameters, and implementation details.

Features of the histogram plot#

Histograms offer several features that make them valuable for data analysis:

  • Distribution visualization

  • Bin customization

  • Data comparison

  • Insight generation

Distribution visualization#

Histograms visually represent the data distribution, making identifying patterns, trends, and outliers easier. The following are some important distributions:

  • Normal distribution: A bell-shaped symmetric distribution where most data cluster around the mean, with fewer data points in the tails. In a histogram, a normal distribution appears as a symmetrical mound centered around the mean.

  • Uniform distribution: A distribution where all outcomes are equally likely. In a histogram, a uniform distribution appears as a flat line, indicating that all values occur with equal frequency.

  • Poisson distribution: A discrete probability distribution that represents the number of events occurring in a fixed interval of time or space, given a constant mean rate of occurrence. A Poisson distribution appears discrete in a histogram, with a peak around the mean value and tails extending toward zero and infinity.

  • Exponential distribution: A distribution that represents the time between events in a Poisson process. Events occur continuously and independently at a constant average rate. In a histogram, an exponential distribution appears skewed with a long tail on the right side.

  • Left-skewed distribution: This is also known as a negatively skewed distribution, where the tail of the distribution extends to the left, indicating more extreme low values. A left-skewed distribution appears in a histogram with a longer tail on the left side.

  • Right-skewed distribution: Also known as a positively skewed distribution, this type of distribution has a tail that extends to the right, indicating more extreme high values. It appears in a histogram with a longer tail on the right side.

  • Bimodal distribution: A distribution with two distinct peaks, indicating two separate modes or clusters of data. A bimodal distribution appears in a histogram with two distinct peaks separated by valleys.

  • Log-normal distribution: A distribution where the logarithm of the data follows a normal distribution. In a histogram, a log-normal distribution appears skewed with a long tail on the right side after transforming the x-axis to a logarithmic scale.

Some examples of these distributions are as follows:

Normal Distribution
Normal Distribution
1 of 8

Bin customization #

Users can customize the number and size of bins to control the granularity of the histogram, allowing for better data exploration.

The bin customization in histograms is illustrated below. Each histogram represents the same dataset but with different numbers of bins: 10, 20, 30, 40, 50, and 60 bins:

Six histograms with varying bin counts
Six histograms with varying bin counts

Data comparison #

Histograms enable the comparison of different datasets or subsets of data, facilitating the identification of similarities and differences.

Below is an image that compares two schools’ exam scores using histograms. The x-axis represents exam scores, while the y-axis represents the number of students. This visualization allows for a direct comparison between the performance of students from each school, highlighting any disparities or similarities in their scores:

Histogram comparison example
Histogram comparison example

Insight generation #

Examining the histogram’s shape, central tendency, spread, and skewness can help analysts gain insights into the data’s underlying characteristics.

Let’s analyze a histogram concerning these parameters to gain deeper insights:

  • Shape: The histogram’s shape can reveal the data’s distribution pattern. For example, a symmetrical shape suggests a normal distribution, indicating that the data is evenly distributed around the mean. A skewed shape, whether left or right, suggests an imbalance in the distribution, with more data points clustered toward one end of the range. This insight allows analysts to understand the general trend and variability within the dataset.

  • Central tendency: The central tendency, represented by measures such as the mean, median, or mode, provides insight into the typical or central value around which the data is clustered. Examining the central tendency in the histogram helps analysts identify the dataset’s most common or typical outcome, providing a reference point for further analysis and decision-making.

  • Spread: The spread of the data, also known as dispersion, describes the extent to which the data points are scattered or clustered. A wider spread indicates greater variability within the dataset, while a narrower spread suggests less variability and a more consistent pattern. Understanding the spread in the histogram allows analysts to assess the range of possible values and the degree of uncertainty associated with the data.

  • Skewness: Skewness measures the asymmetry of the distribution. A positively skewed distribution has a longer right tail, indicating more extreme values on the higher end of the range. Conversely, a negatively skewed distribution has a longer left tail, indicating more extreme values on the lower end of the range. Analyzing skewness in the histogram provides insights into the direction and extent of the skew, helping analysts understand the distribution’s shape and potential outliers.

Syntax#

The syntax for creating a histogram using Matplotlib is straightforward:

import matplotlib.pyplot as plt
# Data
data = [...] # Input your dataset here
# Plotting the histogram
plt.hist(data, bins=..., color='...', edgecolor='...', alpha=..., ...)
# Adding labels and title
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Title of Histogram')
# Displaying the plot
plt.show()

This syntax outlines the essential steps for creating a histogram using Matplotlib. You’ll need to replace [...] with your dataset, specify the number of bins with bins=..., choose a color for the bars with color='...', set the edge color of the bars with edgecolor='...', and adjust the transparency of the bars with alpha=..., among other parameters as needed. Finally, don’t forget to add appropriate labels to the x-axis and y-axis and provide a title for the histogram.

Parameters#

The plt.hist() function accepts several parameters that control the appearance and behavior of the histogram:

  • x: The input data is to be plotted as a histogram. This can be a list, a Numpy array, or any other sequence-like object, such as a pandas Series.

  • bins: This parameter controls the number of bins used in the histogram. It can be an integer specifying the number of bins or a sequence of bin edges defining the bin boundaries. If not specified, a default of 10 bins will be used.

  • range: This parameter sets the range of values to be displayed on the x-axis. It can be a tuple specifying the minimum and maximum values for the bins. If not specified, the range is automatically determined from the input data.

  • density: This is a boolean parameter that, if set to True, normalizes the histogram such that the area under the histogram equals 1, converting the counts to a probability density. The default is False.

  • weights: This array-like parameter assigns weights to each data point. This can be used to specify the importance of individual data points in the histogram.

  • cumulative: This is a boolean parameter that plots a cumulative histogram if set to True. The default is False.

  • histtype: This parameter specifies the type of histogram to be plotted. Options include 'bar', 'barstacked', 'step', 'stepfilled'. The default is 'bar'.

  • align: This parameter controls the alignment of the bars with the bin edges. Options include 'left', 'mid', 'right'. The default is 'mid'.

  • orientation: This parameter specifies the orientation of the histogram. Options include 'vertical' or 'horizontal'. The default is 'vertical'.

  • color: This is the color or sequence of colors used for the bars in the histogram.

  • label: This is a label for the histogram, which can be used for creating a legend.

  • stacked: This is a boolean parameter that, if set to True, stacks multiple histograms on top of each other. The default is False.

  • alpha: This is the transparency of the bars in the histogram. It ranges from 0 (transparent) to 1 (opaque).

  • edgecolor: This is the color of the edges of the bars in the histogram.

  • linewidth: This is the width of the edges of the bars in the histogram.

These are the main parameters of the hist() function in Matplotlib. You can adjust these parameters to customize the appearance and behavior of your histogram according to your specific requirements.

Return type#

The hist() function in Matplotlib returns a tuple containing three elements:

  1. n: This is an array or list of counts for each histogram bin. The length of this array corresponds to the number of bins.

  2. bins: This is an array or list of bin edges defining the boundaries of each bin in the histogram.

  3. patches: This is a list of Patch objects representing the histogram bars.

After creation, you can use these returned values to further analyze or manipulate the histogram. For example, you might want to:

  • Access and modify the counts of each bin (n): Adjusting the counts can be useful for normalizing the data, applying weights, or transforming the data for comparison with other datasets.

  • Access and modify the bin edges (bins): Changing the bin edges can help refine the histogram’s granularity or align it with specific data ranges or intervals of interest.

  • Modify the properties of the histogram bars (patches): Customizing the appearance of the bars, such as their color, transparency, or edge style, can enhance the visualization for better clarity and presentation.

Implementation#

We’ll use the Iris dataset, a popular dataset in machine learning and statistics. It contains measurements of iris flowers’ sepal and petal lengths and widths. For this histogram, we’ll focus on the sepal length.

import matplotlib.pyplot as plt
import seaborn as sns
# Load example dataset from seaborn
data = sns.load_dataset("iris")
# Set up the figure
plt.figure(figsize=(8, 6))
# Extract a single column for the histogram
sepal_length = data["sepal_length"]
# Plot the histogram
n, bins, patches = plt.hist(sepal_length, bins=10, color='skyblue', edgecolor='black', alpha=0.7)
# Add title and labels
plt.title('Histogram of Sepal Length')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
# Add grid lines
plt.grid(True)
# Show the plot
plt.show()

The following is the output of the code above:

Let’s review the code above:

  • Line 1: We import the pyplot module of Matplotlib and alias it as plt.

  • Line 2: We import the Seaborn library and alias it as sns. Seaborn provides datasets and functions for statistical data visualization.

  • Line 5: We load the Iris dataset using Seaborn’s load_dataset() function. This dataset contains information about iris flowers, including sepal and petal measurements.

  • Line 8: We create a new figure with a specified size of 8x6 inches using plt.figure(). This sets up the canvas for our histogram.

  • Line 11: We extract the "sepal_length" column from the dataset and assign it to the variable sepal_length.

  • Line 14: We plot the histogram of sepal lengths using plt.hist(). We specify 10 bins and customize the histogram bars, edges, and transparency color.

  • Line 17: We set the plot title to Histogram of Sepal Length using plt.title().

  • Line 18: We label the x-axis as Sepal Length (cm) using plt.xlabel().

  • Line 19: We label the y-axis as Frequency using plt.ylabel().

  • Line 22: We add grid lines to the plot using plt.grid(True) to improve readability.

  • Line 25: Finally, we display the histogram plot using plt.show().

Note: This implementation demonstrates how to plot a histogram in Python using Matplotlib, utilizing a sample dataset from the Seaborn library. Following these steps lets you visualize the data distribution and gain insights into its characteristics.

Conclusion#

Mastering the creation of histograms in Python using Matplotlib opens up a world of possibilities for software developers and data professionals. Histograms serve as powerful tools for understanding the distribution and characteristics of data, enabling informed decision-making and insightful analysis. By delving into the syntax, parameters, and implementation details outlined in this blog, individuals can enhance their data visualization skills and excel in various domains, from software development to data science. Whether you’re visualizing trends in financial data, analyzing customer behavior in e-commerce, or exploring patterns in scientific research, the ability to craft informative histograms empowers you to unlock valuable insights and drive impactful outcomes in your projects and career endeavors.

Next steps #

This blog has provided an in-depth exploration of plotting histograms using Matplotlib, a fundamental data analysis and visualization skill. However, the journey in data science is vast and rich, offering numerous opportunities for further growth and exploration. Here are some potential next steps to continue enhancing your data science skill set:

Matplotlib for Python: Visually Represent Data with Plots

Cover
Matplotlib for Python: Visually Represent Data with Plots

For data science, Matplotlib is one of the most popular tools for representing data in a visual manner. There are many other tools, but for the Python user, Matplotlib is a must-know. In this course, you will learn how to visually represent data in several different ways. You will learn how to use figures and axes to plot a chart, as well as how to plot from multiple types of objects and modules. You will also discover ways to control the spine of an axes and how to create complex layouts for a figure using GridSpec so you can create visually stunning charts. In the latter half of the course, you will focus on how to draw various types of plots, whether it be a line plot, a stem plot, or a heatmap plot. Overall, this is your no-fuss introduction to creating impactful data charts. By the end, you will have an important new skill to add to your resume. As any data scientist knows, it is necessary that you be able to show insights found from analyzing data.

6hrs
Intermediate
69 Playgrounds
2 Quizzes

Data Storytelling through Visualizations in Python

Cover
Data Storytelling through Visualizations in Python

Mining the insights from data is the next critical step after parsing data and generating visualizations. This activity is called data storytelling, where you form a cohesive story explaining the strengths, weaknesses, and trends of your dataset with the help of predictions through machine learning models. In this course, you will learn how to identify and evaluate your data for trends, handle common real-world challenges of messy data such as large datasets and missing values, and present the right visualizations for different kinds of data. We will use Python, Matplotlib, Seaborn, and Plotly as the data science libraries for this course. This course will help you develop the key skills to translate the technical indicators in line with business objectives. It also aids in building your technical skills and processes to create effective data visualizations and narratives. Data storytelling can help you unlock actionable insights from your data.

9hrs
Intermediate
80 Playgrounds
5 Quizzes

Introduction to Data Science with Python

Cover
Introduction to Data Science with Python

Python is one of the most popular programming languages for data science and analytics. It’s used across a wide range of industries. It’s easy to learn, highly flexible, and its various libraries can expand functionality to natively perform statistical functions and plotting. This course is a comprehensive introduction to statistical analysis using Python. You’ll start with a step-by-step guide to the fundamentals of programming in Python. You’ll learn to apply these functions to numerical data. You’ll first look at strings, lists, dictionaries, loops, functions, and data maps. After mastering these, you’ll take a deep dive through various Python libraries, including pandas, NumPy, Matplotlib, Seaborn, and Plotly. You’ll wrap up with guided projects to clean, analyze, and visualize unique datasets using these libraries. By the end of this course, you will be proficient in data science, including data management, analysis, and visualization.

4hrs 10mins
Beginner
11 Challenges
7 Quizzes

Data Science Interview Handbook

Cover
Data Science Interview Handbook

This course will increase your skills to crack the data science or machine learning interview. You will cover all the most common data science and ML concepts coupled with relevant interview questions. You will start by covering Python basics as well as the most widely used algorithms and data structures. From there, you will move on to more advanced topics like feature engineering, unsupervised learning, as well as neural networks and deep learning. This course takes a non-traditional approach to interview prep, in that it focuses on data science fundamentals instead of open-ended questions. In all, this course will get you ready for data science interviews. By the time you finish this course, you will have reviewed all the major concepts in data science and will have a good idea of what interview questions you can expect.

9hrs
Intermediate
140 Playgrounds
128 Quizzes


  

Free Resources