How to do data analysis and visualization in Python

When the British mathematician Clive Humby said that “data is the new oil,” it meant two things; data isn’t useful in its raw state, and data will be critical for the economy and progress. Data analysis and visualization are the key to identifying useful patterns and trends from raw data.

Key Takeaways:

  • Data analysis and visualization are essential in fields like data science and big data, helping to uncover patterns and trends from raw data inputs.

  • Python is a popular choice for data analysis due to its simplicity and the availability of powerful visualization libraries like Matplotlib, pandas, and seaborn.

  • Installing the required libraries using pip is the first step in setting up your environment for data analysis, ensuring you have the necessary tools to explore and visualize data

  • Various plot types can be used for data visualization, including bar charts for comparing categories, line graphs for showing trends over time, and scatter plots for illustrating relationships between two variables.

  • Choosing the right visualization is important because it can greatly affect how clearly the audience sees the trends and patterns in the data.

The need for data analysis and visualization

Data analysis and visualization play a major role in computer science fields such as data analysis, big data, and data science, etc. Data analysis and visualization have widespread applications, from analyzing stock market trends to even optimizing business operations. By turning raw data into meaningful insights, visualization and analysis help to understand patterns, correlations, and trends, thus enabling better decision-making across industries.

This Answer will help you learn how to represent data in their most suitable visual forms and what to understand from them.

Tools for data analysis

Some most commonly used and easy-to-learn tools for data analysis are:

  • Python programming

  • R programming

  • Power BI

  • Microsoft Excel

Although each has its own unique strengths, this Answer will keep things simple and explain the data analysis and visualization concepts through Python. Python is the choice for this Answer here because it is a high-level language and offers many visualization libraries.

Data analysis and visualization libraries in Python

When it comes to analyzing and visualizing data, Python has some great libraries that make the job much easier. These libraries help you explore your data and create visuals that tell a story. Here are a few popular ones:

These libraries can be used to import data from file formats, such as Excel, and convert random raw data into graphs, pie charts, scatter plots, etc.

Steps to perform data analysis and visualization in Python

To perform data visualization and analysis, the following steps need to be performed:

  1. Install the libraries.

  2. Import the libraries.

  3. Choose and import the dataset.

  4. Perform data visualization and analysis.

An example of data analysis and visualization in Python

Now, let’s discuss each of these steps individually through a Python example:

1. Install Python data visualization libraries

To install the latest release of these libraries, you can use pip. Make sure you have pip installed, and then you can run these commands in your terminal or command prompt:

pip install matplotlib
pip install pandas
pip install seaborn
Iinstall Python data visualization libraries

2. Import Python data visualization libraries

Once the libraries are installed, the next step is to import them into the environment so that they can be used.

You can use the following code to import these libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

3. Choose and import the datasets

Select the dataset you want to analyze. You can import it from various file formats, such as CSV or Excel.

The dataset we are using in this Answer is the 2008 Swing State US elections. Here is a glimpse of the dataset, in a .csv format, for your understanding:

state,county,total_votes,dem_votes,rep_votes,dem_share
PA,Erie County,127691,75775,50351,60.08
PA,Bradford County,25787,10306,15057,40.64
PA,Tioga County,17984,6390,11326,36.07
PA,McKean County,15947,6465,9224,41.21
PA,Potter County,7507,2300,5109,31.04
PA,Wayne County,22835,9892,12702,43.78
PA,Susquehanna County,19286,8381,10633,44.08
PA,Warren County,18517,8537,9685,46.85
OH,Ashtabula County,44874,25027,18949,56.94
OH,Lake County 121335,60155,59142,50.46
PA,Crawford County,38134,16780,20750,44.71
OH,Lucas County 219830,142852,73706,65.99
OH,Fulton County,21973,9900,11689,45.88
OH,Geauga County,51102,21250,29096,42.23
OH,Williams County,18397,8174,9880,45.26
PA,Wyoming County,13138,5985,6983,46.15
PA,Lackawanna County,107876,67520,39488,63.1
PA,Elk County,14271,7290,6676,52.2
PA,Forest County,2444,1038,1366,43.18
PA,Venango County,23307,9238,13718,40.24
OH,Erie County,41229,23148,17432,57.01
OH,Wood County,65022,34285,29648,53.61
PA,Cameron County,2245,879,1323,39.92
PA,Pike County,24284,11493,12518,47.87

Note: While working locally, make sure the CSV file is downloaded in the system.

In Python, you can use the following code to load and explore the CSV dataset.

main.py
2008_Election.csv
import pandas as pd
# The data can be imported in Python using the panda read_csv method
df=pd.read_csv('2008_Election.csv')
# The first 5 rows of the data can be represented using the pandas head() method
print(df.head())
# The describe() method can be used for the description of the dataset
# This include mean, standard deviation, maximum, and minimum values.
print(df.describe())

4. Perform data visualization and analysis

Python libraries like seaborn and Matplotlib have an array of graph options. The selection of the graph is purely based on the data that you want to visualize and the problem at hand.

Select the right kind of plot

For the census data, if we want to see the distribution of democratic vote share across different counties, a histogram would make more sense. The reason for this is that histograms offer univariate analysis and can represent data in a way that helps us understand relationships.

Plot the data

The data can provide us with different insights based on the type of chart we select to project it.

a) Plotting with histograms

Let’s plot the data in matplotlib first. Here is a code (with comments providing necessary insights):

main.py
2008_Election.csv
import matplotlib.pyplot as plt
# Plotting the histogram of Democratic vote share
# Histograms can be created in matplotlib using plt.hist() function
plt.hist(df['dem_share'], bins=10, color='blue', alpha=0.7) # Specify the number of bins and color
# Adding labels and title
plt.xlabel('Percentage of Votes for Democrats') # Clarify the label
plt.ylabel('Number of Counties') # Clear label
plt.title('Distribution of Democratic Vote Share Across Counties') # Adding a title
# Add a grid for better readability (optional)
plt.grid(axis='y')
plt.show()

Now let’s see how we can achieve the same using seaborn library:

main.py
2008_Election.csv
import seaborn as sns
import matplotlib.pyplot as plt
# Set the style for Seaborn
sns.set(style="whitegrid")
# Create the histogram using distplot (for older versions)
plt.figure(figsize=(10, 6)) # Set the figure size (Optional)
sns.distplot(df['dem_share'], bins=10, color='blue', kde=False) # KDE can be added if desired
# Adding labels and title
plt.xlabel('Percentage of Votes for Democrats')
plt.ylabel('Number of Counties')
plt.title('Distribution of Democratic Vote Share Across Counties')
# Show the plot
plt.show()

b) Making an ECDF

ECDF stands for Empirical cumulative distribution function (ECDF). It is an estimator tool that allows a user to plot a particular feature from lowest to highest and therefore, it is considered to be an alternative to Histograms.

Let's first create an ECDF in matplotlib:

main.py
2008_Election.csv
import numpy as np
import matplotlib.pyplot as plt
x=np.sort(df['dem_share']) #sorts data
y=np.arange(1, len(x)+1)/len(x) #arranges data
_=plt.plot(x,y,marker='.', linestyle='none')
_=plt.xlabel('percentage of vote for Obama')
_=plt.ylabel('ECDF')
plt.margins(0.02) #Keeps data off plot edges
plt.show()

Now, let’s see how can we do the same with seaborn:

main.py
2008_Election.csv
import seaborn as sns
import matplotlib.pyplot as plt
# Set the style for seaborn
sns.set(style="whitegrid")
# Create the ECDF plot
# plt.figure(figsize=(10, 6)) # Optional: Set the figure size
sns.ecdfplot(data=df, x='dem_share', marker='o') # Use the marker parameter for point markers
# Adding labels
plt.xlabel('Percentage of Votes for Democrats')
plt.ylabel('ECDF')
# Adding a title
plt.title('Empirical Cumulative Distribution Function of Democratic Vote Share')
# Show the plot
plt.margins(0.02) # Keeps data off plot edges
plt.show()

Look at the results closely and try to infer what the plot is trying to present.

Now, let’s say you wanted to see the county’s share for Republican and Democratic parties in comparison to each other; what plot would you use? A pie chart? or a histogram? You can learn the differences and use cases for the different charts and decide which one is best suited for your problem.

The next step: Enhancing data visualization with interactivity

While static data visualizations provide valuable insights, interactive data visualization takes it a step further by allowing users to explore data dynamically, uncovering deeper trends and patterns in real time.

To implement interactive visualizations, libraries like Plotly and Bokeh offer powerful tools that enable users to create dynamic, responsive charts and dashboards with ease. These tools allow for real-time exploration and manipulation of data, making it more engaging and insightful. You can explore the following exciting projects from Educative to apply interactive visualization techniques and further enhance your understanding of dynamic data exploration:

  1. Data Analysis and Visualization with sidetable and Bokeh

  2. Time Series Analysis and Visualization Using Python and Plotly

  3. Visualize Geospatial Data Using Plotly and Mapbox

Frequently asked questions

Haven’t found what you were looking for? Contact Us


Can I do data analysis with Python?

Yes, you can analyze data with Python using libraries like pandas and NumPy to handle and analyze your data easily.


Is Python good for data visualization?

Yes, Python is great for data visualization because it has powerful libraries that make it easy to create beautiful and informative charts. The blog, “Exploring data visualization: Matplotlib vs. seaborn” gives an interesting hands-on introduction to two such libraries—Matplotlib and seaborn.


How to make data visualization using Python?

Data visualization in Python is possible by using libraries like Matplotlib and seaborn to create charts and graphs that clearly show your data.


How do you analyze data visualization?

To analyze data visualization, you look at the patterns, trends, and insights that the charts show to understand what the data means and make decisions based on it.


Free Resources