...

/

Understanding the Role of Data Manipulation Skills

Understanding the Role of Data Manipulation Skills

Learn how to describe and analyze datasets using pandas.

In practical situations, we rarely have our data in the format that we want. We usually have different datasets that we want to merge, and often, we need to normalize and clean up the data. For these reasons, data manipulation and preparation will always play a big part in any data visualization process. So, we will be focusing on this in this chapter and throughout the course.

The plan for preparing our dataset is roughly the following:

  • Explore the different files one by one.
  • Check the available data and data types and explore how each can help us categorize and analyze the data.
  • Reshape the data where required.
  • Combine different DataFrames to add more ways to describe our data.

Let’s go through these steps right away.

Exploring the data files

We start by reading in the files in the data folder.

Press + to interact
import os
import pandas as pd
pd.options.display.max_columns = None
print(os.listdir('../data'))

To make things clear, we’ll use the distinct part of each file name as the variable name for each DataFrame: 'PovStats<name>.csv'.

The series file

We start by exploring the series file using the following code:

Press + to interact
series = pd.DataFrame(pd.read_csv('../data/PovStatsSeries.csv'))
print(series.shape)
print(series.head())

Line 2 will display the shape attribute of the DataFrame, and line 3 will display the first five rows of the DataFrame.

It seems we have 64 different indicators, and for each one of them, we have 21 attributes, explanations, and notes. This is already in long format—columns contain data about one attribute, and rows are complete representations of an indicator, so there is nothing to change. We just need to explore what data is available and get familiar with this table.

Using this information, we can easily imagine creating a special dashboard for each indicator and placing it on a separate page. Each row seems to have enough information to produce an independent page with a title, description, details, and so on. The main content area of the page could be a visualization of that indicator for all countries and across all years. This is just one idea.

Let’s take a closer look at some interesting columns.

Press + to interact
print(series['Topic'].value_counts())

We can see that the indicators are spread across four topics, the counts of which can be seen in the code widget above. ...