Search⌘ K

Statistics and Counts

Explore how to gather key statistics and counts from data using Pandas in Python. Understand data types, convert data as needed, examine unique values, apply grouping functions, calculate correlations, and generate percentiles. This lesson helps you use these techniques to better describe and analyze datasets for clearer insights.

In the last chapter, while studying how to read data from CSV files, we used Pandas. Now, we will look deeper at Pandas to process data.

To do so, we will be using the same data set we used when learning about how to read in CSV data. Here are the first 5 rows to refresh your memory.

Python 3.5
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.head())

Gathering statistics on data #

A good place to start is just looking at your data using some pandas functions to better understand what issues there might be. Describe will give you counts and some statistics for continuous variables.

Python 3.5
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.describe())

For all of our numeric values, we now have the mean, the std, the min, the max, and a few different percentiles.

Note: It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.

You may have noticed that some ...