...

/

Statistics and Counts

Statistics and Counts

This lesson discusses principal data processing and data analytics techniques using Pandas.

In the last chapter, while studying how to read data from CSV files, we used Pandas. Now, we will look deeper at Pandas to process data.

To do so, we will be using the same data set we used when learning about how to read in CSV data. Here are the first 5 rows to refresh your memory.

Press + to interact
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.head())

Gathering statistics on data #

A good place to start is just looking at your data using some pandas functions to better understand what issues there might be. Describe will give you counts and some statistics for continuous variables.

Press + to interact
import pandas as pd
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("adult.data", header=None, names=names)
print(train_df.describe())

For all of our numeric values, we now have the mean, the std, the min, the max, and a few different percentiles.

Note: It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.

You may have noticed that some ...