Statistics and Counts
This lesson discusses principal data processing and data analytics techniques using Pandas.
In the last chapter, while studying how to read data from CSV files, we used Pandas. Now, we will look deeper at Pandas to process data.
To do so, we will be using the same data set we used when learning about how to read in CSV data. Here are the first 5 rows to refresh your memory.
import pandas as pdnames = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race','sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']train_df = pd.read_csv("adult.data", header=None, names=names)print(train_df.head())
Gathering statistics on data #
A good place to start is just looking at your data using some pandas functions to better understand what issues there might be. Describe
will give you counts and some statistics for continuous variables.
import pandas as pdnames = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race','sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']train_df = pd.read_csv("adult.data", header=None, names=names)print(train_df.describe())
For all of our numeric values, we now have the mean
, the std
, the min
, the max
, and a few different percentiles
.
Note: It is good to remember that the mean value will be influenced more by outliers than the median. Also, you can always square the standard deviation to get the variance.
You may have noticed that some ...