Outliers

This lesson explains what are outliers, why they happen and how to remove them.

What is an outlier? #

Another area of cleaning can be dealing with outliers. First off, how do you define an outlier? This can require domain knowledge as well as other information, but a simple way to start is by taking a look at box plots:

Box Plot of Hours Per Week
Box Plot of Hours Per Week

The above plot was calculated with this command:

bbox = train_df['hoursperweek'].plot(kind="box")

Detection of an outlier #

Here, anything outside the “whiskers” could be considered an outlier. As a refresher, the “whiskers” are the lines sticking out from the box and are 1.5 times the interquartile range. ...