...

Kaggle Challenge - Data Preprocessing

We'll cover the following...

2. Data Preprocessing - Prepare the Data for Machine Learning Algorithms

Press + to interact

From the results above we can assume that PoolQC to Bsmt attributes are missing for the houses that do not have these facilities (houses without pools, basements, garage etc.). Therefore, the missing values could be filled in with “None”. MasVnrType and MasVnrArea both have 8 missing values, likely houses without masonry veneer.

What should we do with all this missing data?

Most machine learning algorithms cannot work with missing features, so we need to take care of them. Essentially, we have three options:

Get rid of the corresponding houses.
Get rid of the whole attribute or remove the whole column.
Set the missing values to some value (zero, the mean, the median, etc.).

We can accomplish these easily using DataFrame’s dropna(), drop(), and fillna() methods.

📌Note: Whenever you choose the third option, say imputing values using the median, you should compute the median value on the training set, and use it to fill the missing values in the training set. But you should also remember to later replace missing values in the test set using the same median value when you want to evaluate your system, and also once the model gets deployed to replace missing values in new unseen data.

We are going to apply different approaches to fix our missing values, so that we can various approaches in action:

We are going to replace values for categorical attributes with None.
For LotFrontage, we are going to go a bit fancy and compute the median LotFrontage for all the houses in the same neighborhood, instead of the plain median for the entire column, and use that to impute on a neighborhood by neighborhood basis.
We are going to replace missing values for most of the numerical columns with zero and one with the mode.
We are going to drop one non-interesting column, Utilities.

Right now, we are going to look at how to do these fixes by explicitly writing the name of the column in the code. Later, in the upcoming section on transformation pipelines, we will learn how to handle them in an automated manner as well.

Press + to interact

# Imputing Missing Values
housing_processed = housing
# Categorical columns:
cat_cols_fill_none = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
                     'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType',
                     'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'BsmtCond',
                     'MasVnrType']
# Replace missing values for categorical columns with None
for cat in cat_cols_fill_none:
    housing_processed[cat] = housing_processed[cat].fillna("None")
    
# Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
housing_processed['LotFrontage'] = housing_processed.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))    
# Garage: GarageYrBlt, GarageArea and GarageCars these are numerical columns, replace with zero
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
    housing_processed[col] = housing_processed[col].fillna(int(0))
    
# MasVnrArea : replace with zero
housing_processed['MasVnrArea'] = housing_processed['MasVnrArea'].fillna(int(0))
# Use the mode value 
housing_processed['Electrical'] = housing_processed['Electrical'].fillna(housing_processed['Electrical']).mode()[0]
# There is no need of Utilities so let's just drop this column
housing_processed = housing_processed.drop(['Utilities'], axis=1)
# Get the count again to verify that we do not have any more missing values
housing_processed.isnull().apply(sum).max()

Python Fundamentals for Data Science

The Fundamentals of Statistics

Machine Learning 101

End-to-End Machine Learning Project

The Real Talk

Kaggle Challenge - Data Preprocessing

2. Data Preprocessing - Prepare the Data for Machine Learning Algorithms

Deal With Missing Values

Deal With Outliers