...

/

Kaggle Challenge - Data Preprocessing

Kaggle Challenge - Data Preprocessing

2. Data Preprocessing - Prepare the Data for Machine Learning Algorithms

We took our notes in the exploratory phase, now it’s time to act on them and prepare our data for the machine learning algorithms. Instead of just doing this manually, we will also learn how to write functions where possible.

Deal With Missing Values

Let’s get a sorted count of the missing values for all the attributes.

Press + to interact
housing.isnull().sum().sort_values(ascending=False)
widget

From the results above we can assume that PoolQC to Bsmt attributes are missing for the houses that do not have these facilities (houses without pools, basements, garage etc.). Therefore, the missing values could be filled in with “None”. MasVnrType and MasVnrArea both have 8 missing values, likely houses without masonry veneer.

What should we do with all this missing data?

Most machine learning algorithms cannot work with missing features, so we need to take care of them. Essentially, we have three options:

  • Get rid of the corresponding houses.

  • Get rid of the whole attribute or remove the whole column.

  • Set the missing values to some value (zero, the mean, the median, etc.).

We can accomplish these easily using DataFrame’s dropna(), drop(), and fillna() methods.

📌Note: Whenever you choose the third option, say imputing values using the median, you should compute the median value on the training set, and use it to fill the missing values in the training set. But you should also remember to later replace missing values in the test set using the same median value when you want to evaluate your system, and also once the model gets deployed to replace missing values in new unseen data.

We are going to apply different approaches to fix our missing values, so that we can various approaches in action:

  • We are going to replace values for categorical attributes with None.
  • For LotFrontage, we are going to go a bit fancy and compute the median LotFrontage for all the houses in the same neighborhood, instead of the plain median for the entire column, and use that to impute on a neighborhood by neighborhood basis.
  • We are going to replace missing values for most of the numerical columns with zero and one with the mode.
  • We are going to drop one non-interesting column, Utilities.

Right now, we are going to look at how to do these fixes by explicitly writing the name of the column in the code. Later, in the upcoming section on transformation pipelines, we will learn how to handle them in an automated manner as well.

Press + to interact
# Imputing Missing Values
housing_processed = housing
# Categorical columns:
cat_cols_fill_none = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType',
'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'BsmtCond',
'MasVnrType']
# Replace missing values for categorical columns with None
for cat in cat_cols_fill_none:
housing_processed[cat] = housing_processed[cat].fillna("None")
# Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
housing_processed['LotFrontage'] = housing_processed.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
# Garage: GarageYrBlt, GarageArea and GarageCars these are numerical columns, replace with zero
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
housing_processed[col] = housing_processed[col].fillna(int(0))
# MasVnrArea : replace with zero
housing_processed['MasVnrArea'] = housing_processed['MasVnrArea'].fillna(int(0))
# Use the mode value
housing_processed['Electrical'] = housing_processed['Electrical'].fillna(housing_processed['Electrical']).mode()[0]
# There is no need of Utilities so let's just drop this column
housing_processed = housing_processed.drop(['Utilities'], axis=1)
# Get the count again to verify that we do not have any more missing values
housing_processed.isnull().apply(sum).max()
widget

Deal With Outliers

To remove noisy data, we are going to remove houses where we ...

Access this course and 1400+ top-rated courses and projects.