Feature Engineering

Learn how to discover and extract features through domain knowledge and data wrangling techniques.

Not all features are used to train a machine learning model. Some features improve the performance, and others increase the bias. In this lesson, we’ll extract important features for our model using the information gathered in the data exploration step.

Load the dataset

Before we start, let’s import the pertinent libraries and load the dataset.

Press + to interact
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
df_retail = pd.read_csv('retail_transactions.csv')
print(df_retail.head())

Data wrangling

In the previous lesson, we learned about exploratory data analysis techniques. Using that knowledge, we’ll shape and prepare our data for our model.

Press + to interact
# remove unnecessary columns
df_retail = df_retail.drop(columns=['StockCode', 'Description'])
# keep UK records only
df_retail = df_retail[df_retail['Country'] == 'United Kingdom']
# fix the data type and parse datetime
df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)
df_retail['InvoiceDate'] = pd.to_datetime(df_retail['InvoiceDate']).dt.normalize()
# calculate revenue and transaction year
df_retail['Revenue'] = df_retail['UnitPrice'] * df_retail['Quantity']
df_retail['Year'] = df_retail['InvoiceDate'].dt.year
# take a look at the current dataset
df_retail.info()

Explanation ...