...

/

Kaggle Challenge - Exploratory Data Analysis

Kaggle Challenge - Exploratory Data Analysis

Our project is based on the Kaggle Housing Prices Competition. In this challenge we are given a dataset with different attributes for houses and their prices. Our goal is to develop a model that can predict the prices of houses based on this data.

At this point we already know two things:

  1. We are given labeled training examples
  2. We are asked to predict a value

What do these tell us in terms of framing our problem? The first point tells us that this is clearly a typical supervised learning task, while the second one tells us that this is a typical regression task.

widget

If we look at the file with data description, data_description.txt, we can see the kind of attributes we are expected to have for the houses we are working with. Here is a sneak peek into some of the interesting attributes and their description from that file:

  • SalePrice – the property’s sale price in dollars. This is the target variable that we are trying to predict.
  • MSSubClass: The building class.
  • LotFrontage: Linear feet of street connected to property.
  • LotArea: Lot size in square feet. Street: Type of road access.
  • Alley: Type of alley access.
  • LotShape: General shape of property.
  • LandContour: Flatness of the property.
  • LotConfig: Lot configuration.
  • LandSlope: Slope of property. Neighborhood: Physical locations within Ames city limits.
  • Condition1: Proximity to main road or railroad.
  • HouseStyle: Style of dwelling.
  • OverallQual: Overall material and finish quality.
  • OverallCond: Overall condition rating.
  • YearBuilt: Original construction date.

📌 Note: Before moving further, download the dataset, train.csv, from here. Launch your Jupyter notebook and then follow along! It is important to get your hands dirty; don’t just read through these lessons!

📌You can also find the Juptyter notebook with all the code for this project on my Git profile, here.

📌You can see the live execution of code in the Jupyter Notebook at the end of the lesson and can also play with it.

1. Exploratory Data Analysis

Importing modules and getting the data

Let’s start by importing the modules and getting the data. In the code snippet below, it is assumed that you have downloaded the csv file and saved it in the working directory as ‘./data/train.csv’.

Press + to interact
# Core Modules
import pandas as pd
import numpy as np
# Basic modules for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Press + to interact
# Load data into a pandas DataFrame from given filepath
housing = pd.read_csv('./data/train.csv')

Understand the Data Structure

We now have our DataFrame in place, so let’s get familiar with it by looking at the columns it contains:

Press + to interact
# Get column names of the df
housing.columns
widget

How many attributes do we have in total? Of course, ...