Welcome and Data Overview
Learn to load and explore data.
We'll cover the following
In this chapter, we'll learn the linear regression (supervised machine learning) model hands-on.
Welcome
We are very excited because it takes tremendous effort to come to this stage, where we are doing our first machine learning project. We'll go through the process step by step. We'll be doing the following lessons without going through these steps in detail. However, we'll revisit these steps several times along our way in the machine learning section.
Let’s start with a very famous and real dataset. Our task is to build a machine learning model to predict housing prices in the Boston (USA) area. This housing dataset is a part of scikit-learn. Let’s read this project’s Boston housing dataset from the scikit-learn repository. In this way, we’ll learn about the process of loading built-in datasets from scikit-learn.
Relating the project to a context is always helpful. Let’s create a context.
Context building
Imagine that a real estate company hires us to achieve its business goals. The company wants to predict housing prices in the Boston area. Based on the community and other criteria, some areas are in high demand. The company is interested in an automated way of suggesting a house price based on its features. The given dataset contains features such as the age of the house, number of rooms, crime rate by town, the proportion of residential land, nitric oxide concentration, property tax, and so on.
When we look at the dataset, we think linear regression is an excellent model for this problem. We have the data, so let’s start working on the model. Details of the full list of features are given below:
CRIM
: Per capita crime rate by town.ZN
: Proportion of residential land zoned for lots over 25,000 square feet.INDUS
: Proportion of non-retail business acres per town.CHAS
: Charles River dummy variable (=1
if tract bounds river;0
otherwise).NOX
: Nitric oxides concentration (parts per 10 million).RM
: Average number of rooms per dwelling.AGE
: Proportion of owner-occupied units built before 1940.DIS
: Weighted distances to five Boston employment centers.RAD
: Index of accessibility to radial highways.TAX
: Full-value property tax rate per USD 10,000.PTRATIO
: Pupil-teacher ratio by town.MEDV
: Median value of owner-occupied homes in $1,000s.
Note: We’ll work with more than one variable or feature; this is a multiple linear regression problem. We can try to create a model with one feature—for example, predicting house price using the number of rooms only. This would be our simple linear regression problem.
Let’s import datasets from scikit-learn, load the built-in housing price dataset boston
into bh
, and check its keys.
# importing the datasets from sklearnfrom sklearn import datasets# loading the Boston databh = datasets.load_boston()# displaying the bh.keysprint(bh.keys())
So, bh
contains data
: target
is the price, feature_names
are the columns, and DESCR
is the description of the dataset. We can start by exploring the description of the dataset.
# displaying the bh['DESCR'] aka data descriptionprint(bh['DESCR'])
Let's create a pandas data frame with bh.features_names
as columns
so that the bh.data
will go to its respective column. We can also add the target as another column named price
.
# importing the pandasimport pandas as pd# creating dataframe using data and feature_namesdf = pd.DataFrame(data=bh.data, columns=bh.feature_names)# adding price columndf['price'] = bh.target# displaying the first two rows of dataprint(df.head(2))
Let's get some information on the data using info()
.
# displaying the data infoprint(df.info())
So, if we look at each column, there is no missing data. The price
(dependent variable) is our target column along with related features (independent variables). We can use describe()
on the data frame object df
to get a quick view of basic statistics.
# displaying the basic statisticsprint(df.describe())
If we look at the output from describe()
, we have max
, min
, mean
, and std
(standard deviation), which suggest the distributions in our selected features.