Case Study: Seattle House Prices-I

Learn about the ModernDive with this case study with EDA.

We'll cover the following

Kaggle is a machine-learning and predictive-modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is “House Sales in King County, USA.” It consists of the sale prices of homes sold between May 2014 and May 2015 in King County, Washington, USA, which includes the greater Seattle metropolitan area. This dataset is in the house_prices data frame included in the moderndive package.

The dataset consists of 21,613 houses and 21 variables describing these houses (for a full list and description of these variables, see the help file by running ?house_prices in the console). In this case study, we’ll create a multiple regression model where:

  • The outcome variable yy is the sale price of houses.

  • There are two explanatory variables:

    • A numerical explanatory variable x1x_1: House size sqft_living is measured in square feet of living space. Note that 1 square foot is about 0.09 square meters.

    • A categorical explanatory variable x2x_2: House condition is a categorical variable with five levels where 1 indicates poor and 5 indicates excellent.

Exploratory data analysis

As we’ve said numerous times throughout, a crucial first step when presented with data is to perform an EDA. This can give us a sense of our data, help identify issues with our data, bring to light any outliers, and help inform model construction.

Recall the three common steps in an EDA:

  1. Looking at raw data values

  2. Computing summary statistics

  3. Creating data visualizations

First, let’s look at the raw data using View() to bring up RStudio’s spreadsheet viewer and the glimpse() function from the dplyr package:

Get hands-on with 1200+ tech skills courses.