...

/

Case Study: Seattle House Prices-II

Case Study: Seattle House Prices-II

Learn about the ModernDive with this case study with EDA.

Exploratory data analysis

Let’s now continue our EDA by creating multivariate visualizations. Unlike univariate histograms and barplots, multivariate visualizations show relationships between more than one variable. This is an important step for an EDA to perform because the goal of modeling is to explore relationships between variables.

Our model involves a numerical outcome variable, a numerical explanatory variable, and a categorical explanatory variable, so we’re in a regression modeling situation.

We, therefore, have two choices of models we can fit. First, an interaction model where the regression line for each condition level will have both a different slope and a different intercept. Second, a parallel slopes model where the regression line for each condition level will have the same slope but different intercepts.

The geom_parallel_slopes() function is a special purpose function that Evgeni Chasnovski created and included in the moderndive package. This was done because the geom_smooth() method in the ggplot2 package doesn’t have a convenient way to plot parallel slopes models. We plot both resulting models in the figure below, with the interaction model.

Press + to interact
# Plot interaction model
ggplot(house_prices,
aes(x = log10_size, y = log10_price, col = condition)) +
geom_point(alpha = 0.05) +
geom_smooth(method = "lm", se = FALSE) +
labs(y = "log10 price",
x = "log10 size",
title = "House prices in Seattle")

We plot both resulting models in the figure below, with the parallel model.

Press + to interact
# Plot parallel slopes model
ggplot(house_prices,
aes(x = log10_size, y = log10_price, col = condition)) +
geom_point(alpha = 0.05) +
geom_parallel_slopes(se = FALSE) +
labs(y = "log10 price",
x = "log10 size",
title = "House prices in Seattle")

In both cases, we see there’s a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes. This is evidenced by the fact that the line for ...