Search⌘ K

Exploratory Data Analysis of One Categorical Explanatory Variable

Explore how to perform exploratory data analysis on one categorical explanatory variable using R and Tidyverse. Learn to inspect data frames, summarize data with skim(), and visualize distributions with faceted histograms and boxplots. Understand comparing means and medians across categories to uncover patterns in life expectancy across continents.

We'll cover the following...

The data on the 142 countries can be found in the gapminder data frame included in the gapminder package. However, to keep things simple, let’s filter() for only those observations/rows corresponding to the year 2007. Additionally, let’s select() only the subset of the variables we’ll consider. We’ll save this data in a new data frame called gapminder2007, as follows:

R
library(gapminder)
gapminder2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, lifeExp, continent, gdpPercap)
gapminder2007

Let’s perform the first common step in an exploratory data analysis, which is looking at the raw data values. We’ll do this by using the glimpse() command for exploring data frames:

R
glimpse(gapminder2007)
?gapminder

Observe that Rows: 142 indicates that there are 142 rows/observations in gapminder2007, where each row corresponds to one country. In other words, the observational unit is an individual country. Furthermore, observe that the variable continent is of type <fct>, which stands for factor and is R’s way of encoding categorical variables.

A full description of all the variables included in gapminder can be found by reading the associated help file, which can be accessed by executing the ?gapminder command, as demonstrated above. However, let’s fully describe only the four variables we selected in gapminder2007:

  1. country: This is an identification variable of type character/text used to distinguish the 142 countries in the dataset.

  2. lifeExp: This is a numerical variable of that country’s life expectancy at birth. This is the outcome variable 𝑦𝑦 of interest.

  3. continent: This is a categorical variable with five levels. Here, levels correspond to the possible categories—Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable 𝑥𝑥 ...