Microsoft Excel

Learn to import data from Microsoft Excel files using readxl and purrr in the tidyverse.

For many reasons, a lot of data and analyses are often stored in Microsoft Excel. One of those reasons is that spreadsheets are very accessible and very flexible. Analysts using Excel don’t need to concern themselves with any constraints instilled by databases or programming languages. They can freely do anything, structuring their data in whatever format they like. However, that freedom gives rise to potential issues when we try to tackle more robust analytics, where data tends to be messy, files get lost, and version control is confusing or non-existent.

Press + to interact
Data stored in Excel can be difficult to navigate
Data stored in Excel can be difficult to navigate

As a result, data science teams often extract data from Excel into R and eventually into databases for more robust storage of their analyses and the underlying data. That being said, the tidyverse provides a convenient tool for accessing Excel files:

install.packages("readxl")

The readxl package

The readxl package is fully integrated with the tidyverse, so we’ll be able to stick with our usual tidyverse code structures, as in the example below, where we load a sample Excel file, read and print a worksheet containing data regarding earthquakes, and then also read all of the worksheets within the workbook.

Press + to interact
#load tidyverse libraries
library(ggplot2)
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
#load readxl package
library(readxl)
#Create a reference to the Excel file
VAR_ExampleFile <- readxl_example("datasets.xlsx")
#Read a specific worksheert from the Excel file
VAR_ExcelQuakeData <- read_excel(VAR_ExampleFile, sheet = "quakes", range = "C1:E4")
VAR_ExcelQuakeData #Print the contents of that Excel worksheet
#Read all sheets in a workbook
VAR_TibbleListAllSheets <- VAR_ExampleFile %>%
excel_sheets() %>%
set_names() %>%
map(read_excel, path = VAR_ExampleFile)
#Print the contents of all of the worksheets, one after the other
print(VAR_TibbleListAllSheets[1:length(VAR_TibbleListAllSheets)])

In this code, we did the following:

...