Microsoft Excel
Learn to import data from Microsoft Excel files using readxl and purrr in the tidyverse.
We'll cover the following...
For many reasons, a lot of data and analyses are often stored in Microsoft Excel. One of those reasons is that spreadsheets are very accessible and very flexible. Analysts using Excel don’t need to concern themselves with any constraints instilled by databases or programming languages. They can freely do anything, structuring their data in whatever format they like. However, that freedom gives rise to potential issues when we try to tackle more robust analytics, where data tends to be messy, files get lost, and version control is confusing or non-existent.
As a result, data science teams often extract data from Excel into R and eventually into databases for more robust storage of their analyses and the underlying data. That being said, the tidyverse provides a convenient tool for accessing Excel files:
install.packages("readxl")
The readxl
package
The readxl
package is fully integrated with the tidyverse, so we’ll be able to stick with our usual tidyverse code structures, as in the example below, where we load a sample Excel file, read and print a worksheet containing data regarding earthquakes, and then also read all of the worksheets within the workbook.
#load tidyverse librarieslibrary(ggplot2)library(purrr)library(tibble)library(dplyr, warn.conflicts = FALSE)library(tidyr)library(stringr)library(readr)library(forcats)#load readxl packagelibrary(readxl)#Create a reference to the Excel fileVAR_ExampleFile <- readxl_example("datasets.xlsx")#Read a specific worksheert from the Excel fileVAR_ExcelQuakeData <- read_excel(VAR_ExampleFile, sheet = "quakes", range = "C1:E4")VAR_ExcelQuakeData #Print the contents of that Excel worksheet#Read all sheets in a workbookVAR_TibbleListAllSheets <- VAR_ExampleFile %>%excel_sheets() %>%set_names() %>%map(read_excel, path = VAR_ExampleFile)#Print the contents of all of the worksheets, one after the otherprint(VAR_TibbleListAllSheets[1:length(VAR_TibbleListAllSheets)])
In this code, we did the following: