Database Connections
Learn to pull database data into R in the tidyverse using DBI, odbc, dbplyr, and other database packages.
Most data science tasks begin with pulling data from some external source, sometimes it’s a csv
file, but often it’s something else, e.g. live databases, Microsoft Excel sheets, and websites. Those aren’t the only options, but they’re the most common.
Among those options, the most common is to pull from a live database. Live databases have a few significant advantages that support the data science team:
- Typically, data is up to date. There’s often an automatic pipeline that feeds new data into the database.
- They can host a vast number of records, depending on the type of database.
- The database engine can manipulate data without needing to load it into our computer’s memory first.
In particular, manipulating data outside of memory is critical for data scientists. When we leverage csv
files, our first step is:
VAR_MyData <- read_csv(“Mycsvfile.csv”)
This loads the entire csv
file into memory. That’s not a big deal when the csv
file is a few thousand rows long, but once we get into huge datasets, the kinds that we’re often going to be dealing with, it becomes very problematic. We quickly hit our computer’s limitations when manipulating the data.
Fortunately, the tidyverse provides a convenient mechanism for connecting to a database and performing data manipulations at the database end rather than in-memory. That is, we can interact with a database directly and pare down the data to something manageable before transferring data back to our local machine. So, to interact with a database in R, we’ll need to install the following packages:
#Install packages for tidyverse database connectionsinstall.packages("DBI")install.packages("odbc")install.packages("RSQLite")install.packages("dbplyr")
DBI
The DBI
package is our standard, underlying database engine. It provides the necessary functionality to perform overhead tasks when interacting with a database, such as opening and closing connections. It also provides the ability to query and manipulate data. We’ll mainly leverage the DBI
package because it interfaces well with most types of databases (MySQL, Oracle, Snowflake, etc.). So, we’ll be able to build a common codebase for database interaction regardless ...