Data Frames
Learn to use data frames—one of the most powerful tools in R—and key functions for manipulating them.
The data frame is R’s most common way of storing the types of data sets that we’ll use as data scientists. In almost every piece of data science code, there will be a form of a data frame. They are, in essence, two-dimensional tabular data storage objects. Both rows and columns are named. Those who’re familiar with any spreadsheet software (Excel, Google Sheets) can think of them as variables containing spreadsheets. Those who’ve previously used any database software can think of them as database tables.
The critical elements in a data frame are:
- Columns: Each column contains one consistent data type, but different columns can be different data types (as in the image).
- Rows: Each row represents a record, one set of associated data values. For instance, one row could be one respondent in a survey.
- Cells: Every row and column intersection is called a cell. It contains one singular data point.
- Column and row names: Convenient names that are used to make the data both more readable and more easily accessed.
Creating a data frame
Creating a data frame from scratch in R is supported by the data.frame
function and the row.names
function. The code to do so is shown in the example and is dissected below.
#Store some survey data in a data frame objectVAR_DataFrame <- data.frame(Q1_Ans = c(1,4,3,5,1,2),Q2_Ans = c(5,3,2,2,5,1),Q3_Ans = c(TRUE,TRUE,FALSE,TRUE,FALSE,FALSE))#Name the rows of our data frame to make it easy to readrow.names(VAR_DataFrame) <- c("John", "Katie", "Chris", "Kirtan", "David", "Peng")#Print the data frameprint("Survey data in data frame:")VAR_DataFrame
- Line 2: The
data.frame()
function is a part of base-R and tells R that we’d like to include everything within the brackets inside a data frame. In our case, we’ve included three columns,Q1_Ans
,Q2_Ans
, andQ3_Ans
. Note that there are three items in our case, but we can have as many or as few as we’d like to include.
- Lines 3–5: The
c()
function is a part of base-R and says we’re creating a vector. A vector is a single column of data, and in its definition, we separate elements with a comma. Thus, the lineQ1_Ans = c(1,4,3,5,1,2)
is simply creating a vector with data points 1, 4, 3, 5, 1, 2. The fact that it’s contained in adata.frame()
function tells R that theQ1_Ans
vector is one column of our data frame.
- Line