Data Exploration and Error Checking
Explore the data using R.
We'll cover the following
Whenever we start working with a dataset in R, we should first devote substantial time to checking it for errors. These are some questions we should ask ourselves:
- Did the data import correctly?
- Are the column names correct?
- Are the types of data appropriate? (e.g., factor vs numerical)
- Are the numbers of columns and rows appropriate?
- Are there typos?
For example, if a column that’s supposed to be numerical shows up as a factor, that likely indicates a typo where we accidentally have text in place of a number. Remember, each column in a data frame is a vector, and vectors can only have one mode. So, a vector with both numbers and characters is treated as if it’s all characters. Similarly, if we have a factor that should have three categories but imports with four, we likely have a typo—(for example, “predator” versus “predtaor”), and the misspelled version is showing up as a separate category. These sorts of mistakes are widespread!
Because this dataset has been very thoroughly examined, these types of errors aren’t present. However, we may want to change the names of columns or remove outliers, which we’ll cover in the subsequent sections.
Data structure
We begin by examining the structure of the data frame with the str()
function.
str(RxP)
We can see that our dataset has 2502
observations of 14
different variables, some of which are integers, some are factors, and some are numerical. The following are things to notice:
-
Several variables are listed twice but are coded in different ways. For example, there’s a column titled
Tank
and one titledTank.Unique
. As stated earlier, there are 12 tanks in each of the eight blocks. The variableTank
lists what number a tank is (1
through12
) in a given block, whereasTank.Unique
provides each tank with a unique number out of the entire 96. -
Similarly, we have the columns
Age.DPO
andAge.FromEmergence
. The first column,Age.DPO
, is the age of the frogs at the time of their emergence from the water in terms of days post-oviposition (DPO
), whereas theAge.FromEmergence
column counts the day the first animal crawled out of the water as day 1, so the age of the animals is recorded in terms of days relative to when the emergence began. Sometimes, it can be helpful to view the same data in two different ways. -
We have three categorical predictors or factors:
Hatching age
,Predator treatment
, andResource level
. Each factor has several levels or categories, which we can see in thestr()
output. -
We have several response variables—for example,
SVL
orMass
—which are measured at the initial point when the froglets left the water, at the end of metamorphosis when the tail was fully resorbed, or both.
Data exploration and visualization
We begin by plotting the data to check for errors. The default plot()
function creates a simple graphic based on the data we provide. To access a named variable within a data frame, we use the $
operator, as in data_frame$variable
. The data frame always goes first, then the $
column’s name we are interested in. For example, we may type in the following to look at our data:
plot(RxP$SVL.initial)
The graph generated by the code given above shows that SVL generally varies between 14 and 24 mm (millimeters), but one animal is much smaller.
plot(RxP$Tail.initial)
The illustration generated by the code given above shows that the Tail
length at emergence varies from about 0 to 15 mm in most individuals, but three froglets have tails that are longer than 15 mm.
plot(RxP$Mass.final)
The plot generated by the code given above shows that the Mass
of the frogs varies from about 0.2 g (grams) to over 1 g.
plot(RxP$Pred)
The illustration generated by the code given above shows that the number of individuals surviving metamorphosis in the three Predator treatments varies considerably, from around 1,200 froglets in the Control group to approximately 500 in the Nonlethal group.
Note: The plot style changes depending on whether we plot a continuous variable or a factor. For the continuous variable, the default is to plot the data in order, from the first row to the last. In the case of a factor, the default is to plot the number of observations in each group.
Further data exploration and identifying mistakes
Plotting data by itself can be helpful. Let’s say we want to check for outliers or find typos (like making a numeric variable plot as a factor). However, it’s often more helpful to plot response data against an explanatory variable. For example, we may want to know how the final mass of metamorphs varies across predator treatment. Here, we use the ~
sign to separate our response variable from a predictor variable. Let’s examine the relationship of Mass.final
and Predator treatment by plotting Mass.final~Pred
.
plot(Mass.final~Pred, data=RxP)
Note: Here are some essential things to take note of. By providing a categorical variable as our predictor, R automatically knew to make a box and whisker plot, also known as a boxplot. There aren’t many instances when R will think for us, but this is one where it will.
Looking at the plot generated by the code given above, there are several things to know about how R draws a boxplot.
-
First, the top and bottom of each box represent the interquartile range—that is, the middle 50% of our data. Thus, 25% of the metamorphs in each predator treatment are more significant than the top of their respective box, and 25% are smaller than the bottom.
-
Second, the heavy dark line in the middle of the box is the median, not the mean as many observers may initially think.
-
Third, the extremes of the “whiskers” are either of the following:
- The maximum or minimum value of the data.
- 1.5 times the interquartile range (IQR).
In the event of the second option, R plots all the points that fall beyond the 1.5 times of the IQR. So, what does that mean in practice? If we look at the plot generated by the code above, we can see that the bottom whiskers are all just that, a whisker. That means they have been plotted to the smallest value in the dataset and that that value falls within 1.5 times the IQR. The upper whiskers have many points above them, meaning that the whiskers extend to 1.5 times the IQR mark, and the points plotted above the whisker fall outside that range.
-
Lastly, notice that we’ve introduced a new syntax. We can use the
~
sign to denote a relationship between two vectors, usually thought of asresponse~predictor
. This structure will be used later for defining statistical models and can be expanded to incorporate multiple predictors—for example,response~predictor1 + predictor2 +
.
What happens if we plot two continuous variables against one another instead of a continuous response versus a categorical predictor? Maybe we want to see a relationship between mass at the end of metamorphosis and SVL at the end of metamorphosis. Since we have provided two continuous variables, R will know to make a scatterplot automatically.
plot(Mass.final~SVL.final, data=RxP)
The plot generated by the code given above tells us several things.
- There appear to be several outliers. These individuals have a very small SVL but a large mass, or vice versa. These almost certainly represent mistakes made during data entry since they’re biologically unrealistic, maybe even impossible, and should therefore be removed.
- The relationship between SVL and mass isn’t linear. It curves upward, which indicates that longer frogs with greater
SVL.final
values seem to have disproportionately larger masses. This is expected in many length-to-mass relationships in nature, and perhaps plotting the data on logarithmic axes would make this relationship linear.
Now, let’s see if plotting the log-log axes makes the length-to-mass relationship linear. The following code takes the log of each variable and plots them against one another. Thus, the values on the axes will be in terms of the logarithm of either SVL or mass.
plot(log(Mass.final)~log(SVL.final), data=RxP)
The code given below plots the figure with normal numeric axes, but the scale of the axes will change at a log rate.
plot(Mass.final~SVL.final, data=RxP, log="xy")
Note: By denoting
log='x'
orlog='y'
, we can log transform one axis.
By comparing these figures, we observe that log transformation made the data more linear. The reason why some numbers are missing from the axes is that R won’t plot numbers on top of other numbers by default, so it makes some decisions about what to include versus what to exclude when spacing gets too tight.