What is R?

Get familiar with R and determine when it’s preferreable over dedicated statistical software.

Let's briefly examine what R is and learn about how it’s preferred in statistical software.

Overview of R

Before getting into the technical details of R, we’ll start with a high-level overview. The goal is to understand what R is, why it exists, and what we can do with it. That will let us know when and why to use R. In fact, one of the most common questions for new data scientists is whether they should learn R or another programming language. In truth, to be successful in data science at this time, it’s important to know R, but data scientists may need to learn other languages as well.

R is a statistically oriented programming environment. It was entirely designed and built with statistics in mind. There are other statistical platforms, such as SAS, SPSS, or MATLAB, but what sets R apart is that it has a high degree of similarity to programming environments like C++ or Python. So, while R was designed and built with the thought process of statistics, it operates and looks much like more general-purpose programming languages, as we can see in the code example below. It doesn’t necessarily feel like a statistical platform in the way that MATLAB or SPSS does.

Press + to interact
#Create a data frame with two columns called x and y
#with 50 rows of randomly generated data
VAR_Data <- data.frame(x = runif(50, min=0, max = 100),
y = runif(50, min=0, max = 100))
#Scatter plot the results against each other
plot(VAR_Data$x, VAR_Data$y)
#Show the line of best fit between x and y
abline(lm(VAR_Data$y ~ VAR_Data$x))
#Print the regression statistics associated with the line of best fit
summary(lm(VAR_Data$y ~ VAR_Data$x))

We won’t dive into this code in detail yet, but to describe the function of this code example:

  • Lines 3–4: We randomly generate a dataframe with 50 observations (rows) and two variables (columns).

  • Line 7: We plot a scatter plot of the two variables against each other.

  • Line 9: We add a line of best fit to the plot.

  • Line 11: We print the summary statistics for the ...