Why Data Scientists Need R

Learn which projects are most suitable for R, and why it’s important to know R for data science.

When Data Scientists get familiar with the idea of R, their first question is often “why choose R over another general-purpose language?” In particular, “why choose R over Python?” The answer: There isn’t a perfect choice one way or the other. If we’re starting with data science, either language is a good starting point, and the decision isn’t worth fussing over.

To learn R or Python

Suppose you’re already comfortable with other general-purpose languages, but not Python or R. In that case, the transition to either R or Python will be equally straightforward (or similarly tricky, as the case may be). If you come from a statistical background and only know SAS, SPSS, MATLAB, or Excel, then R is a more straightforward step, but the skills learned will make Python easier to understand later. Additionally, if you know either R or Python, then learning the other is achievable—with some work, of course.

How teams choose

The choice between R and Python for many data science teams comes down to an organization’s existing knowledge base. If the organization has done a lot of work in Python already, then they set up their data science teams in Python. If the organization is more comfortable with something more statistically specialized (SAS, SPSS, MATLAB, Excel), then they choose R as the primary tool.

However, in truth, many organizations blend the two languages. Some projects are run in R, others in Python. For that reason, the best data scientists eventually make both options available to themselves. They may have a primary language of choice, but they can work in either.

Project-by-project selection

The main differentiator between whether to use R or Python for a project is whether the final product to be delivered is the model itself, which will then be integrated into some wider application, or if the final product is the output of the model in the form of statistics, visualization, or dashboards.

Suppose the final deliverable is the model itself, such that it can be integrated into some wider application by a different team of developers or engineers. For example, a team developing an engine to recommend movies to users based on their watch history. In that case, Python is typically a better choice.

On the other hand, if the final deliverable is the model’s statistical output, then R is typically a better choice. For example, a team running a classification model on a database and delivering a distribution of how records are classified. These needs are often better addressed in R, even if the output is delivered as part of a visual dashboard and even if the output will be updated frequently. Of course, this rule has many exceptions, but it is a good guiding principle.

Press + to interact
Balancing both R and Python
Balancing both R and Python

Where R and Python diverge

Based on the above rule, we can see the divergence in the R vs. Python communities. R lends itself a little better to those who care more deeply about the statistical nature of their project—researchers, academics, business analysts, etc. In contrast, Python lends itself a little better to those who care more deeply about the functional performance of their model in a broader context, such as application developers.

However, this distinction is not clear-cut, especially as CRAN and Python’s libraries continue to expand, creating more and more overlap between the two languages. In several projects, an application developed primarily in Python might eventually be called an R model, or a model developed primarily in R might eventually be called a Python module.

Specific selection criteria

When considering R vs. Python or another general-purpose language, the debate can be summarized in a few points. Keep in mind, though, that very few of these are clear-cut—but the gaps can often be bridged with extensions and some effort.

Aspect R Python
Statistical methodology X
Charts and visualization X
Specialized statistics X
Scalability X
Complex machine learning X
Integration with other code X

Let's summarize the discussion with a few questions.

Questions

1.

Is the organization already strongly biased towards either R or Python?

Show Answer
Q1 / Q2

R as a data scientist

If you’re in the early stages of your data science career and have landed on this course, you’re in a great place. Suppose you’re already experienced in Python and are looking to expand your skill set. In that case, you’ll expand your knowledge of R to better serve your statistical needs and to work more effectively in organizations with a heavy bias towards R or other statistical platforms. If you don’t know Python, you’ll conclude this course as a strong R data scientist with enough confidence to also get comfortable in Python in the future.