Data Science in R: From Basics to Machine Learning/

...

Analytical Best Practices

Learn best practices in data science and how to implement them.

We'll cover the following...

Input data quality
Train-test split
Cross-validation
Data preprocessing
Final thoughts

The adage “garbage in, garbage out” is as accurate in data science as in any other field. Bad data can lead to unreliable models and incorrect predictions. Therefore, it’s crucial to identify and handle bad data. We must always check our input data ahead of time, early in our projects. When bad data goes unchecked, it can have disastrous consequences. For instance, if issues come to light late in the project, it might require an extensive re-evaluation of several decisions we took earlier.

Press + to interact

R offers two convenient tools for checking input data with minimal coding required: summary and skimr. With both of these, one of the simplest ways to identify bad data is to look for missing values. Missing values often indicate data entry errors, corruption, or overall poor data collection. Of course, we must be familiar with the data to know when missing values are and aren’t expected—but unexpected missing values are often symptomatic of significant data issues.

Similarly, significant data outliers can be symptomatic of data issues. These observations that lie far away from the rest of the data points in a dataset can be caused by data entry errors or may represent true anomalies in the data. In R, outliers can be quickly identified using skim or boxplots. Keep in mind that even if an outlier is not representative of a data issue but represents a true anomaly in the dataset, it may be worth further investigation or removal, or it may require adjustment of our modeling procedures. For example, we may need to log transform the data, depending on the circumstances.

Press + to interact

Why R?

R Fundamentals

R Fundamentals Exercises

Readable Coding with tidyverse

Tidyverse Exercises

Importing More Data Sources

Data Visualization with ggplot2

Best Practices for Data Scientists

Statistical Analysis and Machine Learning with tidymodels

Exploring tidymodels through Exercises

Useful Libraries for Data Science

Git Integration

Getting The Most Out of R

Appendix

Credit Card Fraud Detection using the R Language

Analytical Best Practices

Input data quality