Analytical Best Practices
Learn best practices in data science and how to implement them.
We'll cover the following...
Ultimately in data science, it is of the utmost importance that we build analytical models of the highest quality. This lesson will discuss some conceptual best practices for ensuring our models perform well and meet our project needs. While there are many best practices, here we focus on those general practices that apply widely across model types and fields of analysis.
Input data quality
The adage “garbage in, garbage out” is as accurate in data science as in any other field. Bad data can lead to unreliable models and incorrect predictions. Therefore, it’s crucial to identify and handle bad data. We must always check our input data ahead of time, early in our projects. When bad data goes unchecked, it can have disastrous consequences. For instance, if issues come to light late in the project, it might require an extensive re-evaluation of several decisions we took earlier.
R offers two convenient tools for checking input data with minimal coding required: summary and skimr. With both of these, one of the simplest ways to identify bad data is to look for missing values. Missing values often indicate data entry errors, corruption, or overall poor data collection. Of course, we must be familiar with the data to know when missing values are and aren’t expected—but unexpected missing values are often symptomatic of significant data issues.
Similarly, significant data outliers can be symptomatic of data issues. These observations that lie far away from the rest of the data points in a dataset can be caused by data entry errors or may represent true anomalies in the data. In R, outliers can be quickly identified using skim or boxplots. Keep in mind that even if an outlier is not representative of a data issue but represents a true anomaly in the dataset, it may be worth further investigation or removal, or it may require adjustment of our modeling procedures. For example, we may need to log transform the data, depending on the circumstances.
#Load tidyverse librarieslibrary(ggplot2)library(purrr)library(tibble)library(dplyr, warn.conflicts = FALSE)library(tidyr)library(stringr)library(readr)library(forcats)library(skimr)#Use the iris datasetVAR_IrisData <- as_tibble(iris)skim(VAR_IrisData)#A histogram of sepal lengthsVAR_IrisData %>% ggplot(mapping = aes(x = Species, y = Sepal.Length)) +geom_boxplot()
In this example, we perform fundamental checks for input data quality. Here we find that there are no missing values in any of the data columns. We can observe that the means and medians for the data columns are ...