Introduction to Data Bias

Learn what data bias is and where it comes from.

While many of the pre-pipeline biases are not directly observed or created by data scientists, it’s important to be conscious of where and under what conditions data is sourced. In this lesson, we focus primarily on data bias.

Defined simply, data bias is a skew or tendency in the data that leads a model to make potentially erroneous conclusions. In other words, it’s a property of a dataset that greatly increases ML risk downstream in the pipeline. Data bias is a general phenomenon that doesn’t necessarily relate to discrimination, but some of the most famous cases of data bias in the media come from improperly sourced sets that lead to discriminatory models.

Misrepresentation in data

The most common source of data bias comes from skewed representation in data. Let’s consider the following brief example. Suppose we have a highly skewed dataset in which one class (the 0s) represents 90% of the target variable. We build a random forest classifier using this data.

Get hands-on with 1200+ tech skills courses.