...

/

Theory of Data Bias Mitigation

Theory of Data Bias Mitigation

Learn some mathematical approaches to bias mitigation in ML.

Let’s explore the theoretical solutions for solving various types of data bias.

Non-ML approaches

These solutions are useful for data scientists to know because they’re generally cost- and time-effective tools and procedures that can greatly enhance the quality of the data, if done properly. These are, in many cases, simplified approaches to the higher-grade fixes that ML debiasers can provide. However, they are still very much worth knowing.

Oversampling and undersampling

One very simple approach is to change the sampling structure of the underlying data. In essence, we either duplicate rows of the minority group to match the numbers in the majority group (oversampling), or randomly remove rows of the majority class to match the numbers in the minority class (undersampling).

Oversampling

Let's consider a dataset with three variables: age, credit score, and race. We’ll use a binary race variable for simplicity. We’ll also set the prior distribution to draw race of 0 80% of the time. That way, we can quickly calculate the change in representation rate.

Press + to interact
# import libraries
import pandas as pd
import numpy as np
# set seed for reproducibility
np.random.seed(101)
# create fake data via a normal distribution
original_data = pd.DataFrame(data={
'age': np.round(np.abs(np.random.normal(loc=50, scale=20, size=1000))),
'credit_score': np.round(np.random.normal(loc=650, scale=50, size=1000)),
'race': np.random.choice([0, 1], p=[0.80, 0.20], size=1000)
})
# get counts of minority class for oversampling
minority = original_data.loc[original_data['race'] == 1]
class_counts = original_data['race'].value_counts()
rows_to_add = class_counts[0] - class_counts[1]
print('Initial Distribution\n0: {}, 1: {}'.format(class_counts[0], class_counts[1]))
# oversample from minority class
sampled_rows = np.random.choice(range(minority.shape[0]),
replace=True,
size=rows_to_add)
altered_data = pd.concat([original_data, minority.iloc[sampled_rows]], axis=0)
class_counts = altered_data['race'].value_counts()
print('Oversampled Distribution\n0: {}, 1: {}'.format(class_counts[0], class_counts[1]))
print('\nOriginal Data Stats:\n{}'.format(original_data.describe()))
print('\nAltered Data Stats:\n{}'.format(altered_data.describe()))

Lines 9–13: We generate a fake dataset with normal distributions.

Lines 16–18: We count the representation of the minority class so we know how much to sample.

Lines 23–29: We sample from the minority rows and concatenate them to the original dataset.

We can see that oversampling does indeed correct the representation rate difference.

Note, however, that by duplicating rows, we also change the fundamental statistics of the data. Repeating rows means we’re also inadvertently reducing variance (because we’re pulling the average row-to-row difference closer to zero). In ML models, variance is extremely helpful because it allows for a greater degree of separation and better performance.

There ...