Non-ML approaches

These solutions are useful for data scientists to know because they’re generally cost- and time-effective tools and procedures that can greatly enhance the quality of the data, if done properly. These are, in many cases, simplified approaches to the higher-grade fixes that ML debiasers can provide. However, they are still very much worth knowing.

Oversampling and undersampling

One very simple approach is to change the sampling structure of the underlying data. In essence, we either duplicate rows of the minority group to match the numbers in the majority group (oversampling), or randomly remove rows of the majority class to match the numbers in the minority class (undersampling).

Oversampling

Let's consider a dataset with three variables: age, credit score, and race. We’ll use a binary race variable for simplicity. We’ll also set the prior distribution to draw race of 0 80% of the time. That way, we can quickly calculate the change in representation rate.

Press + to interact

Python

# import libraries
import pandas as pd
import numpy as np
# set seed for reproducibility
np.random.seed(101)
# create fake data via a normal distribution
original_data = pd.DataFrame(data={
  'age': np.round(np.abs(np.random.normal(loc=50, scale=20, size=1000))),
  'credit_score': np.round(np.random.normal(loc=650, scale=50, size=1000)),
  'race': np.random.choice([0, 1], p=[0.80, 0.20], size=1000)
})
# get counts of minority class for oversampling
minority = original_data.loc[original_data['race'] == 1]
class_counts = original_data['race'].value_counts()
rows_to_add = class_counts[0] - class_counts[1]
print('Initial Distribution\n0: {}, 1: {}'.format(class_counts[0], class_counts[1]))
# oversample from minority class
sampled_rows = np.random.choice(range(minority.shape[0]), 
                                replace=True, 
                                size=rows_to_add)
altered_data = pd.concat([original_data, minority.iloc[sampled_rows]], axis=0)
class_counts = altered_data['race'].value_counts()
print('Oversampled Distribution\n0: {}, 1: {}'.format(class_counts[0], class_counts[1]))
print('\nOriginal Data Stats:\n{}'.format(original_data.describe()))
print('\nAltered Data Stats:\n{}'.format(altered_data.describe()))

Introduction

Disasters in Data

Disasters in Models

Measuring Causal Relations with Python

Alternatives to Traditional ML

Adversarial Robustness of Neural Networks

Conclusion

Assessment: Disasters in ML Pipelines

Theory of Data Bias Mitigation

Non-ML approaches

Oversampling and undersampling

Oversampling