...

/

Case Study: Identifying Bias in Personal and Sensitive Data

Case Study: Identifying Bias in Personal and Sensitive Data

Learn how to identify bias in personal and sensitive data using Fairlearn.

Bias in data can lead to unfair and discriminatory outcomes in AI systems. By actively seeking out and addressing bias, we can work toward ensuring fair treatment and nondiscrimination for all individuals and groups.

Understanding personal and sensitive attributes in data

In the context of bias in AI solutions, sensitive data refers to a characteristic or attribute that is closely associated with protected or vulnerable groups.

Sensitive features can include attributes such as race, ethnicity, gender, age, religion, sexual orientation, disability status, and socioeconomic background. These features are considered sensitive because they have historically been associated with discrimination or marginalization in various domains.

Personally identifiable information (PII) refers to any information that can be used to identify an individual uniquely. It includes personally identifiable attributes such as full name, social security number, date of birth, address, phone number, email address, financial account numbers, and more. PII is considered sensitive because its exposure or misuse can lead to privacy breaches, identity theft, or other forms of harm.

Press + to interact
Personally identifiable information
Personally identifiable information

Overview of the credit loan dataset

Previously, we examined the credit loan data and performed some exploratory analysis. Now that we are aware of sensitive data and PII, let’s analyze them to see if there is any bias present there.

Identifying bias in sensitive attributes of loan data

We focus our attention on 3 personal attributes which are present in the data:

  • Gender
  • Married
  • Self_Employed

In an ideal world, any AI solution should not discriminate one’s loan decision based on the above attributes

However, does the training data have an equal representation of the above attributes?

Hypothesis 1: The training data has equal representation for the Gender, Married, and Self_Employed attributes.

Press + to interact
main.py
loan_approval.csv
#Import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
#Read the data
data = pd.read_csv('loan_approval.csv')
data.drop(['Loan_Id'],axis=1,inplace=True)
#Identify sensitive features
sensitive_features = ['Gender','Married','Self_Employed']
print('\n Distribution of Gender')
print(data['Gender'].value_counts(normalize=True))
print('\n')
print('\n Distribution of Married Status')
print(data['Married'].value_counts(normalize=True))
print('\n')
print('\n Distribution of Self_Employed')
print(data['Self_Employed'].value_counts(normalize=True))
print('\n')

We test the hypothesis using the above script.

  • Lines 1–5: We import libraries for our analysis.
  • Lines 7–9: We read the loan data
...