...
/Case Study: Identifying Bias in Personal and Sensitive Data
Case Study: Identifying Bias in Personal and Sensitive Data
Learn how to identify bias in personal and sensitive data using Fairlearn.
We'll cover the following...
- Understanding personal and sensitive attributes in data
- Overview of the credit loan dataset
- Identifying bias in sensitive attributes of loan data
- Train a classification model to predict loan approval
- Compute demographic parity fairness metric using Fairlearn
- Compute equalized odds fairness metrics using Fairlearn
Bias in data can lead to unfair and discriminatory outcomes in AI systems. By actively seeking out and addressing bias, we can work toward ensuring fair treatment and nondiscrimination for all individuals and groups.
Understanding personal and sensitive attributes in data
In the context of bias in AI solutions, sensitive data refers to a characteristic or attribute that is closely associated with protected or vulnerable groups.
Sensitive features can include attributes such as race, ethnicity, gender, age, religion, sexual orientation, disability status, and socioeconomic background. These features are considered sensitive because they have historically been associated with discrimination or marginalization in various domains.
Personally identifiable information (PII) refers to any information that can be used to identify an individual uniquely. It includes personally identifiable attributes such as full name, social security number, date of birth, address, phone number, email address, financial account numbers, and more. PII is considered sensitive because its exposure or misuse can lead to privacy breaches, identity theft, or other forms of harm.
Overview of the credit loan dataset
Previously, we examined the credit loan data and performed some exploratory analysis. Now that we are aware of sensitive data and PII, let’s analyze them to see if there is any bias present there.
Identifying bias in sensitive attributes of loan data
We focus our attention on 3 personal attributes which are present in the data:
Gender
Married
Self_Employed
In an ideal world, any AI solution should not discriminate one’s loan decision based on the above attributes
However, does the training data have an equal representation of the above attributes?
Hypothesis 1: The training data has equal representation for the Gender
, Married
, and Self_Employed
attributes.
#Import librariesimport pandas as pdimport osimport matplotlib.pyplot as pltimport seaborn as sns#Read the datadata = pd.read_csv('loan_approval.csv')data.drop(['Loan_Id'],axis=1,inplace=True)#Identify sensitive featuressensitive_features = ['Gender','Married','Self_Employed']print('\n Distribution of Gender')print(data['Gender'].value_counts(normalize=True))print('\n')print('\n Distribution of Married Status')print(data['Married'].value_counts(normalize=True))print('\n')print('\n Distribution of Self_Employed')print(data['Self_Employed'].value_counts(normalize=True))print('\n')
We test the hypothesis using the above script.
- Lines 1–5: We import libraries for our analysis.
- Lines 7–9: We read the loan data