Factorize and Cross-Tabulate
Learn how to factorize categorical variables and perform cross-tabulation on the categories.
Factorizing categorical data
So far we’ve seen how the astype()
method converts a DataFrame column into the category data type, while maintaining the categories at their original values. If instead we want to encode the column and obtain a numeric representation of the categories, we can use the factorize()
function.
For example, we can apply the factorize()
method on the Ethnicity
column of the credit card dataset to retrieve a numerical representation of the different ethnicities.
# Factorize Ethnicity categorical columncodes, uniques = df['Ethnicity'].factorize()# View outputsprint(codes)print('=' * 80)print(uniques)
The factorize()
method returns two objects, codes
and uniques
:
The
codes
output is an integer array that is the numerical representation of the original categories. A better way to visualize this long sequence of numbers is to store them in a new DataFrame column, as shown below:
# Factorize and store numerical representation in columndf['Ethnicity_codes'], _ = df['Ethnicity'].factorize()# View Ethnicity columnsprint(df[['Ethnicity', 'Ethnicity_codes']])
The output DataFrame above shows that each ethnicity category has been assigned a corresponding numerical representation, i.e., Caucasian: 0; Asian: 1; African American: 2. This operation is also known as label encoding.
The
uniques
output represents the unique category values. Depending on the original values in the column, it can be in the form of anNumPy
array,Index
object, orCategoricalIndex
object. In our ...
Get hands-on with 1400+ tech skills courses.