Factorize and Cross-Tabulate

Learn how to factorize categorical variables and perform cross-tabulation on the categories.

Factorizing categorical data

So far we’ve seen how the astype() method converts a DataFrame column into the category data type, while maintaining the categories at their original values. If instead we want to encode the column and obtain a numeric representation of the categories, we can use the factorize() function.

For example, we can apply the factorize() method on the Ethnicity column of the credit card dataset to retrieve a numerical representation of the different ethnicities.

Press + to interact
# Factorize Ethnicity categorical column
codes, uniques = df['Ethnicity'].factorize()
# View outputs
print(codes)
print('=' * 80)
print(uniques)

The factorize() method returns two objects, codes and uniques:

  • The codes output is an integer array that is the numerical representation of the original categories. A better way to visualize this long sequence of numbers is to store them in a new DataFrame column, as shown below:

Press + to interact
# Factorize and store numerical representation in column
df['Ethnicity_codes'], _ = df['Ethnicity'].factorize()
# View Ethnicity columns
print(df[['Ethnicity', 'Ethnicity_codes']])

The output DataFrame above shows that each ethnicity category has been assigned a corresponding numerical representation, i.e., Caucasian: 0; Asian: 1; African American: 2. This operation is also known as label encoding.

  • The uniques output represents the unique category values. Depending on the original values in the column, it can be in the form of an NumPy array, Index object, or CategoricalIndex object. In our ...

Get hands-on with 1400+ tech skills courses.