Many Categories Impurity

Learn how CART decision trees handle categorical features with more than two categories.

Multivalue attributes

When building decision trees, the CART algorithm uses only two-way (i.e., binary) data splits. CART classification trees are constructed using the Gini gain calculation. This lesson expands this knowledge by teaching how the CART classification tree algorithm handles a widespread situation in business data—categorical features with more than two values.

Consider the following Adult Census Income data sample:

Adult Census Income Data Sample

Occupation

Income

Adm-clerical

<=50K

Exec-managerial

<=50K

Handlers-cleaners

<=50K

Handlers-cleaners

<=50K

Prof-specialty

<=50K

Exec-managerial

<=50K

Other-service

<=50K

Exec-managerial

>50K

Prof-specialty

>50K

Exec-managerial

>50K

In this data sample, the occupation feature has five ...