Many Categories SSE
Build knowledge of how CART regression trees find the most optimal splits for categorical data.
We'll cover the following
Splitting binary categorical features
The simplest case for categorical features is where there are only two possible values (i.e., the feature is binary). In the case of binary categorical features, the CART regression tree algorithm calculates the SSE for the feature by choosing one of the categories, splitting the data, calculating the SSE for the left-hand and right-hand data, and then adding the two SSEs.
Splitting many category features
When a categorical feature has three or more categories (i.e., levels), the CART regression tree algorithm optimizes to evaluate a minimum number of potential splits. Consider a hypothetical model imputing missing values of the Age
feature from the Titanic dataset.
The following table represents a sample of Titanic data for the Pclass
and Age
features available to the CART regression tree algorithm for a split:
Get hands-on with 1400+ tech skills courses.