Feature Engineering and Categorical Variables Encoding
In this lesson, we'll explore the information behind Categorical Variables encoding. It enables us to convert the categorical variables into Numerical ones so that they can be processed by Machine Learning Algorithms.
Feature Engineering
Feature Engineering helps us build complex models using the preprocessed features at hand. Feature Selection involves taking only a handful of preprocessed features for building the model. These two steps are part of every model building pipeline.
Missing values
Features of the Input Dataset can contain missing values due to certain reasons. Filling in the missing values or, perhaps throwing out the features or instances with a large number of missing values is an important part of the pipeline. Data Imputation is the technique used for estimating the missing values.
Dealing with missing values
-
Drop the instances: The first technique is to drop the instances or features with at least one missing value. The variation in it can be dropped to the instances in which values are missing in any of the defined features.
-
Mean or Median Imputation: This refers to replacing the missing value with the mean or median of the respective feature. Mean or median is calculated on the training dataset, and it is also used in the test dataset if the values are missing.
-
Mode or Frequent Category Imputation: This imputation is used for mostly Categorical variables and involves replacing the missing values with the most common value in the feature.
-
Arbitrary Number Imputation: This involves replacing the missing value with an arbitrary value. The most commonly used values for numerical features are 999, 9999, or -1. In case of a missing value for a Categorical variable, it is replaced by the Missing string value.
-
Random sampling imputation: This consists of extracting random observations from the pool of available values in the feature. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques that we’ve discussed in this lesson and is suitable for numerical and categorical variables alike.
Get hands-on with 1400+ tech skills courses.