Kaggle Challenge - Data Transformation
We'll cover the following
3. Transformation Pipelines
As you can see, from imputing missing values to feature scaling to handling categorical attributes, we have many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn is here to make our life easier: Scikit-Learn provides the Pipeline
class to help with such sequences of transformations.
π Note: Creating transformation pipelines is optional. It is handy when dealing with a large number of attributes, so it is a good-to-know feature of Scikit-Learn. In fact, at this point we could directly move on to create our machine learning model. However, for learning how things are done, we are going to look at working with pipelines.
Some Scikit-Learn terminology:
-
Estimators: An object that can estimate some parameters based on a dataset, e.g., an imputer is an estimator). The estimation itself is performed by simply calling the
fit()
method. -
Transformers: Some estimators (such as an imputer) can also transform a dataset; these are called transformers. The transformation is performed by the handy and easy to use
transform()
method with the dataset to transform as a parameter. -
Predictors: Some estimators are capable of making predictions given a dataset; they are called predictors. For example, the LinearRegression model is a predictor. A predictor has a
predict()
method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has ascore()
method that measures the quality of the predictions given a test set.
Based on some of the data preparation steps we have identified so far, we are going to create a transformation pipeline based on SimpleImputer
(*) and StandardScalar
classes for the numerical attributes and OneHotEncoder
for dealing with categorical attributes.
(*)Scikit-Learn provides a very handy class, SimpleImputer
to take care of missing values. You just tell it the type of imputation, e.g. by median, and voila, the job is done. We have already talked about the other two classes.
First, we will look at a simple example pipeline to impute and scale numerical attributes. Then we will create a full pipeline to handle both numerical and categorical attributes in one go.
The numerical pipeline:
Get hands-on with 1400+ tech skills courses.