Data Preprocessing
In this lesson, we present some useful methods for data preprocessing.
We'll cover the following...
In the real world, data is not perfect. You need to spend a lot of time on data preprocessing, such as cleaning, scaling, normalizing, etc. Data preprocessing may be the most important step in the entire Machine Learning process. You may have heard the phrase "Garbage in, garbage out"
. If the data quality is not high, no matter how fancy the model is, an ideal result will not be achieved. Typically, for most engineers, 70 percent of the time is spent processing data.
The preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
Notice
: There are many preprocessing types. In this lesson, we will cover some of the most commonly used methods. If you want to learn more, just launch theJupyter
file at the end of this lesson.
Scale numerical feature
Most of the time, the features in your dataset vary in range. However, most of the Machine Learning algorithms use Euclidean distance as the metrics to measure the distance between two data points. This is a problem. In this case, we should make sure that the features are in the same range. To solve this problem, you need to scale your data. There are many ways to do this.
MinMax
It shrinks the range so that it is now between zero and one.
...