Preparing Data with Scikit-learn

Learn how to clean and prepare data for analysis using the scikit-learn package.

scikit-learn is one of the most widely used and comprehensive machine learning libraries in Python. It plays very well with the rest of the data-science ecosystem libraries, such as NumPy, pandas, and matplotlib. We will be using it for modeling our data and for some preprocessing as well.

We now have two issues that we need to tackle first: missing values and scaling data. Let’s see two simple examples for each, and then tackle them in our dataset.

Let’s start with missing values.

Handling missing values

Models need data, and they can’t know what to do with a set of numbers containing missing values. In such cases (and there are many in our dataset), we need to make a decision on what to do with those missing values.

There are several options, and the right choice depends on the application as well as the nature of the data, but we won’t get into those details. For simplicity, we will make a generic choice of replacing missing data with suitable values.

Let’s explore how we can impute missing values with a simple example, as follows:

Get hands-on with 1400+ tech skills courses.