Preparing Data with Scikit-learn

Learn how to clean and prepare data for analysis using the scikit-learn package.

We'll cover the following

Handling missing values
Scaling data with scikit-learn
- Standard scaling

scikit-learn is one of the most widely used and comprehensive machine learning libraries in Python. It plays very well with the rest of the data-science ecosystem libraries, such as NumPy, pandas, and matplotlib. We will be using it for modeling our data and for some preprocessing as well.

We now have two issues that we need to tackle first: missing values and scaling data. Let’s see two simple examples for each, and then tackle them in our dataset.

Let’s start with missing values.

Handling missing values

Models need data, and they can’t know what to do with a set of numbers containing missing values. In such cases (and there are many in our dataset), we need to make a decision on what to do with those missing values.

There are several options, and the right choice depends on the application as well as the nature of the data, but we won’t get into those details. For simplicity, we will make a generic choice of replacing missing data with suitable values.

Let’s explore how we can impute missing values with a simple example, as follows:

Get hands-on with 1200+ tech skills courses.

Plotly's Dash Framework

Overview of the Dash Ecosystem

Exploring the Structure of a Dash App

Working with Plotly's Figure Objects

Data Manipulation and Preparation using Plotly Express

Interactively Comparing Values with Bar Charts and Drop-Down Menus

Exploring Variables and Filtering Subsets

Exploring Map Plots and Enriching Dashboards with Markdown

Calculating the Frequency of Data with Histograms and Tables

Letting the Data Speak for Itself with Machine Learning

Turbocharge Apps with Advanced Callbacks

URLs and Multipage Apps

Deploying the App

Next Steps

Appendix

Preparing Data with Scikit-learn

Handling missing values