Supervised Learning with Sklearn

Get hands on experience of data science basics using sklearn.

What is sklearn?

The sklearn library, mostly known as scikit-learn, is one of the most useful, efficient, and robust libraries for machine learning in Python. It’s built upon numpy, scipy, and matplotlib. It provides tools for almost all popular algorithms for machine learning. However, in this lesson, we’ll focus on the supervised learning pipeline.

Press + to interact

Datasets

There are a few toy datasets available in sklearn that don’t require downloading from any external source. The code for loading different datasets is consistent:

from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)

Here, name in the datasets.load_name() call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes and can be loaded as:

X, y = datasets.load_diabetes(return_X_y=True)

Let’s check the total number of samples in the diabetes dataset along with its feature count:

Press + to interact
import numpy as np
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')

Toy datasets

The list of the available toy datasets is as follows:

Name

Type

boston

Regression

iris

Classification

diabetes

Regression

digits

Classification

linnerud

Regression (multitarget)

wine

Classification

breast_cancer

Classification

Large datasets

There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people and can be downloaded using the following code:

X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)

Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person can be used to know the number. There are other parameters as well.

Note: This dataset has to be downloaded and, therefore, might take a while.

Press + to interact
import numpy as np
from sklearn import datasets
X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')

Synthetic datasets

For quick testing of some models, it might be handy to create a synthetic dataset for regression or classification tasks. Regression and classification datasets can be created using calls to make_regression and make_classification, respectively.

Press + to interact
import numpy as np
from sklearn import datasets
n_samples=1000
n_features=10
n_informative = n_features//2
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,
n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
n_classes = 3
n_clusters_per_class = 1
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,
n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')

Here is the explanation for the code above:

  • Lines 10–11: We generate two synthetic datasets using the make_regression and make_classification functions from the sklearn library. The make_regression function generates a dataset for regression tasks with 1000 samples, 10 features, and 5 informative features (half of the total features).

  • Lines 15–18: We use the make_classification function to generate a dataset for classification tasks with 1000 samples, 10 features, 3 classes, 1 cluster per class, and 5 informative features (half of the total features). After generation, we print the dimensions of the generated datasets, including the number of samples, features, and classes (for the classification dataset).

Note: For multitarget classification, see make_multilabel_classification and for multi-target regression use the n_targets parameter in the call make_regression.

Feature transformation

Feature transformation is the process of converting a dataset’s original features or variables into a new set of features using various mathematical functions. The aim of feature transformation is to improve the performance of machine learning algorithms by transforming the data to make it easier for algorithms to learn patterns and make accurate predictions.

Standardization

Data preprocessing often improves performance. Although there are several ways to preprocess the data, we’ll discuss one of the most popular, that is, StandardScaler, which scales each feature (column) so that ...