Fundamentals of Machine Learning: A Pythonic Introduction/

...

Supervised Learning with Sklearn

Get hands on experience of data science basics using sklearn.

We'll cover the following...

What is sklearn?

Press + to interact

Datasets

There are a few toy datasets available in sklearn that don’t require downloading from any external source. The code for loading different datasets is consistent:

from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)

Here, name in the datasets.load_name() call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes and can be loaded as:

X, y = datasets.load_diabetes(return_X_y=True)

Let’s check the total number of samples in the diabetes dataset along with its feature count:

Press + to interact

Large datasets

There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people and can be downloaded using the following code:

X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)

Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person can be used to know the number. There are other parameters as well.

Note: This dataset has to be downloaded and, therefore, might take a while.

Press + to interact

import numpy as np
from sklearn import datasets
n_samples=1000 
n_features=10
n_informative = n_features//2
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,
                                n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
n_classes = 3
n_clusters_per_class = 1
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features, 
                                    n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,
                                    n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')

Here is the explanation for the code above:

Lines 10–11: We generate two synthetic datasets using the make_regression and make_classification functions from the sklearn library. The make_regression function generates a dataset for regression tasks with 1000 samples, 10 features, and 5 informative features (half of the total features).
Lines 15–18: We use the make_classification function to generate a dataset for classification tasks with 1000 samples, 10 features, 3 classes, 1 cluster per class, and 5 informative features (half of the total features). After generation, we print the dimensions of the generated datasets, including the number of samples, features, and classes (for the classification dataset).

Note: For multitarget classification, see make_multilabel_classification and for multi-target regression use the n_targets parameter in the call make_regression.

Feature transformation

Feature transformation is the process of converting a dataset’s original features or variables into a new set of features using various mathematical functions. The aim of feature transformation is to improve the performance of machine learning algorithms by transforming the data to make it easier for algorithms to learn patterns and make accurate predictions.

Standardization

Data preprocessing often improves performance. Although there are several ways to preprocess the data, we’ll discuss one of the most popular, that is, StandardScaler, which scales each feature (column) so that it has a mean of $0$ and a standard deviation of $1$ .

Press + to interact

The original data plot shows a scatter plot of two variables, “YearsExperience” and “Salary,” which correlate positively. This means that as the years of experience increase, so does the salary. Then, the same scatter plot is shown after applying the StandardScaler to the “YearsExperience” variable. We can see that the shape of the scatter plot remains the same, and the positive correlation between “YearsExperience” and “Salary” is still present. This transformation helps compare the relative magnitude of different features. Importantly, it’s worth noting that the StandardScaler doesn’t change the original data points; it only standardizes the data to have a mean of $0$ and a standard deviation of $1$ .

Press + to interact

import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
np.set_printoptions(precision=2, suppress=True)
X, y = datasets.make_regression(n_samples = 500, n_features = 2)
print('Means : ', np.abs(np.mean(X, axis = 0))) # mean of each feature across samples
print('Variances: ', np.var(X, axis = 0)) # variance of each feature across samples
scalar = StandardScaler().fit(X)
X_scaled = scalar.transform(X)
print('Means after standardization: ', np.abs(np.mean(X_scaled, axis = 0))) # mean of each feature across samples
print('Variances after standardization: ', np.var(X_scaled, axis = 0)) # variance of each feature across samples

Name	Type
boston	Regression
iris	Classification
diabetes	Regression
digits	Classification
linnerud	Regression (multitarget)
wine	Classification
breast_cancer	Classification

Course Overview

Supervised Learning

Detect Cyber Intrusion Using Machine Learning

Clustering

Project: Bag of Visual Words

Generalized Linear Regression

Face Recognition Using Kernel Linear Discriminant

Support Vector Machine

Logistic Regression

Ensemble Learning

Early Stage Diabetes Prediction Using Ensemble Learning

Decoding Dimensions: PCA and Autoencoders

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

Appendix

Wrapping Up

How to Predict the Traffic Volume Using Machine Learning

Supervised Learning with Sklearn

What is `sklearn`?

Datasets

Toy datasets

Large datasets

Synthetic datasets

Feature transformation

Standardization

Polynomial features

Detect Cyber Intrusion Using Machine Learning

Project: Bag of Visual Words

Face Recognition Using Kernel Linear Discriminant

Early Stage Diabetes Prediction Using Ensemble Learning

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

How to Predict the Traffic Volume Using Machine Learning

Supervised Learning with Sklearn

What is sklearn?

Datasets

Toy datasets

Large datasets

Synthetic datasets

Feature transformation

Standardization

Polynomial features

What is `sklearn`?