Supervised Learning with Sklearn
Get hands on experience of data science basics using sklearn.
What is sklearn
?
The sklearn
library, mostly known as scikit-learn, is one of the most useful, efficient, and robust libraries for machine learning in Python. It’s built upon numpy
, scipy
, and matplotlib
. It provides tools for almost all popular algorithms for machine learning. However, in this lesson, we’ll focus on the supervised learning pipeline.
Datasets
There are a few toy datasets available in sklearn
that don’t require downloading from any external source. The code for loading different datasets is consistent:
from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)
Here, name
in the datasets.load_name()
call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes
and can be loaded as:
X, y = datasets.load_diabetes(return_X_y=True)
Let’s check the total number of samples in the diabetes dataset along with its feature count:
import numpy as npfrom sklearn import datasetsX, y = datasets.load_diabetes(return_X_y=True)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
Toy datasets
The list of the available toy datasets is as follows:
Name | Type |
boston | Regression |
iris | Classification |
diabetes | Regression |
digits | Classification |
linnerud | Regression (multitarget) |
wine | Classification |
breast_cancer | Classification |
Large datasets
There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people
and can be downloaded using the following code:
X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)
Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person
can be used to know the number. There are other parameters as well.
Note: This dataset has to be downloaded and, therefore, might take a while.
import numpy as npfrom sklearn import datasetsX, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
Synthetic datasets
For quick testing of some models, it might be handy to create a synthetic dataset for regression or classification tasks. Regression and classification datasets can be created using calls to make_regression
and make_classification
, respectively.
import numpy as npfrom sklearn import datasetsn_samples=1000n_features=10n_informative = n_features//2X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,n_informative = n_informative)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')n_classes = 3n_clusters_per_class = 1X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,n_informative = n_informative)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')
Here is the explanation for the code above:
-
Lines 10–11: We generate two synthetic datasets using the
make_regression
andmake_classification
functions from thesklearn
library. Themake_regression
function generates a dataset for regression tasks with1000
samples,10
features, and5
informative features (half of the total features). -
Lines 15–18: We use the
make_classification
function to generate a dataset for classification tasks with1000
samples,10
features,3
classes,1
cluster per class, and5
informative features (half of the total features). After generation, we print the dimensions of the generated datasets, including the number of samples, features, and classes (for the classification dataset).
Note: For multitarget classification, see
make_multilabel_classification
and for multi-target regression use then_targets
parameter in the callmake_regression
.
Feature transformation
Feature transformation is the process of converting a dataset’s original features or variables into a new set of features using various mathematical functions. The aim of feature transformation is to improve the performance of machine learning algorithms by transforming the data to make it easier for algorithms to learn patterns and make accurate predictions.
Standardization
Data preprocessing often improves performance. Although there are several ways to preprocess the data, we’ll discuss one of the most popular, that is, StandardScaler
, which scales each feature (column) so that ...