What is scikit-learn?

Machine learning has revolutionized various industries, enabling computers to learn from information and make intelligent predictions or decisions. Python, a versatile programming language, offers numerous libraries for machine-learning tasks. One such library that stands out is scikit-learn.

In this Answer, we will explore scikit-learn's features, its importance in the machine learning ecosystem, and how to leverage its capabilities through practical code examples.

scikit-learn

scikit-learn, popularly known as sklearn, is an open-source Python library that provides a comprehensive set of machine learning algorithms and tools for data preprocessing, classification, model selection and etc. It is built upon other fundamental scientific libraries, including NumPy, SciPy, and matplotlib, making it a powerful and user-friendly machine learning toolkit.

scikit-learn
scikit-learn

Key features of scikit-learn

scikit-learn offers a wide set of functionalities for different machine learning tasks. Some of the key features include:

  • Easy-to-use API: Provides a user-friendly and consistent interface for implementing machine learning models.

  • Broad algorithm selection: Offers a diverse range of machine learning algorithms for various tasks such as classification, clustering, linear or multiple regression, and more.

  • Preprocessing and feature extraction: Provides tools for data preprocessing, handling missing values, scaling features, and extracting relevant features.

  • Model evaluation and validation: Supports model evaluation with metrics and techniques for cross-validation and hyperparameterHyperparameters are the kinds of parameters that are set before starting the learning process. They function as controls that can be adjusted to various settings to enhance the learning of the model. tuning.

Applications of scikit-learn

Here are some common applications of scikit-learn:

Applications of scikit-learn
Applications of scikit-learn

Getting started with scikit-learn

Getting started with scikit-learn is relatively straightforward. Follow the steps below to begin using scikit-learn for the machine learning projects:

Step 1: Install scikit-learn

First, we must ensure that Python is installed on the system. scikit-learn is compatible with Python 3.6 and above. We can install scikit-learn using pip, a package installer for Python, by running the following command in the terminal:

pip install scikit-learn
Installing scikit-learn

Step 2: Import the scikit-learn library

In the Python script or notebook, import scikit-learn as shown below:

import sklearn

Step 3: Load a dataset

scikit-learn provides various datasets for experimentation. We can load sample datasets or import our own dataset using pandas or other data manipulation libraries. For example, we will load the iris dataset as shown below:

from sklearn.datasets import load_iris
irisDataset = load_iris()
X = irisDataset.data # Features
y = irisDataset.target # Labels

Step 4: Choose a model and split the data

Next, we will select a machine learning model that suits our task, such as classification, regression, or clustering. Partitioning the data into training and testing sets allows us to assess the model's performance. scikit-learn has a train_test_split() function for this purpose. Here's an example:

from sklearn.model_selection import train_test_split
X_data_train, X_data_test, y_data_train, y_data_test = train_test_split(X, y, test_size=0.3, random_state=39)

Step 5: Train and evaluate the model

We will instantiate the selected model and train it using the provided training data. Subsequently, we will utilize the trained model to generate predictions on the test data. Finally, we will evaluate the model's performance using appropriate metrics. Here's a simple example using logistic regression for classification:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create and train the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_data_train, y_data_train)
# Make predictions on the test set
y_predict_data = lr_model.predict(X_data_test)
# Calculate the accuracy of the model
lr_model_accuracy = accuracy_score(y_data_test, y_predict_data)

Step 6: Refine and fine-tune your model

We experiment with different models, hyperparameters, and feature engineering techniques to improve the model's performance. scikit-learn offers utilities for model selection, hyperparameter tuning, and feature preprocessing to help refine the models.

Code example

Here's the executable code example implementing the above steps:

import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Step 1: Install scikit-learn
# pip install scikit-learn
# Step 2: Import the scikit-learn library
import sklearn
# Step 3: Load a dataset
irisDataset = load_iris()
X = irisDataset.data # Features
y = irisDataset.target # Labels
# Step 4: Choose a model and split the data
X_data_train, X_data_test, y_data_train, y_data_test = train_test_split(X, y, test_size=0.3, random_state=39)
# Step 5: Train and evaluate the model
# Create and train the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_data_train, y_data_train)
# Make predictions on the test set
y_predict_data = lr_model.predict(X_data_test)
# Calculate the accuracy of the model
lr_model_accuracy = accuracy_score(y_data_test, y_predict_data)
print(lr_model_accuracy)

Code explanation

Here’s the explanation for each part of the code:

  • Lines 1–5: Import the necessary libraries from scikit-learn. load_iris is used to load the Iris dataset, train_test_split for splitting the data into training and testing sets, LogisticRegression is the chosen model, and accuracy_score for calculating the accuracy of the model.

  • Line 11: Import the scikit-learn library.

  • Line 14: The Iris dataset is loaded using load_iris() and stored in the irisDataset variable.

  • Lines 15–16: Separate the features (X) and labels (y) from the dataset. The features are stored in X, and the labels are stored in y.

  • Line 19: The data is split into training and testing sets using train_test_split(). test_size=0.3 indicates that 30% of the data will be used for testing, and random_state=39 sets a specific random seed for reproducibilityThe ability to obtain consistent and identical results when an experiment is rerun using the same data, code, and settings.. The ability to obtain consistent and identical results when an experiment is rerun using the same data, code, and settings.

  • Line 23: A logistic regression model is created by instantiating the LogisticRegression() class.

  • Line 24: Train the logistic regression model using fit(). This step involves finding the optimal parameters for the model based on the training data.

  • Line 27: Predictions are made on the testing set using predict(). The model predicts the labels for the testing set based on the learned parameters.

  • Line 30: Th accuracy of our model is calculated by comparing the predicted labels (y_predict_data) with the actual labels (y_data_test) using the accuracy_score() function.

  • Line 31: Print the accuracy of the model.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved