Home/Blog/Machine Learning/Building Machine Learning Classification Models with Python
Home/Blog/Machine Learning/Building Machine Learning Classification Models with Python

Building Machine Learning Classification Models with Python

Najeeb Ul Hassan
6 min read

The idea of classification permeates every aspect of our daily life. We categorize things continuously depending on their traits and characteristics, whether we’re doing laundry by color or identifying species of birds. We frequently take classification for granted since it is ingrained in our mental processes. However, classifying things accurately and efficiently is essential for making informed decisions and navigating the complexities of real-world problems.

The classification process, a key concept in the journey to learn machine learning, assigns a label or category to a given input based on its traits or attributes. The task of predicting the class or category of a new observation based on its similarity to previously observed examples is known as classification in machine learning and statistics.

For instance, the classification task in email spam filtering is to decide whether an incoming email is spam. A machine learning algorithm may evaluate the email’s content, sender, and other features to make this conclusion. It will then categorize the email as spam or not based on patterns it discovered from previously labeled cases.

Email classifier
Email classifier

Loading the dataset#

Let’s start by building a classification model. We’ll take an example of an Iris dataset that contains measurements of different attributes of three species of iris flowers. Begin by clicking the “Run” button to display the top few rows of the DataFrame.

# Importing libraries
import pandas as pd
# Importing the Iris dataset
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
# Organizing the columns of DataFrame
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['species'] = iris.target
df['species'] = df['species'].replace(to_replace= [0, 1, 2], value = ['setosa', 'versicolor', 'virginica'])
print(df.sample(n=5))
  • Line 2: We import the pandas library to read the DataFrame.
  • Lines 5–8: We import the load_iris function from the sklearn.dataset module. This allows us to load the Iris dataset on line 8.
  • Lines 11–13: We create a DataFrame using the features and target data from the Iris dataset.
  • Line 15: We print five randomly selected rows of the DataFrame using the sample() function.

The DataFrame contains N=3N=3 iris species classes, “setosa,” “virginica,” and “versicolor,” with the following four features:

  • Sepal length: The length of the sepal in centimeters.
  • Sepal width: The width of the sepal in centimeters.
  • Petal length: The length of the petal in centimeters.
  • Petal width: The width of the petal in centimeters.

In machine learning, features are the independent variables, and target is the dependent variable. Our goal is to build a classification model in Python that predicts the flower’s species using the four features as input.

Classification models with Python#

In this blog, we will focus on logistic regression. Logistic regression is a method that statistically models a binary classification task. It predicts the probability pp that the input features fall into a specific class.

Mathematically, we model the logistic regression model as follows:

p=1/(1+ez).p = 1 / (1 + e^{-z}).

Here, zz defines the weighted linear combination of the input features and is calculated as follows:

z=w0+w1x1+w2x2+...+wnxn.z = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n.

The linear regression algorithm, such as gradient descent, finds the optimal values for the weights that maximize the likelihood of the observed data.

Let’s see how this can be done using Python:

# Importing libraries and dataset
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#from sklearn import metrics
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Creating the logistic regression model
model = LogisticRegression()
# Training the model
model.fit(X_train, y_train)
  • Lines 4–5: We import LogisticRegression and train_test_split from the sklearn library.

  • Line 14: We split the features X and target y into training and test datasets. The training dataset trains the model, while the test dataset evaluates its performance.

  • Lines 17–20: We create a logistic regression model and train the classifier on training data X_train and y_train.

Validation#

Now that we have created and trained a classification model using the training data, we will proceed toward evaluating the model.

We will start by defining the confusion matrix. A confusion matrix is a N×NN\times N matrix used to evaluate a classification model’s performance and compares the model’s prediction and actual labels. The elements of a confusion matrix are defined as follows:

  • True positive (TP): The number of instances that belong to species of class ii and are correctly predicted by the model as class ii.

  • False positive (FP): The number of instances that do not belong to species of class ii but are incorrectly predicted by the model as class ii.

  • False negative (FN): The number of instances that belong to species of class ii but are incorrectly predicted by the model as a different class.

  • True negative (TN): The number of instances that do not belong to species of class ii and are correctly predicted by the model as not belonging to class ii.

Let’s see how to calculate the confusion matrix in Python for our Iris dataset:

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
# Validation
# Evaluating the trained model on training data
y_pred = model.predict(X_test)
# Computing the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plotting the confusion matrix using Matplotlib
fig, ax = plt.subplots(figsize=(8, 8))
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
xticklabels=iris.target_names, yticklabels=iris.target_names,
title='Confusion Matrix',
ylabel='True label',
xlabel='Predicted label')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
fmt = 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
#saving figure
plt.savefig('output/graph.png')

Here, the model.predict() function is applied to the test data on line 6, and a confusion matrix is calculated on line 9 using the confusion_matrix() function. The diagonal elements in the confusion matrix show the values of TP.

Now it will be easier to define the following performance metrics.

  • Accuracy: A ratio of the sum of TP and TN to the total predictions. Accuracy tells us about the overall correctness of the model’s prediction. A higher accuracy shows a better-performing model.

  • Precision: A ratio of TP to the sum of TP and FP. Precision is important when the cost of a false positive event is high. A higher precision tells us that the species predicted as class ii by our model are more likely to be correctly identified as class ii.

  • Recall: A ratio of TP to the sum of TP and FN. The recall is important when the cost of FN is high and focuses on the model’s ability to avoid FN.

Now that we understand the performance metrics let’s calculate our model’s accuracy, precision, and recall on the test dataset.

# Validation
# Evaluating the trained model on training data
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred, average='macro')
recall = metrics.recall_score(y_test, y_pred, average='macro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Here, we evaluate the model’s performance using the expected output in y_test, and the model predicted output in y_pred.

  • On line 6, the accuracy of the model is calculated using metrics.accuracy_score function.
  • On line 7, the precision of the model is calculated using metrics.precision_score function.
  • On line 8, the recall of the model is calculated using metrics.recall_score function.

The average parameter in calculating precision and recall on lines 7–8 is set to macro. This individually computes the corresponding values of each class and then averages them.

The results show that our classification model fits well and provides accurate predictions.

This blog has briefly introduced a logistic regression classification model with Python. We encourage you to explore other classification models like the random forest classifier, support vector machine (SVM), K-nearest neighbors (KNN), and decision trees to build accurate and robust models. Additionally, you can check out the following courses on Educative:

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

Hands-on Machine Learning with Scikit-Learn

Cover
Hands-on Machine Learning with Scikit-Learn

Scikit-Learn is a powerful library that provides a handful of supervised and unsupervised learning algorithms. If you’re serious about having a career in machine learning, then scikit-learn is a must know. In this course, you will start by learning the various built-in datasets that scikit-learn offers, such as iris and mnist. You will then learn about feature engineering and more specifically, feature selection, feature extraction, and dimension reduction. In the latter half of the course, you will dive into linear and logistic regression where you’ll work through a few challenges to test your understanding. Lastly, you will focus on unsupervised learning and deep learning where you’ll get into k-means clustering and neural networks. By the end of this course, you will have a great new skill to add to your resume, and you’ll be ready to start working on your own projects that will utilize scikit-learn.

5hrs
Intermediate
5 Challenges
2 Quizzes

Machine Learning with Python Libraries

Cover
Machine Learning with Python Libraries

Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.

53hrs
Beginner
56 Challenges
62 Quizzes

If you want to learn how to build regression models with Python, we encourage you to check our blog on the topic.

Frequently Asked Questions

What is classification model building in ML?

A classification model is a form of supervised machine learning that assigns input data to specific categories based on its features. The model learns from labeled data, which includes both the input features and their corresponding labels. By training on this data, the model gains the ability to categorize new inputs according to the predefined categories, therefore making it a valuable tool for automating the data classification process.


  

Free Resources