Graph Neural Networks, Convolution Neural Networks, and PyTorch
The idea of classification permeates every aspect of our daily life. We categorize things continuously depending on their traits and characteristics, whether we’re doing laundry by color or identifying species of birds. We frequently take classification for granted since it is ingrained in our mental processes. However, classifying things accurately and efficiently is essential for making informed decisions and navigating the complexities of real-world problems.
The classification process, a key concept in the journey to learn machine learning, assigns a label or category to a given input based on its traits or attributes. The task of predicting the class or category of a new observation based on its similarity to previously observed examples is known as classification in machine learning and statistics.
For instance, the classification task in email spam filtering is to decide whether an incoming email is spam. A machine learning algorithm may evaluate the email’s content, sender, and other features to make this conclusion. It will then categorize the email as spam or not based on patterns it discovered from previously labeled cases.
Let’s start by building a classification model. We’ll take an example of an Iris dataset that contains measurements of different attributes of three species of iris flowers. Begin by clicking the “Run” button to display the top few rows of the DataFrame.
# Importing librariesimport pandas as pd# Importing the Iris datasetfrom sklearn.datasets import load_iris# Load the Iris datasetiris = load_iris()# Organizing the columns of DataFramedf = pd.DataFrame(iris.data, columns = iris.feature_names)df['species'] = iris.targetdf['species'] = df['species'].replace(to_replace= [0, 1, 2], value = ['setosa', 'versicolor', 'virginica'])print(df.sample(n=5))
load_iris
function from the sklearn.dataset
module. This allows us to load the Iris dataset on line 8.sample()
function.The DataFrame contains iris species classes, “setosa,” “virginica,” and “versicolor,” with the following four features:
In machine learning, features are the independent variables, and target is the dependent variable. Our goal is to build a classification model in Python that predicts the flower’s species using the four features as input.
In this blog, we will focus on logistic regression. Logistic regression is a method that statistically models a binary classification task. It predicts the probability that the input features fall into a specific class.
Mathematically, we model the logistic regression as follows:
Here, defines the weighted linear combination of the input features and is calculated as follows:
The linear regression algorithm, such as gradient descent, finds the optimal values for the weights that maximize the likelihood of the observed data.
Let’s see how this can be done using Python:
# Importing libraries and datasetimport numpy as npfrom sklearn.datasets import load_irisfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split#from sklearn import metrics# Load the Iris datasetiris = load_iris()X = iris.datay = iris.target# Splitting the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating the logistic regression modelmodel = LogisticRegression()# Training the modelmodel.fit(X_train, y_train)
Lines 4–5: We import LogisticRegression
and train_test_split
from the sklearn
library.
Line 14: We split the features X
and target y
into training and test datasets. The training dataset trains the model, while the test dataset evaluates its performance.
Lines 17–20: We create a logistic regression model and train the classifier on training data X_train
and y_train
.
You can get a hands-on experience by building the following machine learning projects.
Now that we have created and trained a classification model using the training data, we will proceed toward evaluating the model.
We will start by defining the confusion matrix. A confusion matrix is a matrix used to evaluate a classification model’s performance and compares the model’s prediction and actual labels. The elements of a confusion matrix are defined as follows:
True positive (TP): The number of instances that belong to species of class and are correctly predicted by the model as class .
False positive (FP): The number of instances that do not belong to species of class but are incorrectly predicted by the model as class .
False negative (FN): The number of instances that belong to species of class but are incorrectly predicted by the model as a different class.
True negative (TN): The number of instances that do not belong to species of class and are correctly predicted by the model as not belonging to class .
Let’s see how to calculate the confusion matrix in Python for our Iris dataset:
from sklearn.metrics import confusion_matriximport matplotlib.pyplot as plt# Validation# Evaluating the trained model on training datay_pred = model.predict(X_test)# Computing the confusion matrixcm = confusion_matrix(y_test, y_pred)# Plotting the confusion matrix using Matplotlibfig, ax = plt.subplots(figsize=(8, 8))im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)ax.figure.colorbar(im, ax=ax)ax.set(xticks=np.arange(cm.shape[1]),yticks=np.arange(cm.shape[0]),xticklabels=iris.target_names, yticklabels=iris.target_names,title='Confusion Matrix',ylabel='True label',xlabel='Predicted label')plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")fmt = 'd'thresh = cm.max() / 2.for i in range(cm.shape[0]):for j in range(cm.shape[1]):ax.text(j, i, format(cm[i, j], fmt),ha="center", va="center",color="white" if cm[i, j] > thresh else "black")fig.tight_layout()#saving figureplt.savefig('output/graph.png')
Here, the model.predict()
function is applied to the test data on line 6, and a confusion matrix is calculated on line 9 using the confusion_matrix()
function. The diagonal elements in the confusion matrix show the values of TP.
Now it will be easier to define the following performance metrics.
Accuracy: A ratio of the sum of TP and TN to the total predictions. Accuracy tells us about the overall correctness of the model’s prediction. A higher accuracy shows a better-performing model.
Precision: A ratio of TP to the sum of TP and FP. Precision is important when the cost of a false positive event is high. A higher precision tells us that the species predicted as class by our model are more likely to be correctly identified as class .
Recall: A ratio of TP to the sum of TP and FN. The recall is important when the cost of FN is high and focuses on the model’s ability to avoid FN.
Now that we understand the performance metrics let’s calculate our model’s accuracy, precision, and recall on the test dataset.
# Validation# Evaluating the trained model on training datay_pred = model.predict(X_test)# Evaluating the modelaccuracy = metrics.accuracy_score(y_test, y_pred)precision = metrics.precision_score(y_test, y_pred, average='macro')recall = metrics.recall_score(y_test, y_pred, average='macro')print("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)
Here, we evaluate the model’s performance using the expected output in y_test
, and the model predicted output in y_pred
.
metrics.accuracy_score
function.metrics.precision_score
function.metrics.recall_score
function.The average
parameter in calculating precision and recall on lines 7–8 is set to macro
. This individually computes the corresponding values of each class and then averages them.
The results show that our classification model fits well and provides accurate predictions.
This blog has briefly introduced a logistic regression classification model with Python. We encourage you to explore other classification models like the random forest classifier, support vector machine (SVM), K-nearest neighbors (KNN), and decision trees to build accurate and robust models. Additionally, you can check out the following courses on Educative:
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Hands-on Machine Learning with Scikit-Learn
Scikit-Learn is a powerful library that provides a handful of supervised and unsupervised learning algorithms. If you’re serious about having a career in machine learning, then scikit-learn is a must know. In this course, you will start by learning the various built-in datasets that scikit-learn offers, such as iris and mnist. You will then learn about feature engineering and more specifically, feature selection, feature extraction, and dimension reduction. In the latter half of the course, you will dive into linear and logistic regression where you’ll work through a few challenges to test your understanding. Lastly, you will focus on unsupervised learning and deep learning where you’ll get into k-means clustering and neural networks. By the end of this course, you will have a great new skill to add to your resume, and you’ll be ready to start working on your own projects that will utilize scikit-learn.
Machine Learning with Python Libraries
Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.
If you want to learn how to build regression models with Python, we encourage you to check our blog on the topic.
Free Resources