Search⌘ K

Introduction to SVM

Gain an understanding of SVM, and the concepts of signed and unsigned distance.

Support vector machine (SVM) is a popular and powerful supervised learning algorithm for classification and regression problems. It works by finding the best possible boundary between different classes of data points. In this lesson, we’ll cover the basic concepts and principles behind SVMs and see how they can be applied in practice.

What is SVM?

Suppose a person works for a bank, and their job is to decide whether to approve or reject loan applications based on the applicant’s financial history. They have a loan dataset with various features such as credit score, income, and debt-to-income ratio, along with past approval and rejection records. The task is to use SVM to build a predictive model for future loan applications.

First, they map each loan application into a feature space based on its features and label each loan application as either “approved” or “rejected,” which creates two different classes in the dataset. Next, they try to find a decision boundary that will separate the data linearly.

SVM finds the best hyperplane that separates the two classes. A hyperplane is simply a decision boundary: a line in 2D, a plane in 3D, or an (N1)(N-1)-dimensional flat subspace in an NN-dimensional feature space.

The best hyperplane is the one that maximizes the margin, which is the distance between the hyperplane and the closest data points from each class. These closest points, which lie on the margin boundaries, are called the support vectors. This means the goal is to position the hyperplane so it is as far away as possible from the nearest approved and rejected loan applications.

Maximizing the margin

Any line (or plane) can separate the data, but only one is the maximum margin classifier. Maximizing this margin achieves two crucial goals:

  1. Increased generalization: A larger margin provides a safety buffer. If the hyperplane is too close to a data point, a small change in a new applicant’s features could cause the model to misclassify them. A large margin ensures the decision boundary is robust and makes the most confident prediction possible for unseen data.

  2. Focus on support vectors: We don’t need to consider every loan application when determining the hyperplane. Only the support vectors—the loan applications that lie closest to the hyperplane—are used to determine its precise position and orientation. All other points can be removed, and the final decision boundary wouldn’t change. This approach makes SVM memory-efficient and allows us to create models that are driven by the most critical, hardest to classify data points.

SVM vs. other classifier
SVM vs. other classifier

The plot above shows two classifiers that separate the positive and negative classes of a dataset. The blue line represents the SVM classifier, whereas the green line represents the other classifier. The points on the dotted line are called support vectors because they’re the closest to the hyperplane, and the distance between the blue dotted lines is called the margin, which is what we want to maximize in SVM to get the best possible classifier. We can’t say that the green line is a hyperplane of SVM because it doesn’t have a maximum margin.

Note: SVM can be thought of as a generalized linear discriminant with maximum margin.

Signed & unsigned distance

In SVM, the hyperplane is defined by a weight vector w\bold w and a bias term bb. The hyperplane equation can be written as wTx+b=0\bold w^T\bold x + b = 0. Here, x\bold x represents a data point, w\bold w represents the normal vector to the hyperplane, and bb represents the offset of the hyperplane from the origin. The signed distance of a point xi\bold x_i from the hyperplane can be defined as the distance between xi\bold x_i and the hyperplane, taking into account the direction of the normal vector. This distance is signed because it can be positive or negative depending on which side of the hyperplane the point is on.

If the normal vector w\bold w and the direction of the distance from the hyperplane to the point is pointing in the same direction, then the distance is positive, but if they’re pointing in opposite directions, then the distance is negative. All the points that are above the hyperplane have a positive distance, while all the points that are below the hyperplane have a negative distance, as shown in the figure below.

Note: Unless stated otherwise, we assume the bias parameter as the part of the vector w\bold w, and we’ll append 11 to the feature vectors.

Signed and unsigned distance
Signed and unsigned distance

A hyperplane in the feature space defined by mapping ϕ\phi can be defined as wTϕ(x)=0\bold w^T\phi(\bold x)=0. Given a binary classification dataset D={(x1,y1),(x2,y2),,(xn,yn)}D=\{(\bold x_1, y_1), (\bold x_2, y_2), \dots,(\bold x_n, y_n)\} ...