Introduction to Support Vector Machine
Get introduced to the support vector machine.
We use a best-fitted line to predict a continuous target in regression-based algorithms like linear regression.
What if we try to use the linear regression algorithm to predict some classes (for example, 0/1)? That isn’t a good idea (we typically convert the categorical targets or labels to integer classes 0/1).
To accomplish a classification task, we consider the line as a boundary that splits the space instead of fitting the points.
Here enters logistic regression, where transfer functions can be used to tackle the classification problem using a linear regression-based classification algorithm. The logistic transfer function converts the real numbers into probabilities, and the algorithm then uses these class probabilities as a proxy for the class predictions.
This creates difficulty because we are tackling a problem where the answer is no/yes or 0/1 by solving the logistic regression problem, and our typical old loss functions (MSE, MAE, and others) are not helpful. We used the log loss or binary cross-entropy loss function to get this done. The cross-entropy loss function returns a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized, and a perfect cross-entropy value is 0 for the accuracy of 1.
We set the best possible line as a boundary, and ideally, we expect this boundary to make as few classification mistakes as possible. For evaluation, we look for the true labels of the class; if it matches the predicted label, the loss is 0, and if it does not match, the loss is 1. If we think about this misclassification loss, we treat every misclassification equally bad. However, some classifications could be worse than others that are based on how close they are to the decision boundary. We can imagine how destructive such a linear separation between the two classes could be.
Can we think of an alternate approach to deal with "equally bad" as "how bad?" Let’s see how support vector machines (SVMs) are helpful.
What if we consider treating the misclassifications based on how bad they have been misclassified? For example, we could impose stronger penalties for the observations found deeper in the territory of the other class.
Further, if the misclassifications are from the decision boundary, then the more wrong they are, the higher the penalty. We would not mind a margin for error, and even correctly classified observations close to the boundary (almost misclassified) could contribute to the penalty.
The support vector machine (SVM) algorithm follows a different classification approach. The SVM algorithm still fits a decision boundary like logistic regression. Still, it uses a different loss function called “the hinge loss,”—an alternative to cross-entropy for binary classification problems and used for maximum margin classification. It’s primarily developed using SVM models and for binary class classification where the targets are -1/1.
The function encourages the observations to have the correct sign while assigning more errors where there is a significant difference between the true and predicted class labels.
How does the SVM classify?
In the figure below, we have two types of example datasets for classification: linearly separable and not linearly separable.
Let’s start with the linearly separable example to see how the SVM algorithm works for such classification examples. The algorithm fits a decision boundary, defined by the largest margin between the closest points for each class. This is commonly called the maximum margin hyperplane (MMH). The intuition behind finding the decision boundary is straightforward, and the algorithm looks for the surface that is maximally far from any data point between classes.
The points used by SVM to fit that boundary are the support vectors.
The figure below shows two decision boundaries (red and green) and their margins in dotted lines. The one in green has a wider margin than the one in red.
The arrowed lines are the respective distances between support vectors and the decision boundaries. In green is a maximum margin hyperplane.
The maximum margin hyperplane
For linearly separable data, we can select two parallel hyperplanes (the dotted lines in the figure below) that separate the two classes of data so that the distance between them is as considerable as possible and the maximum margin hyperplane is the hyperplane that lies halfway between them, where the bounded region between them is a margin.
With a normalized or standardized dataset, these hyperplanes can be described by the equations. We choose a normalizing constant such that the distance from the plane to the closest points (the support vectors) of either class will be 1:
If the normalizer for the weights is