What is SIFT?

Overview

Scale Invariant Feature Transform (SIFT) was introduced by D. Lowe, a former professor at the University of British Columbia, in the year 2004. SIFT is a feature extraction method that reduces the image content to a set of points used to detect similar patterns in other images. This algorithm is usually related to computer vision applications, including image matching and object detection. 

Key terminologies

  • Feature Extraction: These methods aim to reduce the number of features in the dataset by passing them through a mapping function.
  • Key points: An image's key points are spatial locations that are rotation and scale-invariant. These key points highlight what stands out in an image and which pixels are of utmost importance.
  • Descriptors: Descriptors are vectors that describe the local surroundings around the key points present in the image. These descriptors are used to make associations between different images.
  • Gaussian Blur: This is a method used to reduce the noise in the image, with the help of the Gaussian function, so that the key points can be detected efficiently.

Advantages of SIFT

Advantages

Explanation

Locality

The features are local, so they are not affected by noise and clutter.

Distinctiveness

The features that are obtained can be compared with a large datasets of objects.

Quantity

SIFT can help generate many features even from small objects.

Efficiency

The performance of this algorithm is comparable to real-time performance.

Steps of execution

Overview of the SIFT Algorithm

Building the scale-space

The first step is to ensure that we can identify the distinct points while ignoring the noise. To ensure this, we use the 'Gaussian Blur' technique. Each pixel value is evaluated using the pixels in its proximity. The following images show the results after we have successfully applied Gaussian blur:

Test image after Gaussian blur is applied

As we can see from above, Gaussian blur has successfully highlighted the essential features, so the next step is to ensure that the features are not scale-dependent. For this, we will use different scales of the original image. This is referred to as Building the Scale-Space. Now we can take a look at how the scaled images (from 300 x 300 to 180 x 180) appear after we apply the Gaussian Blur:

The creation of scale-space using the test image

This process is repeated for multiple octaves. Each octave represents a set of images that are scaled versions of the original test image and are obtained after applying the Gaussian blur on the previously processed image. The ideal number of octaves is 4 as shown below:

Visualization of 4 Octaves

Difference of Gaussian (DoG)

After we create the scale-space, we must now amplify the features using a technique called Difference of Gaussian (DoG). DoG subtracts the higher blurred version of an original image from the lower blurred version, which creates another set of images in the same scale for each octave, respectively. The following is the visualization of DoG:

The concept of 'Difference of Gaussian'

This method is repeated for all four octaves, and the resultant images are then used to find key points.

Key point localization

After our images are processed, we must locate the key points in the image. This is done by comparing the pixel values with other pixels in their locality. So, each pixel value is now compared with all 8 neighboring pixels along with 9 pixels in the different scales as defined in the octaves. So, a total of 26 comparisons are made to check if a point is a local extremum, and only then is it classified as a potential key point.

However, this process results in a lot of key points being generated. So, we drop the key points which do not have enough contrast or are lying along an edge in our test image. This way we get a set of legitimate key points for our test image.

Orientation assignment

To make the set of legitimate key points invariant to rotation, we must assign them a magnitude (intensity of a pixel) and orientation (direction of the pixel). The formula used for calculating the magnitude and orientation is as follows:

Here, 'G_x' and 'G_y' represents the gradient in the x and y direction respectively for the test image.

The magnitude and orientation values are used to create a histogram. The x-axis will represent the angle bins from 0 to 360 degrees in 10-step intervals, and the y-axis will illustrate the magnitude. Each pixel will be placed in its associated angle bin.

Graph of magnitude against orientation

As seen above, the histogram will have a peak value. The peak value represents the orientation of a key point in the domain of its bin. Any other peak greater than 80%, with respect to the highest peak, will be considered as a key point with separate orientations.

Key point descriptor

The last step in the SIFT algorithm is to make a descriptor. The surrounding pixels to the key points are used to make descriptors. Hence, the descriptors are invariant to viewpoint and illumination to a certain extent. A 16 x16 grid is formed around the key point, further sliced into 4 x 4 sub-blocks.

A histogram of orientation and magnitude is created for each sub-block; therefore, 128 bin values are generated to represent a feature vector.

Visualization of key point descriptors

The feature vector’s rotation and illumination dependence have to be eliminated to get an accurate descriptor. Rotation dependence is eliminated by taking the difference between each gradient orientation and the key point’s orientation. Similarly, illumination dependency is eliminated using a threshold value and normalizing the descriptor vector.

Key point matching

The key points extracted from the previous steps are used for pattern matching in other images. This signifies the importance of SIFT in object detection and image matching.

Implementation of SIFT

The following is the implementation of SIFT using an in-built Python library, cv2. The library provides us with a method to create a sift object as done on line 7. The test image is first read and then converted to a grayscale image, so this way we reduce the dimensions from 3 (RGB) to 1 (line 3 and line 4). The sift object created is then used to detect and compute the key points & descriptors of the test image (line 8). Then we use another method, namely cv2.drawKeypoints, to highlight the key points on the image and superimpose them on the original image (line 9). After the SIFT algorithm is executed, a test image with all the highlighted key points is obtained.

import cv2
from cv2 import imread
from matplotlib import pyplot as plt
img = imread('/dogimage.jpg')
imgGray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
sift = cv2.SIFT_create()
keypoints,descriptors = sift.detectAndCompute(img, None)
sift_image = cv2.drawKeypoints(imgGray, keypoints, img)
plt.figure()
plt.imshow(sift_image,cmap='gray')
plt.savefig('output/sift_image')

Conclusion

SIFT is a very efficient and easy-to-use feature extraction technique with a lot of advantages. It helps to reduce the dimensions of the feature space by removing the redundant features, which highly impact the training of the machine learning used in large scale applications worldwide.