Locality Sensitive Hashing (LSH) is a technique widely applicable to the approximate similarity search. It’s used because comparing billions of data points in current-day searches is not practically feasible. Therefore, we need a holistic method for dimensionality reduction to limit our search scope to highly relatable data points only.
LSH is an algorithmic approach to clustering similar input points into the same buckets using their hash values.
Contrary to the other hashing techniques, LSH aims to maximize the collisions of similar input points, ensuring a high probability of placing them in a single bucket.
There are different versions of LSH depending on the variation of hashing functions and similarity metrics. The two prominent versions are as follows:
Here, we explore the hyperplane approach in more detail. Let’s go through its algorithm.
0
(negative) value to all the points in the upward portion of the hyperplane and a 1
(positive) value to the points residing in the downward portion.Note: Practically, we take the dot product of the hyperplane’s normal vector and the input vector and assign values based on the result:
1
if they both are in the same direction and0
if they are in the opposite direction.
Repeat step 2 a fixed number of times with varying placements of hyperplanes. This allows us to generate hashed vectors with lower dimensionality (equal to the number of hyperplanes). In short, it generates varying bucket configurations of input vectors for various iterations and helps us improve their final bucket allocation.
Once we have our index of hashed vectors for input points, we can find the similarity index of any query vector by calculating its Hamming distance with each one of the index vectors. The higher the Hamming distance, the lower the similarity, and vice versa.
Based on the above illustration, the hashed vectors for three input points come out to be 101
, 101
, and 001
, respectively.
Let’s assume there comes a query point with the hash value of 111
.
We can calculate the
The following are the Hamming distances of all the comparisons:
101
, 111
) = 1101
, 111
) = 1001
, 111
) = 2Since we take the lowest Hamming number as an indicator of the closest match, points 1 and 2 are approximately similar to the query point.
Let’s assume we have N
-dimensional P
points as inputs, and we want to generate a total of K
hyperplanes, approximately equal to the natural logarithm of P
().
Free Resources