What is a Bloom Filter?

What is a Bloom filter?

  • It is a probabilistic data structure designed to answer the set membership question- Is this element present in the Set?
  • It is highly space efficient and does not store the actual items.
  • It can tell very quickly if an item does not exist in a set or if it may be existing in the Set.
  • However, it cannot tell if an item is definitely present in a Set.

How does a Bloom filter work?

An empty Bloom filter is a Bit Vector with all bits set to zero. In the image below, each cell represents a bit. The number below the bit is its index for a 10-bit vector.

An empty Bloom filter

Adding an item to the Bloom Filter

In order to add an element, it must be hashed using multiple hash functions. Bits are set at the index of the hashes in the bit vector.

For example, let’s assume we need to add the james@gmail.com element using three efficient hash functions: H1(james@gmail.com) = 12021 H2(james@gmail.com) = 23324 H3(james@gmail.com) = 23237

We can take the mod of 10 for all these values to get an index within the bounds of the bit vector: 12021 % 10 = 1 23324 % 10= 4 23237 % 10 = 7

Adding a item to the Bloom Filter

Testing membership of an item

For an item whose membership needs to be tested, it is also hashed via the same hash functions. If all the bits are already set for this, the element may exist in the set.

If any bit is not set, the element is definitely not in the set.

Checking membership of the item in Bloom Filter

Why Bloom filters give false positive results

Let’s assume we have added the two members below to the bloom filter.

  • Monkey with Hash Output H(“Monkey”) = {1,2,5}
  • Lion with Hash Output H(“Lion”) = {7,4,3}

Now, if we want to check whether or not Tiger exists in the set, we can hash it via the same hash functions.

H(“Tiger”) = {2,7,3}

We have not added “Tiger” to the bloom filter, but all the bits at index {2,7,3} have already been set by the previous two elements; thus, Bloom Filter claims that “Tiger” is present in the set. This is a false positive result.

Bloom filter applications

A Bloom filters is a space-efficient data structure, but it does not store the actual items since it is just a bit vector. It has many applications such as:

  • Medium uses Bloom filters in its Recommendation module to avoid showing those posts that have already been seen by the user.
  • Cassandra uses bloom filters to optimize the search of data in an SSTable on the disk.
  • CDNs use bloom filters to avoid caching items that are rarely searched.