Bloom Filters & Probabilistic Data Structures

0/12 in this phase0/57 across the roadmap

📖 Concept

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is possibly in a set or definitely not in a set.

Properties:

  • No false negatives — if it says "not present", it's 100% certain
  • Possible false positives — if it says "present", it MIGHT be wrong
  • Cannot delete elements (standard version)
  • Extremely space-efficient — uses a fraction of the memory vs hash set

How it works:

  1. A bit array of size m, initialized to all 0s
  2. k independent hash functions
  3. Insert: compute k hashes, set those k bit positions to 1
  4. Query: compute k hashes, check if ALL k positions are 1
    • All 1 → "probably yes" (may be false positive)
    • Any 0 → "definitely no"

Trade-offs:

  • More bits (m) → lower false positive rate
  • More hash functions (k) → sweet spot exists (too many = too many bits set)
  • Optimal k = (m/n) × ln(2) where n = expected elements

Real-world uses:

  • Chrome: checks URLs against malware database (millions of URLs, tiny memory)
  • Medium: recommends articles you haven't read
  • Cassandra/HBase: avoids disk reads for non-existent rows
  • CDN caching: quickly check if content is cached
  • Spell checkers: quick "might be misspelled" check

🏠 Real-world analogy: A bouncer with a fuzzy memory. If they say "you're NOT on the list," you're definitely not on the list. If they say "you might be on the list," you could be, but they might be confused with someone else.

💻 Code Example

codeTap to expand ⛶
1// BLOOM FILTER — Implementation
2class BloomFilter {
3 constructor(size = 1000, numHashes = 3) {
4 this.size = size;
5 this.numHashes = numHashes;
6 this.bits = new Array(size).fill(false);
7 }
8
9 // Simple hash functions using different seeds
10 _hash(value, seed) {
11 let hash = 0;
12 const str = String(value);
13 for (let i = 0; i < str.length; i++) {
14 hash = (hash * seed + str.charCodeAt(i)) % this.size;
15 }
16 return Math.abs(hash);
17 }
18
19 _getPositions(value) {
20 const positions = [];
21 const seeds = [31, 37, 41, 43, 47, 53, 59];
22 for (let i = 0; i < this.numHashes; i++) {
23 positions.push(this._hash(value, seeds[i]));
24 }
25 return positions;
26 }
27
28 // Add element
29 add(value) {
30 for (const pos of this._getPositions(value)) {
31 this.bits[pos] = true;
32 }
33 }
34
35 // Check if element might exist
36 mightContain(value) {
37 return this._getPositions(value).every(pos => this.bits[pos]);
38 }
39
40 // False positive rate ≈ (1 - e^(-kn/m))^k
41 estimateFPRate(numElements) {
42 const k = this.numHashes;
43 const m = this.size;
44 return Math.pow(1 - Math.exp(-k * numElements / m), k);
45 }
46}
47
48// USAGE
49const bloom = new BloomFilter(10000, 5);
50bloom.add("apple");
51bloom.add("banana");
52bloom.add("cherry");
53
54console.log(bloom.mightContain("apple")); // true (definitely added)
55console.log(bloom.mightContain("banana")); // true
56console.log(bloom.mightContain("grape")); // false (definitely NOT added)
57console.log(bloom.mightContain("mango")); // false OR true (false positive possible)
58
59// COUNTING BLOOM FILTER — Supports deletion
60class CountingBloomFilter {
61 constructor(size = 1000, numHashes = 3) {
62 this.size = size;
63 this.numHashes = numHashes;
64 this.counters = new Array(size).fill(0); // Counters instead of bits
65 }
66
67 _getPositions(value) {
68 const positions = [];
69 const seeds = [31, 37, 41];
70 for (let i = 0; i < this.numHashes; i++) {
71 let hash = 0;
72 const str = String(value);
73 for (let j = 0; j < str.length; j++)
74 hash = (hash * seeds[i] + str.charCodeAt(j)) % this.size;
75 positions.push(Math.abs(hash));
76 }
77 return positions;
78 }
79
80 add(value) { for (const p of this._getPositions(value)) this.counters[p]++; }
81 remove(value) { for (const p of this._getPositions(value)) this.counters[p]--; }
82 mightContain(value) { return this._getPositions(value).every(p => this.counters[p] > 0); }
83}

🏋️ Practice Exercise

Practice Problems:

  1. Implement a basic Bloom filter with add and mightContain
  2. Calculate the optimal number of hash functions for given m and n
  3. Implement a counting Bloom filter that supports deletion
  4. Estimate the false positive rate for a given configuration
  5. Design a URL shortener that uses a Bloom filter to avoid collisions
  6. Use a Bloom filter to find elements in one set but not another
  7. Compare memory usage: Bloom filter vs HashSet for 1 million URLs
  8. Implement HyperLogLog for approximate distinct count (conceptual)
  9. Design a cache system that uses a Bloom filter to avoid cache misses
  10. Explain when a Bloom filter is NOT appropriate

⚠️ Common Mistakes

  • Assuming Bloom filters have no errors — they DO have false positives (just no false negatives)

  • Using too few hash functions — increases false positive rate

  • Using too many hash functions — fills the bit array too quickly, also increases false positives

  • Trying to delete from a standard Bloom filter — use counting Bloom filter for deletion

  • Using Bloom filters when exact results are needed — they're probabilistic by design

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Bloom Filters & Probabilistic Data Structures