Bloom Filters & Probabilistic Data Structures
📖 Concept
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is possibly in a set or definitely not in a set.
Properties:
- No false negatives — if it says "not present", it's 100% certain
- Possible false positives — if it says "present", it MIGHT be wrong
- Cannot delete elements (standard version)
- Extremely space-efficient — uses a fraction of the memory vs hash set
How it works:
- A bit array of size m, initialized to all 0s
- k independent hash functions
- Insert: compute k hashes, set those k bit positions to 1
- Query: compute k hashes, check if ALL k positions are 1
- All 1 → "probably yes" (may be false positive)
- Any 0 → "definitely no"
Trade-offs:
- More bits (m) → lower false positive rate
- More hash functions (k) → sweet spot exists (too many = too many bits set)
- Optimal k = (m/n) × ln(2) where n = expected elements
Real-world uses:
- Chrome: checks URLs against malware database (millions of URLs, tiny memory)
- Medium: recommends articles you haven't read
- Cassandra/HBase: avoids disk reads for non-existent rows
- CDN caching: quickly check if content is cached
- Spell checkers: quick "might be misspelled" check
🏠 Real-world analogy: A bouncer with a fuzzy memory. If they say "you're NOT on the list," you're definitely not on the list. If they say "you might be on the list," you could be, but they might be confused with someone else.
💻 Code Example
1// BLOOM FILTER — Implementation2class BloomFilter {3 constructor(size = 1000, numHashes = 3) {4 this.size = size;5 this.numHashes = numHashes;6 this.bits = new Array(size).fill(false);7 }89 // Simple hash functions using different seeds10 _hash(value, seed) {11 let hash = 0;12 const str = String(value);13 for (let i = 0; i < str.length; i++) {14 hash = (hash * seed + str.charCodeAt(i)) % this.size;15 }16 return Math.abs(hash);17 }1819 _getPositions(value) {20 const positions = [];21 const seeds = [31, 37, 41, 43, 47, 53, 59];22 for (let i = 0; i < this.numHashes; i++) {23 positions.push(this._hash(value, seeds[i]));24 }25 return positions;26 }2728 // Add element29 add(value) {30 for (const pos of this._getPositions(value)) {31 this.bits[pos] = true;32 }33 }3435 // Check if element might exist36 mightContain(value) {37 return this._getPositions(value).every(pos => this.bits[pos]);38 }3940 // False positive rate ≈ (1 - e^(-kn/m))^k41 estimateFPRate(numElements) {42 const k = this.numHashes;43 const m = this.size;44 return Math.pow(1 - Math.exp(-k * numElements / m), k);45 }46}4748// USAGE49const bloom = new BloomFilter(10000, 5);50bloom.add("apple");51bloom.add("banana");52bloom.add("cherry");5354console.log(bloom.mightContain("apple")); // true (definitely added)55console.log(bloom.mightContain("banana")); // true56console.log(bloom.mightContain("grape")); // false (definitely NOT added)57console.log(bloom.mightContain("mango")); // false OR true (false positive possible)5859// COUNTING BLOOM FILTER — Supports deletion60class CountingBloomFilter {61 constructor(size = 1000, numHashes = 3) {62 this.size = size;63 this.numHashes = numHashes;64 this.counters = new Array(size).fill(0); // Counters instead of bits65 }6667 _getPositions(value) {68 const positions = [];69 const seeds = [31, 37, 41];70 for (let i = 0; i < this.numHashes; i++) {71 let hash = 0;72 const str = String(value);73 for (let j = 0; j < str.length; j++)74 hash = (hash * seeds[i] + str.charCodeAt(j)) % this.size;75 positions.push(Math.abs(hash));76 }77 return positions;78 }7980 add(value) { for (const p of this._getPositions(value)) this.counters[p]++; }81 remove(value) { for (const p of this._getPositions(value)) this.counters[p]--; }82 mightContain(value) { return this._getPositions(value).every(p => this.counters[p] > 0); }83}
🏋️ Practice Exercise
Practice Problems:
- Implement a basic Bloom filter with add and mightContain
- Calculate the optimal number of hash functions for given m and n
- Implement a counting Bloom filter that supports deletion
- Estimate the false positive rate for a given configuration
- Design a URL shortener that uses a Bloom filter to avoid collisions
- Use a Bloom filter to find elements in one set but not another
- Compare memory usage: Bloom filter vs HashSet for 1 million URLs
- Implement HyperLogLog for approximate distinct count (conceptual)
- Design a cache system that uses a Bloom filter to avoid cache misses
- Explain when a Bloom filter is NOT appropriate
⚠️ Common Mistakes
Assuming Bloom filters have no errors — they DO have false positives (just no false negatives)
Using too few hash functions — increases false positive rate
Using too many hash functions — fills the bit array too quickly, also increases false positives
Trying to delete from a standard Bloom filter — use counting Bloom filter for deletion
Using Bloom filters when exact results are needed — they're probabilistic by design
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Bloom Filters & Probabilistic Data Structures