Probabilistic Data Structures
📖 Concept
Probabilistic data structures trade perfect accuracy for massive efficiency. They answer questions like "have I seen this before?" using a fraction of the memory that exact data structures would need.
Bloom Filter
"Is this element in the set?" → Yes (probably) or No (definitely)
- False positives: Might say "yes" when the answer is "no" (~1% error rate)
- No false negatives: If it says "no," the element is definitely not in the set
- Space efficient: 10 bytes per element (vs 100+ bytes for a hash set)
Used by: Chrome (malicious URL checking), Cassandra (avoiding disk reads for missing rows), Medium (article recommendation dedup)
HyperLogLog (HLL)
"How many unique elements?" → Approximate count with ~0.81% error
- Counts billions of unique elements using only 12 KB of memory
- Redis:
PFADD visitors user123,PFCOUNT visitorsUsed by: Redis unique visitor counting, database query optimization
Count-Min Sketch
"How many times has this element appeared?" → Approximate frequency count
Used by: Network traffic analysis, finding heavy hitters (trending topics)
When to Use Probabilistic Structures
| Question | Structure | Memory | Accuracy |
|---|---|---|---|
| "Seen this before?" | Bloom Filter | Very low | No false negatives |
| "How many unique?" | HyperLogLog | 12 KB! | ~0.81% error |
| "How often?" | Count-Min Sketch | Low | Slight overcount |
Interview tip: Mention bloom filters when designing systems with "have I seen this?" checks (web crawlers, recommendation dedup, spam filters). They're a great way to show depth.
💻 Code Example
1// ============================================2// Probabilistic Data Structures3// ============================================45// ---------- Bloom Filter ----------6class BloomFilter {7 constructor(size = 1024, hashCount = 3) {8 this.size = size;9 this.hashCount = hashCount;10 this.bitArray = new Uint8Array(size);11 }1213 add(item) {14 for (let i = 0; i < this.hashCount; i++) {15 const index = this.hash(item, i) % this.size;16 this.bitArray[index] = 1;17 }18 }1920 mightContain(item) {21 for (let i = 0; i < this.hashCount; i++) {22 const index = this.hash(item, i) % this.size;23 if (this.bitArray[index] === 0) return false; // Definitely not in set24 }25 return true; // Probably in set (maybe false positive)26 }2728 hash(item, seed) {29 let hash = seed;30 const str = String(item);31 for (let i = 0; i < str.length; i++) {32 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;33 }34 return Math.abs(hash);35 }36}3738// ---------- HyperLogLog (Simplified) ----------39class HyperLogLog {40 constructor(precision = 14) {41 this.precision = precision;42 this.m = 1 << precision; // Number of registers (16384)43 this.registers = new Uint8Array(this.m);44 }4546 add(item) {47 const hash = this.hash64(item);48 const index = hash >>> (32 - this.precision); // First p bits → register index49 const remaining = hash << this.precision; // Remaining bits50 const leadingZeros = this.countLeadingZeros(remaining) + 1;51 this.registers[index] = Math.max(this.registers[index], leadingZeros);52 }5354 count() {55 // Harmonic mean of 2^register values56 let sum = 0;57 let zeros = 0;58 for (let i = 0; i < this.m; i++) {59 sum += Math.pow(2, -this.registers[i]);60 if (this.registers[i] === 0) zeros++;61 }62 const alpha = 0.7213 / (1 + 1.079 / this.m);63 let estimate = alpha * this.m * this.m / sum;6465 // Small range correction66 if (estimate < 2.5 * this.m && zeros > 0) {67 estimate = this.m * Math.log(this.m / zeros);68 }69 return Math.round(estimate);70 }7172 hash64(item) {73 let hash = 0x811c9dc5;74 const str = String(item);75 for (let i = 0; i < str.length; i++) {76 hash ^= str.charCodeAt(i);77 hash = (hash * 0x01000193) >>> 0;78 }79 return hash;80 }8182 countLeadingZeros(n) {83 if (n === 0) return 32;84 let count = 0;85 while ((n & 0x80000000) === 0) { n <<= 1; count++; }86 return count;87 }88}8990// Demo91const bloom = new BloomFilter(10000, 5);92bloom.add('hello');93bloom.add('world');94console.log('Contains "hello":', bloom.mightContain('hello')); // true95console.log('Contains "missing":', bloom.mightContain('missing')); // false (probably)9697const hll = new HyperLogLog();98for (let i = 0; i < 100000; i++) hll.add(`user_\${i}`);99console.log('Estimated unique:', hll.count()); // ~100000 (±1%)
🏋️ Practice Exercise
Bloom Filter for Web Crawler: Design a bloom filter to track 1 billion crawled URLs. Calculate: optimal bit array size and number of hash functions for < 0.1% false positive rate.
Unique Visitors: Use HyperLogLog to count unique daily visitors across 1000 pages. Each page has its own HLL. How do you count unique visitors across ALL pages?
Recommendation Dedup: Design a system that avoids recommending articles a user has already read, using bloom filters. Handle: 10M users, 100K articles, and bloom filter per user.
Trending Topics: Use a Count-Min Sketch to find the top 10 trending hashtags from a stream of 1M tweets/minute.
⚠️ Common Mistakes
Using bloom filters when false positives are unacceptable — bloom filters are probabilistic. If a false positive causes a payment to be skipped or a security check to pass, don't use a bloom filter.
Not sizing bloom filters correctly — too small = high false positive rate 2. Too large = wasted memory. Use the formula: m = -n*ln(p) / (ln(2))^2 where n = elements, p = false positive rate.
Forgetting you can't remove from a bloom filter — standard bloom filters only support add and query. Use a Counting Bloom Filter if you need deletions.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Probabilistic Data Structures