Probabilistic Data Structures

0/2 in this phase0/45 across the roadmap

📖 Concept

Probabilistic data structures trade perfect accuracy for massive efficiency. They answer questions like "have I seen this before?" using a fraction of the memory that exact data structures would need.

Bloom Filter

"Is this element in the set?" → Yes (probably) or No (definitely)

  • False positives: Might say "yes" when the answer is "no" (~1% error rate)
  • No false negatives: If it says "no," the element is definitely not in the set
  • Space efficient: 10 bytes per element (vs 100+ bytes for a hash set)

Used by: Chrome (malicious URL checking), Cassandra (avoiding disk reads for missing rows), Medium (article recommendation dedup)

HyperLogLog (HLL)

"How many unique elements?" → Approximate count with ~0.81% error

  • Counts billions of unique elements using only 12 KB of memory
  • Redis: PFADD visitors user123, PFCOUNT visitors Used by: Redis unique visitor counting, database query optimization

Count-Min Sketch

"How many times has this element appeared?" → Approximate frequency count

Used by: Network traffic analysis, finding heavy hitters (trending topics)

When to Use Probabilistic Structures

Question Structure Memory Accuracy
"Seen this before?" Bloom Filter Very low No false negatives
"How many unique?" HyperLogLog 12 KB! ~0.81% error
"How often?" Count-Min Sketch Low Slight overcount

Interview tip: Mention bloom filters when designing systems with "have I seen this?" checks (web crawlers, recommendation dedup, spam filters). They're a great way to show depth.

💻 Code Example

codeTap to expand ⛶
1// ============================================
2// Probabilistic Data Structures
3// ============================================
4
5// ---------- Bloom Filter ----------
6class BloomFilter {
7 constructor(size = 1024, hashCount = 3) {
8 this.size = size;
9 this.hashCount = hashCount;
10 this.bitArray = new Uint8Array(size);
11 }
12
13 add(item) {
14 for (let i = 0; i < this.hashCount; i++) {
15 const index = this.hash(item, i) % this.size;
16 this.bitArray[index] = 1;
17 }
18 }
19
20 mightContain(item) {
21 for (let i = 0; i < this.hashCount; i++) {
22 const index = this.hash(item, i) % this.size;
23 if (this.bitArray[index] === 0) return false; // Definitely not in set
24 }
25 return true; // Probably in set (maybe false positive)
26 }
27
28 hash(item, seed) {
29 let hash = seed;
30 const str = String(item);
31 for (let i = 0; i < str.length; i++) {
32 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
33 }
34 return Math.abs(hash);
35 }
36}
37
38// ---------- HyperLogLog (Simplified) ----------
39class HyperLogLog {
40 constructor(precision = 14) {
41 this.precision = precision;
42 this.m = 1 << precision; // Number of registers (16384)
43 this.registers = new Uint8Array(this.m);
44 }
45
46 add(item) {
47 const hash = this.hash64(item);
48 const index = hash >>> (32 - this.precision); // First p bits → register index
49 const remaining = hash << this.precision; // Remaining bits
50 const leadingZeros = this.countLeadingZeros(remaining) + 1;
51 this.registers[index] = Math.max(this.registers[index], leadingZeros);
52 }
53
54 count() {
55 // Harmonic mean of 2^register values
56 let sum = 0;
57 let zeros = 0;
58 for (let i = 0; i < this.m; i++) {
59 sum += Math.pow(2, -this.registers[i]);
60 if (this.registers[i] === 0) zeros++;
61 }
62 const alpha = 0.7213 / (1 + 1.079 / this.m);
63 let estimate = alpha * this.m * this.m / sum;
64
65 // Small range correction
66 if (estimate < 2.5 * this.m && zeros > 0) {
67 estimate = this.m * Math.log(this.m / zeros);
68 }
69 return Math.round(estimate);
70 }
71
72 hash64(item) {
73 let hash = 0x811c9dc5;
74 const str = String(item);
75 for (let i = 0; i < str.length; i++) {
76 hash ^= str.charCodeAt(i);
77 hash = (hash * 0x01000193) >>> 0;
78 }
79 return hash;
80 }
81
82 countLeadingZeros(n) {
83 if (n === 0) return 32;
84 let count = 0;
85 while ((n & 0x80000000) === 0) { n <<= 1; count++; }
86 return count;
87 }
88}
89
90// Demo
91const bloom = new BloomFilter(10000, 5);
92bloom.add('hello');
93bloom.add('world');
94console.log('Contains "hello":', bloom.mightContain('hello')); // true
95console.log('Contains "missing":', bloom.mightContain('missing')); // false (probably)
96
97const hll = new HyperLogLog();
98for (let i = 0; i < 100000; i++) hll.add(`user_\${i}`);
99console.log('Estimated unique:', hll.count()); // ~100000 (±1%)

🏋️ Practice Exercise

  1. Bloom Filter for Web Crawler: Design a bloom filter to track 1 billion crawled URLs. Calculate: optimal bit array size and number of hash functions for < 0.1% false positive rate.

  2. Unique Visitors: Use HyperLogLog to count unique daily visitors across 1000 pages. Each page has its own HLL. How do you count unique visitors across ALL pages?

  3. Recommendation Dedup: Design a system that avoids recommending articles a user has already read, using bloom filters. Handle: 10M users, 100K articles, and bloom filter per user.

  4. Trending Topics: Use a Count-Min Sketch to find the top 10 trending hashtags from a stream of 1M tweets/minute.

⚠️ Common Mistakes

  • Using bloom filters when false positives are unacceptable — bloom filters are probabilistic. If a false positive causes a payment to be skipped or a security check to pass, don't use a bloom filter.

  • Not sizing bloom filters correctly — too small = high false positive rate 2. Too large = wasted memory. Use the formula: m = -n*ln(p) / (ln(2))^2 where n = elements, p = false positive rate.

  • Forgetting you can't remove from a bloom filter — standard bloom filters only support add and query. Use a Counting Bloom Filter if you need deletions.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Probabilistic Data Structures