Probabilistic Data Structures

0/2 in this phase0/45 across the roadmap

📖 Concept

Probabilistic data structures trade perfect accuracy for massive efficiency. They answer questions like "have I seen this before?" using a fraction of the memory that exact data structures would need.

Bloom Filter

"Is this element in the set?" → Yes (probably) or No (definitely)

False positives: Might say "yes" when the answer is "no" (~1% error rate)
No false negatives: If it says "no," the element is definitely not in the set
Space efficient: 10 bytes per element (vs 100+ bytes for a hash set)

Used by: Chrome (malicious URL checking), Cassandra (avoiding disk reads for missing rows), Medium (article recommendation dedup)

HyperLogLog (HLL)

"How many unique elements?" → Approximate count with ~0.81% error

Counts billions of unique elements using only 12 KB of memory
Redis: PFADD visitors user123, PFCOUNT visitors Used by: Redis unique visitor counting, database query optimization

Count-Min Sketch

"How many times has this element appeared?" → Approximate frequency count

Used by: Network traffic analysis, finding heavy hitters (trending topics)

When to Use Probabilistic Structures

Question	Structure	Memory	Accuracy
"Seen this before?"	Bloom Filter	Very low	No false negatives
"How many unique?"	HyperLogLog	12 KB!	~0.81% error
"How often?"	Count-Min Sketch	Low	Slight overcount

Interview tip: Mention bloom filters when designing systems with "have I seen this?" checks (web crawlers, recommendation dedup, spam filters). They're a great way to show depth.

💻 Code Example

codeTap to expand ⛶

1// ============================================
2// Probabilistic Data Structures
3// ============================================
4
5// ---------- Bloom Filter ----------
6class BloomFilter {
7  constructor(size = 1024, hashCount = 3) {
8    this.size = size;
9    this.hashCount = hashCount;
10    this.bitArray = new Uint8Array(size);
11  }
12
13  add(item) {
14    for (let i = 0; i < this.hashCount; i++) {
15      const index = this.hash(item, i) % this.size;
16      this.bitArray[index] = 1;
17    }
18  }
19
20  mightContain(item) {
21    for (let i = 0; i < this.hashCount; i++) {
22      const index = this.hash(item, i) % this.size;
23      if (this.bitArray[index] === 0) return false; // Definitely not in set
24    }
25    return true; // Probably in set (maybe false positive)
26  }
27
28  hash(item, seed) {
29    let hash = seed;
30    const str = String(item);
31    for (let i = 0; i < str.length; i++) {
32      hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
33    }
34    return Math.abs(hash);
35  }
36}
37
38// ---------- HyperLogLog (Simplified) ----------
39class HyperLogLog {
40  constructor(precision = 14) {
41    this.precision = precision;
42    this.m = 1 << precision; // Number of registers (16384)
43    this.registers = new Uint8Array(this.m);
44  }
45
46  add(item) {
47    const hash = this.hash64(item);
48    const index = hash >>> (32 - this.precision); // First p bits → register index
49    const remaining = hash << this.precision; // Remaining bits
50    const leadingZeros = this.countLeadingZeros(remaining) + 1;
51    this.registers[index] = Math.max(this.registers[index], leadingZeros);
52  }
53
54  count() {
55    // Harmonic mean of 2^register values
56    let sum = 0;
57    let zeros = 0;
58    for (let i = 0; i < this.m; i++) {
59      sum += Math.pow(2, -this.registers[i]);
60      if (this.registers[i] === 0) zeros++;
61    }
62    const alpha = 0.7213 / (1 + 1.079 / this.m);
63    let estimate = alpha * this.m * this.m / sum;
64
65    // Small range correction
66    if (estimate < 2.5 * this.m && zeros > 0) {
67      estimate = this.m * Math.log(this.m / zeros);
68    }
69    return Math.round(estimate);
70  }
71
72  hash64(item) {
73    let hash = 0x811c9dc5;
74    const str = String(item);
75    for (let i = 0; i < str.length; i++) {
76      hash ^= str.charCodeAt(i);
77      hash = (hash * 0x01000193) >>> 0;
78    }
79    return hash;
80  }
81
82  countLeadingZeros(n) {
83    if (n === 0) return 32;
84    let count = 0;
85    while ((n & 0x80000000) === 0) { n <<= 1; count++; }
86    return count;
87  }
88}
89
90// Demo
91const bloom = new BloomFilter(10000, 5);
92bloom.add('hello');
93bloom.add('world');
94console.log('Contains "hello":', bloom.mightContain('hello'));     // true
95console.log('Contains "missing":', bloom.mightContain('missing')); // false (probably)
96
97const hll = new HyperLogLog();
98for (let i = 0; i < 100000; i++) hll.add(`user_\${i}`);
99console.log('Estimated unique:', hll.count()); // ~100000 (±1%)

🏋️ Practice Exercise

Bloom Filter for Web Crawler: Design a bloom filter to track 1 billion crawled URLs. Calculate: optimal bit array size and number of hash functions for < 0.1% false positive rate.
Unique Visitors: Use HyperLogLog to count unique daily visitors across 1000 pages. Each page has its own HLL. How do you count unique visitors across ALL pages?
Recommendation Dedup: Design a system that avoids recommending articles a user has already read, using bloom filters. Handle: 10M users, 100K articles, and bloom filter per user.
Trending Topics: Use a Count-Min Sketch to find the top 10 trending hashtags from a stream of 1M tweets/minute.

⚠️ Common Mistakes

Using bloom filters when false positives are unacceptable — bloom filters are probabilistic. If a false positive causes a payment to be skipped or a security check to pass, don't use a bloom filter.
Not sizing bloom filters correctly — too small = high false positive rate 2. Too large = wasted memory. Use the formula: m = -n*ln(p) / (ln(2))^2 where n = elements, p = false positive rate.
Forgetting you can't remove from a bloom filter — standard bloom filters only support add and query. Use a Counting Bloom Filter if you need deletions.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Probabilistic Data Structures

Was this topic helpful?

← PreviousDistributed Consensus & Leader Election Next →Three Pillars of Observability