Bloom filters

What is it?

A Bloom filter is a tiny, probabilistic data structure for set membership:

Answers “possibly in the set” or “definitely not”.
No false negatives, but false positives are possible.
Extremely space-efficient and fast.

How it works (intuition)

Allocate a bit array of length m, all zeros.
Pick k independent hash functions: h1..hk : item → [0, m-1].
Insert(x): set bits at positions h1(x), …, hk(x) to 1.
Query(x): if all k bits are 1 → possibly present; if any bit is 0 → definitely not present.

Tiny example (m=10, k=3)


Start: 0000000000  
Insert "ana": set [1,4,7] → 0100100100  
Insert "ben": set [2,5,9] → 0110110101  
Query "ava": check its 3 positions [1,5,7]

- if bits at 1,5,7 are 1 → "maybe"
    
- if any is 0 → "no"

False positives happen because different items can flip the same bits.

Why it’s useful

Skip work cheaply: e.g., “Does this partition/segment possibly contain key K?”
Guard remote calls: avoid cache/database lookups for definitely-missing keys.
Accelerate joins: build a Bloom filter on small table’s join keys; probe the big table to discard non-matching rows early.

Columnar/Lakehouse examples

Parquet/ORC/Iceberg/Delta/Hudi often store per-row-group or per-file Bloom filters on selective columns (e.g., email, user_id).
Query WHERE email='a@b.com':
1. Partition prune by directory (e.g., dt=...).
2. Bloom filter test per file/row group → skip most data.
3. Scan only the files/groups that passed.

Sizing & math (handy formulas)

Let:

n = expected items,
m = bits in the filter,
k = number of hash functions,
p = false positive rate.

False positive (approx.):
p ≈ (1 - e^(-k·n/m))^k

Optimal k for given m,n:
k* = (m/n) · ln 2

Bits needed for target p:
m ≈ - (n · ln p) / (ln 2)^2

Quick sizing example
Target: n = 10,000,000, p = 1% (0.01)

Bits per item ≈ -ln(0.01)/(ln2)^2 ≈ 9.6 → m ≈ 96M bits ≈ 12 MB
k* ≈ (m/n)·ln2 ≈ 9.6 · 0.693 ≈ 6.6 → use k = 7

Variants & cousins

Counting Bloom Filter: replaces bits with small counters → supports deletes (extra memory).
Scalable Bloom Filter: grows in stages to keep p bounded as n grows.
Blocked/Partitioned Bloom: better CPU cache locality (group bits by block).
Cuckoo / XOR filters: alternative structures; support deletes and often lower p at similar sizes (implementation dependent).

Limitations & gotchas

Can’t list elements or get counts (only “maybe/definitely not”).
As the filter fills up (n >> expectation), p rises; rebuild or use scalable variant.
Requires good hash functions; poor hashing = high collision = higher p.
For tiny sets, a sorted array + binary search can be simpler/faster.

Minimal pseudocode

init(m, k): bitset[0..m-1] = 0
 
insert(x):
  for i in 1..k:
    j = hi(x) mod m
    bitset[j] = 1
 
mightContain(x):
  for i in 1..k:
    j = hi(x) mod m
    if bitset[j] == 0: return false   // definitely not
  return true                          // possibly yes

Jonas Notes

Explorer