What is it?
A Bloom filter is a tiny, probabilistic data structure for set membership:
- Answers “possibly in the set” or “definitely not”.
- No false negatives, but false positives are possible.
- Extremely space-efficient and fast.
How it works (intuition)
- Allocate a bit array of length
m, all zeros. - Pick
kindependent hash functions:h1..hk : item → [0, m-1]. - Insert(x): set bits at positions
h1(x), …,hk(x)to 1. - Query(x): if all
kbits are 1 → possibly present; if any bit is 0 → definitely not present.
Tiny example (m=10, k=3)
Start: 0000000000
Insert "ana": set [1,4,7] → 0100100100
Insert "ben": set [2,5,9] → 0110110101
Query "ava": check its 3 positions [1,5,7]
- if bits at 1,5,7 are 1 → "maybe"
- if any is 0 → "no"
False positives happen because different items can flip the same bits.
Why it’s useful
- Skip work cheaply: e.g., “Does this partition/segment possibly contain key K?”
- Guard remote calls: avoid cache/database lookups for definitely-missing keys.
- Accelerate joins: build a Bloom filter on small table’s join keys; probe the big table to discard non-matching rows early.
Columnar/Lakehouse examples
- Parquet/ORC/Iceberg/Delta/Hudi often store per-row-group or per-file Bloom filters on selective columns (e.g.,
email,user_id).
QueryWHERE email='a@b.com':- Partition prune by directory (e.g.,
dt=...). - Bloom filter test per file/row group → skip most data.
- Scan only the files/groups that passed.
- Partition prune by directory (e.g.,
Sizing & math (handy formulas)
Let:
n= expected items,m= bits in the filter,k= number of hash functions,p= false positive rate.
False positive (approx.):
p ≈ (1 - e^(-k·n/m))^k
Optimal k for given m,n:
k* = (m/n) · ln 2
Bits needed for target p:
m ≈ - (n · ln p) / (ln 2)^2
Quick sizing example
Target: n = 10,000,000, p = 1% (0.01)
- Bits per item ≈
-ln(0.01)/(ln2)^2 ≈ 9.6→ m ≈ 96M bits ≈ 12 MB k* ≈ (m/n)·ln2 ≈ 9.6 · 0.693 ≈ 6.6→ use k = 7
Variants & cousins
- Counting Bloom Filter: replaces bits with small counters → supports deletes (extra memory).
- Scalable Bloom Filter: grows in stages to keep
pbounded asngrows. - Blocked/Partitioned Bloom: better CPU cache locality (group bits by block).
- Cuckoo / XOR filters: alternative structures; support deletes and often lower p at similar sizes (implementation dependent).
Limitations & gotchas
- Can’t list elements or get counts (only “maybe/definitely not”).
- As the filter fills up (n >> expectation),
prises; rebuild or use scalable variant. - Requires good hash functions; poor hashing = high collision = higher
p. - For tiny sets, a sorted array + binary search can be simpler/faster.
Minimal pseudocode
init(m, k): bitset[0..m-1] = 0
insert(x):
for i in 1..k:
j = hi(x) mod m
bitset[j] = 1
mightContain(x):
for i in 1..k:
j = hi(x) mod m
if bitset[j] == 0: return false // definitely not
return true // possibly yes