What is it?

A Bloom filter is a tiny, probabilistic data structure for set membership:

  • Answers “possibly in the set” or “definitely not”.
  • No false negatives, but false positives are possible.
  • Extremely space-efficient and fast.

How it works (intuition)

  • Allocate a bit array of length m, all zeros.
  • Pick k independent hash functions: h1..hk : item → [0, m-1].
  • Insert(x): set bits at positions h1(x), …, hk(x) to 1.
  • Query(x): if all k bits are 1 → possibly present; if any bit is 0 → definitely not present.

Tiny example (m=10, k=3)


Start: 0000000000  
Insert "ana": set [1,4,7] → 0100100100  
Insert "ben": set [2,5,9] → 0110110101  
Query "ava": check its 3 positions [1,5,7]

- if bits at 1,5,7 are 1 → "maybe"
    
- if any is 0 → "no"
    

False positives happen because different items can flip the same bits.


Why it’s useful

  • Skip work cheaply: e.g., “Does this partition/segment possibly contain key K?”
  • Guard remote calls: avoid cache/database lookups for definitely-missing keys.
  • Accelerate joins: build a Bloom filter on small table’s join keys; probe the big table to discard non-matching rows early.

Columnar/Lakehouse examples

  • Parquet/ORC/Iceberg/Delta/Hudi often store per-row-group or per-file Bloom filters on selective columns (e.g., email, user_id).
    Query WHERE email='a@b.com':
    1. Partition prune by directory (e.g., dt=...).
    2. Bloom filter test per file/row group → skip most data.
    3. Scan only the files/groups that passed.

Sizing & math (handy formulas)

Let:

  • n = expected items,
  • m = bits in the filter,
  • k = number of hash functions,
  • p = false positive rate.

False positive (approx.):
p ≈ (1 - e^(-k·n/m))^k

Optimal k for given m,n:
k* = (m/n) · ln 2

Bits needed for target p:
m ≈ - (n · ln p) / (ln 2)^2

Quick sizing example
Target: n = 10,000,000, p = 1% (0.01)

  • Bits per item ≈ -ln(0.01)/(ln2)^2 ≈ 9.6m ≈ 96M bits ≈ 12 MB
  • k* ≈ (m/n)·ln2 ≈ 9.6 · 0.693 ≈ 6.6use k = 7

Variants & cousins

  • Counting Bloom Filter: replaces bits with small counters → supports deletes (extra memory).
  • Scalable Bloom Filter: grows in stages to keep p bounded as n grows.
  • Blocked/Partitioned Bloom: better CPU cache locality (group bits by block).
  • Cuckoo / XOR filters: alternative structures; support deletes and often lower p at similar sizes (implementation dependent).

Limitations & gotchas

  • Can’t list elements or get counts (only “maybe/definitely not”).
  • As the filter fills up (n >> expectation), p rises; rebuild or use scalable variant.
  • Requires good hash functions; poor hashing = high collision = higher p.
  • For tiny sets, a sorted array + binary search can be simpler/faster.

Minimal pseudocode

init(m, k): bitset[0..m-1] = 0
 
insert(x):
  for i in 1..k:
    j = hi(x) mod m
    bitset[j] = 1
 
mightContain(x):
  for i in 1..k:
    j = hi(x) mod m
    if bitset[j] == 0: return false   // definitely not
  return true                          // possibly yes