0) Setup (Example Tables)
We’ll use small tables to illustrate:
-- Orders(A)
A(order_id, customer_id, amount)
-- Customers(B)
B(customer_id, name)Sample rows:
-
A: (1, 10, 50), (2, 20, 20), (3, 10, 30)
-
B: (10, ‘Ana’), (20, ‘Ben’)
Goal:SELECT * FROM A JOIN B ON A.customer_id = B.customer_id;
1) Nested Loop Join (NLJ)
How it works
For each row in the outer table, scan the inner table and test the join predicate.
-
Simple NLJ: scan inner table row-by-row each time.
-
Index Nested Loop (INLJ): if inner has an index on join key, do an index lookup per outer row.
-
Block Nested Loop (BNLJ): read a block/batch of outer rows, then scan inner once per block (fewer passes).
Example (INLJ)
-
Take A(1,10,50) → index lookup in B on
customer_id=10→ match B(10,‘Ana). -
Take A(2,20,20) → lookup
20→ B(20,‘Ben). -
Take A(3,10,30) → lookup
10→ B(10,‘Ana).
Pros / Cons
-
✅ Best when outer is small and inner has an index on the join key.
-
✅ Supports any predicate (equality or inequality).
-
❌ Quadratic if no index: ~O(|A|×|B|).
-
❌ Can thrash I/O if outer is large and inner not indexed.
Rules of Thumb
-
Use INLJ for OLTP-ish workloads: many point lookups, selective outer.
-
Use BNLJ when memory allows block buffering but no useful index exists.
2) Hash Join (HJ)
How it works
-
Choose the smaller input as build side; hash it into in-memory buckets by the join key.
-
Probe: scan the larger side; for each row, compute hash(key) and match in buckets.
Example
-
Build hash table from B:
bucket[10] → (‘Ana’), bucket[20] → (‘Ben’) -
Probe A:
-
A(1,10,50) → bucket[10] → match ‘Ana’
-
A(2,20,20) → bucket[20] → match ‘Ben’
-
A(3,10,30) → bucket[10] → match ‘Ana’
-
Pros / Cons
-
✅ Linear time ~O(|A|+|B|) for equality joins.
-
✅ Great for large joins; doesn’t need pre-sorted data.
-
❌ Needs memory for hash table; may spill to disk (Grace/Hybrid Hash Join).
-
❌ Primarily for equality predicates (theta-joins harder).
Rules of Thumb
-
Default for big equality joins.
-
Make sure build side is small enough (or partition/spill gracefully).
-
Watch for skew: hot keys → unbalanced buckets.
3) Merge (Sort-Merge) Join (SMJ)
How it works
-
Ensure both inputs are sorted on join key (via index order or explicit sort).
-
Walk the two sorted streams in lockstep, joining matching keys; advance the side with the smaller key.
Example (sorted by customer_id)
-
A keys: [10, 10, 20] ; B keys: [10, 20]
-
Compare heads:
-
10==10 → output (A(1),B(10)), then also (A(3),B(10)) due to duplicate 10.
-
Advance sides where equal exhausted, then 20==20 → output (A(2),B(20)).
-
Pros / Cons
-
✅ Excellent for very large datasets when inputs are already sorted (e.g., clustered index, sorted ETL).
-
✅ Supports equality well; with some variants can handle range joins efficiently on ordered keys.
-
❌ If not sorted, you pay sort cost first.
-
❌ Handling many duplicates can require buffering groups.
Rules of Thumb
-
Use when sorted inputs are available “for free” (index order, prior step).
-
Good alternative to hash join when memory is tight but sequential I/O is cheap.
4) Choosing the Algorithm (Single-Node)
Quick decision tree
-
Equality join, large inputs:
-
Inputs not sorted → Hash Join.
-
Inputs already sorted or cheap to sort → Merge Join.
-
-
Outer small, inner indexed → Index Nested Loop.
-
Non-equality (e.g.,
<,BETWEEN):- Merge Join (on sorted keys) or Nested Loop (often with range index scans).
-
Memory-bound:
- Prefer Merge Join (sequential) or BNLJ if no order; avoid huge hash tables.
5) Distributed Systems (MPP, Spark/Trino/StarRocks)
-
Broadcast Join: send small table to all workers; local join with big table partition.
- ✅ Simple, fast for small build side. ❌ Fails if “small” isn’t actually small.
-
Shuffle (Repartition) Hash Join: hash-partition both tables by join key so matching keys co-locate.
- ✅ Scales to large inputs. ❌ Network shuffle cost; key skew hurts.
-
Co-located Join: if both inputs are already partitioned by the same key in the same way, no shuffle needed.
- ✅ Best case at scale.
-
Skew handling: salting hot keys, two-phase joins, or adaptive execution (split heavy buckets).
6) Cost & Complexity (High Level)
-
INLJ: O(|A| × log|B|) with index; O(|A|×|B|) without. Great if |A| is small & index exists.
-
BNLJ: reduces I/O passes with buffering; still worst-case quadratic on rows.
-
Hash Join: ~O(|A|+|B|) for eq-joins; memory/spill sensitive.
-
Merge Join: ~O(|A|+|B|) if already sorted; else add sort cost O(|A|log|A| + |B|log|B|).
7) Practical Tips
-
Pick the smaller input as build (hash), or broadcast the smaller input (MPP).
-
Keep stats fresh (cardinality, NDV) so the optimizer doesn’t pick a bad plan.
-
Use appropriate indexes:
-
Point/range predicates → B-tree indexes help (INLJ/merge preconditions).
-
Equality heavy + large data → Hash Join typically wins.
-
-
Handle skew early: identify hot keys; consider salting or split strategies.
-
Push filters down before joins to shrink inputs.
-
Use covering indexes to make NLJ much faster (avoid extra heap lookups).
8) Worked Micro-Examples
8.1 Index Nested Loop
-- Index on B(customer_id)
SELECT /*+ USE_NLJ */ A.*, B.*
FROM A JOIN B ON A.customer_id = B.customer_id;
-- For each row in A, do an index lookup into B.8.2 Hash Join
-- Equality join, large inputs
SELECT /*+ USE_HASH */ A.*, B.*
FROM A JOIN B ON A.customer_id = B.customer_id;
-- Build hash on smaller input (say B), then probe with A.8.3 Merge Join
-- Inputs already sorted (e.g., via clustered index or ORDER BY in pipeline)
SELECT /*+ USE_MERGE */ A.*, B.*
FROM (SELECT * FROM A ORDER BY customer_id) A
JOIN (SELECT * FROM B ORDER BY customer_id) B
ON A.customer_id = B.customer_id;
-- Walk both sorted streams together.9) Cheat-Sheet (One-liners)
-
INLJ: small outer + index on inner → great. No index → avoid.
-
BNLJ: no index but decent memory → batch the outer to reduce inner scans.
-
Hash: default for big equality joins; watch memory and skew.
-
Merge: best when inputs are already sorted or sorts are cheap; memory friendly and sequential.