0) Setup (Example Tables)

We’ll use small tables to illustrate:

-- Orders(A)
A(order_id, customer_id, amount)
-- Customers(B)
B(customer_id, name)

Sample rows:

  • A: (1, 10, 50), (2, 20, 20), (3, 10, 30)

  • B: (10, ‘Ana’), (20, ‘Ben’)
    Goal: SELECT * FROM A JOIN B ON A.customer_id = B.customer_id;


1) Nested Loop Join (NLJ)

How it works

For each row in the outer table, scan the inner table and test the join predicate.

  • Simple NLJ: scan inner table row-by-row each time.

  • Index Nested Loop (INLJ): if inner has an index on join key, do an index lookup per outer row.

  • Block Nested Loop (BNLJ): read a block/batch of outer rows, then scan inner once per block (fewer passes).

Example (INLJ)

  1. Take A(1,10,50) → index lookup in B on customer_id=10 → match B(10,‘Ana).

  2. Take A(2,20,20) → lookup 20 → B(20,‘Ben).

  3. Take A(3,10,30) → lookup 10 → B(10,‘Ana).

Pros / Cons

  • ✅ Best when outer is small and inner has an index on the join key.

  • ✅ Supports any predicate (equality or inequality).

  • Quadratic if no index: ~O(|A|×|B|).

  • ❌ Can thrash I/O if outer is large and inner not indexed.

Rules of Thumb

  • Use INLJ for OLTP-ish workloads: many point lookups, selective outer.

  • Use BNLJ when memory allows block buffering but no useful index exists.


2) Hash Join (HJ)

How it works

  1. Choose the smaller input as build side; hash it into in-memory buckets by the join key.

  2. Probe: scan the larger side; for each row, compute hash(key) and match in buckets.

Example

  • Build hash table from B:
    bucket[10] → (‘Ana’), bucket[20] → (‘Ben’)

  • Probe A:

    • A(1,10,50) → bucket[10] → match ‘Ana’

    • A(2,20,20) → bucket[20] → match ‘Ben’

    • A(3,10,30) → bucket[10] → match ‘Ana’

Pros / Cons

  • Linear time ~O(|A|+|B|) for equality joins.

  • ✅ Great for large joins; doesn’t need pre-sorted data.

  • ❌ Needs memory for hash table; may spill to disk (Grace/Hybrid Hash Join).

  • ❌ Primarily for equality predicates (theta-joins harder).

Rules of Thumb

  • Default for big equality joins.

  • Make sure build side is small enough (or partition/spill gracefully).

  • Watch for skew: hot keys → unbalanced buckets.


3) Merge (Sort-Merge) Join (SMJ)

How it works

  1. Ensure both inputs are sorted on join key (via index order or explicit sort).

  2. Walk the two sorted streams in lockstep, joining matching keys; advance the side with the smaller key.

Example (sorted by customer_id)

  • A keys: [10, 10, 20] ; B keys: [10, 20]

  • Compare heads:

    • 10==10 → output (A(1),B(10)), then also (A(3),B(10)) due to duplicate 10.

    • Advance sides where equal exhausted, then 20==20 → output (A(2),B(20)).

Pros / Cons

  • ✅ Excellent for very large datasets when inputs are already sorted (e.g., clustered index, sorted ETL).

  • ✅ Supports equality well; with some variants can handle range joins efficiently on ordered keys.

  • ❌ If not sorted, you pay sort cost first.

  • ❌ Handling many duplicates can require buffering groups.

Rules of Thumb

  • Use when sorted inputs are available “for free” (index order, prior step).

  • Good alternative to hash join when memory is tight but sequential I/O is cheap.


4) Choosing the Algorithm (Single-Node)

Quick decision tree

  • Equality join, large inputs:

    • Inputs not sorted → Hash Join.

    • Inputs already sorted or cheap to sort → Merge Join.

  • Outer small, inner indexedIndex Nested Loop.

  • Non-equality (e.g., <, BETWEEN):

    • Merge Join (on sorted keys) or Nested Loop (often with range index scans).
  • Memory-bound:

    • Prefer Merge Join (sequential) or BNLJ if no order; avoid huge hash tables.

5) Distributed Systems (MPP, Spark/Trino/StarRocks)

  • Broadcast Join: send small table to all workers; local join with big table partition.

    • ✅ Simple, fast for small build side. ❌ Fails if “small” isn’t actually small.
  • Shuffle (Repartition) Hash Join: hash-partition both tables by join key so matching keys co-locate.

    • ✅ Scales to large inputs. ❌ Network shuffle cost; key skew hurts.
  • Co-located Join: if both inputs are already partitioned by the same key in the same way, no shuffle needed.

    • ✅ Best case at scale.
  • Skew handling: salting hot keys, two-phase joins, or adaptive execution (split heavy buckets).


6) Cost & Complexity (High Level)

  • INLJ: O(|A| × log|B|) with index; O(|A|×|B|) without. Great if |A| is small & index exists.

  • BNLJ: reduces I/O passes with buffering; still worst-case quadratic on rows.

  • Hash Join: ~O(|A|+|B|) for eq-joins; memory/spill sensitive.

  • Merge Join: ~O(|A|+|B|) if already sorted; else add sort cost O(|A|log|A| + |B|log|B|).


7) Practical Tips

  • Pick the smaller input as build (hash), or broadcast the smaller input (MPP).

  • Keep stats fresh (cardinality, NDV) so the optimizer doesn’t pick a bad plan.

  • Use appropriate indexes:

    • Point/range predicates → B-tree indexes help (INLJ/merge preconditions).

    • Equality heavy + large data → Hash Join typically wins.

  • Handle skew early: identify hot keys; consider salting or split strategies.

  • Push filters down before joins to shrink inputs.

  • Use covering indexes to make NLJ much faster (avoid extra heap lookups).


8) Worked Micro-Examples

8.1 Index Nested Loop

-- Index on B(customer_id)
SELECT /*+ USE_NLJ */ A.*, B.*
FROM A JOIN B ON A.customer_id = B.customer_id;
-- For each row in A, do an index lookup into B.

8.2 Hash Join

-- Equality join, large inputs
SELECT /*+ USE_HASH */ A.*, B.*
FROM A JOIN B ON A.customer_id = B.customer_id;
-- Build hash on smaller input (say B), then probe with A.

8.3 Merge Join

-- Inputs already sorted (e.g., via clustered index or ORDER BY in pipeline)
SELECT /*+ USE_MERGE */ A.*, B.*
FROM (SELECT * FROM A ORDER BY customer_id) A
JOIN (SELECT * FROM B ORDER BY customer_id) B
  ON A.customer_id = B.customer_id;
-- Walk both sorted streams together.

9) Cheat-Sheet (One-liners)

  • INLJ: small outer + index on inner → great. No index → avoid.

  • BNLJ: no index but decent memory → batch the outer to reduce inner scans.

  • Hash: default for big equality joins; watch memory and skew.

  • Merge: best when inputs are already sorted or sorts are cheap; memory friendly and sequential.