MergeTree is the foundational table engine family in ClickHouse: columnar, append-most, sorted by a key, and continuously merging small parts into bigger ones. It gives fast range scans via a sparse primary index, powerful pruning, and high ingest.

https://www.youtube.com/watch?v=U_VGke_0-Dk


1) Big Picture

  • Columnar files on disk, sorted by an ORDER BY key (often called the sorting key).
  • Data is written as many small parts; background threads merge parts to form fewer, larger parts.
  • Fast reads via data skipping: a sparse primary index + per-granule stats tell the engine which ranges to skip.
  • Family members (e.g., ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, ReplicatedMergeTree) add specialized merge semantics.

2) Core Storage Concepts

2.1 Partitions, Parts, Granules

  • PARTITION BY (optional): groups rows (e.g., by month) so merges happen within each partition.
  • Part: an immutable bundle of column files + metadata for a partition slice.
  • Granule: the minimal read unit (~index granularity rows, default 8192). Each granule has:
    • compressed column pages,
    • a mark entry (offset) in .mrk2 files,
    • min/max and other stats (used for skipping).

2.2 Sorting & Primary Key

  • ORDER BY defines on-disk sort order.
  • PRIMARY KEY (optional) defaults to ORDER BY, but can be its prefix (reduces RAM for the in-memory index).
  • The “primary index” is sparse: it stores key values at granule boundaries, not per row.

2.3 On-disk Layout (per Part)

  • One file per column (compressed), plus .mrk2 (marks) and metadata.
  • Per-column codecs: LZ4 (default), ZSTD, plus column encodings like Delta, DoubleDelta, Gorilla, etc.
  • Part levels reflect merge depth (new merges → higher level).

3) Write Path (Ingest → Parts → Merges)

  1. INSERT creates a new part (or several) with sorted rows (by ORDER BY).
  2. Background merge selector picks compatible parts (same partition) and merges them:
    • Concatenate in sort order,
    • Rebuild marks & stats,
    • Apply engine-specific logic (e.g., dedup in ReplacingMergeTree, sum in SummingMergeTree).
  3. TTL actions (delete/move) and mutations may be applied during merges.
  4. Old parts become Outdated/Deleting, then removed.

Many tiny parts → poor pruning & overhead. Healthy tables maintain few, large parts per partition.


4) Read Path (Pruning → Vectorized Scan)

Given a query with predicates:

  1. Partition pruning: skip partitions via PARTITION BY expression.
  2. Primary-index range selection: compare predicates with sparse index → pick relevant granule ranges.
  3. Secondary skip indexes (optional; see §6) further reduce ranges.
  4. Vectorized scan: read only needed columns, apply filters/aggregations with SIMD.

5) Family Variants (When to Use)

  • MergeTree: vanilla, no special merge rules.
  • ReplacingMergeTree(ver): deduplicate equal sorting-key rows, keeping the latest by ver (or arbitrary latest if ver omitted).
    • Use for upserts / late data corrections (best when data eventually converges).
  • SummingMergeTree(cols): during merges, sum listed numeric columns for rows with identical sorting keys.
    • Use for pre-aggregated rollups (beware of partial-row semantics).
  • AggregatingMergeTree: stores aggregate function states (e.g., sumState, uniqCombinedState); merges combine states.
    • Use with Materialized Views for OLAP cubes.
  • CollapsingMergeTree(Sign): pairs of +1/-1 rows with same key collapse during merge (delete vs insert semantics).
    • Use for event sourcing patterns.
  • VersionedCollapsingMergeTree(Sign, Version): like Collapsing but resolves conflicts via a version column.
  • GraphiteMergeTree: retention & rollup policies for Graphite metrics.
  • ReplicatedMergeTree: any of the above with replication (via ClickHouse-Keeper/ZooKeeper); adds deduplication of inserts per partition via block IDs.

6) Data-Skipping Indexes (Secondary)

Add optional skip indexes to prune more granules beyond the primary key:

-- Typical examples
INDEX minmax_price price TYPE minmax GRANULARITY 1;
INDEX bf_email email TYPE bloom_filter GRANULARITY 64;
INDEX set_country country TYPE set(1000) GRANULARITY 1;
INDEX ngram_title title TYPE ngrambf_v1(3, 256, 2) GRANULARITY 2;
  • minmax: per-granule min/max, great for ranges.

  • bloom_filter: probabilistic membership for equality / IN filters on high-cardinality columns.

  • set(N): exact small set per granule (for low-cardinality values).

  • tokenbf_v1 / ngrambf_v1: text token/sketch filters.

Set GRANULARITY thoughtfully: fewer = more index entries & RAM, more pruning potential.


7) Mutations (ALTER UPDATE/DELETE)

  • ClickHouse doesn’t update in place. ALTER TABLE … UPDATE/DELETE creates mutations that rewrite parts in the background.

  • Mutations are applied during merges or as dedicated rewrites; track in system.mutations.

  • Heavy, frequent mutations on huge partitions can backlog merges → consider partitioning and MV-based corrections.


8) TTLs, Volumes & Storage Policies

  • Row TTL: delete rows when condition expires.

  • Column TTL: null/expire specific columns over time (cost-saving for cold attributes).

  • MOVE TTL: move parts to storage volumes (hot → warm → cold) via storage policies.

TTL toDate(event_time) + INTERVAL 90 DAY DELETE,
    toDate(event_time) + INTERVAL 30 DAY TO VOLUME 'warm';
  • Storage policies define volumes mapped to disks (NVMe/SSD/HDD/object store). Merges respect target volumes.

9) Sampling (Approximate Queries)

  • SAMPLE BY defines a sampling key (usually a hash of a stable ID).

  • Queries can use SAMPLE 1/10 or SAMPLE 10 (percent) for approximate results with uniform coverage.

CREATE TABLE t
(
  id UInt64,
  ts DateTime,
  ...
)
ENGINE = MergeTree
ORDER BY (ts, id)
SAMPLE BY cityHash64(id);

10) DDL Templates & Patterns

10.1 Vanilla MergeTree with pruning & TTL

CREATE TABLE events
(
  event_date Date,
  event_time DateTime,
  user_id UInt64,
  country LowCardinality(String),
  amount Float64,
  -- skip indexes
  INDEX minmax_amount amount TYPE minmax GRANULARITY 1,
  INDEX bf_user user_id TYPE bloom_filter GRANULARITY 64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)         -- sorting/primary key
SETTINGS index_granularity = 8192
TTL event_date + INTERVAL 90 DAY DELETE;

10.2 ReplacingMergeTree for upserts

CREATE TABLE users
(
  id UInt64,
  name String,
  updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (id);
-- Latest row per id wins during merges

10.3 AggregatingMergeTree with MV

CREATE TABLE sales_agg
(
  key Tuple(country String, bucket Date),
  sum_amount AggregateFunction(sum, Float64),
  uniq_users AggregateFunction(uniqCombined, UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY (bucket)
ORDER BY (key);
 
-- Materialized view to feed states
CREATE MATERIALIZED VIEW sales_mv
TO sales_agg AS
SELECT
  (country, toStartOfDay(ts)) AS key,
  sumState(amount) AS sum_amount,
  uniqCombinedState(user_id) AS uniq_users
FROM sales
GROUP BY key;

11) Observability & Maintenance

System tables

  • system.parts — parts state, rows, bytes, levels, partition_id.

  • system.merges — running merges.

  • system.mutations — mutation queues/progress, failures.

  • system.detached_parts — orphaned/detached parts (manual ops, failures).

  • system.replicas — (replicated) lag, queue length, errors.

Useful ops

  • OPTIMIZE TABLE ... FINAL — force a full merge per partition (use sparingly; heavy).

  • ALTER TABLE ... FREEZE — create backups (hard links).

  • SYSTEM STOP/START MERGES <table> — maintenance windows.

  • SYSTEM FLUSH LOGS — flush query/system logs.

Red flags

  • Many small parts (per partition).

  • Large backlog in system.mutations or system.merges.

  • Insufficient pruning (high “rows read” vs “rows matched”).

  • Excessive large merges during peak ingest (tune threads/limits or partitioning).


12) Primary Key Design & Tuning Tips

  • Put high-selectivity, commonly filtered columns early in ORDER BY to maximize pruning.

  • Avoid overly wide ORDER BY (primary index is cached in RAM at marks).

  • Choose partitioning that matches retention & heavy filters (e.g., by month/day).

  • Keep index granularity near defaults unless you have strong reasons:

    • Smaller granularity → more marks → more RAM/index & better pruning.

    • Larger granularity → fewer marks → less RAM & potentially more reads.

  • Control ingest: batch rows into reasonably sized parts (MBs–hundreds of MBs), not thousands of tiny parts.


13) Replication (ReplicatedMergeTree)

  • Same MergeTree semantics but with replication and insert dedup per partition by block ID.

  • Automatic part fetch/consistency via Keeper; merges coordinated loosely per replica.

  • Monitor system.replicas (queue size, log pointer lag, errors).


14) Common Pitfalls

  • Relying on ReplacingMergeTree to fix heavy in-flight duplicates without a proper version → non-deterministic winners.

  • Overusing mutations for massive backfills/rewrites → merge backlog; consider MV pipelines or reload.

  • Too fine granularity or too many skip indexes → RAM overhead, diminishing returns.

  • Tiny files (too many small parts) → metadata & open-file pressure; fix ingest batch size or OPTIMIZE off-hours.

  • Mismatch between partitioning and query filters → poor pruning and slow scans.


15) Quick Glossary

  • Granule: ~index_granularity rows; minimal read/prune unit.

  • Mark: pointer/offset to the start of a granule in column files.

  • Part: immutable set of column files + marks for a partition slice.

  • Level: how many merges created a part (higher = more merged).

  • Mutation: background rewrite triggered by ALTER ... UPDATE/DELETE.

  • Skip index: auxiliary metadata per granule for pruning (minmax/bloom/set/ngram).


16) Sanity-Check Queries

-- See pruning effectiveness for a query
EXPLAIN indexes = 1
SELECT ... FROM events WHERE event_date >= today()-7 AND user_id = 42;
 
-- Parts overview
SELECT partition_id, count() parts, sum(rows) rows, sum(bytes_on_disk) bytes
FROM system.parts WHERE table = 'events' AND active
GROUP BY partition_id ORDER BY partition_id;
 
-- Merges & mutations
SELECT * FROM system.merges WHERE table = 'events';
SELECT database, table, mutation_id, command, parts_to_do, parts_done, is_done
FROM system.mutations WHERE table = 'events' ORDER BY create_time DESC;