MergeTree

MergeTree is the foundational table engine family in ClickHouse: columnar, append-most, sorted by a key, and continuously merging small parts into bigger ones. It gives fast range scans via a sparse primary index, powerful pruning, and high ingest.

https://www.youtube.com/watch?v=U_VGke_0-Dk

1) Big Picture

Columnar files on disk, sorted by an ORDER BY key (often called the sorting key).
Data is written as many small parts; background threads merge parts to form fewer, larger parts.
Fast reads via data skipping: a sparse primary index + per-granule stats tell the engine which ranges to skip.
Family members (e.g., ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, ReplicatedMergeTree) add specialized merge semantics.

2) Core Storage Concepts

2.1 Partitions, Parts, Granules

PARTITION BY (optional): groups rows (e.g., by month) so merges happen within each partition.
Part: an immutable bundle of column files + metadata for a partition slice.
Granule: the minimal read unit (~index granularity rows, default 8192). Each granule has:
- compressed column pages,
- a mark entry (offset) in .mrk2 files,
- min/max and other stats (used for skipping).

2.2 Sorting & Primary Key

ORDER BY defines on-disk sort order.
PRIMARY KEY (optional) defaults to ORDER BY, but can be its prefix (reduces RAM for the in-memory index).
The “primary index” is sparse: it stores key values at granule boundaries, not per row.

2.3 On-disk Layout (per Part)

One file per column (compressed), plus .mrk2 (marks) and metadata.
Per-column codecs: LZ4 (default), ZSTD, plus column encodings like Delta, DoubleDelta, Gorilla, etc.
Part levels reflect merge depth (new merges → higher level).

3) Write Path (Ingest → Parts → Merges)

INSERT creates a new part (or several) with sorted rows (by ORDER BY).
Background merge selector picks compatible parts (same partition) and merges them:
- Concatenate in sort order,
- Rebuild marks & stats,
- Apply engine-specific logic (e.g., dedup in ReplacingMergeTree, sum in SummingMergeTree).
TTL actions (delete/move) and mutations may be applied during merges.
Old parts become Outdated/Deleting, then removed.

Many tiny parts → poor pruning & overhead. Healthy tables maintain few, large parts per partition.

4) Read Path (Pruning → Vectorized Scan)

Given a query with predicates:

Partition pruning: skip partitions via PARTITION BY expression.
Primary-index range selection: compare predicates with sparse index → pick relevant granule ranges.
Secondary skip indexes (optional; see §6) further reduce ranges.
Vectorized scan: read only needed columns, apply filters/aggregations with SIMD.

5) Family Variants (When to Use)

MergeTree: vanilla, no special merge rules.
ReplacingMergeTree(ver): deduplicate equal sorting-key rows, keeping the latest by ver (or arbitrary latest if ver omitted).
- Use for upserts / late data corrections (best when data eventually converges).
SummingMergeTree(cols): during merges, sum listed numeric columns for rows with identical sorting keys.
- Use for pre-aggregated rollups (beware of partial-row semantics).
AggregatingMergeTree: stores aggregate function states (e.g., sumState, uniqCombinedState); merges combine states.
- Use with Materialized Views for OLAP cubes.
CollapsingMergeTree(Sign): pairs of +1/-1 rows with same key collapse during merge (delete vs insert semantics).
- Use for event sourcing patterns.
VersionedCollapsingMergeTree(Sign, Version): like Collapsing but resolves conflicts via a version column.
GraphiteMergeTree: retention & rollup policies for Graphite metrics.
ReplicatedMergeTree: any of the above with replication (via ClickHouse-Keeper/ZooKeeper); adds deduplication of inserts per partition via block IDs.

6) Data-Skipping Indexes (Secondary)

Add optional skip indexes to prune more granules beyond the primary key:

-- Typical examples
INDEX minmax_price price TYPE minmax GRANULARITY 1;
INDEX bf_email email TYPE bloom_filter GRANULARITY 64;
INDEX set_country country TYPE set(1000) GRANULARITY 1;
INDEX ngram_title title TYPE ngrambf_v1(3, 256, 2) GRANULARITY 2;

minmax: per-granule min/max, great for ranges.
bloom_filter: probabilistic membership for equality / IN filters on high-cardinality columns.
set(N): exact small set per granule (for low-cardinality values).
tokenbf_v1 / ngrambf_v1: text token/sketch filters.

Set GRANULARITY thoughtfully: fewer = more index entries & RAM, more pruning potential.

7) Mutations (ALTER UPDATE/DELETE)

ClickHouse doesn’t update in place. ALTER TABLE … UPDATE/DELETE creates mutations that rewrite parts in the background.
Mutations are applied during merges or as dedicated rewrites; track in system.mutations.
Heavy, frequent mutations on huge partitions can backlog merges → consider partitioning and MV-based corrections.

8) TTLs, Volumes & Storage Policies

Row TTL: delete rows when condition expires.
Column TTL: null/expire specific columns over time (cost-saving for cold attributes).
MOVE TTL: move parts to storage volumes (hot → warm → cold) via storage policies.

TTL toDate(event_time) + INTERVAL 90 DAY DELETE,
    toDate(event_time) + INTERVAL 30 DAY TO VOLUME 'warm';

Storage policies define volumes mapped to disks (NVMe/SSD/HDD/object store). Merges respect target volumes.

9) Sampling (Approximate Queries)

SAMPLE BY defines a sampling key (usually a hash of a stable ID).
Queries can use SAMPLE 1/10 or SAMPLE 10 (percent) for approximate results with uniform coverage.

CREATE TABLE t
(
  id UInt64,
  ts DateTime,
  ...
)
ENGINE = MergeTree
ORDER BY (ts, id)
SAMPLE BY cityHash64(id);

10) DDL Templates & Patterns

10.1 Vanilla MergeTree with pruning & TTL

CREATE TABLE events
(
  event_date Date,
  event_time DateTime,
  user_id UInt64,
  country LowCardinality(String),
  amount Float64,
  -- skip indexes
  INDEX minmax_amount amount TYPE minmax GRANULARITY 1,
  INDEX bf_user user_id TYPE bloom_filter GRANULARITY 64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)         -- sorting/primary key
SETTINGS index_granularity = 8192
TTL event_date + INTERVAL 90 DAY DELETE;

10.2 ReplacingMergeTree for upserts

CREATE TABLE users
(
  id UInt64,
  name String,
  updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (id);
-- Latest row per id wins during merges

10.3 AggregatingMergeTree with MV

CREATE TABLE sales_agg
(
  key Tuple(country String, bucket Date),
  sum_amount AggregateFunction(sum, Float64),
  uniq_users AggregateFunction(uniqCombined, UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY (bucket)
ORDER BY (key);
 
-- Materialized view to feed states
CREATE MATERIALIZED VIEW sales_mv
TO sales_agg AS
SELECT
  (country, toStartOfDay(ts)) AS key,
  sumState(amount) AS sum_amount,
  uniqCombinedState(user_id) AS uniq_users
FROM sales
GROUP BY key;

11) Observability & Maintenance

System tables

system.parts — parts state, rows, bytes, levels, partition_id.
system.merges — running merges.
system.mutations — mutation queues/progress, failures.
system.detached_parts — orphaned/detached parts (manual ops, failures).
system.replicas — (replicated) lag, queue length, errors.

Useful ops

OPTIMIZE TABLE ... FINAL — force a full merge per partition (use sparingly; heavy).
ALTER TABLE ... FREEZE — create backups (hard links).
SYSTEM STOP/START MERGES <table> — maintenance windows.
SYSTEM FLUSH LOGS — flush query/system logs.

Red flags

Many small parts (per partition).
Large backlog in system.mutations or system.merges.
Insufficient pruning (high “rows read” vs “rows matched”).
Excessive large merges during peak ingest (tune threads/limits or partitioning).

12) Primary Key Design & Tuning Tips

Put high-selectivity, commonly filtered columns early in ORDER BY to maximize pruning.
Avoid overly wide ORDER BY (primary index is cached in RAM at marks).
Choose partitioning that matches retention & heavy filters (e.g., by month/day).
Keep index granularity near defaults unless you have strong reasons:
- Smaller granularity → more marks → more RAM/index & better pruning.
- Larger granularity → fewer marks → less RAM & potentially more reads.
Control ingest: batch rows into reasonably sized parts (MBs–hundreds of MBs), not thousands of tiny parts.

13) Replication (ReplicatedMergeTree)

Same MergeTree semantics but with replication and insert dedup per partition by block ID.
Automatic part fetch/consistency via Keeper; merges coordinated loosely per replica.
Monitor system.replicas (queue size, log pointer lag, errors).

14) Common Pitfalls

Relying on ReplacingMergeTree to fix heavy in-flight duplicates without a proper version → non-deterministic winners.
Overusing mutations for massive backfills/rewrites → merge backlog; consider MV pipelines or reload.
Too fine granularity or too many skip indexes → RAM overhead, diminishing returns.
Tiny files (too many small parts) → metadata & open-file pressure; fix ingest batch size or OPTIMIZE off-hours.
Mismatch between partitioning and query filters → poor pruning and slow scans.

15) Quick Glossary

Granule: ~index_granularity rows; minimal read/prune unit.
Mark: pointer/offset to the start of a granule in column files.
Part: immutable set of column files + marks for a partition slice.
Level: how many merges created a part (higher = more merged).
Mutation: background rewrite triggered by ALTER ... UPDATE/DELETE.
Skip index: auxiliary metadata per granule for pruning (minmax/bloom/set/ngram).

16) Sanity-Check Queries

-- See pruning effectiveness for a query
EXPLAIN indexes = 1
SELECT ... FROM events WHERE event_date >= today()-7 AND user_id = 42;
 
-- Parts overview
SELECT partition_id, count() parts, sum(rows) rows, sum(bytes_on_disk) bytes
FROM system.parts WHERE table = 'events' AND active
GROUP BY partition_id ORDER BY partition_id;
 
-- Merges & mutations
SELECT * FROM system.merges WHERE table = 'events';
SELECT database, table, mutation_id, command, parts_to_do, parts_done, is_done
FROM system.mutations WHERE table = 'events' ORDER BY create_time DESC;

Jonas Notes

Explorer

MergeTree

1) Big Picture

2) Core Storage Concepts

2.1 Partitions, Parts, Granules

2.2 Sorting & Primary Key

2.3 On-disk Layout (per Part)

3) Write Path (Ingest → Parts → Merges)

4) Read Path (Pruning → Vectorized Scan)

5) Family Variants (When to Use)

6) Data-Skipping Indexes (Secondary)

7) Mutations (ALTER UPDATE/DELETE)

8) TTLs, Volumes & Storage Policies

9) Sampling (Approximate Queries)

10) DDL Templates & Patterns

10.1 Vanilla MergeTree with pruning & TTL

10.2 ReplacingMergeTree for upserts

10.3 AggregatingMergeTree with MV

11) Observability & Maintenance

12) Primary Key Design & Tuning Tips

13) Replication (ReplicatedMergeTree)

14) Common Pitfalls

15) Quick Glossary

16) Sanity-Check Queries

Table of Contents

Backlinks