MergeTree is the foundational table engine family in ClickHouse: columnar, append-most, sorted by a key, and continuously merging small parts into bigger ones. It gives fast range scans via a sparse primary index, powerful pruning, and high ingest.
https://www.youtube.com/watch?v=U_VGke_0-Dk
1) Big Picture
- Columnar files on disk, sorted by an
ORDER BYkey (often called the sorting key). - Data is written as many small parts; background threads merge parts to form fewer, larger parts.
- Fast reads via data skipping: a sparse primary index + per-granule stats tell the engine which ranges to skip.
- Family members (e.g.,
ReplacingMergeTree,SummingMergeTree,AggregatingMergeTree,CollapsingMergeTree,VersionedCollapsingMergeTree,ReplicatedMergeTree) add specialized merge semantics.
2) Core Storage Concepts
2.1 Partitions, Parts, Granules
- PARTITION BY (optional): groups rows (e.g., by month) so merges happen within each partition.
- Part: an immutable bundle of column files + metadata for a partition slice.
- Granule: the minimal read unit (~index granularity rows, default 8192). Each granule has:
- compressed column pages,
- a mark entry (offset) in
.mrk2files, - min/max and other stats (used for skipping).
2.2 Sorting & Primary Key
ORDER BYdefines on-disk sort order.PRIMARY KEY(optional) defaults toORDER BY, but can be its prefix (reduces RAM for the in-memory index).- The “primary index” is sparse: it stores key values at granule boundaries, not per row.
2.3 On-disk Layout (per Part)
- One file per column (compressed), plus
.mrk2(marks) and metadata. - Per-column codecs:
LZ4(default),ZSTD, plus column encodings likeDelta,DoubleDelta,Gorilla, etc. - Part levels reflect merge depth (new merges → higher level).
3) Write Path (Ingest → Parts → Merges)
- INSERT creates a new part (or several) with sorted rows (by
ORDER BY). - Background merge selector picks compatible parts (same partition) and merges them:
- Concatenate in sort order,
- Rebuild marks & stats,
- Apply engine-specific logic (e.g., dedup in
ReplacingMergeTree, sum inSummingMergeTree).
- TTL actions (delete/move) and mutations may be applied during merges.
- Old parts become Outdated/Deleting, then removed.
Many tiny parts → poor pruning & overhead. Healthy tables maintain few, large parts per partition.
4) Read Path (Pruning → Vectorized Scan)
Given a query with predicates:
- Partition pruning: skip partitions via
PARTITION BYexpression. - Primary-index range selection: compare predicates with sparse index → pick relevant granule ranges.
- Secondary skip indexes (optional; see §6) further reduce ranges.
- Vectorized scan: read only needed columns, apply filters/aggregations with SIMD.
5) Family Variants (When to Use)
- MergeTree: vanilla, no special merge rules.
- ReplacingMergeTree(ver): deduplicate equal sorting-key rows, keeping the latest by
ver(or arbitrary latest ifveromitted).- Use for upserts / late data corrections (best when data eventually converges).
- SummingMergeTree(cols): during merges, sum listed numeric columns for rows with identical sorting keys.
- Use for pre-aggregated rollups (beware of partial-row semantics).
- AggregatingMergeTree: stores aggregate function states (e.g.,
sumState,uniqCombinedState); merges combine states.- Use with Materialized Views for OLAP cubes.
- CollapsingMergeTree(Sign): pairs of +1/-1 rows with same key collapse during merge (delete vs insert semantics).
- Use for event sourcing patterns.
- VersionedCollapsingMergeTree(Sign, Version): like Collapsing but resolves conflicts via a version column.
- GraphiteMergeTree: retention & rollup policies for Graphite metrics.
- ReplicatedMergeTree: any of the above with replication (via ClickHouse-Keeper/ZooKeeper); adds deduplication of inserts per partition via block IDs.
6) Data-Skipping Indexes (Secondary)
Add optional skip indexes to prune more granules beyond the primary key:
-- Typical examples
INDEX minmax_price price TYPE minmax GRANULARITY 1;
INDEX bf_email email TYPE bloom_filter GRANULARITY 64;
INDEX set_country country TYPE set(1000) GRANULARITY 1;
INDEX ngram_title title TYPE ngrambf_v1(3, 256, 2) GRANULARITY 2;-
minmax: per-granule min/max, great for ranges. -
bloom_filter: probabilistic membership for equality / IN filters on high-cardinality columns. -
set(N): exact small set per granule (for low-cardinality values). -
tokenbf_v1/ngrambf_v1: text token/sketch filters.
Set GRANULARITY thoughtfully: fewer = more index entries & RAM, more pruning potential.
7) Mutations (ALTER UPDATE/DELETE)
-
ClickHouse doesn’t update in place.
ALTER TABLE … UPDATE/DELETEcreates mutations that rewrite parts in the background. -
Mutations are applied during merges or as dedicated rewrites; track in
system.mutations. -
Heavy, frequent mutations on huge partitions can backlog merges → consider partitioning and MV-based corrections.
8) TTLs, Volumes & Storage Policies
-
Row TTL: delete rows when condition expires.
-
Column TTL: null/expire specific columns over time (cost-saving for cold attributes).
-
MOVE TTL: move parts to storage volumes (hot → warm → cold) via storage policies.
TTL toDate(event_time) + INTERVAL 90 DAY DELETE,
toDate(event_time) + INTERVAL 30 DAY TO VOLUME 'warm';- Storage policies define volumes mapped to disks (NVMe/SSD/HDD/object store). Merges respect target volumes.
9) Sampling (Approximate Queries)
-
SAMPLE BYdefines a sampling key (usually a hash of a stable ID). -
Queries can use
SAMPLE 1/10orSAMPLE 10(percent) for approximate results with uniform coverage.
CREATE TABLE t
(
id UInt64,
ts DateTime,
...
)
ENGINE = MergeTree
ORDER BY (ts, id)
SAMPLE BY cityHash64(id);10) DDL Templates & Patterns
10.1 Vanilla MergeTree with pruning & TTL
CREATE TABLE events
(
event_date Date,
event_time DateTime,
user_id UInt64,
country LowCardinality(String),
amount Float64,
-- skip indexes
INDEX minmax_amount amount TYPE minmax GRANULARITY 1,
INDEX bf_user user_id TYPE bloom_filter GRANULARITY 64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id) -- sorting/primary key
SETTINGS index_granularity = 8192
TTL event_date + INTERVAL 90 DAY DELETE;10.2 ReplacingMergeTree for upserts
CREATE TABLE users
(
id UInt64,
name String,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (id);
-- Latest row per id wins during merges10.3 AggregatingMergeTree with MV
CREATE TABLE sales_agg
(
key Tuple(country String, bucket Date),
sum_amount AggregateFunction(sum, Float64),
uniq_users AggregateFunction(uniqCombined, UInt64)
)
ENGINE = AggregatingMergeTree
PARTITION BY (bucket)
ORDER BY (key);
-- Materialized view to feed states
CREATE MATERIALIZED VIEW sales_mv
TO sales_agg AS
SELECT
(country, toStartOfDay(ts)) AS key,
sumState(amount) AS sum_amount,
uniqCombinedState(user_id) AS uniq_users
FROM sales
GROUP BY key;11) Observability & Maintenance
System tables
-
system.parts— parts state, rows, bytes, levels, partition_id. -
system.merges— running merges. -
system.mutations— mutation queues/progress, failures. -
system.detached_parts— orphaned/detached parts (manual ops, failures). -
system.replicas— (replicated) lag, queue length, errors.
Useful ops
-
OPTIMIZE TABLE ... FINAL— force a full merge per partition (use sparingly; heavy). -
ALTER TABLE ... FREEZE— create backups (hard links). -
SYSTEM STOP/START MERGES <table>— maintenance windows. -
SYSTEM FLUSH LOGS— flush query/system logs.
Red flags
-
Many small parts (per partition).
-
Large backlog in
system.mutationsorsystem.merges. -
Insufficient pruning (high “rows read” vs “rows matched”).
-
Excessive large merges during peak ingest (tune threads/limits or partitioning).
12) Primary Key Design & Tuning Tips
-
Put high-selectivity, commonly filtered columns early in
ORDER BYto maximize pruning. -
Avoid overly wide
ORDER BY(primary index is cached in RAM at marks). -
Choose partitioning that matches retention & heavy filters (e.g., by month/day).
-
Keep index granularity near defaults unless you have strong reasons:
-
Smaller granularity → more marks → more RAM/index & better pruning.
-
Larger granularity → fewer marks → less RAM & potentially more reads.
-
-
Control ingest: batch rows into reasonably sized parts (MBs–hundreds of MBs), not thousands of tiny parts.
13) Replication (ReplicatedMergeTree)
-
Same MergeTree semantics but with replication and insert dedup per partition by block ID.
-
Automatic part fetch/consistency via Keeper; merges coordinated loosely per replica.
-
Monitor
system.replicas(queue size, log pointer lag, errors).
14) Common Pitfalls
-
Relying on
ReplacingMergeTreeto fix heavy in-flight duplicates without a properversion→ non-deterministic winners. -
Overusing mutations for massive backfills/rewrites → merge backlog; consider MV pipelines or reload.
-
Too fine granularity or too many skip indexes → RAM overhead, diminishing returns.
-
Tiny files (too many small parts) → metadata & open-file pressure; fix ingest batch size or OPTIMIZE off-hours.
-
Mismatch between partitioning and query filters → poor pruning and slow scans.
15) Quick Glossary
-
Granule: ~
index_granularityrows; minimal read/prune unit. -
Mark: pointer/offset to the start of a granule in column files.
-
Part: immutable set of column files + marks for a partition slice.
-
Level: how many merges created a part (higher = more merged).
-
Mutation: background rewrite triggered by
ALTER ... UPDATE/DELETE. -
Skip index: auxiliary metadata per granule for pruning (minmax/bloom/set/ngram).
16) Sanity-Check Queries
-- See pruning effectiveness for a query
EXPLAIN indexes = 1
SELECT ... FROM events WHERE event_date >= today()-7 AND user_id = 42;
-- Parts overview
SELECT partition_id, count() parts, sum(rows) rows, sum(bytes_on_disk) bytes
FROM system.parts WHERE table = 'events' AND active
GROUP BY partition_id ORDER BY partition_id;
-- Merges & mutations
SELECT * FROM system.merges WHERE table = 'events';
SELECT database, table, mutation_id, command, parts_to_do, parts_done, is_done
FROM system.mutations WHERE table = 'events' ORDER BY create_time DESC;