Column-store OLAP

Columnar = store each column contiguously on disk. Great for analytics: scan a few columns over many rows, compress aggressively, and skip irrelevant data.

1) Why Columnar?

Better compression: same-type, similar values sit together → RLE, dictionary, delta/for, bit-packing shine.
Less I/O: read only the columns you need.
Vectorized compute: operate on batches (vectors) using SIMD instructions for throughput.

2) Core Building Blocks

2.1 Physical layout

Columns split into chunks (segments / stripes / row groups / pages).
Each chunk carries statistics (e.g., min/max, NDV, null count), often Bloom filters too.
Files are immutable (append-only) and compacted/merged in the background.

2.2 Data skipping (pruning)

Partition pruning: skip entire folders (e.g., dt=2025-08-31/).
File/segment pruning via zone maps (min/max) and Bloom filters.
In-file pruning: skip row groups/pages whose stats prove they can’t match.

Mini-example
Query: WHERE price BETWEEN 100 AND 120 AND country='SG'

Skip partitions where country != 'SG'.
Skip files whose price min/max doesn’t intersect [100,120].
Inside kept files, skip row groups by price stats/Bloom.
Read only the needed columns (price, country, plus those you SELECT).

2.3 Vectorised execution

Engine pulls batches (e.g., 1K–16K values) per column into CPU registers.
Applies the same predicate/operation to the whole vector using SIMD.
Fewer function calls, branchless tight loops → high IPC.

3) Updates, Deletes & Consistency (Append-Mostly)

Columnar systems typically avoid in-place updates:

Copy-on-Write (CoW): write new files with changes, old files become obsolete.
Merge-on-Read (MoR): write delta files (inserts/deletes) and merge them during reads; compaction later bakes them in.
Snapshots/Manifests: a table’s “current state” is just a pointer to a set of files and metadata → time travel & concurrent writers.

Lakehouse table formats:

Apache Iceberg: manifests, metadata tables; snapshot isolation; hidden partitioning.
Delta Lake: _delta_log (JSON/Parquet) transaction log; CoW or MoR; time travel.
Apache Hudi: CoW and MoR table types; indexed upserts, commit timeline.

Snowflake is proprietary: micro-partitions + rich metadata (stats, pruning); no user-visible B-trees/LSM. It does not use Delta Lake’s _delta_log.

4) Auxiliary Indexing & Metadata

Zone maps (min/max): cheap, universal.
Bloom filters: fast probabilistic “might contain value X?” checks.
Bitmaps: per-value or per-range bitmaps (common in Druid/Pinot).
Sparse primary indexes / data-skipping indexes: e.g., ClickHouse’s primary key and skip indexes.
Traditional B-trees are rare; column stores lean on metadata-pruning + scans.

5) Compute–Storage Separation

Data sits in object storage (S3/GCS/OSS).
Stateless compute (engines like Trino/Presto, Spark, Snowflake) pull files, prune aggressively, and process in parallel.
Scale-out by adding workers; scale-down by stopping compute — storage persists independently.

6) Parallelism & Scalability

File/partition-level parallelism: many workers scan different files at once.
Pipeline parallelism: scan → filter → project → aggregate operate as streaming stages.
Locality matters less with object storage; bandwidth & metadata ops dominate.

7) File Formats (Columnar)

Parquet: row groups → column chunks → pages; rich stats, encodings per column.
ORC: stripes → row index + bloom + stats; strong predicate pushdown.
(Others exist, but Parquet/ORC dominate the lakehouse.)

Encodings you’ll meet

Dictionary (categoricals), RLE (runs), Delta/Frame-of-Reference (nearby numbers), Bit-packing (small integer widths), ZSTD/LZ4 for general compression.

8) Systems & How They Map

System	What it is	Notes
Snowflake	Cloud DWH	Proprietary columnar; micro-partitions, heavy metadata pruning; no user B-trees/LSM.
ClickHouse	Columnar DBMS	MergeTree family; sparse primary index, data-skipping indexes; blazing vectorized execution.
Apache Druid	Realtime OLAP store	Segments with bitmap indexes, great for time-series & slice-and-dice; ingestion + query tiers.
Apache Pinot	Realtime OLAP store	Columnar segment files; star-schema analytics, low-latency serving.
DuckDB	Embedded columnar	In-process analytics; Parquet integration; vectorized exec.
Trino/Presto	SQL query engine	Stateless compute; scans Parquet/ORC/Iceberg/Delta/Hudi; no storage of its own.

9) Partitioning & Layout

Partition by high-selectivity, common filters (date, tenant, country).
Keep file sizes balanced (e.g., 128–1024 MB) for efficient scans.
Avoid tiny files (metadata overhead) and huge files (slow splits).

10) When Columnar Shines vs. Struggles

Shines

Aggregations, scans over a subset of columns, ad-hoc analytics, time-series summarization. Struggles
Heavy point lookups, high-churn OLTP updates, lots of single-row transactions.
→ Use row stores (or hybrids) for write-hot OLTP paths; ETL into columnar for analytics.

11) The “Non-Negotiables”

Columnar data layout (per-column files/chunks).
Aggressive per-column compression (choose encoding per column).
Per-block/segment stats for pruning/skipping.
Parallel scans across partitions/files.
Efficient metadata management (fast listing, manifests, stats indexes).

12) Quick FAQ

Q: Is “zone map” the same as an index?
A: It’s a lightweight index: per-chunk min/max (and sometimes NDV). It doesn’t point to rows; it tells you what to skip.

Q: What makes vectorization fast?
A: Process thousands of values per call with SIMD and branch-free loops; fewer virtual calls, better cache.

Q: How do updates work if files are immutable?
A: Append delta files (upserts/deletes), then compact/merge to produce new base files; readers consult the latest snapshot to find which files are current.

13) Field Guide: Picking Tools

Need a managed warehouse with strong pruning? → Snowflake / BigQuery.
Need self-hosted, ultra-fast analytics with indexes? → ClickHouse.
Realtime dashboarding over streaming data? → Druid / Pinot.
Lakehouse DIY with open tables? → Iceberg / Delta Lake / Hudi on Trino/Spark.
Local analytics or app-embedded OLAP? → DuckDB.

Jonas Notes

Explorer