Columnar = store each column contiguously on disk. Great for analytics: scan a few columns over many rows, compress aggressively, and skip irrelevant data.


1) Why Columnar?

  • Better compression: same-type, similar values sit together → RLE, dictionary, delta/for, bit-packing shine.
  • Less I/O: read only the columns you need.
  • Vectorized compute: operate on batches (vectors) using SIMD instructions for throughput.

2) Core Building Blocks

2.1 Physical layout

  • Columns split into chunks (segments / stripes / row groups / pages).
  • Each chunk carries statistics (e.g., min/max, NDV, null count), often Bloom filters too.
  • Files are immutable (append-only) and compacted/merged in the background.

2.2 Data skipping (pruning)

  • Partition pruning: skip entire folders (e.g., dt=2025-08-31/).
  • File/segment pruning via zone maps (min/max) and Bloom filters.
  • In-file pruning: skip row groups/pages whose stats prove they can’t match.

Mini-example
Query: WHERE price BETWEEN 100 AND 120 AND country='SG'

  1. Skip partitions where country != 'SG'.
  2. Skip files whose price min/max doesn’t intersect [100,120].
  3. Inside kept files, skip row groups by price stats/Bloom.
  4. Read only the needed columns (price, country, plus those you SELECT).

2.3 Vectorised execution

  • Engine pulls batches (e.g., 1K–16K values) per column into CPU registers.
  • Applies the same predicate/operation to the whole vector using SIMD.
  • Fewer function calls, branchless tight loops → high IPC.

3) Updates, Deletes & Consistency (Append-Mostly)

Columnar systems typically avoid in-place updates:

  • Copy-on-Write (CoW): write new files with changes, old files become obsolete.
  • Merge-on-Read (MoR): write delta files (inserts/deletes) and merge them during reads; compaction later bakes them in.
  • Snapshots/Manifests: a table’s “current state” is just a pointer to a set of files and metadata → time travel & concurrent writers.

Lakehouse table formats:

  • Apache Iceberg: manifests, metadata tables; snapshot isolation; hidden partitioning.
  • Delta Lake: _delta_log (JSON/Parquet) transaction log; CoW or MoR; time travel.
  • Apache Hudi: CoW and MoR table types; indexed upserts, commit timeline.

Snowflake is proprietary: micro-partitions + rich metadata (stats, pruning); no user-visible B-trees/LSM. It does not use Delta Lake’s _delta_log.


4) Auxiliary Indexing & Metadata

  • Zone maps (min/max): cheap, universal.
  • Bloom filters: fast probabilistic “might contain value X?” checks.
  • Bitmaps: per-value or per-range bitmaps (common in Druid/Pinot).
  • Sparse primary indexes / data-skipping indexes: e.g., ClickHouse’s primary key and skip indexes.
  • Traditional B-trees are rare; column stores lean on metadata-pruning + scans.

5) Compute–Storage Separation

  • Data sits in object storage (S3/GCS/OSS).
  • Stateless compute (engines like Trino/Presto, Spark, Snowflake) pull files, prune aggressively, and process in parallel.
  • Scale-out by adding workers; scale-down by stopping compute — storage persists independently.

6) Parallelism & Scalability

  • File/partition-level parallelism: many workers scan different files at once.
  • Pipeline parallelism: scan → filter → project → aggregate operate as streaming stages.
  • Locality matters less with object storage; bandwidth & metadata ops dominate.

7) File Formats (Columnar)

  • Parquet: row groups → column chunks → pages; rich stats, encodings per column.
  • ORC: stripes → row index + bloom + stats; strong predicate pushdown.
  • (Others exist, but Parquet/ORC dominate the lakehouse.)

Encodings you’ll meet

  • Dictionary (categoricals), RLE (runs), Delta/Frame-of-Reference (nearby numbers), Bit-packing (small integer widths), ZSTD/LZ4 for general compression.

8) Systems & How They Map

SystemWhat it isNotes
SnowflakeCloud DWHProprietary columnar; micro-partitions, heavy metadata pruning; no user B-trees/LSM.
ClickHouseColumnar DBMSMergeTree family; sparse primary index, data-skipping indexes; blazing vectorized execution.
Apache DruidRealtime OLAP storeSegments with bitmap indexes, great for time-series & slice-and-dice; ingestion + query tiers.
Apache PinotRealtime OLAP storeColumnar segment files; star-schema analytics, low-latency serving.
DuckDBEmbedded columnarIn-process analytics; Parquet integration; vectorized exec.
Trino/PrestoSQL query engineStateless compute; scans Parquet/ORC/Iceberg/Delta/Hudi; no storage of its own.

9) Partitioning & Layout

  • Partition by high-selectivity, common filters (date, tenant, country).
  • Keep file sizes balanced (e.g., 128–1024 MB) for efficient scans.
  • Avoid tiny files (metadata overhead) and huge files (slow splits).

10) When Columnar Shines vs. Struggles

Shines

  • Aggregations, scans over a subset of columns, ad-hoc analytics, time-series summarization. Struggles
  • Heavy point lookups, high-churn OLTP updates, lots of single-row transactions.
    → Use row stores (or hybrids) for write-hot OLTP paths; ETL into columnar for analytics.

11) The “Non-Negotiables”

  1. Columnar data layout (per-column files/chunks).
  2. Aggressive per-column compression (choose encoding per column).
  3. Per-block/segment stats for pruning/skipping.
  4. Parallel scans across partitions/files.
  5. Efficient metadata management (fast listing, manifests, stats indexes).

12) Quick FAQ

Q: Is “zone map” the same as an index?
A: It’s a lightweight index: per-chunk min/max (and sometimes NDV). It doesn’t point to rows; it tells you what to skip.

Q: What makes vectorization fast?
A: Process thousands of values per call with SIMD and branch-free loops; fewer virtual calls, better cache.

Q: How do updates work if files are immutable?
A: Append delta files (upserts/deletes), then compact/merge to produce new base files; readers consult the latest snapshot to find which files are current.


13) Field Guide: Picking Tools

  • Need a managed warehouse with strong pruning? → Snowflake / BigQuery.
  • Need self-hosted, ultra-fast analytics with indexes? → ClickHouse.
  • Realtime dashboarding over streaming data? → Druid / Pinot.
  • Lakehouse DIY with open tables? → Iceberg / Delta Lake / Hudi on Trino/Spark.
  • Local analytics or app-embedded OLAP? → DuckDB.