Columnar = store each column contiguously on disk. Great for analytics: scan a few columns over many rows, compress aggressively, and skip irrelevant data.
1) Why Columnar?
- Better compression: same-type, similar values sit together → RLE, dictionary, delta/for, bit-packing shine.
- Less I/O: read only the columns you need.
- Vectorized compute: operate on batches (vectors) using SIMD instructions for throughput.
2) Core Building Blocks
2.1 Physical layout
- Columns split into chunks (segments / stripes / row groups / pages).
- Each chunk carries statistics (e.g., min/max, NDV, null count), often Bloom filters too.
- Files are immutable (append-only) and compacted/merged in the background.
2.2 Data skipping (pruning)
- Partition pruning: skip entire folders (e.g.,
dt=2025-08-31/). - File/segment pruning via zone maps (min/max) and Bloom filters.
- In-file pruning: skip row groups/pages whose stats prove they can’t match.
Mini-example
Query: WHERE price BETWEEN 100 AND 120 AND country='SG'
- Skip partitions where
country != 'SG'. - Skip files whose price min/max doesn’t intersect [100,120].
- Inside kept files, skip row groups by price stats/Bloom.
- Read only the needed columns (
price,country, plus those you SELECT).
2.3 Vectorised execution
- Engine pulls batches (e.g., 1K–16K values) per column into CPU registers.
- Applies the same predicate/operation to the whole vector using SIMD.
- Fewer function calls, branchless tight loops → high IPC.
3) Updates, Deletes & Consistency (Append-Mostly)
Columnar systems typically avoid in-place updates:
- Copy-on-Write (CoW): write new files with changes, old files become obsolete.
- Merge-on-Read (MoR): write delta files (inserts/deletes) and merge them during reads; compaction later bakes them in.
- Snapshots/Manifests: a table’s “current state” is just a pointer to a set of files and metadata → time travel & concurrent writers.
Lakehouse table formats:
- Apache Iceberg: manifests, metadata tables; snapshot isolation; hidden partitioning.
- Delta Lake:
_delta_log(JSON/Parquet) transaction log; CoW or MoR; time travel. - Apache Hudi: CoW and MoR table types; indexed upserts, commit timeline.
Snowflake is proprietary: micro-partitions + rich metadata (stats, pruning); no user-visible B-trees/LSM. It does not use Delta Lake’s
_delta_log.
4) Auxiliary Indexing & Metadata
- Zone maps (min/max): cheap, universal.
- Bloom filters: fast probabilistic “might contain value X?” checks.
- Bitmaps: per-value or per-range bitmaps (common in Druid/Pinot).
- Sparse primary indexes / data-skipping indexes: e.g., ClickHouse’s primary key and skip indexes.
- Traditional B-trees are rare; column stores lean on metadata-pruning + scans.
5) Compute–Storage Separation
- Data sits in object storage (S3/GCS/OSS).
- Stateless compute (engines like Trino/Presto, Spark, Snowflake) pull files, prune aggressively, and process in parallel.
- Scale-out by adding workers; scale-down by stopping compute — storage persists independently.
6) Parallelism & Scalability
- File/partition-level parallelism: many workers scan different files at once.
- Pipeline parallelism: scan → filter → project → aggregate operate as streaming stages.
- Locality matters less with object storage; bandwidth & metadata ops dominate.
7) File Formats (Columnar)
- Parquet: row groups → column chunks → pages; rich stats, encodings per column.
- ORC: stripes → row index + bloom + stats; strong predicate pushdown.
- (Others exist, but Parquet/ORC dominate the lakehouse.)
Encodings you’ll meet
- Dictionary (categoricals), RLE (runs), Delta/Frame-of-Reference (nearby numbers), Bit-packing (small integer widths), ZSTD/LZ4 for general compression.
8) Systems & How They Map
| System | What it is | Notes |
|---|---|---|
| Snowflake | Cloud DWH | Proprietary columnar; micro-partitions, heavy metadata pruning; no user B-trees/LSM. |
| ClickHouse | Columnar DBMS | MergeTree family; sparse primary index, data-skipping indexes; blazing vectorized execution. |
| Apache Druid | Realtime OLAP store | Segments with bitmap indexes, great for time-series & slice-and-dice; ingestion + query tiers. |
| Apache Pinot | Realtime OLAP store | Columnar segment files; star-schema analytics, low-latency serving. |
| DuckDB | Embedded columnar | In-process analytics; Parquet integration; vectorized exec. |
| Trino/Presto | SQL query engine | Stateless compute; scans Parquet/ORC/Iceberg/Delta/Hudi; no storage of its own. |
9) Partitioning & Layout
- Partition by high-selectivity, common filters (date, tenant, country).
- Keep file sizes balanced (e.g., 128–1024 MB) for efficient scans.
- Avoid tiny files (metadata overhead) and huge files (slow splits).
10) When Columnar Shines vs. Struggles
Shines
- Aggregations, scans over a subset of columns, ad-hoc analytics, time-series summarization. Struggles
- Heavy point lookups, high-churn OLTP updates, lots of single-row transactions.
→ Use row stores (or hybrids) for write-hot OLTP paths; ETL into columnar for analytics.
11) The “Non-Negotiables”
- Columnar data layout (per-column files/chunks).
- Aggressive per-column compression (choose encoding per column).
- Per-block/segment stats for pruning/skipping.
- Parallel scans across partitions/files.
- Efficient metadata management (fast listing, manifests, stats indexes).
12) Quick FAQ
Q: Is “zone map” the same as an index?
A: It’s a lightweight index: per-chunk min/max (and sometimes NDV). It doesn’t point to rows; it tells you what to skip.
Q: What makes vectorization fast?
A: Process thousands of values per call with SIMD and branch-free loops; fewer virtual calls, better cache.
Q: How do updates work if files are immutable?
A: Append delta files (upserts/deletes), then compact/merge to produce new base files; readers consult the latest snapshot to find which files are current.
13) Field Guide: Picking Tools
- Need a managed warehouse with strong pruning? → Snowflake / BigQuery.
- Need self-hosted, ultra-fast analytics with indexes? → ClickHouse.
- Realtime dashboarding over streaming data? → Druid / Pinot.
- Lakehouse DIY with open tables? → Iceberg / Delta Lake / Hudi on Trino/Spark.
- Local analytics or app-embedded OLAP? → DuckDB.