Row-store OLTP

1) Why Row-Based Storage?

Spatial locality: entire row lives together ⇒ great for point lookups, inserts/updates, OLTP workflows.
Strengths: primary-key lookups, selective predicates that fetch full rows, frequent small writes.
Trade-offs: analytical scans touching few columns incur extra I/O vs columnar stores.
Typical systems: InnoDB, PostgreSQL heap tables, SQL Server/Oracle row stores.

2) Storage Building Blocks

2.1 Data Model

Schema: column names, types, nullability, constraints (PK, FK, UNIQUE, CHECK).
Row format: fixed/variable attributes, NULL bitmap, alignment/padding.

2.2 Physical Storage Layer (Pages)

Pages (a.k.a. blocks): fixed-size units (e.g., 4–32 KB). Smallest I/O and buffer-pool unit.
Slotted-page layout (common):
- Page header: LSN/checksum, free-space pointer, tuple count, flags.
- Slot directory: array of offsets to tuples (allows row movement without changing row ID).
- Free space: grows/shrinks between header/slots and tuple area (“free space hole”).
- Tuples (records): may have per-row header (null bitmap, varlena pointers).
Large attributes:
- Out-of-line storage: BLOBs/CLOBs/TOAST (Postgres) stored elsewhere; row stores a pointer.
- Page overflow: very wide rows can spill to overflow pages.

2.3 Fixed vs Variable Length

Fixed: fast offset math, but wastes space (internal fragmentation).
Variable: compact, but requires indirection (offsets/lengths) and may fragment more easily.

2.4 Fragmentation Types

Internal: wasted space inside pages/rows.
External: logical neighbors scattered across pages ⇒ more random I/O.
Mitigation: fillfactor, autovacuum/vacuum, REORG/CLUSTER, page compaction.

3) CRUD Operations (Row Store)

3.1 Insert

Use free-space maps / free lists to find pages with room.
Insert tuple, update slot directory + free-space metadata.
Hot pages can cause latch contention ⇒ batch/partitioning can help.

3.2 Update

In-place if size unchanged and space available.
Row move if it grows (new page chosen, old becomes dead/redirected) ⇒ increases fragmentation.
Secondary index maintenance: if indexed columns change, index entries must be updated.
HOT updates (heap-only tuple, Postgres): if PK/indexed columns unchanged, avoid touching indexes.

3.3 Delete

Usually logical (mark dead, keep space for MVCC visibility).
Deferred cleanup: vacuum/compaction reclaims space and defragments.

4) Indexing (Row Store)

4.1 Primary Structures

B+-Trees: default for range scans and ordered access; leaf pages hold row IDs (or the row itself in clustered tables).
Hash indexes: fast equality lookups, no ordering; less common as primary in RDBMS.

4.2 Clustered vs Heap

Clustered index (InnoDB, SQL Server): table data physically ordered by the PK ⇒ range scans are sequential; only one per table.
Heap (Postgres): no inherent ordering; CLUSTER can reorganize to follow an index (not maintained automatically).

4.3 Covering & Index-Only Scans

Covering index: query can be satisfied by index alone (include columns in leaf).
Index-only scans: if visibility info allows skipping heap fetch (e.g., visibility map in Postgres).

4.4 Composite & Selectivity

Composite indexes order columns lexicographically; put most selective/leading filter first.
Beware index bloat and write amplification on heavy-write tables.

5) Concurrency Control & Isolation

5.1 MVCC

Core idea: writers create new versions; readers see a consistent snapshot (no blocking).
Implementation flavours:
- Undo-log based (e.g., InnoDB): old versions reconstructed from undo segments.
- Append-new-version (e.g., Postgres): new tuple version inserted; old one stays until vacuum.
Visibility: determined by transaction IDs/timestamps; vacuum removes dead versions.
Phantoms: solved with predicate/next-key locks at higher isolation.

5.2 Locks vs Latches

Locks: logical concurrency control (row/table); S=shared (read), X=exclusive (write).
Latches: short-term in-memory mutual exclusion for data structures (page/index).
Granularity: table, page, row; escalation may occur under pressure.

5.3 Levels of Isolation (ANSI)

Read Uncommitted → may see dirty reads (rarely exposed).
Read Committed → no dirty reads; nonrepeatable reads/phantoms possible.
Repeatable Read → stable reads; phantom behavior depends on engine (InnoDB uses next-key locks).
Serializable → as if executed one-by-one; highest overhead.

5.4 Deadlocks

Detection (waits-for graph) vs timeouts; keep lock order consistent and keep transactions short.

6) Query Execution & Joins

6.1 Join Algorithms

Nested Loop: great with index on inner; worst-case O(N×M).
Hash Join: equality joins; build hash on smaller input; memory sensitive.
Merge Join: inputs sorted on join keys; great for large ordered scans.

6.2 Optimizer Basics

Uses statistics (histograms, NDV, correlation) to estimate costs.
Cardinality errors lead to bad plans; keep stats fresh (ANALYZE/auto-analyze).

7) Durability & Recovery (WAL/ARIES)

7.1 Write Ahead Logs (WAL)

Rule: log (redo/undo info) must reach disk before corresponding data pages.
Group commit: batch many transactions’ WAL flushes to reduce fsyncs.
Checksums & LSNs: ensure page/WAL consistency.

7.2 ARIES-style Recovery (typical)

Analysis: find dirty pages/active transactions at crash.
REDO: repeat history from last checkpoint (idempotent).
UNDO: roll back incomplete transactions.
Doublewrite buffers (InnoDB) / full page writes (Postgres option) mitigate torn-page writes.

7.3 Checkpoints

Periodically persist dirty pages/WAL pointers to cap recovery time.
Spiky I/O if not tuned; spread writes (background writer) to smooth latency.

8) Buffer Pool / Caching

8.1 Responsibilities

Cache pages (data & indexes), manage dirty pages, coordinate flushes.
Read-ahead (sequential scan detection) and write-behind help throughput.

8.2 Eviction Policies

LRU variants (2Q, LRU-K), CLOCK-sweep; protect hot pages from scan pollution.
Pin pages during I/O/operations to avoid eviction.

8.3 Sizing & Hot Sets

Too small: thrash and random I/O. Too large: OS cache duplication, long recovery.
Monitor cache hit ratio, dirty page percentage, checkpoint age.

9) Bulk Operations

Batching reduces per-row overhead (fewer WAL records/index changes).
Use COPY/bulk-insert APIs; disable/deferrable constraints/secondary indexes during massive loads if safe.
Consider minimal logging modes (vendor-specific) for ETL phases.

10) Compression & Encoding

Row/page compression: dictionary, prefix, RLE at page level.
Pros: more rows/page ⇒ fewer I/Os; Cons: CPU overhead.
Hot vs cold pages**: compress colder data; beware update penalties on compressed pages.

11) Partitioning, Clustering & Layout

Partitioning: range/list/hash; improves pruning, maintenance windows, and vacuum scope.
Local vs global indexes: affects cross-partition queries & maintenance.
Clustering: reorder heap following an index to improve locality (needs periodic re-cluster on heaps).

12) Practical Tuning & Maintenance

Fillfactor: leave headroom on pages to reduce page splits and row-move updates.
Vacuum/Autovacuum (Postgres): reclaim space, freeze XIDs to avoid wraparound.
Reindex: fix bloat after heavy churn.
Analyze/Statistics: keep histograms/NDV up-to-date for the optimizer.
Hot tables: consider sharding/partitioning or moving to faster storage.
Contended sequences/auto-increment: cache sizes, monotonicity vs gaps trade-off.

13) Common Pathologies (What to Watch)

Page/Index bloat: frequent updates/deletes without maintenance.
Hot pages: latch contention on popular leaf/pages; mitigate with partitioning or different keys.
Random I/O storms: missing indexes, poor plans, small buffer pool.
Checkpoint spikes: too infrequent or too aggressive flushing.
Isolation surprises: phantom reads, next-key locks blocking unexpected statements.

14) Quick Glossary

Tuple/Row: a record in a table.
TID/RowID: physical row identifier (slot + page).
LSN: log sequence number (WAL position).
FSM/Free Space Map: tracks pages with available space.
Dirty Page: modified in memory, not yet flushed.
Checkpoint: persists positions to bound recovery time.
HOT Update: heap-only tuple update (skip index rewrite).

15) TL;DR Cheatsheet

Row store = OLTP king: fast point lookups & updates; weaker for wide scans.
B+-tree everywhere: PK/range scans; choose composite order carefully.
MVCC: versions for readers; vacuum cleans the past.
WAL + Group Commit: durability with amortized fsync.
Keep stats fresh; watch bloat; right-size buffer; partition hot/large tables.

Jonas Notes

Explorer

Row-store OLTP

1) Why Row-Based Storage?

2) Storage Building Blocks

2.1 Data Model

2.2 Physical Storage Layer (Pages)

2.3 Fixed vs Variable Length

2.4 Fragmentation Types

3) CRUD Operations (Row Store)

3.1 Insert

3.2 Update

3.3 Delete

4) Indexing (Row Store)

4.1 Primary Structures

4.2 Clustered vs Heap

4.3 Covering & Index-Only Scans

4.4 Composite & Selectivity

5) Concurrency Control & Isolation

5.1 MVCC

5.2 Locks vs Latches

5.3 Levels of Isolation (ANSI)

5.4 Deadlocks

6) Query Execution & Joins

6.1 Join Algorithms

6.2 Optimizer Basics

7) Durability & Recovery (WAL/ARIES)

7.1 Write Ahead Logs (WAL)

7.2 ARIES-style Recovery (typical)

7.3 Checkpoints

8) Buffer Pool / Caching

8.1 Responsibilities

8.2 Eviction Policies

8.3 Sizing & Hot Sets

9) Bulk Operations

10) Compression & Encoding

11) Partitioning, Clustering & Layout

12) Practical Tuning & Maintenance

13) Common Pathologies (What to Watch)

14) Quick Glossary

15) TL;DR Cheatsheet

Table of Contents

Backlinks