1) Why Row-Based Storage?
- Spatial locality: entire row lives together ⇒ great for point lookups, inserts/updates, OLTP workflows.
- Strengths: primary-key lookups, selective predicates that fetch full rows, frequent small writes.
- Trade-offs: analytical scans touching few columns incur extra I/O vs columnar stores.
- Typical systems: InnoDB, PostgreSQL heap tables, SQL Server/Oracle row stores.
2) Storage Building Blocks
2.1 Data Model
- Schema: column names, types, nullability, constraints (PK, FK, UNIQUE, CHECK).
- Row format: fixed/variable attributes, NULL bitmap, alignment/padding.
2.2 Physical Storage Layer (Pages)
- Pages (a.k.a. blocks): fixed-size units (e.g., 4–32 KB). Smallest I/O and buffer-pool unit.
- Slotted-page layout (common):
- Page header: LSN/checksum, free-space pointer, tuple count, flags.
- Slot directory: array of offsets to tuples (allows row movement without changing row ID).
- Free space: grows/shrinks between header/slots and tuple area (“free space hole”).
- Tuples (records): may have per-row header (null bitmap, varlena pointers).
- Large attributes:
- Out-of-line storage: BLOBs/CLOBs/TOAST (Postgres) stored elsewhere; row stores a pointer.
- Page overflow: very wide rows can spill to overflow pages.
2.3 Fixed vs Variable Length
- Fixed: fast offset math, but wastes space (internal fragmentation).
- Variable: compact, but requires indirection (offsets/lengths) and may fragment more easily.
2.4 Fragmentation Types
- Internal: wasted space inside pages/rows.
- External: logical neighbors scattered across pages ⇒ more random I/O.
- Mitigation: fillfactor, autovacuum/vacuum, REORG/CLUSTER, page compaction.
3) CRUD Operations (Row Store)
3.1 Insert
- Use free-space maps / free lists to find pages with room.
- Insert tuple, update slot directory + free-space metadata.
- Hot pages can cause latch contention ⇒ batch/partitioning can help.
3.2 Update
- In-place if size unchanged and space available.
- Row move if it grows (new page chosen, old becomes dead/redirected) ⇒ increases fragmentation.
- Secondary index maintenance: if indexed columns change, index entries must be updated.
- HOT updates (heap-only tuple, Postgres): if PK/indexed columns unchanged, avoid touching indexes.
3.3 Delete
- Usually logical (mark dead, keep space for MVCC visibility).
- Deferred cleanup: vacuum/compaction reclaims space and defragments.
4) Indexing (Row Store)
4.1 Primary Structures
- B+-Trees: default for range scans and ordered access; leaf pages hold row IDs (or the row itself in clustered tables).
- Hash indexes: fast equality lookups, no ordering; less common as primary in RDBMS.
4.2 Clustered vs Heap
- Clustered index (InnoDB, SQL Server): table data physically ordered by the PK ⇒ range scans are sequential; only one per table.
- Heap (Postgres): no inherent ordering; CLUSTER can reorganize to follow an index (not maintained automatically).
4.3 Covering & Index-Only Scans
- Covering index: query can be satisfied by index alone (include columns in leaf).
- Index-only scans: if visibility info allows skipping heap fetch (e.g., visibility map in Postgres).
4.4 Composite & Selectivity
- Composite indexes order columns lexicographically; put most selective/leading filter first.
- Beware index bloat and write amplification on heavy-write tables.
5) Concurrency Control & Isolation
- Core idea: writers create new versions; readers see a consistent snapshot (no blocking).
- Implementation flavours:
- Undo-log based (e.g., InnoDB): old versions reconstructed from undo segments.
- Append-new-version (e.g., Postgres): new tuple version inserted; old one stays until vacuum.
- Visibility: determined by transaction IDs/timestamps; vacuum removes dead versions.
- Phantoms: solved with predicate/next-key locks at higher isolation.
5.2 Locks vs Latches
- Locks: logical concurrency control (row/table); S=shared (read), X=exclusive (write).
- Latches: short-term in-memory mutual exclusion for data structures (page/index).
- Granularity: table, page, row; escalation may occur under pressure.
- Read Uncommitted → may see dirty reads (rarely exposed).
- Read Committed → no dirty reads; nonrepeatable reads/phantoms possible.
- Repeatable Read → stable reads; phantom behavior depends on engine (InnoDB uses next-key locks).
- Serializable → as if executed one-by-one; highest overhead.
5.4 Deadlocks
- Detection (waits-for graph) vs timeouts; keep lock order consistent and keep transactions short.
6) Query Execution & Joins
- Nested Loop: great with index on inner; worst-case O(N×M).
- Hash Join: equality joins; build hash on smaller input; memory sensitive.
- Merge Join: inputs sorted on join keys; great for large ordered scans.
6.2 Optimizer Basics
- Uses statistics (histograms, NDV, correlation) to estimate costs.
- Cardinality errors lead to bad plans; keep stats fresh (ANALYZE/auto-analyze).
7) Durability & Recovery (WAL/ARIES)
- Rule: log (redo/undo info) must reach disk before corresponding data pages.
- Group commit: batch many transactions’ WAL flushes to reduce fsyncs.
- Checksums & LSNs: ensure page/WAL consistency.
7.2 ARIES-style Recovery (typical)
- Analysis: find dirty pages/active transactions at crash.
- REDO: repeat history from last checkpoint (idempotent).
- UNDO: roll back incomplete transactions.
- Doublewrite buffers (InnoDB) / full page writes (Postgres option) mitigate torn-page writes.
7.3 Checkpoints
- Periodically persist dirty pages/WAL pointers to cap recovery time.
- Spiky I/O if not tuned; spread writes (background writer) to smooth latency.
8) Buffer Pool / Caching
8.1 Responsibilities
- Cache pages (data & indexes), manage dirty pages, coordinate flushes.
- Read-ahead (sequential scan detection) and write-behind help throughput.
8.2 Eviction Policies
- LRU variants (2Q, LRU-K), CLOCK-sweep; protect hot pages from scan pollution.
- Pin pages during I/O/operations to avoid eviction.
8.3 Sizing & Hot Sets
- Too small: thrash and random I/O. Too large: OS cache duplication, long recovery.
- Monitor cache hit ratio, dirty page percentage, checkpoint age.
9) Bulk Operations
- Batching reduces per-row overhead (fewer WAL records/index changes).
- Use COPY/bulk-insert APIs; disable/deferrable constraints/secondary indexes during massive loads if safe.
- Consider minimal logging modes (vendor-specific) for ETL phases.
10) Compression & Encoding
- Row/page compression: dictionary, prefix, RLE at page level.
- Pros: more rows/page ⇒ fewer I/Os; Cons: CPU overhead.
- Hot vs cold pages**: compress colder data; beware update penalties on compressed pages.
11) Partitioning, Clustering & Layout
- Partitioning: range/list/hash; improves pruning, maintenance windows, and vacuum scope.
- Local vs global indexes: affects cross-partition queries & maintenance.
- Clustering: reorder heap following an index to improve locality (needs periodic re-cluster on heaps).
12) Practical Tuning & Maintenance
- Fillfactor: leave headroom on pages to reduce page splits and row-move updates.
- Vacuum/Autovacuum (Postgres): reclaim space, freeze XIDs to avoid wraparound.
- Reindex: fix bloat after heavy churn.
- Analyze/Statistics: keep histograms/NDV up-to-date for the optimizer.
- Hot tables: consider sharding/partitioning or moving to faster storage.
- Contended sequences/auto-increment: cache sizes, monotonicity vs gaps trade-off.
13) Common Pathologies (What to Watch)
- Page/Index bloat: frequent updates/deletes without maintenance.
- Hot pages: latch contention on popular leaf/pages; mitigate with partitioning or different keys.
- Random I/O storms: missing indexes, poor plans, small buffer pool.
- Checkpoint spikes: too infrequent or too aggressive flushing.
- Isolation surprises: phantom reads, next-key locks blocking unexpected statements.
14) Quick Glossary
- Tuple/Row: a record in a table.
- TID/RowID: physical row identifier (slot + page).
- LSN: log sequence number (WAL position).
- FSM/Free Space Map: tracks pages with available space.
- Dirty Page: modified in memory, not yet flushed.
- Checkpoint: persists positions to bound recovery time.
- HOT Update: heap-only tuple update (skip index rewrite).
15) TL;DR Cheatsheet
- Row store = OLTP king: fast point lookups & updates; weaker for wide scans.
- B+-tree everywhere: PK/range scans; choose composite order carefully.
- MVCC: versions for readers; vacuum cleans the past.
- WAL + Group Commit: durability with amortized fsync.
- Keep stats fresh; watch bloat; right-size buffer; partition hot/large tables.