1️⃣ WAL Concept Recap
The core idea of Write-Ahead Logging (WAL) is:
- Never modify data files directly before recording the change somewhere durable.
- First, generate a WAL record that describes the change (e.g. “insert tuple X into page Y”).
- Append the WAL record to the WAL buffer in memory, and on
COMMITflush it (fsync) to stable storage (disk/SSD). - Only once the WAL is safely persisted can the database acknowledge success to the client.
- The actual data pages (table/index files) are modified in memory (buffer pool) immediately, but are only flushed to disk later by background processes (checkpoints, eviction, etc.).
That way:
- If the database crashes before WAL is flushed, the transaction is considered never committed and is discarded.
- If WAL is flushed but the data pages are not, the database can replay the WAL during recovery to redo the changes safely.
- WAL records are idempotent (or guarded by LSN checks), so even if a page was already written to disk, replaying the same WAL entry after a crash won’t corrupt or duplicate the data.
2️⃣ Recovery and Checkpoints
- Databases maintain a checkpoint pointer that marks the position in WAL up to which all changes are guaranteed to be reflected in data files.
- Once dirty pages covered by certain WAL records are flushed to disk, those WAL records are no longer needed and can be truncated or recycled.
- If a crash occurs before the checkpoint pointer is advanced, WAL replay will redo some operations unnecessarily, but this is safe due to idempotency.
- This design ensures correctness always comes first: better to redo too much than to risk missing a committed change.
3️⃣ Performance Benefits of WAL
At first glance, WAL looks like extra overhead (writing to WAL and later writing data pages).
But in practice, WAL often improves performance:
- Sequential I/O is fast: WAL writes are appended sequentially, which is far faster than random writes to data files.
- Batching data flushes: Data pages can be written lazily and in bulk, reducing random I/O.
- Group commit: Multiple transactions can share the same WAL flush, amortizing fsync cost.
- Reduced contention: Clients don’t wait for every page write, only for the WAL fsync.
4️⃣ Replication and WAL
WAL is often the foundation for replication in databases.
There are different levels of replication abstraction:
Physical Replication
- Ships raw WAL records (page/block-level changes).
- Replica replays WAL to remain byte-for-byte identical.
- Pros: Efficient, exact copy.
- Cons: No filtering, must match primary’s schema/version.
- Example: PostgreSQL streaming replication.
Transactional Replication
- Replicates entire transactions (commit boundaries preserved).
- Replica applies all operations atomically.
- Pros: Keeps consistent transaction boundaries.
- Cons: Less flexible than logical.
- Example: SQL Server transactional replication.
Logical Replication
- Replicates row-level or statement-level changes decoded from WAL.
- Replica re-executes changes (e.g.
INSERT/UPDATE/DELETE). - Pros: Can filter, transform, replicate to different schema or DB engine.
- Cons: More overhead; divergence possible if changes depend on non-deterministic functions (e.g.
NOW(),RAND()). - Example: PostgreSQL logical replication, Debezium CDC.
Analogy
- Physical = “copy the disk blocks.”
- Transactional = “copy the finished chapters.”
- Logical = “copy the text of edits and retype them.”
5️⃣ Differences Across Database Systems
While the principle is the same, implementations differ:
-
PostgreSQL
- Strict WAL-first design.
- WAL is always fsynced before commit is acknowledged.
- Supports streaming replication by sending WAL to replicas.
- Checkpoints control WAL recycling.
-
SQLite
- WAL is optional; default is rollback journal mode.
- In WAL mode, writes go into the WAL file; readers can continue reading the main DB file without being blocked.
- Periodic checkpoints merge WAL into the DB.
-
MySQL (InnoDB)
- WAL is called the redo log.
- Uses configurable flush policies (
innodb_flush_log_at_trx_commit):0— write + fsync once per second (not at commit) → can lose up to 1s even if only mysqld crashes.1— write + fsync at every commit → safest, lowest potential loss, highest fsync overhead.2— write at commit, fsync every N seconds (default 1s) → middle ground; safe if only mysqld dies, but up to ~N seconds loss on OS/power crash.
-
Distributed OLAP systems (e.g. StarRocks, ClickHouse)
- Often replace traditional WAL with a replicated consensus log (Raft, Paxos).
- WAL entries are replicated to quorum nodes before acknowledging success.
- This provides both durability and consistency in a cluster.
6️⃣ Mental Model
- Client success = WAL record safely persisted.
- Data durability = guaranteed by WAL, not by immediate page flushes.
- Data pages = updated in memory first, flushed later in bulk.
- Crash recovery = replay WAL from last checkpoint until the end.
- Replication = WAL stream can be shipped to other instances.
This balances performance (fast sequential WAL writes), durability (guaranteed recovery), and scalability (replication support).
7️⃣ How WAL is Written to Disk
Every transactional engine follows a similar 3-step path for WAL durability:
[DB Engine Log Buffer] (e.g. InnoDB redo log buffer, Postgres WAL buffer)
│ write()
▼
[OS Page Cache] (kernel memory, still volatile if OS/power crash)
│ fsync()
▼
[Log File on Disk] (durable, survives crashes)
- Log buffer (engine memory) → private to the DB process; lost if mysqld/postgres/oracle dies.
- OS page cache → outside the DB, managed by the OS; lost if the OS or power dies.
- Disk → only after
fsync()is it guaranteed durable.
Analogy
- Log buffer = letter on your desk.
- OS page cache = letter dropped at the post office.
- Disk = letter delivered to recipient.
Commit policies
- Flush at every COMMIT (safest, more fsync overhead).
- Flush at intervals (e.g. once per second) (faster, may lose ≤ N seconds of data on crash).
- Some engines allow group commit, batching multiple transactions into one fsync.
This pattern is universal: PostgreSQL, InnoDB, Oracle, SQL Server, SQLite (WAL mode) all follow this “buffer → OS cache → fsync” sequence.
8️⃣ Shadow Paging vs WAL
Shadow Paging is an alternative crash-recovery mechanism:
- Changes are written to new pages (“shadow copies”) instead of overwriting existing ones.
- A page map (table of pointers) is updated at commit to “point” to the new versions.
- Old pages remain until no longer needed.
Pros
- Instant crash recovery (no WAL replay needed).
- Always-consistent snapshots.
Cons
- High write amplification (every update copies a full page).
- Poor performance for frequent updates.
- Harder to support fine-grained concurrency.
Use cases
- Copy-on-write file systems (e.g. ZFS, btrfs).
- Some embedded DBs (LMDB, FoundationDB’s storage layer).
Why WAL is usually better
- Sequential appends are faster than random page copies.
- Scales better for OLTP workloads.
9️⃣ WAL in Other Systems
Kafka
- Kafka is essentially a distributed WAL.
- Each topic partition is an append-only log stored on disk.
- Producers append sequentially, consumers read sequentially.
- Durability: messages written to page cache, flushed/fsynced based on settings (
acks,flush.ms). - Replication: WAL entries replicated to follower brokers (Raft in modern Kafka).
HDFS
- The NameNode uses WAL to protect metadata.
- Every change (create, delete, rename) is appended to the EditLog (its WAL).
- Periodically merged with the FsImage (a checkpoint snapshot).
- On restart, recovery = replay EditLog.
- DataNodes don’t use WAL at the HDFS layer (blocks are immutable), but higher systems like HBase add their own WAL.