PostgreSQL Write-Ahead Log — Record Insertion, LSNs, and the Durability Spine
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A write-ahead log exists to answer one question after a crash: which of
the changes a running system had in flight actually made it to durable
state, and how do we get back to a consistent point? The canonical
answer is ARIES (Mohan et al., ARIES: A Transaction Recovery Method
Supporting Fine-Granularity Locking and Partial Rollbacks Using
Write-Ahead Logging, ACM TODS 1992; captured at
knowledge/research/dbms-papers/aries.md). ARIES rests on three
principles, and PostgreSQL’s WAL is a faithful — if selectively
implemented — instance of them:
- Write-ahead logging. The log record describing a page change must reach stable storage before the changed data page does. This is the namesake invariant. It lets the buffer manager keep dirty pages in memory indefinitely (a no-force policy) and lets dirty pages be written before commit (a steal policy), because the log alone is enough to redo lost work and undo uncommitted work.
- Repeating history during redo. On restart, replay the log forward from the last checkpoint, re-applying every logged change — even those of transactions that ultimately aborted — to reconstruct the exact page state at the moment of the crash, then roll back the losers.
- Logging changes during undo. Undo actions are themselves logged (as compensation log records) so that a crash during recovery does not lose the rollback progress.
The single mechanism that ties the log to the data pages is the LSN
(Log Sequence Number). In ARIES every page carries the LSN of the most
recent log record that modified it (the PageLSN). Two comparisons fall
out of that and run the whole machine:
- The WAL rule, enforced by comparison. A page may be flushed only
when the log is durable up to that page’s LSN. Equivalently:
flushedLSN >= PageLSN(page)is the precondition for writing the page. - Idempotent redo, also by comparison. During replay, a log record at
LSN L affecting a page is skipped if
PageLSN(page) >= L— the change is already there. This is what makes replay safe to restart.
Database Internals (Petrov, ch. 5, “Transaction Processing and Recovery”) frames this as the dividing line between the operational log (physical/physiological records that describe byte-level page edits) and the logical log (which records higher-level operations). PostgreSQL’s WAL is physiological: a record names a page and carries enough data to re-apply a specific physical edit, but the edit is expressed in terms a resource manager understands (insert this tuple at this line pointer) rather than raw byte diffs. Database System Concepts (Silberschatz et al., 7e, ch. 19, “Recovery System”) gives the same model under the names “deferred” vs. “immediate” modification; PostgreSQL is immediate- modification with steal/no-force, which is exactly the regime ARIES was designed for.
The design space a WAL implementer chooses within:
- Record granularity — physical (byte ranges), logical (operations), or physiological (operations scoped to one page). PostgreSQL picks physiological, per-resource-manager.
- Torn-page protection — how to survive a page that was half-written when power failed. The ARIES assumption that “a page write is atomic” is false on most storage, so a real engine needs an answer. PostgreSQL’s answer is full-page writes (full-page images, FPIs).
- Group commit — whether each committing transaction issues its own fsync, or many piggyback on one. PostgreSQL batches.
- Async durability — whether
COMMITmay return before its record is fsynced, trading a bounded window of loss for latency. PostgreSQL offers this assynchronous_commit = off.
The rest of this document is, in effect, how PostgreSQL turned those four dials.
Common DBMS Design
Section titled “Common DBMS Design”Almost every ARIES-lineage engine converges on the same set of engineering conventions to realize the theory. Naming them here makes PostgreSQL’s specific symbols (next section) read as one set of choices within a shared playbook rather than as inventions.
The log is one logical append-only stream
Section titled “The log is one logical append-only stream”Whatever the on-disk file layout, the log is presented to the rest of the engine as a single, ever-growing sequence of bytes. The LSN is a position in that stream — monotonically increasing, never reused. Making the LSN a byte offset (rather than an opaque counter) means “is the log durable up to here?” is a single integer comparison, and “how far apart are these two records?” is a subtraction. Engines split the stream into fixed-size segment files for recycling and archiving, but the segment boundary is an implementation detail hidden behind the LSN.
Records are self-describing and typed
Section titled “Records are self-describing and typed”A log record carries a header (length, owning transaction, a back-link to
the previous record, a type tag, a checksum) followed by type-specific
payload. The type tag routes the record to the right redo handler at
recovery time. This is the resource-manager pattern: the log subsystem
knows how to frame, locate, checksum, and chain records, but delegates
meaning to per-subsystem callbacks. (PostgreSQL’s full rmgr catalog —
Heap, Btree, Xact, CLOG, and the rest — is the subject of the sibling doc
postgres-wal-records-rmgr.md; this doc treats the rmgr id as an opaque
tag on the record header.)
Insertion is split from durability
Section titled “Insertion is split from durability”High-throughput engines never let a backend hold a global log lock while doing I/O. The universal pattern is two stages:
- A short critical region that reserves space in the log (claims a byte range, assigns the record its LSN) and copies the bytes into an in-memory log buffer.
- A separate, asynchronous path that writes and flushes the buffer to disk, driven by a background writer and by explicit flush requests at commit.
The reservation stage must be as small as possible because it is globally serialized — it is the one point where the single stream’s tip advances. Everything else (assembling the record, computing its checksum, copying it into the buffer, fsyncing) is pushed outside that region.
Three watermarks track progress
Section titled “Three watermarks track progress”Because insertion and durability are decoupled, the engine maintains
ordered progress markers: how far records have been inserted into the
buffer, how far the buffer has been written to the kernel, and how far
the kernel has been flushed (fsynced) to stable media. The invariant
Insert >= Write >= Flush holds at all times, and a commit waits until
Flush >= commit-LSN.
Pages carry their LSN; the buffer manager gates on it
Section titled “Pages carry their LSN; the buffer manager gates on it”Every data page stores the LSN of the last log record that touched it. The buffer manager consults it before evicting/writing a dirty page, enforcing the WAL rule at the one choke point where pages leave memory. This is the seam between the WAL subsystem and the storage engine.
Torn pages are defeated by logging a whole page
Section titled “Torn pages are defeated by logging a whole page”Since hardware page writes are not atomic, the first time a page is modified after a checkpoint, the engine logs the entire page (a full-page image). On replay, restoring that image gives a known-good page regardless of whether the original write was torn. The cost is WAL volume; the mitigations are hole-removal and compression.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| ARIES / textbook concept | PostgreSQL name |
|---|---|
| Log Sequence Number (position in the stream) | XLogRecPtr — a uint64 byte offset (xlogdefs.h) |
| The single append-only log | the WAL / “xlog”; physical files in pg_wal/ |
PageLSN on each data page | PageGetLSN / PageSetLSN (the page header pd_lsn) |
| Log record header | XLogRecord (xlogrecord.h) |
| Record type tag → redo handler | xl_rmid (rmgr id) + xl_info |
| Back-link to previous record | xl_prev |
| Record checksum | xl_crc (CRC-32C) |
| Reserve-then-copy insertion | ReserveXLogInsertLocation then CopyXLogRecordToWAL |
| In-memory log buffer | the WAL buffers (XLogCtl->pages, wal_buffers) |
| Insert / Write / Flush watermarks | logInsertResult / logWriteResult / logFlushResult |
| Full-page image (torn-page defense) | full-page write (FPW / FPI), BKPBLOCK_HAS_IMAGE |
| Group commit | XLogFlush group-commit loop + CommitDelay |
| Async durability | synchronous_commit = off, XLogBackgroundFlush |
| Segment files | WAL segments, default 16 MB (DEFAULT_XLOG_SEG_SIZE) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL calls the WAL subsystem xlog in the code. Its design intent
is captured tersely in src/backend/access/transam/README: “A basic
assumption of a write AHEAD log is that log entries must reach stable
storage before the data-page changes they describe.” Everything below is
how that assumption is mechanized.
An LSN is a byte position, and the log is one stream
Section titled “An LSN is a byte position, and the log is one stream”The foundational decision: an XLogRecPtr is a 64-bit byte offset into a
single conceptual log that began at the start of time and grows forever.
// XLogRecPtr — src/include/access/xlogdefs.htypedef uint64 XLogRecPtr;
#define InvalidXLogRecPtr 0#define XLogRecPtrIsValid(r) ((r) != InvalidXLogRecPtr)Because the value is a byte offset, the conventional human-readable form
%X/%X (via LSN_FORMAT_ARGS) is just the high and low 32 bits of that
offset. Zero is reserved as “invalid” — bootstrap deliberately starts the
first real WAL page one segment in, so no genuine record ever begins at
offset 0.
That single stream is chopped into segment files on disk (default
16 MB), but the segmentation is invisible to the LSN. The conversion
macros live in xlog_internal.h:
// segment math — src/include/access/xlog_internal.h#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \ logSegNo = (xlrp) / (wal_segsz_bytes)
#define XLogSegmentOffset(xlogptr, wal_segsz_bytes) \ ((xlogptr) & ((wal_segsz_bytes) - 1))A WAL file’s name is TLI + high-32 + low-32 of the segment number, all in
hex — the timeline ID, then the segment broken into a 4-GB “log id” and a
within-log segment index:
// XLogFileName — src/include/access/xlog_internal.hsnprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, (uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)), (uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)));There is a subtlety the LSN math has to absorb: each WAL page (8 KB,
XLOG_BLCKSZ) begins with a header, so not every byte of the file is
“usable” for record data. PostgreSQL handles this by tracking insertion
progress in a separate space of usable byte positions that exclude all
page headers, then converting to a real XLogRecPtr outside the hot lock
(see XLogBytePosToRecPtr below). The page header is small but mandatory:
// XLogPageHeaderData — src/include/access/xlog_internal.htypedef struct XLogPageHeaderData{ uint16 xlp_magic; /* magic value for correctness checks */ uint16 xlp_info; /* flag bits, see below */ TimeLineID xlp_tli; /* TimeLineID of first record on page */ XLogRecPtr xlp_pageaddr; /* XLOG address of this page */ uint32 xlp_rem_len; /* total len of remaining data for record */} XLogPageHeaderData;xlp_pageaddr is the page’s own LSN — a self-check exploited by the reader
(§ Source Walkthrough) to detect a recycled-but-not-overwritten segment.
xlp_rem_len is how a record that overflows a page boundary is continued:
the next page’s header records how many bytes of the straddling record
remain.
The record format: header, blocks, data
Section titled “The record format: header, blocks, data”A WAL record is a fixed header followed by a variable run of block references and data chunks. The header is small and rigid:
// XLogRecord — src/include/access/xlogrecord.htypedef struct XLogRecord{ uint32 xl_tot_len; /* total len of entire record */ TransactionId xl_xid; /* xact id */ XLogRecPtr xl_prev; /* ptr to previous record in log */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ pg_crc32c xl_crc; /* CRC for this record */ /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */} XLogRecord;The five header fields are the ARIES record header made concrete:
xl_tot_len frames it, xl_xid ties it to a transaction, xl_prev
back-links it to the previous record (a chain validated on read),
xl_rmid+xl_info route it to a resource manager’s redo handler, and
xl_crc (CRC-32C) is the integrity check that lets recovery trust a
record before applying it. Everything after the header — block references,
full-page images, and “main data” — is laid out as documented in the
header comment of xlogrecord.h:
Fixed-size header (XLogRecord struct)XLogRecordBlockHeader struct (block ref 0)XLogRecordBlockHeader struct (block ref 1)...XLogRecordDataHeader[Short|Long]block datablock data...main dataEach modified page is described by an XLogRecordBlockHeader, identified
by a small block id, optionally followed by a full-page image header and
the page’s relfilelocator/forknum/blocknumber:
// XLogRecordBlockHeader — src/include/access/xlogrecord.htypedef struct XLogRecordBlockHeader{ uint8 id; /* block reference ID */ uint8 fork_flags; /* fork within the relation, and flags */ uint16 data_length; /* number of payload bytes (excl. page image) */ /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader follows */ /* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */ /* BlockNumber follows */} XLogRecordBlockHeader;The fork_flags byte multiplexes the fork number (low 4 bits) with flags
in the high bits — BKPBLOCK_HAS_IMAGE (this block carries a full-page
image), BKPBLOCK_WILL_INIT (redo will rebuild the page from scratch, no
old contents needed), and BKPBLOCK_SAME_REL (omit the relfilelocator, it
equals the previous block’s). A single record may reference up to
XLR_MAX_BLOCK_ID (32) blocks.
flowchart TB
subgraph REC["one XLOG record (contiguous bytes in the stream)"]
direction TB
HDR["XLogRecord header<br/>xl_tot_len, xl_xid, xl_prev,<br/>xl_info, xl_rmid, xl_crc"]
B0["XLogRecordBlockHeader 0<br/>id, fork_flags, data_length<br/>(+ image hdr if FPI)<br/>(+ RelFileLocator + BlockNumber)"]
B1["XLogRecordBlockHeader 1<br/>(BKPBLOCK_SAME_REL -> no locator)"]
DH["XLogRecordDataHeaderShort/Long<br/>(main-data length)"]
BD0["block 0 data / full-page image"]
BD1["block 1 data"]
MD["main data (rmgr-specific)"]
end
HDR --> B0 --> B1 --> DH --> BD0 --> BD1 --> MD
Figure 1 — The on-stream layout of one WAL record. The header is fixed and
MAXALIGN’d at its start; everything after it is unaligned and packed. Block
references come first (each optionally carrying a full-page image), then the
short/long main-data header, then the block data and the rmgr’s main data.
The redo handler for xl_rmid knows how to walk this back apart. (Layout
from the header comment in xlogrecord.h.)
Constructing a record: Begin / Register / Insert
Section titled “Constructing a record: Begin / Register / Insert”A caller never hand-builds the byte layout above. It declares intent with three families of calls, and the assembler does the packing. From the README’s worked example:
// the WAL-logged action recipe — src/backend/access/transam/READMEXLogBeginInsert();XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);XLogRegisterData(&xlrec, SizeOfFictionalAction);XLogRegisterBufData(0, tuple->data, tuple->len);recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);PageSetLSN(dp, recptr); /* stamp the page with the record's LSN */XLogRegisterBuffer records that a shared buffer was modified. The
assembler will decide whether that page needs a full-page image; the
caller does not. The function asserts the buffer is exclusive-locked and
dirty — the WAL rule’s preconditions — unless the caller passes
REGBUF_NO_CHANGE:
// XLogRegisterBuffer — src/backend/access/transam/xloginsert.cvoidXLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags){ registered_buffer *regbuf; Assert(begininsert_called);#ifdef USE_ASSERT_CHECKING if (!(flags & REGBUF_NO_CHANGE)) Assert(BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer));#endif regbuf = ®istered_buffers[block_id]; BufferGetTag(buffer, ®buf->rlocator, ®buf->forkno, ®buf->block); regbuf->page = BufferGetPage(buffer); regbuf->flags = flags; /* ... condensed ... */ regbuf->in_use = true;}XLogRegisterData appends rmgr-specific “main data” (the logical payload —
e.g. the tuple header to redo an insert); XLogRegisterBufData appends
data associated with a registered block, which the assembler may drop if
it takes a full-page image of that block (because the image already
contains it).
The final XLogInsert(rmid, info) is the public entry point. It is a
retry loop around two helpers — assemble, then insert — and it retries when
the insert path discovers that full-page-write state changed underneath it:
// XLogInsert — src/backend/access/transam/xloginsert.cXLogRecPtrXLogInsert(RmgrId rmid, uint8 info){ XLogRecPtr EndPos; if (!begininsert_called) elog(ERROR, "XLogBeginInsert was not called"); /* ... info-mask validation, bootstrap shortcut ... */ do { XLogRecPtr RedoRecPtr; bool doPageWrites; XLogRecPtr fpw_lsn; XLogRecData *rdt; int num_fpi = 0;
/* values needed to decide on full-page writes; rechecked under lock */ GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites); rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites, &fpw_lsn, &num_fpi, &topxid_included); EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi, topxid_included); } while (EndPos == InvalidXLogRecPtr); XLogResetInsertion(); return EndPos;}XLogRecordAssemble walks the registered buffers and data, building the
XLogRecData chain — a linked list of (data, len) fragments — and the
record header in a scratch buffer, computing the CRC over everything except
the not-yet-known xl_prev. It returns EndPos == InvalidXLogRecPtr only
indirectly: the insert step returns invalid to force a re-assemble. The
return value of XLogInsert is the LSN of the end of the record, which
the caller stamps onto the page with PageSetLSN.
Full-page writes: torn-page defense
Section titled “Full-page writes: torn-page defense”Inside XLogRecordAssemble, the decision of whether to embed a full-page
image is per-block. The rule: if full-page writes are on and this is the
first modification of the page since the last checkpoint
(page_lsn <= RedoRecPtr), back it up.
// XLogRecordAssemble — needs_backup decision — xloginsert.celse if (!doPageWrites) needs_backup = false;else{ /* page LSN is the first data on every page passed to XLogInsert */ XLogRecPtr page_lsn = PageGetLSN(regbuf->page); needs_backup = (page_lsn <= RedoRecPtr); if (!needs_backup) { if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn) *fpw_lsn = page_lsn; }}RedoRecPtr is the LSN of the most recent checkpoint’s redo point. A page
whose LSN already exceeds it has been modified (and thus imaged) since the
checkpoint, so it needs no new image. The fpw_lsn returned here is the
lowest LSN among not-backed-up pages — it is the value XLogInsertRecord
rechecks under the insertion lock to detect that a checkpoint snuck in and
the page now does need backing up, triggering the re-assemble loop.
When a page is imaged, PostgreSQL shrinks the WAL cost two ways. For a
standard page layout it omits the hole — the unused middle region
between pd_lower and pd_upper that is all zeros — recording only its
offset and length:
// XLogRecordAssemble — hole removal — xloginsert.cif (regbuf->flags & REGBUF_STANDARD){ uint16 lower = ((PageHeader) page)->pd_lower; uint16 upper = ((PageHeader) page)->pd_upper; if (lower >= SizeOfPageHeaderData && upper > lower && upper <= BLCKSZ) { bimg.hole_offset = lower; cbimg.hole_length = upper - lower; } /* ... condensed ... */}And if wal_compression is on, the (hole-removed) image is run through
PGLZ / LZ4 / ZSTD, with the algorithm recorded in bimg_info. The image
header carries BKPIMAGE_APPLY when redo must actually restore it (as
opposed to images logged only for wal_consistency_checking).
flowchart TD
START["block registered<br/>in XLogRecordAssemble"] --> Q1{REGBUF_FORCE_IMAGE?}
Q1 -- yes --> BK["take full-page image"]
Q1 -- no --> Q2{REGBUF_NO_IMAGE<br/>or doPageWrites off?}
Q2 -- yes --> NOBK["no image;<br/>log only block data"]
Q2 -- no --> Q3{"page_lsn <= RedoRecPtr?<br/>first change since checkpoint"}
Q3 -- yes --> BK
Q3 -- no --> REC["record page_lsn as candidate fpw_lsn;<br/>no image this time"]
BK --> Q4{REGBUF_STANDARD?}
Q4 -- yes --> HOLE["omit pd_lower..pd_upper hole"]
Q4 -- no --> FULL["image whole BLCKSZ"]
HOLE --> COMP{wal_compression on?}
FULL --> COMP
COMP -- yes --> CZ["compress image (PGLZ/LZ4/ZSTD)"]
COMP -- no --> RAW["store raw image"]
Figure 2 — Per-block full-page-image decision inside XLogRecordAssemble.
The load-bearing branch is page_lsn <= RedoRecPtr: the first time a page
is touched after a checkpoint, its whole image is logged so replay can
restore a torn page. Subsequent changes within the same checkpoint cycle
log only the incremental edit. Hole-removal and compression shrink the
image’s WAL footprint. (Logic from xloginsert.c.)
Inserting the record: reserve, then copy
Section titled “Inserting the record: reserve, then copy”XLogInsertRecord is where the assembled chain meets shared memory. Its
own comment lays out the two-step contract; the key is that step 1 is
globally serialized and tiny, step 2 is concurrent:
- Reserve the right amount of space from the WAL. The current head of reserved space is kept in
Insert->CurrBytePos, and is protected byinsertpos_lck.- Copy the record to the reserved WAL space … This can be done concurrently in multiple processes.
The reservation is the hot path’s hot spot, so it is reduced to a few
arithmetic operations under a spinlock. CurrBytePos is a “usable byte
position” (page headers excluded), so reserving N bytes is essentially
CurrBytePos += N:
// ReserveXLogInsertLocation — src/backend/access/transam/xlog.cSpinLockAcquire(&Insert->insertpos_lck);startbytepos = Insert->CurrBytePos;endbytepos = startbytepos + size;prevbytepos = Insert->PrevBytePos;Insert->CurrBytePos = endbytepos;Insert->PrevBytePos = startbytepos;SpinLockRelease(&Insert->insertpos_lck);
*StartPos = XLogBytePosToRecPtr(startbytepos);*EndPos = XLogBytePosToEndRecPtr(endbytepos);*PrevPtr = XLogBytePosToRecPtr(prevbytepos);The byte-position→XLogRecPtr conversion (which adds back the page
headers) is done outside the spinlock by XLogBytePosToRecPtr. The
spinlock holds only long enough to bump two integers — this is the single
serialization point where the log’s tip advances, and PostgreSQL spends
real effort keeping it microscopic (the comment even notes the function is
pg_attribute_always_inline to shave cycles).
Before the reservation, XLogInsertRecord takes one of the WAL
insertion locks and rechecks full-page-write state:
// XLogInsertRecord — under the insertion lock — xlog.cWALInsertLockAcquire();if (RedoRecPtr != Insert->RedoRecPtr){ Assert(RedoRecPtr < Insert->RedoRecPtr); RedoRecPtr = Insert->RedoRecPtr;}doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);if (doPageWrites && (!prevDoPageWrites || (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr))){ /* a page now needs backup that the caller didn't back up. Start over. */ WALInsertLockRelease(); END_CRIT_SECTION(); return InvalidXLogRecPtr; /* -> XLogInsert re-assembles */}ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos, &rechdr->xl_prev);This is the re-assemble trigger: RedoRecPtr is only authoritative while
holding an insertion lock, so a checkpoint that advanced it between
assembly and insertion can invalidate the FPI decision. Once space is
reserved and xl_prev is filled in, the CRC of the header is finalized
(the body CRC was computed during assembly) and the bytes are copied into
the WAL buffers by CopyXLogRecordToWAL.
Why eight insertion locks
Section titled “Why eight insertion locks”The whole point of separating “reserve” (spinlock) from “copy” (insertion
lock) is that copies run in parallel. There are NUM_XLOGINSERT_LOCKS = 8
of them. A backend picks one — preferring the one it used last, for cache
affinity, and migrating away on contention:
// WALInsertLockAcquire — xlog.cif (lockToTry == -1) lockToTry = MyProcNumber % NUM_XLOGINSERT_LOCKS;MyLockNo = lockToTry;immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);if (!immed) lockToTry = (lockToTry + 1) % NUM_XLOGINSERT_LOCKS; /* try another next time */Each lock carries more than the LWLock: an insertingAt atomic and a
lastImportantAt LSN.
// WALInsertLock — xlog.ctypedef struct{ LWLock lock; pg_atomic_uint64 insertingAt; XLogRecPtr lastImportantAt;} WALInsertLock;insertingAt is how a flusher knows it is safe to write a page: a backend
that crosses a page boundary publishes how far it has inserted, so
WaitXLogInsertionsToFinish can wait only for insertions to the pages it
intends to write, ignoring insertions further ahead. lastImportantAt
tracks the LSN of the last important record under each lock (records not
flagged XLOG_MARK_UNIMPORTANT), which the checkpointer uses to decide
whether anything worth checkpointing has happened.
A few record types — an XLOG_SWITCH (force a segment boundary) and the
checkpoint-redo record — need to freeze all inserters; they take all
eight locks via WALInsertLockAcquireExclusive, which is why the insert
path branches on a WalInsertClass.
flowchart TB
A["XLogInsert(rmid, info)"] --> B["XLogRecordAssemble<br/>build XLogRecData chain + header,<br/>decide full-page images, body CRC"]
B --> C["XLogInsertRecord"]
C --> D["START_CRIT_SECTION"]
D --> E["WALInsertLockAcquire<br/>(1 of 8 LWLocks)"]
E --> F{RedoRecPtr stale<br/>or FPI now needed?}
F -- yes --> G["release lock, return Invalid<br/>-> retry assemble"]
F -- no --> H["ReserveXLogInsertLocation<br/>spinlock: CurrBytePos += size<br/>-> StartPos/EndPos/xl_prev"]
H --> I["finalize header CRC"]
I --> J["CopyXLogRecordToWAL<br/>(concurrent; into WAL buffers)"]
J --> K["update lastImportantAt"]
K --> L["WALInsertLockRelease<br/>END_CRIT_SECTION"]
L --> M["bump shared LogwrtRqst.Write<br/>if crossed a page boundary"]
M --> N["return EndPos (LSN)"]
Figure 3 — The insert path. The globally serialized work is only the
spinlock-guarded CurrBytePos += size in step H; record assembly, CRC, and
the copy into WAL buffers are all done off the global tip, parallelized
across the eight insertion LWLocks. The retry edge (F→G) handles a
checkpoint advancing RedoRecPtr between assembly and insertion. (Flow from
XLogInsertRecord in xlog.c.)
Durability: write, flush, and the three watermarks
Section titled “Durability: write, flush, and the three watermarks”Copying a record into the WAL buffers makes it inserted, not durable.
Three monotonic watermarks, held as atomics in XLogCtlData, track the
gap:
// XLogCtlData watermarks — src/backend/access/transam/xlog.cpg_atomic_uint64 logInsertResult; /* last byte + 1 inserted to buffers */pg_atomic_uint64 logWriteResult; /* last byte + 1 written out */pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */The ordering Insert >= Write >= Flush is maintained with explicit memory
barriers — RefreshXLogWriteResult reads Flush before Write behind a
read barrier, and XLogWrite publishes Write before Flush behind a write
barrier — so any observer always sees a Flush value that trails the Write
value it saw.
XLogFlush(record) is the durability request: “make the log durable at
least up to record.” Its first two lines are the fast path — if we are in
recovery we are reading not writing, and if it is already flushed we are
done:
// XLogFlush — src/backend/access/transam/xlog.cif (!XLogInsertAllowed()){ UpdateMinRecoveryPoint(record, false); return;}if (record <= LogwrtResult.Flush) /* already durable */ return;Otherwise it loops to acquire WALWriteLock, but with a group-commit
twist: it uses LWLockAcquireOrWait, so if another backend is already
flushing, this one waits and then rechecks — frequently finding the work
already done for it. When it does flush, it deliberately flushes as far as
any in-progress insertion has reached (WaitXLogInsertionsToFinish),
piggybacking later transactions’ records onto this fsync:
// XLogFlush — group commit core — xlog.cinsertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);if (!LWLockAcquireOrWait(WALWriteLock, LW_EXCLUSIVE)) continue; /* someone may have flushed it already */RefreshXLogWriteResult(LogwrtResult);if (record <= LogwrtResult.Flush) /* yes they did */{ LWLockRelease(WALWriteLock); break;}/* optional CommitDelay sleep to gather more followers */WriteRqst.Write = insertpos;WriteRqst.Flush = insertpos;XLogWrite(WriteRqst, insertTLI, false);The optional pg_usleep(CommitDelay) before the actual write is the
explicit group-commit knob: a brief pause that lets more committers join
the batch, trading latency for throughput.
XLogWrite does the actual pg_pwrite of whole 8-KB buffer pages,
coalescing physically-contiguous pages into one syscall, and then the
fsync:
// XLogWrite — the write — xlog.cfrom = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;nbytes = npages * (Size) XLOG_BLCKSZ;/* ... loop ... */written = pg_pwrite(openLogFile, from, nleft, startoffset);The fsync itself is issue_xlog_fsync, which dispatches on
wal_sync_method (and is a no-op when the method is open_sync/
open_datasync, because in that mode the write() already synced):
// issue_xlog_fsync — xlog.cif (!enableFsync || wal_sync_method == WAL_SYNC_METHOD_OPEN || wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC) return;switch (wal_sync_method){ case WAL_SYNC_METHOD_FSYNC: if (pg_fsync_no_writethrough(fd) != 0) msg = ...; break; case WAL_SYNC_METHOD_FDATASYNC: if (pg_fdatasync(fd) != 0) msg = ...; break; /* ... */}if (msg) ereport(PANIC, ...); /* a failed WAL fsync is unrecoverable */A failed WAL fsync is a PANIC: the system cannot continue if it cannot
prove its log is durable.
flowchart LR
subgraph MEM["in shared memory"]
INS["records copied into<br/>WAL buffers (XLogCtl->pages)<br/>watermark: logInsertResult"]
end
INS -->|"XLogWrite: pg_pwrite<br/>whole 8KB pages"| KERN["kernel page cache<br/>watermark: logWriteResult"]
KERN -->|"issue_xlog_fsync<br/>(fsync/fdatasync)"| DISK["stable storage<br/>watermark: logFlushResult"]
COMMIT["COMMIT / XLogFlush(commitLSN)"] -.->|"waits until<br/>Flush >= commitLSN"| DISK
BUF["buffer manager:<br/>flush dirty page"] -.->|"waits until<br/>Flush >= PageLSN"| DISK
Figure 4 — The durability pipeline and its three watermarks. Insertion
populates the WAL buffers (logInsertResult); XLogWrite pushes pages to
the kernel (logWriteResult); issue_xlog_fsync forces them to stable
media (logFlushResult). Both a COMMIT and any dirty-page flush block on
Flush reaching the LSN they care about — this is the WAL rule and commit
durability expressed as the same gate. (Watermarks from XLogCtlData;
pipeline from XLogFlush/XLogWrite/issue_xlog_fsync.)
How the rest of the engine gates on the LSN
Section titled “How the rest of the engine gates on the LSN”This subsystem earns the name “spine” through two gates the README states plainly.
Gate 1 — the buffer manager (the WAL rule). “Before the bufmgr can
write out a dirty page, it must ensure that xlog has been flushed to disk
at least up to the page’s LSN.” The page’s LSN was stamped by
PageSetLSN(dp, recptr) at insert time; the buffer manager reads it with
PageGetLSN and calls XLogFlush up to that value before writing. The
mechanism lives in postgres-buffer-manager.md; this doc owns the LSN it
reads. Note the README’s caveat: this check exists only in the shared
buffer manager, which is precisely why temp-table operations (local buffer
manager) must not be WAL-logged.
Gate 2 — commit (durability = a flush). A synchronous COMMIT is, at
bottom, XLogFlush(commit_record_LSN). The commit record’s insertion and
the wait for its durability is owned by postgres-xact.md; what this doc
contributes is that “the transaction is durable” reduces to “the Flush
watermark passed the commit LSN.” Under synchronous_commit = off
(asynchronous commit) the backend skips the flush, records the LSN in
shared asyncXactLSN, and returns; XLogBackgroundFlush — the walwriter’s
periodic flush — makes it durable within a bounded number of
wal_writer_delay cycles. The README is explicit that abort records are
never force-flushed: after a crash, an absent commit record is read as an
abort anyway.
The checkpointer (which advances RedoRecPtr, the FPI cutoff) and segment
recycling are owned by postgres-checkpoint.md. Replaying all of these
records on restart — the redo loop, consistency points, timeline handling —
is owned by postgres-recovery-redo.md. This doc deliberately stops at the
point a record is durable on disk.
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by sub-system. Files are under
/data/hgryoo/references/postgres/.
Record format and LSN types (headers)
Section titled “Record format and LSN types (headers)”XLogRecPtr(inxlogdefs.h) — the LSN; auint64byte offset into the logical log.InvalidXLogRecPtris 0;LSN_FORMAT_ARGSsplits it into the%X/%Xprint form.XLogRecord(inxlogrecord.h) — the fixed record header:xl_tot_len,xl_xid,xl_prev,xl_info,xl_rmid,xl_crc.SizeOfXLogRecordis its MAXALIGN-relevant size.XLogRecordBlockHeader/XLogRecordBlockImageHeader/XLogRecordBlockCompressHeader(inxlogrecord.h) — per-block reference framing and the full-page-image sub-headers. FlagsBKPBLOCK_HAS_IMAGE,BKPBLOCK_WILL_INIT,BKPBLOCK_SAME_REL; image flagsBKPIMAGE_APPLY,BKPIMAGE_HAS_HOLE,BKPIMAGE_COMPRESS_*.XLR_MAX_BLOCK_ID(32) and the reserved block idsXLR_BLOCK_ID_DATA_SHORT/LONG/ORIGIN/TOPLEVEL_XID(inxlogrecord.h).XLogRecordMaxSize(inxlogrecord.h) — the ~1 GB ceiling on one record, bounded by what the reader can allocate in a single chunk.
Segment / page layout (headers)
Section titled “Segment / page layout (headers)”XLogPageHeaderData/XLogLongPageHeaderData(inxlog_internal.h) — the 8-KB WAL-page header (xlp_magic,xlp_info,xlp_tli,xlp_pageaddr,xlp_rem_len) and its long variant on the first page of each segment.XLOG_PAGE_MAGICis the version stamp.XLByteToSeg/XLogSegmentOffset/XLogSegNoOffsetToRecPtr(inxlog_internal.h) — LSN↔segment arithmetic.XLogFileName/XLogFilePath/IsXLogFileName(inxlog_internal.h) — the%08X%08X%08Xsegment file naming.WalSegMinSize/WalSegMaxSize/DEFAULT_XLOG_SEG_SIZE(the last inpg_config_manual.h) — segment size bounds (1 MB–1 GB) and the 16-MB default.XLogRecData(inxlog_internal.h) — the(next, data, len)fragment list the assembler builds.
Record construction (xloginsert.c)
Section titled “Record construction (xloginsert.c)”XLogBeginInsert— start a record; resets the registration workspace.XLogRegisterBuffer/XLogRegisterBlock— declare a modified page (in or out of the shared pool); the assembler decides on its FPI.XLogRegisterData/XLogRegisterBufData— append main data / per-block data.XLogEnsureRecordSpace— raise the default limits (5 block refs, 20 data chunks) for unusually large records.XLogSetRecordFlags— setXLOG_MARK_UNIMPORTANTand friends.XLogInsert— the public entry; the assemble→insert retry loop.XLogRecordAssemble— packs header + block refs + data into theXLogRecDatachain, makes per-block FPI decisions (hole removal, compression), and computes the body CRC.
Insert into shared memory (xlog.c)
Section titled “Insert into shared memory (xlog.c)”XLogInsertRecord— the two-step insert (reserve, copy) under the insertion lock; the FPI-recheck retry; theWalInsertClassbranch forXLOG_SWITCH/ checkpoint-redo.ReserveXLogInsertLocation/ReserveXLogSwitch— the spinlock-guardedCurrBytePosbump that assigns the LSN range andxl_prev.XLogBytePosToRecPtr/XLogBytePosToEndRecPtr/XLogRecPtrToBytePos— the usable-byte-position ↔XLogRecPtrconversions that add/remove page headers, run outside the spinlock.CopyXLogRecordToWAL— copies the chain intoXLogCtl->pages.WALInsertLockAcquire/WALInsertLockAcquireExclusive/WALInsertLockRelease/WALInsertLockUpdateInsertingAt— the 8 insertion locks and the all-locks path.WALInsertLock(struct) /NUM_XLOGINSERT_LOCKS(8) — the lock array withinsertingAtandlastImportantAt.XLogCtlInsert(struct) —CurrBytePos/PrevBytePos,insertpos_lck,RedoRecPtr,fullPageWrites,runningBackups.XLogCtlData(struct) — the whole shared WAL state: the watermarks,pages,xlblocks,InsertTimeLineID.
Write / flush / fsync (xlog.c)
Section titled “Write / flush / fsync (xlog.c)”XLogFlush— make the log durable up to a given LSN; the group-commit loop (LWLockAcquireOrWait,CommitDelay).XLogWrite—pg_pwritewhole pages, coalesced; segment switching; the end-of-segment fsync + archiver notify + checkpoint trigger.XLogBackgroundFlush— the walwriter’s periodic write/flush, including the async-commit catch-up.issue_xlog_fsync— thefsync/fdatasyncdispatch onwal_sync_method;PANICon failure.RefreshXLogWriteResult(macro) — barriered read of the Write/Flush watermarks.WaitXLogInsertionsToFinish— wait for in-flight inserters to pass a target LSN, using each lock’sinsertingAt.GetXLogInsertRecPtr— current insert tip (readsCurrBytePos).
Read side (xlogreader.c) — framing only; redo lives elsewhere
Section titled “Read side (xlogreader.c) — framing only; redo lives elsewhere”XLogReadRecord/XLogNextRecord— pull the next record from the reader’s decode queue.DecodeXLogRecord— split a raw record back into header + block refs + data (the inverse ofXLogRecordAssemble).ValidXLogRecordHeader/ValidXLogRecord— validate the header (including thexl_prevback-link) and re-check the CRC-32C before any record is trusted.XLogRecGetBlockTag/XLogRecGetBlockTagExtended/XLogRecGetData— the accessors a redo handler uses (documented for completeness; the redo side ispostgres-recovery-redo.md).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
XLogRecPtr (typedef) | src/include/access/xlogdefs.h | 21 |
InvalidXLogRecPtr | src/include/access/xlogdefs.h | 28 |
LSN_FORMAT_ARGS | src/include/access/xlogdefs.h | 44 |
XLogRecord (struct) | src/include/access/xlogrecord.h | 41 |
XLogRecordMaxSize | src/include/access/xlogrecord.h | 74 |
XLogRecordBlockHeader (struct) | src/include/access/xlogrecord.h | 103 |
XLR_MAX_BLOCK_ID | src/include/access/xlogrecord.h | 241 |
XLOG_PAGE_MAGIC | src/include/access/xlog_internal.h | 34 |
XLogPageHeaderData (struct) | src/include/access/xlog_internal.h | 36 |
XLogLongPageHeaderData (struct) | src/include/access/xlog_internal.h | 61 |
WalSegMinSize | src/include/access/xlog_internal.h | 88 |
XLByteToSeg | src/include/access/xlog_internal.h | 117 |
XLogFileName | src/include/access/xlog_internal.h | 166 |
XLogRecData (struct) | src/include/access/xlog_internal.h | 312 |
DEFAULT_XLOG_SEG_SIZE | src/include/pg_config_manual.h | 20 |
NUM_XLOGINSERT_LOCKS | src/backend/access/transam/xlog.c | 151 |
WALInsertLock (struct) | src/backend/access/transam/xlog.c | 368 |
XLogCtlInsert (struct end) | src/backend/access/transam/xlog.c | 446 |
XLogCtlData (struct end) | src/backend/access/transam/xlog.c | 555 |
XLogRecPtrToBufIdx | src/backend/access/transam/xlog.c | 592 |
UsableBytesInPage | src/backend/access/transam/xlog.c | 598 |
XLogInsertRecord | src/backend/access/transam/xlog.c | 748 |
ReserveXLogInsertLocation | src/backend/access/transam/xlog.c | 1111 |
WALInsertLockAcquire | src/backend/access/transam/xlog.c | 1374 |
WALInsertLockAcquireExclusive | src/backend/access/transam/xlog.c | 1419 |
XLogBytePosToRecPtr | src/backend/access/transam/xlog.c | 1861 |
XLogWrite | src/backend/access/transam/xlog.c | 2304 |
XLogFlush | src/backend/access/transam/xlog.c | 2780 |
XLogBackgroundFlush | src/backend/access/transam/xlog.c | 2968 |
issue_xlog_fsync | src/backend/access/transam/xlog.c | 8744 |
XLogBeginInsert | src/backend/access/transam/xloginsert.c | 149 |
XLogRegisterBuffer | src/backend/access/transam/xloginsert.c | 242 |
XLogRegisterData | src/backend/access/transam/xloginsert.c | 364 |
XLogRegisterBufData | src/backend/access/transam/xloginsert.c | 405 |
XLogInsert | src/backend/access/transam/xloginsert.c | 474 |
XLogRecordAssemble | src/backend/access/transam/xloginsert.c | 548 |
XLogReadRecord | src/backend/access/transam/xlogreader.c | 390 |
ValidXLogRecord | src/backend/access/transam/xlogreader.c | 1204 |
DecodeXLogRecord | src/backend/access/transam/xlogreader.c | 1682 |
XLogRecGetBlockTagExtended | src/backend/access/transam/xlogreader.c | 2017 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
An LSN is literally a 64-bit byte offset into the logical log, and 0 means invalid. Verified in
xlogdefs.h:typedef uint64 XLogRecPtrwith#define InvalidXLogRecPtr 0. The header comment states the width is 64 bits “because we don’t want them ever to overflow” and that bootstrap skips the first segment so no record starts at 0. -
There are exactly 8 WAL insertion locks at REL_18.
#define NUM_XLOGINSERT_LOCKS 8inxlog.c. This is a compile-time constant, not a GUC — changing it requires a source edit. Each lock is cache-line padded via theWALInsertLockPaddedunion. -
The globally serialized part of insertion is only a spinlock-guarded integer bump. Verified in
ReserveXLogInsertLocation: underInsert->insertpos_lckthe function doesstartbytepos = Insert->CurrBytePos; endbytepos = startbytepos + size; Insert->CurrBytePos = endbytepos;and releases — theXLogBytePosToRecPtrconversions run after release. The reservation tracks usable byte positions (page headers excluded) so the bump is a plain addition. -
XLogInsertretries the whole assemble-then-insert cycle when full-page-write state changes under it. Thedo { ... } while (EndPos == InvalidXLogRecPtr)loop inXLogInsertre-runsXLogRecordAssemblewheneverXLogInsertRecordreturnsInvalidXLogRecPtr, which happens (in theWALINSERT_NORMALbranch) when a page now needs an image that the caller did not back up. Confirmed against both functions. -
The first modification of a page after a checkpoint is logged as a full-page image. Verified in
XLogRecordAssemble:needs_backup = (page_lsn <= RedoRecPtr), whereRedoRecPtris the latest checkpoint’s redo point. For standard pages thepd_lower..pd_upperhole is omitted;wal_compressionthen optionally compresses with PGLZ/LZ4/ZSTD. -
A WAL fsync failure is a PANIC. Verified in
issue_xlog_fsync: every sync-method branch setsmsgon failure and the trailingif (msg) ereport(PANIC, ...)terminates the backend. The function is a no-op whenwal_sync_methodisopen/open_datasync(thewrite()already synced) or whenenableFsyncis off. -
The three durability watermarks are atomics ordered Insert ≥ Write ≥ Flush, maintained with explicit memory barriers. Verified in
XLogCtlData(logInsertResult,logWriteResult,logFlushResultarepg_atomic_uint64) and inXLogWrite’s publish sequence (pg_atomic_write_u64(logWriteResult); pg_write_barrier(); pg_atomic_write_u64(logFlushResult)) plus the symmetric barriered read inRefreshXLogWriteResult. -
Commit durability and the buffer-manager WAL rule are the same gate on the Flush watermark. Verified at the boundary:
XLogFlushearly-exits whenrecord <= LogwrtResult.Flush, and the README states the bufmgr must flush “at least up to the page’s LSN” before writing a dirty page. Both reduce toFlush >= target-LSN. The bufmgr and commit mechanisms themselves are inpostgres-buffer-manager.md/postgres-xact.md(intentionally not re-verified here). -
Records are CRC-checked (CRC-32C) before they are trusted on read. Verified in
ValidXLogRecord: it recomputes the CRC over the body then the header (excludingxl_crc) and rejects on mismatch.xl_previs separately validated inValidXLogRecordHeader. -
The default WAL segment size is 16 MB, tunable from 1 MB to 1 GB.
DEFAULT_XLOG_SEG_SIZE = 16*1024*1024(pg_config_manual.h);WalSegMinSize/WalSegMaxSizeandIsValidWalSegSize(power-of-two) bound it (xlog_internal.h).
Open questions
Section titled “Open questions”-
Is
NUM_XLOGINSERT_LOCKS = 8still the right number on many-core hardware? It has been 8 for many releases while core counts grew. The scalable-lock-manager work (dbms-papers/scalable-lock-manager.md) targets the heavyweight lock table, not WAL insertion, so this is a distinct contention surface. Investigation path: benchmark insert throughput vs. lock count on a high-core box; check the pgsql-hackers archives for any proposal to make it a GUC. -
What exactly does
CopyXLogRecordToWALdo when a record straddles a page boundary, and how does it interact withAdvanceXLInsertBufferpage initialization? This doc treats the copy as a black box and quotes only its call site. Thexlp_rem_len/XLP_FIRST_IS_CONTRECORDcontinuation machinery and the WAL-buffer page-init path (AdvanceXLInsertBuffer,WALBufMappingLock) deserve their own walkthrough. Investigation path: readCopyXLogRecordToWAL(xlog.c ~1227) andAdvanceXLInsertBufferin full. -
How much WAL volume is full-page writes in practice, and when is it safe to disable them? The mechanism is verified but the cost/benefit (FPI volume vs. torn-page risk on a given storage stack, and the interaction with
wal_compressionand checkpoint spacing) is a tuning question this doc does not answer. Investigation path:pg_waldump --statson a representative workload; cross-referencepostgres-checkpoint.mdonce it exists.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
ARIES (Mohan et al., TODS 1992) — the theory PostgreSQL implements. PostgreSQL adopts repeating-history redo and the PageLSN/WAL-rule machinery, but notably does not use ARIES-style undo via compensation log records for normal transaction rollback: MVCC means an aborted transaction’s tuples are simply left invisible and reclaimed by vacuum, so there is no page-level undo pass. A focused comparison of “ARIES undo vs. MVCC no-undo” would clarify exactly which ARIES pieces PostgreSQL keeps. (
dbms-papers/aries.md.) -
CUBRID’s WAL — CUBRID is also ARIES-based but keeps a classic physical/physiological log with undo: it writes undo and redo log records and performs an undo pass on rollback and recovery, because its storage is update-in-place rather than no-overwrite MVCC. Comparing CUBRID’s
log_append/LOG_LSAto PostgreSQL’sXLogInsert/XLogRecPtrside-by-side would surface how the no-overwrite-heap decision propagates all the way down to whether the log needs undo at all. (See the cubrid recovery analysis inknowledge/code-analysis/cubrid/.) -
Per-core / distributed logging — the single serialized log tip (
CurrBytePosunder one spinlock) is a known scalability ceiling. Research lines like multi-stream WAL, per-core log partitions (e.g. the “Scalable Logging through Emerging Non-Volatile Memory” and Aether/ELEDA work on log buffer contention) explore removing the single tip. A note tracing whether PostgreSQL’s 8 insertion locks plus the byte-position trick already capture most of that benefit, and where they fall short, would be the natural follow-up. -
Constant-time recovery / instant-restart — engines such as SQL Server’s Accelerated Database Recovery and Hekaton-style persistent logging decouple recovery time from log length. PostgreSQL’s recovery is still linear in WAL replayed since the last checkpoint; the design tension (checkpoint frequency vs. FPI volume vs. recovery time) is a frontier
postgres-checkpoint.mdandpostgres-recovery-redo.mdshould pick up.
Sources
Section titled “Sources”In-tree design docs:
src/backend/access/transam/README— “Write-Ahead Log Coding”, “Constructing a WAL record”, “Writing Hints”, “Asynchronous Commit”.
Source files (REL_18_STABLE, commit 273fe94):
src/backend/access/transam/xlog.c— insert/reserve/write/flush/fsync, shared WAL state, insertion locks.src/backend/access/transam/xloginsert.c— record construction and assembly, full-page-image logic.src/backend/access/transam/xlogreader.c— record decode and validation (read side).src/include/access/xlogrecord.h—XLogRecordand block-reference formats.src/include/access/xlog_internal.h— page/segment layout, file naming,XLogRecData, the rmgr method table.src/include/access/xlogdefs.h—XLogRecPtr/LSN definitions.src/include/pg_config_manual.h—DEFAULT_XLOG_SEG_SIZE.
Papers and textbooks:
knowledge/research/dbms-papers/aries.md— Mohan et al., ARIES (TODS 1992): the WAL/PageLSN/repeating-history model.knowledge/research/dbms-papers/scalable-lock-manager.md— context for the lock-count open question (heavyweight locks, not WAL insertion).- Database Internals (Petrov), ch. 5 — operational vs. logical logs, physiological logging.
- Database System Concepts (Silberschatz et al., 7e), ch. 19 — recovery, steal/no-force, immediate modification.
Cross-references (mechanism owned elsewhere — not duplicated here):
postgres-recovery-redo.md— the redo loop, consistency, timelines.postgres-wal-records-rmgr.md— the resource-manager catalog and per-record formats.postgres-xact.md— commit-record insertion and the commit-time flush.postgres-checkpoint.md—RedoRecPtradvancement, segment recycling.postgres-buffer-manager.md— the dirty-page flush that gates on PageLSN.postgres-architecture-overview.md— Axis 3, the WAL durability spine.