Skip to content

PostgreSQL Write-Ahead Log — Record Insertion, LSNs, and the Durability Spine

Contents:

A write-ahead log exists to answer one question after a crash: which of the changes a running system had in flight actually made it to durable state, and how do we get back to a consistent point? The canonical answer is ARIES (Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured at knowledge/research/dbms-papers/aries.md). ARIES rests on three principles, and PostgreSQL’s WAL is a faithful — if selectively implemented — instance of them:

  1. Write-ahead logging. The log record describing a page change must reach stable storage before the changed data page does. This is the namesake invariant. It lets the buffer manager keep dirty pages in memory indefinitely (a no-force policy) and lets dirty pages be written before commit (a steal policy), because the log alone is enough to redo lost work and undo uncommitted work.
  2. Repeating history during redo. On restart, replay the log forward from the last checkpoint, re-applying every logged change — even those of transactions that ultimately aborted — to reconstruct the exact page state at the moment of the crash, then roll back the losers.
  3. Logging changes during undo. Undo actions are themselves logged (as compensation log records) so that a crash during recovery does not lose the rollback progress.

The single mechanism that ties the log to the data pages is the LSN (Log Sequence Number). In ARIES every page carries the LSN of the most recent log record that modified it (the PageLSN). Two comparisons fall out of that and run the whole machine:

  • The WAL rule, enforced by comparison. A page may be flushed only when the log is durable up to that page’s LSN. Equivalently: flushedLSN >= PageLSN(page) is the precondition for writing the page.
  • Idempotent redo, also by comparison. During replay, a log record at LSN L affecting a page is skipped if PageLSN(page) >= L — the change is already there. This is what makes replay safe to restart.

Database Internals (Petrov, ch. 5, “Transaction Processing and Recovery”) frames this as the dividing line between the operational log (physical/physiological records that describe byte-level page edits) and the logical log (which records higher-level operations). PostgreSQL’s WAL is physiological: a record names a page and carries enough data to re-apply a specific physical edit, but the edit is expressed in terms a resource manager understands (insert this tuple at this line pointer) rather than raw byte diffs. Database System Concepts (Silberschatz et al., 7e, ch. 19, “Recovery System”) gives the same model under the names “deferred” vs. “immediate” modification; PostgreSQL is immediate- modification with steal/no-force, which is exactly the regime ARIES was designed for.

The design space a WAL implementer chooses within:

  1. Record granularity — physical (byte ranges), logical (operations), or physiological (operations scoped to one page). PostgreSQL picks physiological, per-resource-manager.
  2. Torn-page protection — how to survive a page that was half-written when power failed. The ARIES assumption that “a page write is atomic” is false on most storage, so a real engine needs an answer. PostgreSQL’s answer is full-page writes (full-page images, FPIs).
  3. Group commit — whether each committing transaction issues its own fsync, or many piggyback on one. PostgreSQL batches.
  4. Async durability — whether COMMIT may return before its record is fsynced, trading a bounded window of loss for latency. PostgreSQL offers this as synchronous_commit = off.

The rest of this document is, in effect, how PostgreSQL turned those four dials.

Almost every ARIES-lineage engine converges on the same set of engineering conventions to realize the theory. Naming them here makes PostgreSQL’s specific symbols (next section) read as one set of choices within a shared playbook rather than as inventions.

Whatever the on-disk file layout, the log is presented to the rest of the engine as a single, ever-growing sequence of bytes. The LSN is a position in that stream — monotonically increasing, never reused. Making the LSN a byte offset (rather than an opaque counter) means “is the log durable up to here?” is a single integer comparison, and “how far apart are these two records?” is a subtraction. Engines split the stream into fixed-size segment files for recycling and archiving, but the segment boundary is an implementation detail hidden behind the LSN.

A log record carries a header (length, owning transaction, a back-link to the previous record, a type tag, a checksum) followed by type-specific payload. The type tag routes the record to the right redo handler at recovery time. This is the resource-manager pattern: the log subsystem knows how to frame, locate, checksum, and chain records, but delegates meaning to per-subsystem callbacks. (PostgreSQL’s full rmgr catalog — Heap, Btree, Xact, CLOG, and the rest — is the subject of the sibling doc postgres-wal-records-rmgr.md; this doc treats the rmgr id as an opaque tag on the record header.)

High-throughput engines never let a backend hold a global log lock while doing I/O. The universal pattern is two stages:

  1. A short critical region that reserves space in the log (claims a byte range, assigns the record its LSN) and copies the bytes into an in-memory log buffer.
  2. A separate, asynchronous path that writes and flushes the buffer to disk, driven by a background writer and by explicit flush requests at commit.

The reservation stage must be as small as possible because it is globally serialized — it is the one point where the single stream’s tip advances. Everything else (assembling the record, computing its checksum, copying it into the buffer, fsyncing) is pushed outside that region.

Because insertion and durability are decoupled, the engine maintains ordered progress markers: how far records have been inserted into the buffer, how far the buffer has been written to the kernel, and how far the kernel has been flushed (fsynced) to stable media. The invariant Insert >= Write >= Flush holds at all times, and a commit waits until Flush >= commit-LSN.

Pages carry their LSN; the buffer manager gates on it

Section titled “Pages carry their LSN; the buffer manager gates on it”

Every data page stores the LSN of the last log record that touched it. The buffer manager consults it before evicting/writing a dirty page, enforcing the WAL rule at the one choke point where pages leave memory. This is the seam between the WAL subsystem and the storage engine.

Torn pages are defeated by logging a whole page

Section titled “Torn pages are defeated by logging a whole page”

Since hardware page writes are not atomic, the first time a page is modified after a checkpoint, the engine logs the entire page (a full-page image). On replay, restoring that image gives a known-good page regardless of whether the original write was torn. The cost is WAL volume; the mitigations are hole-removal and compression.

ARIES / textbook conceptPostgreSQL name
Log Sequence Number (position in the stream)XLogRecPtr — a uint64 byte offset (xlogdefs.h)
The single append-only logthe WAL / “xlog”; physical files in pg_wal/
PageLSN on each data pagePageGetLSN / PageSetLSN (the page header pd_lsn)
Log record headerXLogRecord (xlogrecord.h)
Record type tag → redo handlerxl_rmid (rmgr id) + xl_info
Back-link to previous recordxl_prev
Record checksumxl_crc (CRC-32C)
Reserve-then-copy insertionReserveXLogInsertLocation then CopyXLogRecordToWAL
In-memory log bufferthe WAL buffers (XLogCtl->pages, wal_buffers)
Insert / Write / Flush watermarkslogInsertResult / logWriteResult / logFlushResult
Full-page image (torn-page defense)full-page write (FPW / FPI), BKPBLOCK_HAS_IMAGE
Group commitXLogFlush group-commit loop + CommitDelay
Async durabilitysynchronous_commit = off, XLogBackgroundFlush
Segment filesWAL segments, default 16 MB (DEFAULT_XLOG_SEG_SIZE)

PostgreSQL calls the WAL subsystem xlog in the code. Its design intent is captured tersely in src/backend/access/transam/README: “A basic assumption of a write AHEAD log is that log entries must reach stable storage before the data-page changes they describe.” Everything below is how that assumption is mechanized.

An LSN is a byte position, and the log is one stream

Section titled “An LSN is a byte position, and the log is one stream”

The foundational decision: an XLogRecPtr is a 64-bit byte offset into a single conceptual log that began at the start of time and grows forever.

// XLogRecPtr — src/include/access/xlogdefs.h
typedef uint64 XLogRecPtr;
#define InvalidXLogRecPtr 0
#define XLogRecPtrIsValid(r) ((r) != InvalidXLogRecPtr)

Because the value is a byte offset, the conventional human-readable form %X/%X (via LSN_FORMAT_ARGS) is just the high and low 32 bits of that offset. Zero is reserved as “invalid” — bootstrap deliberately starts the first real WAL page one segment in, so no genuine record ever begins at offset 0.

That single stream is chopped into segment files on disk (default 16 MB), but the segmentation is invisible to the LSN. The conversion macros live in xlog_internal.h:

// segment math — src/include/access/xlog_internal.h
#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \
logSegNo = (xlrp) / (wal_segsz_bytes)
#define XLogSegmentOffset(xlogptr, wal_segsz_bytes) \
((xlogptr) & ((wal_segsz_bytes) - 1))

A WAL file’s name is TLI + high-32 + low-32 of the segment number, all in hex — the timeline ID, then the segment broken into a 4-GB “log id” and a within-log segment index:

// XLogFileName — src/include/access/xlog_internal.h
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,
(uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
(uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)));

There is a subtlety the LSN math has to absorb: each WAL page (8 KB, XLOG_BLCKSZ) begins with a header, so not every byte of the file is “usable” for record data. PostgreSQL handles this by tracking insertion progress in a separate space of usable byte positions that exclude all page headers, then converting to a real XLogRecPtr outside the hot lock (see XLogBytePosToRecPtr below). The page header is small but mandatory:

// XLogPageHeaderData — src/include/access/xlog_internal.h
typedef struct XLogPageHeaderData
{
uint16 xlp_magic; /* magic value for correctness checks */
uint16 xlp_info; /* flag bits, see below */
TimeLineID xlp_tli; /* TimeLineID of first record on page */
XLogRecPtr xlp_pageaddr; /* XLOG address of this page */
uint32 xlp_rem_len; /* total len of remaining data for record */
} XLogPageHeaderData;

xlp_pageaddr is the page’s own LSN — a self-check exploited by the reader (§ Source Walkthrough) to detect a recycled-but-not-overwritten segment. xlp_rem_len is how a record that overflows a page boundary is continued: the next page’s header records how many bytes of the straddling record remain.

A WAL record is a fixed header followed by a variable run of block references and data chunks. The header is small and rigid:

// XLogRecord — src/include/access/xlogrecord.h
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;

The five header fields are the ARIES record header made concrete: xl_tot_len frames it, xl_xid ties it to a transaction, xl_prev back-links it to the previous record (a chain validated on read), xl_rmid+xl_info route it to a resource manager’s redo handler, and xl_crc (CRC-32C) is the integrity check that lets recovery trust a record before applying it. Everything after the header — block references, full-page images, and “main data” — is laid out as documented in the header comment of xlogrecord.h:

Fixed-size header (XLogRecord struct)
XLogRecordBlockHeader struct (block ref 0)
XLogRecordBlockHeader struct (block ref 1)
...
XLogRecordDataHeader[Short|Long]
block data
block data
...
main data

Each modified page is described by an XLogRecordBlockHeader, identified by a small block id, optionally followed by a full-page image header and the page’s relfilelocator/forknum/blocknumber:

// XLogRecordBlockHeader — src/include/access/xlogrecord.h
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (excl. page image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;

The fork_flags byte multiplexes the fork number (low 4 bits) with flags in the high bits — BKPBLOCK_HAS_IMAGE (this block carries a full-page image), BKPBLOCK_WILL_INIT (redo will rebuild the page from scratch, no old contents needed), and BKPBLOCK_SAME_REL (omit the relfilelocator, it equals the previous block’s). A single record may reference up to XLR_MAX_BLOCK_ID (32) blocks.

flowchart TB
  subgraph REC["one XLOG record (contiguous bytes in the stream)"]
    direction TB
    HDR["XLogRecord header<br/>xl_tot_len, xl_xid, xl_prev,<br/>xl_info, xl_rmid, xl_crc"]
    B0["XLogRecordBlockHeader 0<br/>id, fork_flags, data_length<br/>(+ image hdr if FPI)<br/>(+ RelFileLocator + BlockNumber)"]
    B1["XLogRecordBlockHeader 1<br/>(BKPBLOCK_SAME_REL -> no locator)"]
    DH["XLogRecordDataHeaderShort/Long<br/>(main-data length)"]
    BD0["block 0 data / full-page image"]
    BD1["block 1 data"]
    MD["main data (rmgr-specific)"]
  end
  HDR --> B0 --> B1 --> DH --> BD0 --> BD1 --> MD

Figure 1 — The on-stream layout of one WAL record. The header is fixed and MAXALIGN’d at its start; everything after it is unaligned and packed. Block references come first (each optionally carrying a full-page image), then the short/long main-data header, then the block data and the rmgr’s main data. The redo handler for xl_rmid knows how to walk this back apart. (Layout from the header comment in xlogrecord.h.)

Constructing a record: Begin / Register / Insert

Section titled “Constructing a record: Begin / Register / Insert”

A caller never hand-builds the byte layout above. It declares intent with three families of calls, and the assembler does the packing. From the README’s worked example:

// the WAL-logged action recipe — src/backend/access/transam/README
XLogBeginInsert();
XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);
XLogRegisterData(&xlrec, SizeOfFictionalAction);
XLogRegisterBufData(0, tuple->data, tuple->len);
recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);
PageSetLSN(dp, recptr); /* stamp the page with the record's LSN */

XLogRegisterBuffer records that a shared buffer was modified. The assembler will decide whether that page needs a full-page image; the caller does not. The function asserts the buffer is exclusive-locked and dirty — the WAL rule’s preconditions — unless the caller passes REGBUF_NO_CHANGE:

// XLogRegisterBuffer — src/backend/access/transam/xloginsert.c
void
XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
{
registered_buffer *regbuf;
Assert(begininsert_called);
#ifdef USE_ASSERT_CHECKING
if (!(flags & REGBUF_NO_CHANGE))
Assert(BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer));
#endif
regbuf = &registered_buffers[block_id];
BufferGetTag(buffer, &regbuf->rlocator, &regbuf->forkno, &regbuf->block);
regbuf->page = BufferGetPage(buffer);
regbuf->flags = flags;
/* ... condensed ... */
regbuf->in_use = true;
}

XLogRegisterData appends rmgr-specific “main data” (the logical payload — e.g. the tuple header to redo an insert); XLogRegisterBufData appends data associated with a registered block, which the assembler may drop if it takes a full-page image of that block (because the image already contains it).

The final XLogInsert(rmid, info) is the public entry point. It is a retry loop around two helpers — assemble, then insert — and it retries when the insert path discovers that full-page-write state changed underneath it:

// XLogInsert — src/backend/access/transam/xloginsert.c
XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");
/* ... info-mask validation, bootstrap shortcut ... */
do
{
XLogRecPtr RedoRecPtr;
bool doPageWrites;
XLogRecPtr fpw_lsn;
XLogRecData *rdt;
int num_fpi = 0;
/* values needed to decide on full-page writes; rechecked under lock */
GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
&fpw_lsn, &num_fpi, &topxid_included);
EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
topxid_included);
} while (EndPos == InvalidXLogRecPtr);
XLogResetInsertion();
return EndPos;
}

XLogRecordAssemble walks the registered buffers and data, building the XLogRecData chain — a linked list of (data, len) fragments — and the record header in a scratch buffer, computing the CRC over everything except the not-yet-known xl_prev. It returns EndPos == InvalidXLogRecPtr only indirectly: the insert step returns invalid to force a re-assemble. The return value of XLogInsert is the LSN of the end of the record, which the caller stamps onto the page with PageSetLSN.

Inside XLogRecordAssemble, the decision of whether to embed a full-page image is per-block. The rule: if full-page writes are on and this is the first modification of the page since the last checkpoint (page_lsn <= RedoRecPtr), back it up.

// XLogRecordAssemble — needs_backup decision — xloginsert.c
else if (!doPageWrites)
needs_backup = false;
else
{
/* page LSN is the first data on every page passed to XLogInsert */
XLogRecPtr page_lsn = PageGetLSN(regbuf->page);
needs_backup = (page_lsn <= RedoRecPtr);
if (!needs_backup)
{
if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
*fpw_lsn = page_lsn;
}
}

RedoRecPtr is the LSN of the most recent checkpoint’s redo point. A page whose LSN already exceeds it has been modified (and thus imaged) since the checkpoint, so it needs no new image. The fpw_lsn returned here is the lowest LSN among not-backed-up pages — it is the value XLogInsertRecord rechecks under the insertion lock to detect that a checkpoint snuck in and the page now does need backing up, triggering the re-assemble loop.

When a page is imaged, PostgreSQL shrinks the WAL cost two ways. For a standard page layout it omits the hole — the unused middle region between pd_lower and pd_upper that is all zeros — recording only its offset and length:

// XLogRecordAssemble — hole removal — xloginsert.c
if (regbuf->flags & REGBUF_STANDARD)
{
uint16 lower = ((PageHeader) page)->pd_lower;
uint16 upper = ((PageHeader) page)->pd_upper;
if (lower >= SizeOfPageHeaderData && upper > lower && upper <= BLCKSZ)
{
bimg.hole_offset = lower;
cbimg.hole_length = upper - lower;
}
/* ... condensed ... */
}

And if wal_compression is on, the (hole-removed) image is run through PGLZ / LZ4 / ZSTD, with the algorithm recorded in bimg_info. The image header carries BKPIMAGE_APPLY when redo must actually restore it (as opposed to images logged only for wal_consistency_checking).

flowchart TD
  START["block registered<br/>in XLogRecordAssemble"] --> Q1{REGBUF_FORCE_IMAGE?}
  Q1 -- yes --> BK["take full-page image"]
  Q1 -- no --> Q2{REGBUF_NO_IMAGE<br/>or doPageWrites off?}
  Q2 -- yes --> NOBK["no image;<br/>log only block data"]
  Q2 -- no --> Q3{"page_lsn &lt;= RedoRecPtr?<br/>first change since checkpoint"}
  Q3 -- yes --> BK
  Q3 -- no --> REC["record page_lsn as candidate fpw_lsn;<br/>no image this time"]
  BK --> Q4{REGBUF_STANDARD?}
  Q4 -- yes --> HOLE["omit pd_lower..pd_upper hole"]
  Q4 -- no --> FULL["image whole BLCKSZ"]
  HOLE --> COMP{wal_compression on?}
  FULL --> COMP
  COMP -- yes --> CZ["compress image (PGLZ/LZ4/ZSTD)"]
  COMP -- no --> RAW["store raw image"]

Figure 2 — Per-block full-page-image decision inside XLogRecordAssemble. The load-bearing branch is page_lsn <= RedoRecPtr: the first time a page is touched after a checkpoint, its whole image is logged so replay can restore a torn page. Subsequent changes within the same checkpoint cycle log only the incremental edit. Hole-removal and compression shrink the image’s WAL footprint. (Logic from xloginsert.c.)

XLogInsertRecord is where the assembled chain meets shared memory. Its own comment lays out the two-step contract; the key is that step 1 is globally serialized and tiny, step 2 is concurrent:

  1. Reserve the right amount of space from the WAL. The current head of reserved space is kept in Insert->CurrBytePos, and is protected by insertpos_lck.
  2. Copy the record to the reserved WAL space … This can be done concurrently in multiple processes.

The reservation is the hot path’s hot spot, so it is reduced to a few arithmetic operations under a spinlock. CurrBytePos is a “usable byte position” (page headers excluded), so reserving N bytes is essentially CurrBytePos += N:

// ReserveXLogInsertLocation — src/backend/access/transam/xlog.c
SpinLockAcquire(&Insert->insertpos_lck);
startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
Insert->CurrBytePos = endbytepos;
Insert->PrevBytePos = startbytepos;
SpinLockRelease(&Insert->insertpos_lck);
*StartPos = XLogBytePosToRecPtr(startbytepos);
*EndPos = XLogBytePosToEndRecPtr(endbytepos);
*PrevPtr = XLogBytePosToRecPtr(prevbytepos);

The byte-position→XLogRecPtr conversion (which adds back the page headers) is done outside the spinlock by XLogBytePosToRecPtr. The spinlock holds only long enough to bump two integers — this is the single serialization point where the log’s tip advances, and PostgreSQL spends real effort keeping it microscopic (the comment even notes the function is pg_attribute_always_inline to shave cycles).

Before the reservation, XLogInsertRecord takes one of the WAL insertion locks and rechecks full-page-write state:

// XLogInsertRecord — under the insertion lock — xlog.c
WALInsertLockAcquire();
if (RedoRecPtr != Insert->RedoRecPtr)
{
Assert(RedoRecPtr < Insert->RedoRecPtr);
RedoRecPtr = Insert->RedoRecPtr;
}
doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
if (doPageWrites &&
(!prevDoPageWrites ||
(fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
{
/* a page now needs backup that the caller didn't back up. Start over. */
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr; /* -> XLogInsert re-assembles */
}
ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
&rechdr->xl_prev);

This is the re-assemble trigger: RedoRecPtr is only authoritative while holding an insertion lock, so a checkpoint that advanced it between assembly and insertion can invalidate the FPI decision. Once space is reserved and xl_prev is filled in, the CRC of the header is finalized (the body CRC was computed during assembly) and the bytes are copied into the WAL buffers by CopyXLogRecordToWAL.

The whole point of separating “reserve” (spinlock) from “copy” (insertion lock) is that copies run in parallel. There are NUM_XLOGINSERT_LOCKS = 8 of them. A backend picks one — preferring the one it used last, for cache affinity, and migrating away on contention:

// WALInsertLockAcquire — xlog.c
if (lockToTry == -1)
lockToTry = MyProcNumber % NUM_XLOGINSERT_LOCKS;
MyLockNo = lockToTry;
immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
if (!immed)
lockToTry = (lockToTry + 1) % NUM_XLOGINSERT_LOCKS; /* try another next time */

Each lock carries more than the LWLock: an insertingAt atomic and a lastImportantAt LSN.

// WALInsertLock — xlog.c
typedef struct
{
LWLock lock;
pg_atomic_uint64 insertingAt;
XLogRecPtr lastImportantAt;
} WALInsertLock;

insertingAt is how a flusher knows it is safe to write a page: a backend that crosses a page boundary publishes how far it has inserted, so WaitXLogInsertionsToFinish can wait only for insertions to the pages it intends to write, ignoring insertions further ahead. lastImportantAt tracks the LSN of the last important record under each lock (records not flagged XLOG_MARK_UNIMPORTANT), which the checkpointer uses to decide whether anything worth checkpointing has happened.

A few record types — an XLOG_SWITCH (force a segment boundary) and the checkpoint-redo record — need to freeze all inserters; they take all eight locks via WALInsertLockAcquireExclusive, which is why the insert path branches on a WalInsertClass.

flowchart TB
  A["XLogInsert(rmid, info)"] --> B["XLogRecordAssemble<br/>build XLogRecData chain + header,<br/>decide full-page images, body CRC"]
  B --> C["XLogInsertRecord"]
  C --> D["START_CRIT_SECTION"]
  D --> E["WALInsertLockAcquire<br/>(1 of 8 LWLocks)"]
  E --> F{RedoRecPtr stale<br/>or FPI now needed?}
  F -- yes --> G["release lock, return Invalid<br/>-> retry assemble"]
  F -- no --> H["ReserveXLogInsertLocation<br/>spinlock: CurrBytePos += size<br/>-> StartPos/EndPos/xl_prev"]
  H --> I["finalize header CRC"]
  I --> J["CopyXLogRecordToWAL<br/>(concurrent; into WAL buffers)"]
  J --> K["update lastImportantAt"]
  K --> L["WALInsertLockRelease<br/>END_CRIT_SECTION"]
  L --> M["bump shared LogwrtRqst.Write<br/>if crossed a page boundary"]
  M --> N["return EndPos (LSN)"]

Figure 3 — The insert path. The globally serialized work is only the spinlock-guarded CurrBytePos += size in step H; record assembly, CRC, and the copy into WAL buffers are all done off the global tip, parallelized across the eight insertion LWLocks. The retry edge (F→G) handles a checkpoint advancing RedoRecPtr between assembly and insertion. (Flow from XLogInsertRecord in xlog.c.)

Durability: write, flush, and the three watermarks

Section titled “Durability: write, flush, and the three watermarks”

Copying a record into the WAL buffers makes it inserted, not durable. Three monotonic watermarks, held as atomics in XLogCtlData, track the gap:

// XLogCtlData watermarks — src/backend/access/transam/xlog.c
pg_atomic_uint64 logInsertResult; /* last byte + 1 inserted to buffers */
pg_atomic_uint64 logWriteResult; /* last byte + 1 written out */
pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */

The ordering Insert >= Write >= Flush is maintained with explicit memory barriers — RefreshXLogWriteResult reads Flush before Write behind a read barrier, and XLogWrite publishes Write before Flush behind a write barrier — so any observer always sees a Flush value that trails the Write value it saw.

XLogFlush(record) is the durability request: “make the log durable at least up to record.” Its first two lines are the fast path — if we are in recovery we are reading not writing, and if it is already flushed we are done:

// XLogFlush — src/backend/access/transam/xlog.c
if (!XLogInsertAllowed())
{
UpdateMinRecoveryPoint(record, false);
return;
}
if (record <= LogwrtResult.Flush) /* already durable */
return;

Otherwise it loops to acquire WALWriteLock, but with a group-commit twist: it uses LWLockAcquireOrWait, so if another backend is already flushing, this one waits and then rechecks — frequently finding the work already done for it. When it does flush, it deliberately flushes as far as any in-progress insertion has reached (WaitXLogInsertionsToFinish), piggybacking later transactions’ records onto this fsync:

// XLogFlush — group commit core — xlog.c
insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
if (!LWLockAcquireOrWait(WALWriteLock, LW_EXCLUSIVE))
continue; /* someone may have flushed it already */
RefreshXLogWriteResult(LogwrtResult);
if (record <= LogwrtResult.Flush) /* yes they did */
{
LWLockRelease(WALWriteLock);
break;
}
/* optional CommitDelay sleep to gather more followers */
WriteRqst.Write = insertpos;
WriteRqst.Flush = insertpos;
XLogWrite(WriteRqst, insertTLI, false);

The optional pg_usleep(CommitDelay) before the actual write is the explicit group-commit knob: a brief pause that lets more committers join the batch, trading latency for throughput.

XLogWrite does the actual pg_pwrite of whole 8-KB buffer pages, coalescing physically-contiguous pages into one syscall, and then the fsync:

// XLogWrite — the write — xlog.c
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
nbytes = npages * (Size) XLOG_BLCKSZ;
/* ... loop ... */
written = pg_pwrite(openLogFile, from, nleft, startoffset);

The fsync itself is issue_xlog_fsync, which dispatches on wal_sync_method (and is a no-op when the method is open_sync/ open_datasync, because in that mode the write() already synced):

// issue_xlog_fsync — xlog.c
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
return;
switch (wal_sync_method)
{
case WAL_SYNC_METHOD_FSYNC:
if (pg_fsync_no_writethrough(fd) != 0) msg = ...; break;
case WAL_SYNC_METHOD_FDATASYNC:
if (pg_fdatasync(fd) != 0) msg = ...; break;
/* ... */
}
if (msg) ereport(PANIC, ...); /* a failed WAL fsync is unrecoverable */

A failed WAL fsync is a PANIC: the system cannot continue if it cannot prove its log is durable.

flowchart LR
  subgraph MEM["in shared memory"]
    INS["records copied into<br/>WAL buffers (XLogCtl->pages)<br/>watermark: logInsertResult"]
  end
  INS -->|"XLogWrite: pg_pwrite<br/>whole 8KB pages"| KERN["kernel page cache<br/>watermark: logWriteResult"]
  KERN -->|"issue_xlog_fsync<br/>(fsync/fdatasync)"| DISK["stable storage<br/>watermark: logFlushResult"]
  COMMIT["COMMIT / XLogFlush(commitLSN)"] -.->|"waits until<br/>Flush >= commitLSN"| DISK
  BUF["buffer manager:<br/>flush dirty page"] -.->|"waits until<br/>Flush >= PageLSN"| DISK

Figure 4 — The durability pipeline and its three watermarks. Insertion populates the WAL buffers (logInsertResult); XLogWrite pushes pages to the kernel (logWriteResult); issue_xlog_fsync forces them to stable media (logFlushResult). Both a COMMIT and any dirty-page flush block on Flush reaching the LSN they care about — this is the WAL rule and commit durability expressed as the same gate. (Watermarks from XLogCtlData; pipeline from XLogFlush/XLogWrite/issue_xlog_fsync.)

How the rest of the engine gates on the LSN

Section titled “How the rest of the engine gates on the LSN”

This subsystem earns the name “spine” through two gates the README states plainly.

Gate 1 — the buffer manager (the WAL rule). “Before the bufmgr can write out a dirty page, it must ensure that xlog has been flushed to disk at least up to the page’s LSN.” The page’s LSN was stamped by PageSetLSN(dp, recptr) at insert time; the buffer manager reads it with PageGetLSN and calls XLogFlush up to that value before writing. The mechanism lives in postgres-buffer-manager.md; this doc owns the LSN it reads. Note the README’s caveat: this check exists only in the shared buffer manager, which is precisely why temp-table operations (local buffer manager) must not be WAL-logged.

Gate 2 — commit (durability = a flush). A synchronous COMMIT is, at bottom, XLogFlush(commit_record_LSN). The commit record’s insertion and the wait for its durability is owned by postgres-xact.md; what this doc contributes is that “the transaction is durable” reduces to “the Flush watermark passed the commit LSN.” Under synchronous_commit = off (asynchronous commit) the backend skips the flush, records the LSN in shared asyncXactLSN, and returns; XLogBackgroundFlush — the walwriter’s periodic flush — makes it durable within a bounded number of wal_writer_delay cycles. The README is explicit that abort records are never force-flushed: after a crash, an absent commit record is read as an abort anyway.

The checkpointer (which advances RedoRecPtr, the FPI cutoff) and segment recycling are owned by postgres-checkpoint.md. Replaying all of these records on restart — the redo loop, consistency points, timeline handling — is owned by postgres-recovery-redo.md. This doc deliberately stops at the point a record is durable on disk.

Symbols grouped by sub-system. Files are under /data/hgryoo/references/postgres/.

  • XLogRecPtr (in xlogdefs.h) — the LSN; a uint64 byte offset into the logical log. InvalidXLogRecPtr is 0; LSN_FORMAT_ARGS splits it into the %X/%X print form.
  • XLogRecord (in xlogrecord.h) — the fixed record header: xl_tot_len, xl_xid, xl_prev, xl_info, xl_rmid, xl_crc. SizeOfXLogRecord is its MAXALIGN-relevant size.
  • XLogRecordBlockHeader / XLogRecordBlockImageHeader / XLogRecordBlockCompressHeader (in xlogrecord.h) — per-block reference framing and the full-page-image sub-headers. Flags BKPBLOCK_HAS_IMAGE, BKPBLOCK_WILL_INIT, BKPBLOCK_SAME_REL; image flags BKPIMAGE_APPLY, BKPIMAGE_HAS_HOLE, BKPIMAGE_COMPRESS_*.
  • XLR_MAX_BLOCK_ID (32) and the reserved block ids XLR_BLOCK_ID_DATA_SHORT/LONG/ORIGIN/TOPLEVEL_XID (in xlogrecord.h).
  • XLogRecordMaxSize (in xlogrecord.h) — the ~1 GB ceiling on one record, bounded by what the reader can allocate in a single chunk.
  • XLogPageHeaderData / XLogLongPageHeaderData (in xlog_internal.h) — the 8-KB WAL-page header (xlp_magic, xlp_info, xlp_tli, xlp_pageaddr, xlp_rem_len) and its long variant on the first page of each segment. XLOG_PAGE_MAGIC is the version stamp.
  • XLByteToSeg / XLogSegmentOffset / XLogSegNoOffsetToRecPtr (in xlog_internal.h) — LSN↔segment arithmetic.
  • XLogFileName / XLogFilePath / IsXLogFileName (in xlog_internal.h) — the %08X%08X%08X segment file naming.
  • WalSegMinSize / WalSegMaxSize / DEFAULT_XLOG_SEG_SIZE (the last in pg_config_manual.h) — segment size bounds (1 MB–1 GB) and the 16-MB default.
  • XLogRecData (in xlog_internal.h) — the (next, data, len) fragment list the assembler builds.
  • XLogBeginInsert — start a record; resets the registration workspace.
  • XLogRegisterBuffer / XLogRegisterBlock — declare a modified page (in or out of the shared pool); the assembler decides on its FPI.
  • XLogRegisterData / XLogRegisterBufData — append main data / per-block data.
  • XLogEnsureRecordSpace — raise the default limits (5 block refs, 20 data chunks) for unusually large records.
  • XLogSetRecordFlags — set XLOG_MARK_UNIMPORTANT and friends.
  • XLogInsert — the public entry; the assemble→insert retry loop.
  • XLogRecordAssemble — packs header + block refs + data into the XLogRecData chain, makes per-block FPI decisions (hole removal, compression), and computes the body CRC.
  • XLogInsertRecord — the two-step insert (reserve, copy) under the insertion lock; the FPI-recheck retry; the WalInsertClass branch for XLOG_SWITCH / checkpoint-redo.
  • ReserveXLogInsertLocation / ReserveXLogSwitch — the spinlock-guarded CurrBytePos bump that assigns the LSN range and xl_prev.
  • XLogBytePosToRecPtr / XLogBytePosToEndRecPtr / XLogRecPtrToBytePos — the usable-byte-position ↔ XLogRecPtr conversions that add/remove page headers, run outside the spinlock.
  • CopyXLogRecordToWAL — copies the chain into XLogCtl->pages.
  • WALInsertLockAcquire / WALInsertLockAcquireExclusive / WALInsertLockRelease / WALInsertLockUpdateInsertingAt — the 8 insertion locks and the all-locks path.
  • WALInsertLock (struct) / NUM_XLOGINSERT_LOCKS (8) — the lock array with insertingAt and lastImportantAt.
  • XLogCtlInsert (struct) — CurrBytePos/PrevBytePos, insertpos_lck, RedoRecPtr, fullPageWrites, runningBackups.
  • XLogCtlData (struct) — the whole shared WAL state: the watermarks, pages, xlblocks, InsertTimeLineID.
  • XLogFlush — make the log durable up to a given LSN; the group-commit loop (LWLockAcquireOrWait, CommitDelay).
  • XLogWritepg_pwrite whole pages, coalesced; segment switching; the end-of-segment fsync + archiver notify + checkpoint trigger.
  • XLogBackgroundFlush — the walwriter’s periodic write/flush, including the async-commit catch-up.
  • issue_xlog_fsync — the fsync/fdatasync dispatch on wal_sync_method; PANIC on failure.
  • RefreshXLogWriteResult (macro) — barriered read of the Write/Flush watermarks.
  • WaitXLogInsertionsToFinish — wait for in-flight inserters to pass a target LSN, using each lock’s insertingAt.
  • GetXLogInsertRecPtr — current insert tip (reads CurrBytePos).

Read side (xlogreader.c) — framing only; redo lives elsewhere

Section titled “Read side (xlogreader.c) — framing only; redo lives elsewhere”
  • XLogReadRecord / XLogNextRecord — pull the next record from the reader’s decode queue.
  • DecodeXLogRecord — split a raw record back into header + block refs + data (the inverse of XLogRecordAssemble).
  • ValidXLogRecordHeader / ValidXLogRecord — validate the header (including the xl_prev back-link) and re-check the CRC-32C before any record is trusted.
  • XLogRecGetBlockTag / XLogRecGetBlockTagExtended / XLogRecGetData — the accessors a redo handler uses (documented for completeness; the redo side is postgres-recovery-redo.md).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
XLogRecPtr (typedef)src/include/access/xlogdefs.h21
InvalidXLogRecPtrsrc/include/access/xlogdefs.h28
LSN_FORMAT_ARGSsrc/include/access/xlogdefs.h44
XLogRecord (struct)src/include/access/xlogrecord.h41
XLogRecordMaxSizesrc/include/access/xlogrecord.h74
XLogRecordBlockHeader (struct)src/include/access/xlogrecord.h103
XLR_MAX_BLOCK_IDsrc/include/access/xlogrecord.h241
XLOG_PAGE_MAGICsrc/include/access/xlog_internal.h34
XLogPageHeaderData (struct)src/include/access/xlog_internal.h36
XLogLongPageHeaderData (struct)src/include/access/xlog_internal.h61
WalSegMinSizesrc/include/access/xlog_internal.h88
XLByteToSegsrc/include/access/xlog_internal.h117
XLogFileNamesrc/include/access/xlog_internal.h166
XLogRecData (struct)src/include/access/xlog_internal.h312
DEFAULT_XLOG_SEG_SIZEsrc/include/pg_config_manual.h20
NUM_XLOGINSERT_LOCKSsrc/backend/access/transam/xlog.c151
WALInsertLock (struct)src/backend/access/transam/xlog.c368
XLogCtlInsert (struct end)src/backend/access/transam/xlog.c446
XLogCtlData (struct end)src/backend/access/transam/xlog.c555
XLogRecPtrToBufIdxsrc/backend/access/transam/xlog.c592
UsableBytesInPagesrc/backend/access/transam/xlog.c598
XLogInsertRecordsrc/backend/access/transam/xlog.c748
ReserveXLogInsertLocationsrc/backend/access/transam/xlog.c1111
WALInsertLockAcquiresrc/backend/access/transam/xlog.c1374
WALInsertLockAcquireExclusivesrc/backend/access/transam/xlog.c1419
XLogBytePosToRecPtrsrc/backend/access/transam/xlog.c1861
XLogWritesrc/backend/access/transam/xlog.c2304
XLogFlushsrc/backend/access/transam/xlog.c2780
XLogBackgroundFlushsrc/backend/access/transam/xlog.c2968
issue_xlog_fsyncsrc/backend/access/transam/xlog.c8744
XLogBeginInsertsrc/backend/access/transam/xloginsert.c149
XLogRegisterBuffersrc/backend/access/transam/xloginsert.c242
XLogRegisterDatasrc/backend/access/transam/xloginsert.c364
XLogRegisterBufDatasrc/backend/access/transam/xloginsert.c405
XLogInsertsrc/backend/access/transam/xloginsert.c474
XLogRecordAssemblesrc/backend/access/transam/xloginsert.c548
XLogReadRecordsrc/backend/access/transam/xlogreader.c390
ValidXLogRecordsrc/backend/access/transam/xlogreader.c1204
DecodeXLogRecordsrc/backend/access/transam/xlogreader.c1682
XLogRecGetBlockTagExtendedsrc/backend/access/transam/xlogreader.c2017
  • An LSN is literally a 64-bit byte offset into the logical log, and 0 means invalid. Verified in xlogdefs.h: typedef uint64 XLogRecPtr with #define InvalidXLogRecPtr 0. The header comment states the width is 64 bits “because we don’t want them ever to overflow” and that bootstrap skips the first segment so no record starts at 0.

  • There are exactly 8 WAL insertion locks at REL_18. #define NUM_XLOGINSERT_LOCKS 8 in xlog.c. This is a compile-time constant, not a GUC — changing it requires a source edit. Each lock is cache-line padded via the WALInsertLockPadded union.

  • The globally serialized part of insertion is only a spinlock-guarded integer bump. Verified in ReserveXLogInsertLocation: under Insert->insertpos_lck the function does startbytepos = Insert->CurrBytePos; endbytepos = startbytepos + size; Insert->CurrBytePos = endbytepos; and releases — the XLogBytePosToRecPtr conversions run after release. The reservation tracks usable byte positions (page headers excluded) so the bump is a plain addition.

  • XLogInsert retries the whole assemble-then-insert cycle when full-page-write state changes under it. The do { ... } while (EndPos == InvalidXLogRecPtr) loop in XLogInsert re-runs XLogRecordAssemble whenever XLogInsertRecord returns InvalidXLogRecPtr, which happens (in the WALINSERT_NORMAL branch) when a page now needs an image that the caller did not back up. Confirmed against both functions.

  • The first modification of a page after a checkpoint is logged as a full-page image. Verified in XLogRecordAssemble: needs_backup = (page_lsn <= RedoRecPtr), where RedoRecPtr is the latest checkpoint’s redo point. For standard pages the pd_lower..pd_upper hole is omitted; wal_compression then optionally compresses with PGLZ/LZ4/ZSTD.

  • A WAL fsync failure is a PANIC. Verified in issue_xlog_fsync: every sync-method branch sets msg on failure and the trailing if (msg) ereport(PANIC, ...) terminates the backend. The function is a no-op when wal_sync_method is open/open_datasync (the write() already synced) or when enableFsync is off.

  • The three durability watermarks are atomics ordered Insert ≥ Write ≥ Flush, maintained with explicit memory barriers. Verified in XLogCtlData (logInsertResult, logWriteResult, logFlushResult are pg_atomic_uint64) and in XLogWrite’s publish sequence (pg_atomic_write_u64(logWriteResult); pg_write_barrier(); pg_atomic_write_u64(logFlushResult)) plus the symmetric barriered read in RefreshXLogWriteResult.

  • Commit durability and the buffer-manager WAL rule are the same gate on the Flush watermark. Verified at the boundary: XLogFlush early-exits when record <= LogwrtResult.Flush, and the README states the bufmgr must flush “at least up to the page’s LSN” before writing a dirty page. Both reduce to Flush >= target-LSN. The bufmgr and commit mechanisms themselves are in postgres-buffer-manager.md / postgres-xact.md (intentionally not re-verified here).

  • Records are CRC-checked (CRC-32C) before they are trusted on read. Verified in ValidXLogRecord: it recomputes the CRC over the body then the header (excluding xl_crc) and rejects on mismatch. xl_prev is separately validated in ValidXLogRecordHeader.

  • The default WAL segment size is 16 MB, tunable from 1 MB to 1 GB. DEFAULT_XLOG_SEG_SIZE = 16*1024*1024 (pg_config_manual.h); WalSegMinSize/WalSegMaxSize and IsValidWalSegSize (power-of-two) bound it (xlog_internal.h).

  1. Is NUM_XLOGINSERT_LOCKS = 8 still the right number on many-core hardware? It has been 8 for many releases while core counts grew. The scalable-lock-manager work (dbms-papers/scalable-lock-manager.md) targets the heavyweight lock table, not WAL insertion, so this is a distinct contention surface. Investigation path: benchmark insert throughput vs. lock count on a high-core box; check the pgsql-hackers archives for any proposal to make it a GUC.

  2. What exactly does CopyXLogRecordToWAL do when a record straddles a page boundary, and how does it interact with AdvanceXLInsertBuffer page initialization? This doc treats the copy as a black box and quotes only its call site. The xlp_rem_len / XLP_FIRST_IS_CONTRECORD continuation machinery and the WAL-buffer page-init path (AdvanceXLInsertBuffer, WALBufMappingLock) deserve their own walkthrough. Investigation path: read CopyXLogRecordToWAL (xlog.c ~1227) and AdvanceXLInsertBuffer in full.

  3. How much WAL volume is full-page writes in practice, and when is it safe to disable them? The mechanism is verified but the cost/benefit (FPI volume vs. torn-page risk on a given storage stack, and the interaction with wal_compression and checkpoint spacing) is a tuning question this doc does not answer. Investigation path: pg_waldump --stats on a representative workload; cross-reference postgres-checkpoint.md once it exists.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • ARIES (Mohan et al., TODS 1992) — the theory PostgreSQL implements. PostgreSQL adopts repeating-history redo and the PageLSN/WAL-rule machinery, but notably does not use ARIES-style undo via compensation log records for normal transaction rollback: MVCC means an aborted transaction’s tuples are simply left invisible and reclaimed by vacuum, so there is no page-level undo pass. A focused comparison of “ARIES undo vs. MVCC no-undo” would clarify exactly which ARIES pieces PostgreSQL keeps. (dbms-papers/aries.md.)

  • CUBRID’s WAL — CUBRID is also ARIES-based but keeps a classic physical/physiological log with undo: it writes undo and redo log records and performs an undo pass on rollback and recovery, because its storage is update-in-place rather than no-overwrite MVCC. Comparing CUBRID’s log_append/LOG_LSA to PostgreSQL’s XLogInsert/XLogRecPtr side-by-side would surface how the no-overwrite-heap decision propagates all the way down to whether the log needs undo at all. (See the cubrid recovery analysis in knowledge/code-analysis/cubrid/.)

  • Per-core / distributed logging — the single serialized log tip (CurrBytePos under one spinlock) is a known scalability ceiling. Research lines like multi-stream WAL, per-core log partitions (e.g. the “Scalable Logging through Emerging Non-Volatile Memory” and Aether/ELEDA work on log buffer contention) explore removing the single tip. A note tracing whether PostgreSQL’s 8 insertion locks plus the byte-position trick already capture most of that benefit, and where they fall short, would be the natural follow-up.

  • Constant-time recovery / instant-restart — engines such as SQL Server’s Accelerated Database Recovery and Hekaton-style persistent logging decouple recovery time from log length. PostgreSQL’s recovery is still linear in WAL replayed since the last checkpoint; the design tension (checkpoint frequency vs. FPI volume vs. recovery time) is a frontier postgres-checkpoint.md and postgres-recovery-redo.md should pick up.

In-tree design docs:

  • src/backend/access/transam/README — “Write-Ahead Log Coding”, “Constructing a WAL record”, “Writing Hints”, “Asynchronous Commit”.

Source files (REL_18_STABLE, commit 273fe94):

  • src/backend/access/transam/xlog.c — insert/reserve/write/flush/fsync, shared WAL state, insertion locks.
  • src/backend/access/transam/xloginsert.c — record construction and assembly, full-page-image logic.
  • src/backend/access/transam/xlogreader.c — record decode and validation (read side).
  • src/include/access/xlogrecord.hXLogRecord and block-reference formats.
  • src/include/access/xlog_internal.h — page/segment layout, file naming, XLogRecData, the rmgr method table.
  • src/include/access/xlogdefs.hXLogRecPtr/LSN definitions.
  • src/include/pg_config_manual.hDEFAULT_XLOG_SEG_SIZE.

Papers and textbooks:

  • knowledge/research/dbms-papers/aries.md — Mohan et al., ARIES (TODS 1992): the WAL/PageLSN/repeating-history model.
  • knowledge/research/dbms-papers/scalable-lock-manager.md — context for the lock-count open question (heavyweight locks, not WAL insertion).
  • Database Internals (Petrov), ch. 5 — operational vs. logical logs, physiological logging.
  • Database System Concepts (Silberschatz et al., 7e), ch. 19 — recovery, steal/no-force, immediate modification.

Cross-references (mechanism owned elsewhere — not duplicated here):

  • postgres-recovery-redo.md — the redo loop, consistency, timelines.
  • postgres-wal-records-rmgr.md — the resource-manager catalog and per-record formats.
  • postgres-xact.md — commit-record insertion and the commit-time flush.
  • postgres-checkpoint.mdRedoRecPtr advancement, segment recycling.
  • postgres-buffer-manager.md — the dirty-page flush that gates on PageLSN.
  • postgres-architecture-overview.md — Axis 3, the WAL durability spine.