Skip to content

PostgreSQL Logical Decoding — Reorder Buffer, Snapshot Builder, and Output Plugins

Contents:

Every durable relational engine writes a write-ahead log: before a data page is allowed to reach disk, the redo information that would recreate the change is forced to stable storage first. That log exists to serve crash recovery and physical replication — its records are phrased in the vocabulary of the storage engine (this byte range of this block of this relation file changed to these bytes), not in the vocabulary of the schema (this row of this table was updated from these column values to those). Recovery replays those records into the same block layout; it never needs to know that a record corresponds to “an UPDATE on accounts that set balance = 90”.

Change Data Capture (CDC) is the discipline of recovering exactly that logical meaning. Kleppmann’s Designing Data-Intensive Applications (ch. 11, “Stream Processing”) frames it as the problem of turning a database into a source of an event stream: “change data capture is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems” (§“Change Data Capture”). The canonical implementation strategy named there is to parse the replication log — to reuse the WAL the database already writes for durability, rather than bolt triggers onto every table. Kleppmann is explicit about the payoff and the cost: log-based CDC is “asynchronous” and keeps the downstream consumer “a log consumer” so the source database is not slowed by it, but it requires “parsing … the database’s internal storage format”, which couples the consumer to the engine’s on-disk encoding (§“Change Data Capture”, “Implementing change data capture”).

Three theoretical obligations fall out of “parse the replication log”, and they are exactly the three modules this document covers:

  1. Decode the physical encoding. The log speaks in blocks and byte ranges. Something must reverse-map a redo record back to “an insert of this tuple into this relation”. This is lossy in general — the WAL was never designed as a logical stream — so the engine must either log extra information or accept that some operations are undecodable.

  2. Reassemble transactions and impose commit order. A WAL is a physical interleaving: records from many concurrent transactions are written in the order their backends happened to flush, and a single logical transaction’s records may be split across many WAL positions and across subtransactions/savepoints. Kleppmann notes that a useful change stream is “in the same order as they were written” per row, but consumers overwhelmingly want transaction-atomic, commit-time-ordered delivery: a downstream replica must see all of a transaction’s changes together, only once the transaction commits, and in the order transactions committed — not the order their records hit the log. So the decoder must buffer per-transaction change streams and emit them only at the commit record, aborting buffered changes when an abort record arrives.

  3. Interpret tuples against a time-consistent catalog. To turn raw tuple bytes into named, typed columns the decoder needs the table definition as it was when the change was logged — not as it is now. If a column was added or a type altered after the WAL record was written, decoding the record against the current catalog produces garbage. This is the subtlest obligation: the decoder needs a historic view of the system catalogs, reconstructed for the exact log position being decoded, even though those catalog rows may since have been updated or vacuumed.

The first obligation is a parsing problem; the second is a buffering and ordering problem isomorphic to a k-way merge of sorted streams; the third is an MVCC visibility problem (see the companion postgres-mvcc-snapshots.md) specialized to catalog access. PostgreSQL names these three responsibilities decode.c, reorderbuffer.c, and snapbuild.c, and binds them to a consumer via an output plugin API (logical.c). The rest of this document reads those four files against the theory above.

Log-based CDC is now a standard feature across engines, and the design space has settled into a recognizable shape. Three recurring decisions dominate.

Where the logical information comes from. An engine can (a) make its existing physical log self-describing enough to decode, (b) write a separate logical log alongside the physical one, or (c) capture changes outside the log entirely (triggers, audit tables). MySQL’s binlog in ROW format is option (b): the binlog is a distinct, logical-by-construction log of before/after row images, separate from the InnoDB redo log that handles durability. Oracle GoldenGate and LogMiner take option (a), mining the physical redo/undo stream. Trigger-based capture (option c) is portable but pays a synchronous cost on every write and is the approach Kleppmann warns is “often fragile and has significant performance overhead”. PostgreSQL chose a hybrid of (a): one WAL serves durability and logical decoding, but only when wal_level = logical, which makes the engine log extra information (full old-tuple images for the replica identity, and command-id records for catalog changes) that physical recovery does not need.

How transactions are reassembled. Because any real log interleaves concurrent transactions, every decoder needs a buffering layer keyed by transaction id that accumulates changes and flushes them at commit. The universal challenges are (1) memory — a single transaction can be larger than RAM, forcing spill-to-disk or streaming; (2) subtransactions — savepoints create child transaction ids that must be spliced into their parent in LSN order; and (3) abort handling — buffered changes for a rolled-back transaction must be discarded without ever reaching the consumer. The merge itself is the classic replacement-selection / k-way merge from external sorting: each (sub)transaction contributes a stream already sorted by log position, and a priority queue (binary heap) keyed on the next record’s position yields a single globally LSN-ordered stream.

How the schema is recovered. This is where engines differ most. MySQL sidesteps it by embedding table metadata (the TABLE_MAP event) directly in the binlog ahead of each row event, so the consumer never consults a live catalog. Oracle LogMiner reconstructs DDL from a dictionary either mined from redo or snapshotted. PostgreSQL takes the most MVCC-native route: it reuses the engine’s own time-travel machinery to build a historic snapshot — a snapshot object that the normal catalog-scan code (syscache, relcache, heap visibility) accepts in place of a live MVCC snapshot, so that catalog lookups during decoding transparently see the catalog “as of” the WAL position being decoded. This means the decoder can call ordinary RelationIdGetRelation and get the historic table definition for free, at the cost of having to build that historic snapshot from the WAL — the job of the snapshot builder.

The output boundary. Finally, every decoder must decide what the consumer sees. A monolithic design bakes one wire format in; a plugin design exposes a callback interface and ships format decisions to loadable modules. PostgreSQL chose the plugin route: a small set of callbacks (begin, change, commit, plus optional truncate/message/streaming/two-phase variants) that the core invokes during replay, leaving the bytes-on-the-wire decision to the plugin (pgoutput for built-in logical replication; test_decoding, wal2json, and others for bespoke consumers).

PostgreSQL’s logical decoding pipeline is a four-stage assembly line driven by a WAL reader. A walsender (or the SQL SRF pg_logical_slot_get_changes) repeatedly reads the next WAL record and hands it to LogicalDecodingProcessRecord, which is the top of the pipeline.

flowchart TD
  WAL["WAL stream<br/>XLogReadRecord()"] --> LDPR["LogicalDecodingProcessRecord<br/>(decode.c)"]
  LDPR -->|"GetRmgr(rmid).rm_decode"| DISP{"per-rmgr<br/>decode handler"}
  DISP -->|"RM_HEAP_ID"| HEAP["heap_decode<br/>DecodeInsert/Update/Delete"]
  DISP -->|"RM_HEAP2_ID"| HEAP2["heap2_decode<br/>MultiInsert / NEW_CID"]
  DISP -->|"RM_XACT_ID"| XACT["xact_decode<br/>DecodeCommit/Abort/Prepare"]
  DISP -->|"RM_STANDBY_ID"| STBY["standby_decode<br/>RUNNING_XACTS"]
  HEAP -->|"ReorderBufferQueueChange"| RB["Reorder Buffer<br/>(reorderbuffer.c)"]
  HEAP2 -->|"SnapBuildProcessNewCid"| SB["Snapshot Builder<br/>(snapbuild.c)"]
  STBY -->|"SnapBuildProcessRunningXacts"| SB
  XACT -->|"SnapBuildCommitTxn"| SB
  XACT -->|"ReorderBufferCommit"| RB
  SB -.->|"historic catalog snapshot"| RB
  RB -->|"replay in commit order"| CB["Output-plugin callbacks<br/>begin / change / commit<br/>(logical.c wrappers)"]
  CB --> PLUGIN["Output plugin<br/>(pgoutput, test_decoding, …)"]

Stage 1 — rmgr dispatch (decode.c). Every WAL record carries a resource manager id (rmid). LogicalDecodingProcessRecord looks up the rmgr’s optional rm_decode callback and calls it; resource managers that are irrelevant to logical decoding (most of them) leave it NULL and the record’s xid is merely registered. The crucial invariant, stated in the file header, is that every record’s xid must pass through the reorder buffer even when the record carries no decodable change, because the reorder buffer is what tracks which transactions exist:

// LogicalDecodingProcessRecord — src/backend/replication/logical/decode.c
rmgr = GetRmgr(XLogRecGetRmid(record));
if (rmgr.rm_decode != NULL)
rmgr.rm_decode(ctx, &buf);
else
{
/* just deal with xid, and done */
ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
buf.origptr);
}

The per-rmgr handlers (heap_decode, heap2_decode, xact_decode, standby_decode, xlog_decode, logicalmsg_decode) share a common shape: register the xid, bail out early if no consistent snapshot exists yet, then switch on the record’s info bits. A heap insert becomes a ReorderBufferChange:

// DecodeInsert — src/backend/replication/logical/decode.c
change = ReorderBufferAllocChange(ctx->reorder);
if (!(xlrec->flags & XLH_INSERT_IS_SPECULATIVE))
change->action = REORDER_BUFFER_CHANGE_INSERT;
else
change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT;
change->origin_id = XLogRecGetOrigin(r);
memcpy(&change->data.tp.rlocator, &target_locator, sizeof(RelFileLocator));
tupledata = XLogRecGetBlockData(r, 0, &datalen);
tuplelen = datalen - SizeOfHeapHeader;
change->data.tp.newtuple = ReorderBufferAllocTupleBuf(ctx->reorder, tuplelen);
DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr,
change, xlrec->flags & XLH_INSERT_ON_TOAST_RELATION);

Note the two filters applied before any change is queued: the record’s target relation must belong to this slot’s database (target_locator.dbOid != ctx->slot->data.database → return), and the output plugin’s optional origin filter is consulted (FilterByOrigin) so that changes replayed from another node can be skipped to avoid replication loops.

Stage 2 — transaction reassembly (reorderbuffer.c). The reorder buffer is the heart of the system. It keeps a hash table of ReorderBufferTXN keyed by xid (fronted by a one-entry lookup cache for the extremely common “same xid as last time” case), and each TXN owns a doubly-linked list of ReorderBufferChange in LSN order. Subtransactions are tracked separately until their parent is known (ReorderBufferAssignChild links a child xid to its top-level parent, called from decode.c before the rmgr switch). Changes are held in purpose-built memory contexts — a SlabContext for fixed-size change and TXN structs, a GenerationContext for variable-length tuple data — and total memory is metered against logical_decoding_work_mem. When the limit is crossed, the largest transaction is evicted, either streamed to the plugin (if streaming is enabled and the txn is streamable) or spilled to a per-xid disk file:

// ReorderBufferCheckMemoryLimit — src/backend/replication/logical/reorderbuffer.c
while (rb->size >= logical_decoding_work_mem * (Size) 1024 || ...)
{
if (ReorderBufferCanStartStreaming(rb) &&
(txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL)
ReorderBufferStreamTXN(rb, txn); /* stream in-progress txn */
else
{
txn = ReorderBufferLargestTXN(rb); /* pick biggest (sub)txn */
ReorderBufferSerializeTXN(rb, txn);/* spill to disk */
}
}

When the commit record for a top-level transaction is decoded, xact_decodeDecodeCommitReorderBufferCommit triggers replay. Replay merges all of the transaction’s (and its subtransactions’) LSN-ordered change lists into one globally ordered stream using a binary heap keyed on each stream’s current change LSN — the k-way merge from the theory section:

flowchart LR
  T0["top-txn changes<br/>(LSN-sorted list)"] --> H[["binary heap<br/>keyed on current LSN<br/>ReorderBufferIterCompare"]]
  S1["subtxn A changes<br/>(LSN-sorted)"] --> H
  S2["subtxn B changes<br/>(LSN-sorted)"] --> H
  H -->|"ReorderBufferIterTXNNext<br/>pops smallest LSN"| OUT["single commit-ordered<br/>change stream"]
  OUT --> APPLY["ReorderBufferProcessTXN<br/>begin → change* → commit"]
  APPLY -->|"per change"| CBW["change_cb_wrapper<br/>(logical.c)"]

ReorderBufferProcessTXN sets up the historic snapshot, opens an internal transaction (so syscache/relcache lookups work and so any attempt to write the database is caught), fires the begin callback, iterates changes calling change/truncate callbacks, and finally fires commit. Aborts (DecodeAbortReorderBufferAbort) and cross-database commits (ReorderBufferForget) discard the buffered changes without invoking the change callback.

Stage 3 — historic catalog snapshot (snapbuild.c). To decode a tuple’s columns the reorder buffer needs the catalog as of the change’s LSN. The snapshot builder constructs a SNAPSHOT_HISTORIC_MVCC snapshot purely from the WAL. Its key economy: because most transactions never touch the catalog, it tracks only the committed catalog-modifying xids that fall in [xmin, xmax), treating those as “committed” and everything else as aborted/in-progress. It learns which xids modified the catalog from XLOG_HEAP2_NEW_CID records (which heapam emits only when wal_level = logical and a catalog row changes) and uses the same records to recover the cmin/cmax command ids that the in-tuple fields cannot supply during decoding. The builder cannot be ready instantly: it must observe enough of the WAL to know all in-progress transactions have ended, so it walks a four-state machine (detailed in the walkthrough) from START to CONSISTENT. The rmgr handlers refuse to decode data until SnapBuildCurrentState(builder) >= SNAPBUILD_FULL_SNAPSHOT:

// heap_decode — src/backend/replication/logical/decode.c
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;
switch (info)
{
case XLOG_HEAP_INSERT:
if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
!ctx->fast_forward)
DecodeInsert(ctx, buf);
break;
...
}

Stage 4 — output plugin (logical.c). logical.c loads the plugin named in the replication slot, validates that it registered the three mandatory callbacks, and exposes the OutputPluginPrepareWrite / OutputPluginWrite pair that plugins use to push bytes to the walsender. Every callback is invoked through a wrapper that pushes an error-context entry (so a plugin error reports the slot, plugin name, and LSN) and toggles ctx->accept_writes so the plugin can only emit output from inside begin/change/commit. The reorder buffer’s function pointers (rb->begin, rb->apply_change, rb->commit, …) are set to these wrappers at context startup, which is how stage 2’s replay reaches the plugin.

This section traces the four files by call flow, anchoring on stable symbol names. Line numbers live only in the position-hint table at the end.

4.1 — rmgr dispatch and record parsing (decode.c)

Section titled “4.1 — rmgr dispatch and record parsing (decode.c)”

The entry point LogicalDecodingProcessRecord is called once per WAL record. It first stashes the record’s start/end LSNs into a local XLogRecordBuffer, then — unconditionally, before the dispatch — assigns any subtransaction to its top-level parent if the record carries a top xid:

// LogicalDecodingProcessRecord — src/backend/replication/logical/decode.c
txid = XLogRecGetTopXid(record);
if (TransactionIdIsValid(txid))
ReorderBufferAssignChild(ctx->reorder, txid,
XLogRecGetXid(record), buf.origptr);
rmgr = GetRmgr(XLogRecGetRmid(record));
if (rmgr.rm_decode != NULL)
rmgr.rm_decode(ctx, &buf);

The rm_decode slot is populated from the resource-manager table; only RM_XLOG_ID, RM_XACT_ID, RM_STANDBY_ID, RM_HEAP2_ID, RM_HEAP_ID, and RM_LOGICALMSG_ID provide one. The handlers split into two roles:

  • Snapshot-feeding handlers. standby_decode forwards XLOG_RUNNING_XACTS to SnapBuildProcessRunningXacts (which drives the state machine and the slot’s catalog_xmin) and calls ReorderBufferAbortOld to forget transactions older than the record’s oldestRunningXid. xlog_decode recognizes shutdown / end-of-recovery checkpoints as SnapBuildSerializationPoints and enforces the wal_level >= logical guard on standbys. heap2_decode’s XLOG_HEAP2_NEW_CID case calls SnapBuildProcessNewCid to record catalog command-ids.

  • Change-producing handlers. heap_decode and heap2_decode’s data cases gate every change on SnapBuildProcessChange (which both answers “is this txn decodable yet?” and lazily hands the txn a base snapshot) and on !ctx->fast_forward, then call the DecodeInsert / DecodeUpdate / DecodeDelete / DecodeTruncate / DecodeMultiInsert / DecodeSpecConfirm parsers. Each parser builds a ReorderBufferChange, fills data.tp.rlocator and the new/old HeapTuples via DecodeXLogTuple, and queues it with ReorderBufferQueueChange.

DecodeXLogTuple is the low-level reverse of WAL tuple encoding: it copies the unaligned on-disk image into an aligned HeapTupleHeader, restoring t_infomask, t_infomask2, and t_hoff, and deliberately leaves t_tableOid = InvalidOid and an invalid t_self because those “can only be figured out after reassembling the transactions.”

Transaction-boundary records flow through xact_decode, which refuses to act until SNAPBUILD_FULL_SNAPSHOT and then branches on the xact op-mask. XLOG_XACT_COMMIT[_PREPARED]DecodeCommit, XLOG_XACT_ABORT[_PREPARED]DecodeAbort, XLOG_XACT_PREPAREDecodePrepare, and XLOG_XACT_INVALIDATIONS accumulates cache invalidations to be executed at commit time:

// xact_decode (XLOG_XACT_INVALIDATIONS) — decode.c
if (TransactionIdIsValid(xid))
{
if (!ctx->fast_forward)
ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
invals->nmsgs, invals->msgs);
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
}

DecodeCommit is where snapshot building and reassembly meet. It first tells the snapshot builder about the commit (SnapBuildCommitTxn), checks whether the transaction is interesting at all (DecodeTXNNeedSkip — wrong database, filtered origin, before the start LSN, or fast-forward), and if interesting calls ReorderBufferCommit (or ReorderBufferFinishPrepared for two-phase):

// DecodeCommit — src/backend/replication/logical/decode.c
SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
parsed->nsubxacts, parsed->subxacts, parsed->xinfo);
if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
{
for (i = 0; i < parsed->nsubxacts; i++)
ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
ReorderBufferForget(ctx->reorder, xid, buf->origptr);
return;
}
for (i = 0; i < parsed->nsubxacts; i++)
ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
buf->origptr, buf->endptr);
if (two_phase)
ReorderBufferFinishPrepared(ctx->reorder, xid, ...);
else
ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
commit_time, origin_id, origin_lsn);

4.2 — the reorder buffer (reorderbuffer.c)

Section titled “4.2 — the reorder buffer (reorderbuffer.c)”

ReorderBufferTXNByXid is the lookup primitive: it consults a one-entry cache (by_txn_last_xid / by_txn_last_txn), then the by_txn hash table, creating a new ReorderBufferTXN on demand and (for top-level txns) pushing it onto toplevel_by_lsn so that SnapBuildDistributeSnapshotAndInval can later walk all in-progress transactions in LSN order. Every queued change runs through it:

// ReorderBufferQueueChange — src/backend/replication/logical/reorderbuffer.c
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
if (rbtxn_is_aborted(txn))
{
ReorderBufferFreeChange(rb, change, false); /* known-aborted: drop */
return;
}
change->lsn = lsn;
change->txn = txn;
dlist_push_tail(&txn->changes, &change->node);
txn->nentries++;
txn->nentries_mem++;
ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
ReorderBufferChangeSize(change));
ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
ReorderBufferCheckMemoryLimit(rb); /* may spill or stream */

Two invariants are worth highlighting. First, memory accounting is incrementalReorderBufferChangeMemoryUpdate adjusts both the per-txn size and the buffer-wide rb->size, and also maintains a max-heap of transactions by size so ReorderBufferLargestTXN is O(1) to find. Second, the abort check up front means a transaction already discovered to be aborted (via CLOG check during streaming) never accumulates more changes.

The ReorderBufferTXN struct is the unit of reassembly. Its fields encode everything decode.c and snapbuild.c need to feed in incrementally and everything replay needs to drain in order:

// ReorderBufferTXN — src/include/replication/reorderbuffer.h
typedef struct ReorderBufferTXN
{
bits32 txn_flags; /* RBTXN_IS_PREPARED, _HAS_CATALOG_CHANGES, … */
TransactionId xid; /* top-level or sub xid */
TransactionId toplevel_xid; /* known once linked */
XLogRecPtr first_lsn; /* first data record for this xid */
XLogRecPtr final_lsn; /* commit/prepare/abort record LSN */
XLogRecPtr end_lsn; /* end of commit record + 1 */
struct ReorderBufferTXN *toptxn; /* NULL for top-level */
XLogRecPtr restart_decoding_lsn; /* where to resume to recover this txn */
Snapshot base_snapshot; /* historic snapshot attached at first change */
uint64 nentries; /* total changes (excl. subxacts) */
uint64 nentries_mem; /* of which still in memory (rest spilled) */
dlist_head changes; /* LSN-ordered change list */
dlist_head subtxns; /* child transactions */
...
};

The split between nentries and nentries_mem is what ReorderBufferIterTXNNext checks to decide whether a stream needs a disk reload, and base_snapshot is the per-transaction historic snapshot that SnapBuildProcessChange attaches lazily on the first decodable change. restart_decoding_lsn is the field that, aggregated across all in-progress transactions, becomes the slot’s restart_lsn — the point from which decoding can safely resume after a restart (see postgres-replication-slots.md).

Replay is the merge. ReorderBufferCommit resolves the TXN and delegates to ReorderBufferReplayReorderBufferProcessTXN, which:

  1. builds the ctid→(cmin,cmax) hash (ReorderBufferBuildTupleCidHash) and installs the historic snapshot (SetupHistoricSnapshot);
  2. opens an internal (sub)transaction so catalog access works and database writes are forbidden;
  3. fires begin/begin_prepare;
  4. iterates via the binary-heap merge, dispatching on change->action;
  5. fires commit.

The merge is set up by ReorderBufferIterTXNInit, which allocates one heap entry per (sub)transaction that has changes and primes each entry with the head change of that stream:

// ReorderBufferIterTXNInit — reorderbuffer.c
state->heap = binaryheap_allocate(state->nr_txns,
ReorderBufferIterCompare, state);
...
if (txn->nentries > 0)
{
if (rbtxn_is_serialized(txn)) /* spilled? restore first batch */
{
ReorderBufferSerializeTXN(rb, txn);
ReorderBufferRestoreChanges(rb, txn, &state->entries[off].file,
&state->entries[off].segno);
}
cur_change = dlist_head_element(ReorderBufferChange, node, &txn->changes);
state->entries[off].lsn = cur_change->lsn;
state->entries[off].change = cur_change;
state->entries[off].txn = txn;
binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
}

ReorderBufferIterTXNNext pops the entry with the smallest LSN, advances that stream (loading the next on-disk batch with ReorderBufferRestoreChanges if the stream was spilled), and reinserts it with binaryheap_replace_first:

// ReorderBufferIterTXNNext — reorderbuffer.c
if (state->heap->bh_size == 0)
return NULL;
off = DatumGetInt32(binaryheap_first(state->heap));
entry = &state->entries[off];
change = entry->change;
if (dlist_has_next(&entry->txn->changes, &entry->change->node))
{
ReorderBufferChange *next_change = /* next in this stream */ ...;
state->entries[off].lsn = next_change->lsn;
state->entries[off].change = next_change;
binaryheap_replace_first(state->heap, Int32GetDatum(off));
return change;
}

The spill path, ReorderBufferSerializeTXN, writes the in-memory changes to a per-segment file under pg_replslot/<slot>/ and bumps rb->spillCount; the restore path reads them back in max_changes_in_memory-sized batches. Streaming (ReorderBufferStreamTXN) instead replays an in-progress transaction immediately through the streaming callbacks, used when the consumer opted into streaming = on.

TOAST reassembly. A second reassembly problem lives inside the reorder buffer. When a row has out-of-line (TOASTed) columns, the WAL records the TOAST chunks as separate heap inserts into the TOAST relation, emitted just before the main-table row record. During decoding those chunk inserts arrive as ordinary INSERT changes on the TOAST relation, but the consumer must see the reassembled datum, not the chunks. reorderbuffer.c solves this with a per-txn hash of ReorderBufferToastEnt (declared in the file): ReorderBufferToastAppendChunk accumulates chunks keyed by TOAST value id as they are decoded, and when the main row’s change is processed, ReorderBufferToastReplace substitutes the reassembled value into the tuple before the change callback fires. ReorderBufferToastReset clears the hash afterward — which is exactly why DecodeInsert/DecodeMultiInsert thread a clear_toast_afterwards flag through each change: it marks the last row of a multi-insert batch so the TOAST state is reset only after the final row that could reference the pending chunks. The file header is explicit that “within a single toplevel transaction there will be no other data carrying records between a row’s toast chunks and the row data itself,” which is the invariant that makes this single-pass reassembly correct.

4.3 — the snapshot builder (snapbuild.c)

Section titled “4.3 — the snapshot builder (snapbuild.c)”

The builder is a small state machine over SnapBuild->state. The states (from snapbuild.h) are SNAPBUILD_START = -1, SNAPBUILD_BUILDING_SNAPSHOT = 0, SNAPBUILD_FULL_SNAPSHOT = 1, SNAPBUILD_CONSISTENT = 2. Transitions are driven entirely by xl_running_xacts records arriving at SnapBuildProcessRunningXactsSnapBuildFindSnapshot:

flowchart TD
  START["SNAPBUILD_START<br/>(no snapshot)"] -->|"running_xacts with<br/>zero running xacts"| CONS["SNAPBUILD_CONSISTENT<br/>can decode everything after"]
  START -->|"running_xacts with<br/>running xacts: record nextXid"| BUILD["SNAPBUILD_BUILDING_SNAPSHOT"]
  BUILD -->|"all xacts running at #1<br/>have finished"| FULL["SNAPBUILD_FULL_SNAPSHOT<br/>txns starting now are decodable"]
  FULL -->|"all xacts running at #2<br/>have finished"| CONS
  START -.->|"on-disk serialized<br/>snapshot found"| CONS

SnapBuildFindSnapshot encodes the three ways to reach a snapshot named in its own comment: (a) jump straight to CONSISTENT if the running_xacts record shows no transactions were running (running->oldestRunningXid == running->nextXid); (b) restore a serialized snapshot from disk; or (c) incrementally build, recording next_phase_at = running->nextXid at each phase boundary:

// SnapBuildFindSnapshot (case a) — src/backend/replication/logical/snapbuild.c
if (running->oldestRunningXid == running->nextXid)
{
builder->xmin = running->nextXid; /* < are finished */
builder->xmax = running->nextXid; /* >= are running */
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
ereport(LOG, (errmsg("logical decoding found consistent point at %X/%X",
LSN_FORMAT_ARGS(lsn)),
errdetail("There are no running transactions.")));
return false;
}

Two callbacks maintain the catalog-modifying-xid set. SnapBuildProcessNewCid, fired for each XLOG_HEAP2_NEW_CID, marks the txn as catalog-changing (ReorderBufferXidSetCatalogChanges), feeds the ctid→(cmin,cmax) mapping into the reorder buffer (ReorderBufferAddNewTupleCids), and advances the txn’s command id (ReorderBufferAddNewCommandId). SnapBuildCommitTxn, fired from DecodeCommit, decides whether the just-committed txn needs to be remembered as a committed catalog-changer (and whether a new snapshot must be distributed to in-progress txns):

// SnapBuildCommitTxn — src/backend/replication/logical/snapbuild.c
for (nxact = 0; nxact < nsubxacts; nxact++)
{
TransactionId subxid = subxacts[nxact];
if (SnapBuildXidHasCatalogChanges(builder, subxid, xinfo))
{
sub_needs_timetravel = true;
needs_snapshot = true;
SnapBuildAddCommittedTxn(builder, subxid); /* record as committed */
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
}
else if (needs_timetravel)
SnapBuildAddCommittedTxn(builder, subxid);
}
if (SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
{
needs_snapshot = true;
needs_timetravel = true;
SnapBuildAddCommittedTxn(builder, xid);
}

The snapshot object itself is materialized by SnapBuildBuildSnapshot, which reuses SnapshotData but reinterprets its arrays: xip holds the (sorted, bsearch-able) committed catalog-modifying xids rather than the in-progress xids of a normal MVCC snapshot, and subxip is filled only when the snapshot is copied into a catalog-modifying transaction’s context. The resulting snapshot is tagged SNAPSHOT_HISTORIC_MVCC so a dedicated visibility routine (HeapTupleSatisfiesHistoricMVCC, in heapam_visibility.c) handles it. SnapBuildProcessChange is what binds the snapshot to a transaction: when a decodable change arrives for a txn that has no base snapshot yet, it builds (or reuses) the current snapshot and attaches it via ReorderBufferSetBaseSnapshot.

CreateInitDecodingContext / CreateDecodingContext call StartupDecodingContext, which allocates the LogicalDecodingContext, its ReorderBuffer, and its SnapBuild, then LoadOutputPlugins the slot’s named plugin and validates the mandatory callbacks:

// LoadOutputPlugin — src/backend/replication/logical/logical.c
plugin_init = (LogicalOutputPluginInit)
load_external_function(plugin, "_PG_output_plugin_init", false, NULL);
plugin_init(callbacks); /* plugin fills the OutputPluginCallbacks struct */
if (callbacks->begin_cb == NULL)
elog(ERROR, "output plugins have to register a begin callback");
if (callbacks->change_cb == NULL)
elog(ERROR, "output plugins have to register a change callback");
if (callbacks->commit_cb == NULL)
elog(ERROR, "output plugins have to register a commit callback");

The reorder buffer never calls plugin callbacks directly; it calls the *_cb_wrapper functions installed on rb. Each wrapper pushes an ErrorContextCallback so failures are annotated with slot/plugin/LSN, sets ctx->accept_writes and ctx->write_xid/write_location, then calls the real callback:

// change_cb_wrapper — src/backend/replication/logical/logical.c
LogicalDecodingContext *ctx = cache->private_data;
...
errcallback.callback = output_plugin_error_callback;
error_context_stack = &errcallback;
ctx->accept_writes = true;
ctx->write_xid = txn->xid;
ctx->write_location = change->lsn;
ctx->end_xact = false;
ctx->callbacks.change_cb(ctx, txn, relation, change);
error_context_stack = errcallback.previous;

Inside a callback the plugin emits bytes by bracketing them with OutputPluginPrepareWrite / OutputPluginWrite, which route through the context’s prepare_write / write function pointers (set by the walsender to the COPY-protocol writers). accept_writes guards against a plugin trying to emit output outside a begin/change/commit callback.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
LogicalDecodingProcessRecordsrc/backend/replication/logical/decode.c88
xlog_decodesrc/backend/replication/logical/decode.c129
xact_decodesrc/backend/replication/logical/decode.c201
standby_decodesrc/backend/replication/logical/decode.c359
heap2_decodesrc/backend/replication/logical/decode.c405
heap_decodesrc/backend/replication/logical/decode.c469
logicalmsg_decodesrc/backend/replication/logical/decode.c586
DecodeCommitsrc/backend/replication/logical/decode.c667
DecodeInsertsrc/backend/replication/logical/decode.c894
DecodeXLogTuplesrc/backend/replication/logical/decode.c1252
DecodeTXNNeedSkipsrc/backend/replication/logical/decode.c1296
logical_decoding_work_mem (GUC)src/backend/replication/logical/reorderbuffer.c225
ReorderBufferAllocChangesrc/backend/replication/logical/reorderbuffer.c506
ReorderBufferTXNByXidsrc/backend/replication/logical/reorderbuffer.c652
ReorderBufferQueueChangesrc/backend/replication/logical/reorderbuffer.c809
ReorderBufferAssignChildsrc/backend/replication/logical/reorderbuffer.c1098
ReorderBufferIterComparesrc/backend/replication/logical/reorderbuffer.c1260
ReorderBufferIterTXNInitsrc/backend/replication/logical/reorderbuffer.c1283
ReorderBufferIterTXNNextsrc/backend/replication/logical/reorderbuffer.c1411
ReorderBufferProcessTXNsrc/backend/replication/logical/reorderbuffer.c2210
ReorderBufferCommitsrc/backend/replication/logical/reorderbuffer.c2874
ReorderBufferProcessXidsrc/backend/replication/logical/reorderbuffer.c3279
ReorderBufferLargestTXNsrc/backend/replication/logical/reorderbuffer.c3792
ReorderBufferCheckMemoryLimitsrc/backend/replication/logical/reorderbuffer.c3883
ReorderBufferSerializeTXNsrc/backend/replication/logical/reorderbuffer.c3963
ReorderBufferStreamTXNsrc/backend/replication/logical/reorderbuffer.c4308
ReorderBufferRestoreChangessrc/backend/replication/logical/reorderbuffer.c4510
SnapBuildCurrentStatesrc/backend/replication/logical/snapbuild.c277
SnapBuildXactNeedsSkipsrc/backend/replication/logical/snapbuild.c304
SnapBuildBuildSnapshotsrc/backend/replication/logical/snapbuild.c360
SnapBuildGetOrBuildSnapshotsrc/backend/replication/logical/snapbuild.c579
SnapBuildProcessChangesrc/backend/replication/logical/snapbuild.c639
SnapBuildProcessNewCidsrc/backend/replication/logical/snapbuild.c689
SnapBuildAddCommittedTxnsrc/backend/replication/logical/snapbuild.c829
SnapBuildCommitTxnsrc/backend/replication/logical/snapbuild.c940
SnapBuildProcessRunningXactssrc/backend/replication/logical/snapbuild.c1136
SnapBuildFindSnapshotsrc/backend/replication/logical/snapbuild.c1238
SnapBuildSerializesrc/backend/replication/logical/snapbuild.c1497
StartupDecodingContextsrc/backend/replication/logical/logical.c152
CreateInitDecodingContextsrc/backend/replication/logical/logical.c332
CreateDecodingContextsrc/backend/replication/logical/logical.c500
OutputPluginPrepareWritesrc/backend/replication/logical/logical.c694
OutputPluginWritesrc/backend/replication/logical/logical.c707
LoadOutputPluginsrc/backend/replication/logical/logical.c735
startup_cb_wrappersrc/backend/replication/logical/logical.c776
begin_cb_wrappersrc/backend/replication/logical/logical.c837
commit_cb_wrappersrc/backend/replication/logical/logical.c868
change_cb_wrappersrc/backend/replication/logical/logical.c1088
OutputPluginCallbacks (struct)src/include/replication/output_plugin.h216
SnapBuild->state enumsrc/include/replication/snapbuild.h22

All claims in this document were checked against the REL_18_STABLE working tree at commit 273fe94.

  • rmgr dispatch. LogicalDecodingProcessRecord calls rmgr.rm_decode when non-NULL and otherwise only ReorderBufferProcessXid — verified in decode.c. The six logical rmgr handlers (xlog_decode, xact_decode, standby_decode, heap2_decode, heap_decode, logicalmsg_decode) all exist with the signatures quoted.
  • Snapshot gating. Both heap_decode and heap2_decode return early when SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT, and gate each data change on SnapBuildProcessChange(...) && !ctx->fast_forward — verified verbatim.
  • Change action enum. REORDER_BUFFER_CHANGE_INSERT, _UPDATE, _DELETE, _MESSAGE, _TRUNCATE, and the internal speculative variants are members of enum ReorderBufferChangeType in reorderbuffer.h — verified.
  • Memory limit / eviction. ReorderBufferCheckMemoryLimit loops while rb->size >= logical_decoding_work_mem * 1024, preferring ReorderBufferStreamTXN on the largest streamable top-txn and falling back to ReorderBufferSerializeTXN on the largest (sub)txn — verified.
  • k-way merge. The binary heap is allocated in ReorderBufferIterTXNInit with comparator ReorderBufferIterCompare, and ReorderBufferIterTXNNext pops binaryheap_first and reinserts with binaryheap_replace_first — verified.
  • State machine. The four states and their -1/0/1/2 values are defined in snapbuild.h; the START → CONSISTENT shortcut on zero running xacts and the next_phase_at = running->nextXid phase-boundary recording are in SnapBuildFindSnapshot — verified.
  • Historic snapshot reinterpretation. SnapBuildBuildSnapshot sets snapshot->snapshot_type = SNAPSHOT_HISTORIC_MVCC and fills snapshot->xip from builder->committed.xip, qsorting it for bsearch — verified.
  • Output plugin contract. LoadOutputPlugin requires _PG_output_plugin_init and enforces non-NULL begin_cb, change_cb, commit_cb; the *_cb_wrapper functions install error contexts and toggle accept_writes — verified.
  • REL_18 sanity. No PG-19-only symbols are asserted; no XLOG2 rmgr or B_DATACHECKSUMSWORKER_* references appear. The two-phase decode path (DecodePrepare, ReorderBufferFinishPrepared) and streaming path (ReorderBufferStreamTXN) are present as described.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Where the three modules sit in the bigger system. Logical decoding is only the production half of CDC. The decoded stream is shipped to a consumer by the walsender (postgres-wal-sender-receiver.md) over the streaming replication protocol, the consumer is anchored by a logical replication slot that pins restart_lsn and catalog_xmin so the WAL and the catalog rows the snapshot builder needs are not removed (postgres-replication-slots.md), and the built-in consumer encodes changes with the pgoutput plugin (postgres-pgoutput.md). The snapshot builder’s catalog_xmin feedback is the mechanism that couples decoding back into vacuum: a slow or abandoned logical slot holds back catalog vacuuming cluster-wide, which is the standard operational hazard of the design.

Comparison with MySQL row-based binlog. MySQL’s binlog is a purpose-built logical log, so it has no analogue of snapbuild.c: each row event is preceded by a TABLE_MAP event carrying the column layout, making the stream self-describing and decoding stateless with respect to the catalog. The cost is a second log (binlog + InnoDB redo) and the durability/ordering coordination between them (the infamous two-phase commit between the binlog and the storage engine). PostgreSQL’s single-log design avoids that coordination but pays for it with the snapshot builder’s complexity and the catalog_xmin retention pressure. PostgreSQL also has no equivalent of binlog’s STATEMENT format — logical decoding is fundamentally row-level, because the WAL records it mines are row-level.

Comparison with Oracle GoldenGate / LogMiner. Oracle mines the redo and undo streams and reconstructs both the change and a dictionary view of the schema, optionally from a snapshotted data dictionary. The architectural kinship with PostgreSQL is strong — both reverse-engineer a physical log — but Oracle’s mature offering layers conflict detection, heterogeneous targets, and DDL replication on top. PostgreSQL deliberately stops at the plugin boundary and lets the ecosystem (Debezium over pgoutput, wal2json, etc.) build those layers.

Research and evolution frontiers.

  • Streaming of in-progress transactions (PG 14+, the ReorderBufferStreamTXN path) attacks the core latency weakness of commit-time decoding: a long transaction need not be fully buffered before any of it reaches the consumer. The open trade-off is that the consumer must now handle speculative changes that may later abort — the SetupCheckXidLive / concurrent-abort machinery in ReorderBufferProcessTXN exists precisely to detect aborts while a transaction is being streamed.

  • Eviction policy. The header comment in reorderbuffer.c is candid that the current “evict the largest transaction” policy is “very simple” and that an age-aware (LSN-aware) policy could free memory more effectively with the generational allocator. This is a live tuning frontier; logical_decoding_work_mem is the only knob today.

  • Failover slots and slot synchronization (slotsync.c, PG 17+) let a logical slot survive a physical failover by synchronizing restart_lsn/catalog_xmin to a standby — addressing the historical complaint that logical replication could not survive an HA failover of the publisher.

  • Decoding on standbys (PG 16+) lets a physical standby host logical slots, guarded by the wal_level >= logical on the primary check in xlog_decode’s XLOG_PARAMETER_CHANGE case quoted above. This shifts decoding load off the primary, a direction the streaming-systems literature (Kleppmann ch. 11, on “deriving” multiple consumers from one log) treats as the natural end state: one durable log, many independently-paced derived consumers.

  • DDL replication. The conspicuous gap remains: logical decoding replicates DML and TRUNCATE but not arbitrary DDL. The historic snapshot machinery exists to decode against schema changes, but the schema changes themselves are not emitted as logical events; the XLOG_XACT_INVALIDATIONS handling shows the infrastructure is aware of catalog churn even though it does not surface it to the plugin. Closing this gap (event triggers feeding logical messages, or native DDL decoding) is a recurring roadmap item.

  • Source tree — PostgreSQL REL_18_STABLE at commit 273fe94 (/data/hgryoo/references/postgres):
    • src/backend/replication/logical/decode.c — rmgr dispatch and record parsers.
    • src/backend/replication/logical/reorderbuffer.c — transaction reassembly, memory management, spill/stream, k-way merge replay.
    • src/backend/replication/logical/snapbuild.c — historic catalog snapshot construction and state machine.
    • src/backend/replication/logical/logical.c — decoding context, output-plugin loading, callback wrappers.
    • src/include/replication/reorderbuffer.h, src/include/replication/snapbuild.h, src/include/replication/output_plugin.h — struct/enum/callback definitions.
  • Textbook anchor — Martin Kleppmann, Designing Data-Intensive Applications, ch. 11 “Stream Processing”, §“Change Data Capture” (log-based CDC, parsing the replication log, deriving consumers from a single durable log).
  • Sibling docs (cross-reference, not duplicated here)postgres-wal-sender-receiver.md (the transport that ships the decoded stream), postgres-replication-slots.md (restart_lsn / catalog_xmin retention), postgres-pgoutput.md (the built-in output plugin wire format), postgres-mvcc-snapshots.md (the MVCC visibility model the historic snapshot specializes), postgres-xlog-wal.md (the physical WAL records being decoded), postgres-xact.md (commit/abort/prepare record structure).