PostgreSQL WAL Sender/Receiver — Streaming Replication Transport
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Streaming replication is one answer to a question every database system must answer: how does a standby (replica) server stay current with the primary’s durable state? The design space has three broad families:
-
Log shipping — completed WAL segment files are copied to the standby after they fill. Recovery replays them in order. Simple, but the standby can be a full segment (default 16 MB) behind the primary at any moment, and failover forfeits any WAL not yet shipped.
-
Streaming replication — the standby connects to the primary over a persistent TCP link and receives WAL records in near-real time, as they are flushed. The standby can be within a few milliseconds of the primary, limited by network RTT and the standby’s apply rate.
-
Shared-storage / semi-synchronous — the primary and standby share a disk volume (SAN), or the primary waits for at least one standby to acknowledge WAL before returning from commit. These collapse the log-shipping gap to zero at the cost of coupling latency to the replication link.
PostgreSQL 9.0 introduced streaming replication as option 2. The
theoretical grounding is the same WAL durability spine described in
postgres-xlog-wal.md: an LSN is a byte offset into the append-only
WAL stream; a standby is simply a consumer of that stream from some
starting position. Every design choice in walsender/walreceiver flows
from that observation.
The two supplementary ideas that streaming replication adds on top of plain WAL replay are:
Replication slots. Without a slot, the primary does not know how far
behind a standby is, so it may recycle WAL segments the standby still
needs, causing the standby to fall fatally behind. A physical replication
slot records the standby’s confirmed-flush LSN and prevents WAL reclaim
below it. The slot mechanism is defined in postgres-replication-slots.md;
walsender acquires/releases a slot if one is named in START_REPLICATION.
Hot Standby feedback. A standby running queries under hot_standby = on can cancel those queries when the primary vacuums rows the standby is
still reading. Hot-standby feedback is a mechanism for the standby to
advertise its oldest active xmin back to the primary, preventing the
primary from removing those rows until the standby’s transactions finish.
This is a pure replication-protocol addition, invisible to the storage or
lock layers on the primary.
Database Internals (Petrov, ch. 11, “Replication and Consistency”) frames the streaming model as a two-node pipeline where the primary is the log producer and the standby is the log consumer; the transport layer’s job is to deliver each log record exactly once and in order, with durable acknowledgement flowing back upstream. Every design element of walsender/walreceiver is one piece of that pipeline.
The design space a streaming-replication implementer chooses within:
-
Sync vs. async — does
COMMITwait for the standby to acknowledge? PostgreSQL supports both:synchronous_commit = onwaits for the standby’sflush(orapply) acknowledgement;= offor= remote_writetrades durability for latency. -
Physical vs. logical — does the stream carry raw WAL bytes (same page layout as the primary) or decoded change records (row-level, usable across schema versions)? PostgreSQL supports both from the same walsender process; the transport is identical, only the data source differs (
XLogSendPhysicalvs.XLogSendLogical). -
Transport coupling — is the replication channel a dedicated process, a thread, or an in-process queue? PostgreSQL uses a dedicated walsender process per standby and a dedicated walreceiver process on the standby side, which fits the postmaster fork model.
Common DBMS Design
Section titled “Common DBMS Design”Almost every streaming-replication implementation converges on the same engineering conventions. Naming them here makes PostgreSQL’s specific symbols read as one set of choices within a shared playbook.
A dedicated process or thread per replica
Section titled “A dedicated process or thread per replica”Log shipping can be batched and handled by a file-copy daemon. Streaming cannot: the send loop must react to new WAL, to standby replies, and to keepalive timeouts at sub-second granularity. The universal pattern is a dedicated replication-sender process (or thread) per connected replica, so each sender’s event loop is its own context and a slow or disconnected replica does not block others.
A persistent connection with a handshake phase
Section titled “A persistent connection with a handshake phase”The standby connects, authenticates, and runs a handshake exchange before streaming begins: identify the primary system, optionally negotiate a timeline, then declare the starting LSN. The handshake’s job is to catch mismatches early (wrong cluster, already-diverged timeline) rather than silently streaming garbage.
WAL streaming in COPY mode over the existing wire protocol
Section titled “WAL streaming in COPY mode over the existing wire protocol”Re-using the database wire protocol for replication avoids a separate
port and authentication path. The practical choice is to multiplex
replication over the standard client connection: a START_REPLICATION
command puts the connection into streaming mode (PostgreSQL uses
CopyBothResponse, a bidirectional COPY sub-protocol), and raw WAL bytes
flow as CopyData messages. Keepalive and acknowledgement messages share
the same channel.
Acknowledgement with three LSN cursors
Section titled “Acknowledgement with three LSN cursors”The standby’s progress is not a single number. Three ordered cursors advance at different rates:
- write — the standby has written WAL to its local disk buffer.
- flush — the standby has fsynced WAL to stable storage.
- apply — the standby’s startup process has replayed through this LSN.
The primary tracks all three per sender (in its WalSnd struct) and uses
flush for durability decisions and apply for lag monitoring. Synchronous
replication waits on either flush or apply depending on
synchronous_commit.
Keepalives and timeouts in both directions
Section titled “Keepalives and timeouts in both directions”A streaming connection can be idle when the primary has no new WAL.
Silence is indistinguishable from a dead connection, so both sides send
periodic keepalive messages. The primary sends a keepalive (with a
replyRequested flag) if no new WAL has been sent in
wal_sender_timeout / 2; the standby sends a status reply whenever it
flushes new WAL or when wal_receiver_status_interval elapses.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Replication sender process | walsender (B_WAL_SENDER BackendType) |
| Replication receiver process | walreceiver (B_WAL_RECEIVER BackendType) |
| Handshake commands | IDENTIFY_SYSTEM, TIMELINE_HISTORY |
| Stream-start command | START_REPLICATION |
| Streaming sub-protocol | CopyBothResponse (bidirectional COPY) |
| WAL data message (primary→standby) | 'w' CopyData with (dataStart, walEnd, sendTime) header |
| Keepalive message (primary→standby) | 'k' CopyData with (walEnd, sendTime, replyRequested) |
| Acknowledgement (standby→primary) | 'r' CopyData with (write, flush, apply, sendTime, replyRequested) |
| Hot-standby feedback message | 'h' CopyData with (xmin, epoch, catalogXmin, ...) |
| Per-sender shared-memory slot | WalSnd struct in WalSndCtl->walsnds[] |
| Receiver shared-memory state | WalRcvData (global WalRcv) |
| Receiver state machine | WalRcvState enum (WALRCV_STOPPED … WALRCV_STOPPING) |
| Sender state machine | WalSndState enum (WALSNDSTATE_STARTUP … WALSNDSTATE_STOPPING) |
| Receiver flush pointer (shared memory) | WalRcv->flushedUpto |
| Replication slot (WAL retention) | MyReplicationSlot (see postgres-replication-slots.md) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”The walsender process: a backend variant
Section titled “The walsender process: a backend variant”A walsender starts life as a regular backend: the postmaster fork()s a
child to handle the incoming connection, authenticates it, and only then
determines — from the connection’s replication= parameter — that it
should become a walsender. The file header of walsender.c says it
plainly: “A walsender is similar to a regular backend, ie. there is a
one-to-one relationship between a connection and a walsender process, but
instead of processing SQL queries, it understands a small set of special
replication-mode commands.”
InitWalSender (called from PostgresMain after the replication decision)
marks the process as a walsender and claims a slot in the
WalSndCtl->walsnds[] array, which is sized max_wal_senders * sizeof(WalSnd):
// WalSnd (struct) — src/include/replication/walsender_private.htypedef struct WalSnd{ pid_t pid; /* this walsender's PID, or 0 if not active */ WalSndState state; /* this walsender's state */ XLogRecPtr sentPtr; /* WAL has been sent up to this point */ bool needreload; /* does currently-open file need reloading? */ XLogRecPtr write; /* standby's write position */ XLogRecPtr flush; /* standby's flush position */ XLogRecPtr apply; /* standby's apply position */ TimeOffset writeLag; TimeOffset flushLag; TimeOffset applyLag; int sync_standby_priority; slock_t mutex; TimestampTz replyTime; ReplicationKind kind;} WalSnd;The five state values span the life of a sender:
// WalSndState — src/include/replication/walsender_private.htypedef enum WalSndState{ WALSNDSTATE_STARTUP = 0, WALSNDSTATE_BACKUP, WALSNDSTATE_CATCHUP, WALSNDSTATE_STREAMING, WALSNDSTATE_STOPPING,} WalSndState;CATCHUP is the initial streaming state — the standby is behind the
primary. The transition to STREAMING fires when WalSndLoop observes
WalSndCaughtUp true and flips it via WalSndSetState.
The replication command language
Section titled “The replication command language”The walsender’s main loop (PostgresMain in tcop) calls
exec_replication_command for every message it receives in replication
mode. That function parses the command with a dedicated
replication_scanner / replication_yyparse, then dispatches:
IDENTIFY_SYSTEM— returns the cluster system identifier, timeline, flush LSN, and (for db-mode) database name. The standby uses this to verify it is connecting to the right primary.TIMELINE_HISTORY tli— returns the contents of the timeline history file for the requested timeline, so the standby can follow a switchover.START_REPLICATION— enters streaming mode; the entry point isStartReplicationfor physical streams andStartLogicalReplicationfor logical.
The commands form a small grammar (repl_gram.y, repl_scanner.l) that is
separate from the SQL parser; physical walsenders do not process SQL.
StartReplication and WalSndLoop
Section titled “StartReplication and WalSndLoop”StartReplication resolves the timeline (current or historical), sends a
CopyBothResponse to put the connection in bidirectional COPY mode, sets
sentPtr to the requested start position, calls SyncRepInitConfig, and
then hands off to the main send loop:
// StartReplication — src/backend/replication/walsender.cWalSndSetState(WALSNDSTATE_CATCHUP);/* ... send CopyBothResponse ... */sentPtr = cmd->startpoint;SpinLockAcquire(&MyWalSnd->mutex);MyWalSnd->sentPtr = sentPtr;SpinLockRelease(&MyWalSnd->mutex);SyncRepInitConfig();replication_active = true;WalSndLoop(XLogSendPhysical);WalSndLoop(send_data) is the inner steady-state loop. It iterates until
both sides exchange CopyDone:
// WalSndLoop — src/backend/replication/walsender.cfor (;;){ ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); /* handle config reload, check for replies */ ProcessRepliesIfAny(); if (streamingDoneReceiving && streamingDoneSending && !pq_is_send_pending()) break; if (!pq_is_send_pending()) send_data(); /* XLogSendPhysical or XLogSendLogical */ else WalSndCaughtUp = false; if (pq_flush_if_writable() != 0) WalSndShutdown(); /* CATCHUP -> STREAMING transition when caught up */ if (WalSndCaughtUp && !pq_is_send_pending()) { if (MyWalSnd->state == WALSNDSTATE_CATCHUP) WalSndSetState(WALSNDSTATE_STREAMING); if (got_SIGUSR2) WalSndDone(send_data); } WalSndCheckTimeOut(); WalSndKeepaliveIfNecessary(); /* block on latch when caught up or output buffer full */ WalSndWait(...);}The loop’s structure is: try to drain the output buffer, try to fill it with more WAL, flush what can be written, sleep on a latch when idle.
flowchart TB
A["WalSndLoop iteration start<br/>ResetLatch"] --> B["ProcessRepliesIfAny<br/>(drain standby messages)"]
B --> C{CopyDone both ways?}
C -- yes --> EXIT["exit loop"]
C -- no --> D{output buffer empty?}
D -- yes --> E["send_data<br/>XLogSendPhysical or XLogSendLogical"]
D -- no --> F["WalSndCaughtUp = false"]
E --> G["pq_flush_if_writable"]
F --> G
G --> H{WalSndCaughtUp<br/>and buffer empty?}
H -- yes --> I["if CATCHUP: WalSndSetState STREAMING<br/>if got_SIGUSR2: WalSndDone"]
H -- no --> J["WalSndCheckTimeOut<br/>WalSndKeepaliveIfNecessary"]
I --> J
J --> K["WalSndWait on latch<br/>(sleep when idle)"]
K --> A
Figure 1 — The steady-state WalSndLoop iteration. The loop drains
standby messages, sends WAL, flushes the output buffer, and sleeps on a
latch when the standby is caught up. The CATCHUP → STREAMING transition
fires here. (Flow from WalSndLoop in walsender.c.)
XLogSendPhysical: reading WAL and packaging it
Section titled “XLogSendPhysical: reading WAL and packaging it”XLogSendPhysical is the physical-stream data source. It decides how far
it can safely send — GetFlushRecPtr on a primary, GetStandbyFlushRecPtr
on a cascading standby — then reads WAL from disk (via WALRead) up to an
8 KB page, wraps it in the streaming protocol header, and enqueues it for
pq_putmessage_noblock:
// XLogSendPhysical (excerpt) — src/backend/replication/walsender.c/* How far can we send? */SendRqstPtr = GetFlushRecPtr(NULL); /* primary path *//* ... read WAL into output_message ... *//* header: dataStart (int64), walEnd (int64), sendTime (int64) */resetStringInfo(&output_message);pq_sendbyte(&output_message, 'w');pq_sendint64(&output_message, sentPtr); /* dataStart */pq_sendint64(&output_message, SendRqstPtr); /* walEnd */pq_sendint64(&output_message, sendTime);pq_sendbytes(&output_message, (char *) buf, nbytes);pq_putmessage_noblock('d', ...);The 'w' message type byte identifies WAL data; dataStart is the LSN
of the first byte in the payload; walEnd is the primary’s current flush
pointer (telling the standby “nothing new beyond this point yet”);
sendTime is used to measure replication lag.
Standby reply processing
Section titled “Standby reply processing”ProcessRepliesIfAny drains any non-blocking messages from the standby
connection. Two message types matter:
Status update ('r') — the standby reports its write, flush, and
apply positions. ProcessStandbyReplyMessage unpacks them, updates the
WalSnd struct under the spinlock, and calls SyncRepReleaseWaiters if
this is not a cascading standby (so synchronous-commit waiters on the
primary can proceed):
// ProcessStandbyReplyMessage — src/backend/replication/walsender.cwritePtr = pq_getmsgint64(&reply_message);flushPtr = pq_getmsgint64(&reply_message);applyPtr = pq_getmsgint64(&reply_message);/* ... compute lag times ... */SpinLockAcquire(&walsnd->mutex);walsnd->write = writePtr;walsnd->flush = flushPtr;walsnd->apply = applyPtr;walsnd->writeLag = writeLag;walsnd->flushLag = flushLag;walsnd->applyLag = applyLag;SpinLockRelease(&walsnd->mutex);if (!am_cascading_walsender) SyncRepReleaseWaiters();Hot-standby feedback ('h') — the standby reports its oldest active
xmin. ProcessStandbyHSFeedbackMessage calls
PhysicalReplicationSlotNewXmin (if a slot is active) or directly updates
MyProc->xmin, preventing the primary’s vacuum from removing rows the
standby might still need.
sequenceDiagram
participant P as Primary (walsender)
participant S as Standby (walreceiver)
S->>P: START_REPLICATION LSN/TLI
P->>S: CopyBothResponse
loop steady-state streaming
P->>S: 'w' WAL data (dataStart, walEnd, sendTime)
S->>P: 'r' reply (write, flush, apply, time)
P->>S: 'k' keepalive (walEnd, time, replyRequested)
S->>P: 'r' reply (triggered by keepalive)
Note over S: hot_standby_feedback on
S->>P: 'h' HS feedback (xmin, catalogXmin)
end
S->>P: CopyDone
P->>S: CopyDone
Figure 2 — The streaming protocol message exchange. The primary sends
'w' WAL messages and 'k' keepalives; the standby sends 'r' status
replies and 'h' hot-standby feedback. Both sides exchange CopyDone to
end the stream. (Protocol from the PostgreSQL documentation and
walsender.c / walreceiver.c.)
The walreceiver process
Section titled “The walreceiver process”The walreceiver runs WalReceiverMain, launched by the postmaster in
response to PMSIGNAL_START_WALRECEIVER sent by the startup process when
WAL recovery has exhausted archive/local WAL and streaming is configured.
The startup process fills in WalRcvData->conninfo, ->slotname, and
->receiveStart before signalling the postmaster. WalReceiverMain reads
those values from the shared-memory WalRcvData struct, marks itself
WALRCV_STREAMING, then dynamically loads libpqwalreceiver to get the
actual libpq transport functions:
// WalReceiverMain — src/backend/replication/walreceiver.cwalrcv->pid = MyProcPid;walrcv->walRcvState = WALRCV_STREAMING;strlcpy(conninfo, walrcv->conninfo, MAXCONNINFO);startpoint = walrcv->receiveStart;startpointTLI = walrcv->receiveStartTLI;SpinLockRelease(&walrcv->mutex);
load_file("libpqwalreceiver", false);/* ... connect to primary ... */wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);After the IDENTIFY_SYSTEM handshake verifies the system identifier
matches, the loop calls walrcv_startstreaming (which issues
START_REPLICATION) and then receives messages in a walrcv_receive
loop, dispatching each to XLogWalRcvProcessMsg.
The WalRcvData shared structure is the receiver’s communication channel
with the startup process and with cascading walsenders:
// WalRcvData (excerpt) — src/include/replication/walreceiver.htypedef struct{ ProcNumber procno; /* walreceiver's proc number */ pid_t pid; WalRcvState walRcvState; ConditionVariable walRcvStoppedCV; XLogRecPtr receiveStart; /* where startup told us to begin */ TimeLineID receiveStartTLI; XLogRecPtr flushedUpto; /* last byte fsynced to pg_wal */ TimeLineID receivedTLI; XLogRecPtr latestChunkStart; /* previous flushedUpto before last flush */ /* ... timing, conninfo, slotname ... */ pg_atomic_uint64 writtenUpto; /* written (not yet fsynced) */ sig_atomic_t force_reply;} WalRcvData;flushedUpto is the canonical progress indicator: after each XLogWalRcvFlush
call the startup process wakes up and can advance WAL replay to that point.
writtenUpto is an atomic that advances ahead of flushedUpto — it lets a
cascading walsender serve WAL that has been written but not yet fsynced.
XLogWalRcvProcessMsg, XLogWalRcvWrite, XLogWalRcvFlush
Section titled “XLogWalRcvProcessMsg, XLogWalRcvWrite, XLogWalRcvFlush”XLogWalRcvProcessMsg dispatches on the one-byte message type:
'w'(WAL data) — strips the three-int64 header, callsXLogWalRcvWritewith the payload bytes.'k'(keepalive) — readswalEndandsendTime, optionally triggers an immediate reply.
XLogWalRcvWrite appends bytes to the current WAL segment file in
pg_wal, opening a new segment via XLogFileInit when the current one is
full, and updating the writtenUpto atomic:
// XLogWalRcvWrite (excerpt) — src/backend/replication/walreceiver.cwhile (nbytes > 0){ if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) { /* open/create new segment */ XLByteToSeg(recptr, recvSegNo, wal_segment_size); recvFile = XLogFileInit(recvSegNo, tli); } startoff = XLogSegmentOffset(recptr, wal_segment_size); byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff); /* ... error handling ... */ recptr += byteswritten; nbytes -= byteswritten;}pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);XLogWalRcvFlush fsyncs the current file via issue_xlog_fsync,
advances LogstreamResult.Flush, and then — crucially — wakes the startup
process and any cascading walsender:
// XLogWalRcvFlush (excerpt) — src/backend/replication/walreceiver.cissue_xlog_fsync(recvFile, recvSegNo, tli);LogstreamResult.Flush = LogstreamResult.Write;SpinLockAcquire(&walrcv->mutex);if (walrcv->flushedUpto < LogstreamResult.Flush){ walrcv->latestChunkStart = walrcv->flushedUpto; walrcv->flushedUpto = LogstreamResult.Flush; walrcv->receivedTLI = tli;}SpinLockRelease(&walrcv->mutex);WakeupRecovery();if (AllowCascadeReplication()) WalSndWakeup(true, false);/* send reply and hot-standby feedback back to primary */XLogWalRcvSendReply(false, false);XLogWalRcvSendHSFeedback(false);WakeupRecovery() signals the startup process latch; the startup process
reads GetWalRcvFlushRecPtr and advances recovery up to that LSN.
Receiver state machine and IPC
Section titled “Receiver state machine and IPC”The WalRcvState enum tracks the walreceiver’s life from the startup
process’s perspective:
stateDiagram-v2 [*] --> WALRCV_STOPPED WALRCV_STOPPED --> WALRCV_STARTING : RequestXLogStreaming\nlaunches process WALRCV_STARTING --> WALRCV_STREAMING : WalReceiverMain\ninitializes WALRCV_STREAMING --> WALRCV_WAITING : primary ends stream\nbut stays connected WALRCV_WAITING --> WALRCV_RESTARTING : startup nudges\nexisting process WALRCV_RESTARTING --> WALRCV_STREAMING : process loops\nand reconnects WALRCV_STREAMING --> WALRCV_STOPPING : ShutdownWalRcv\nor SIGTERM WALRCV_WAITING --> WALRCV_STOPPING : ShutdownWalRcv WALRCV_STOPPING --> WALRCV_STOPPED : WalRcvDie\non shmem_exit WALRCV_STOPPED --> [*]
Figure 3 — WalRcvState lifecycle. RequestXLogStreaming (called by the
startup process) transitions STOPPED → STARTING and signals the
postmaster to fork; if a WAITING receiver already exists, it transitions
WAITING → RESTARTING and wakes the existing process instead of forking a
new one. (States from walreceiver.h; transitions from walreceiverfuncs.c
and walreceiver.c.)
RequestXLogStreaming in walreceiverfuncs.c is the startup-process
entry point. It fills in WalRcvData under the spinlock, then either
sends PMSIGNAL_START_WALRECEIVER to the postmaster (first start) or
wakes the existing receiver’s latch (restart):
// RequestXLogStreaming — src/backend/replication/walreceiverfuncs.cSpinLockAcquire(&walrcv->mutex);if (walrcv->walRcvState == WALRCV_STOPPED){ launch = true; walrcv->walRcvState = WALRCV_STARTING; strlcpy(walrcv->conninfo, conninfo, MAXCONNINFO);}else walrcv->walRcvState = WALRCV_RESTARTING;walrcv->receiveStart = recptr;walrcv->receiveStartTLI = tli;walrcv_proc = walrcv->procno;SpinLockRelease(&walrcv->mutex);
if (launch) SendPostmasterSignal(PMSIGNAL_START_WALRECEIVER);else if (walrcv_proc != INVALID_PROC_NUMBER) SetLatch(&GetPGProcByNumber(walrcv_proc)->procLatch);libpqwalreceiver: pluggable transport
Section titled “libpqwalreceiver: pluggable transport”The walreceiver does not link libpq directly into the server binary.
Instead, walreceiver.c calls through the WalReceiverFunctions function
pointer table, which is populated by load_file("libpqwalreceiver") at
startup. The README’s first paragraph captures the rationale: “The
transport-specific part of walreceiver … is loaded dynamically to avoid
having to link the main server binary with libpq.”
The function pointer table is defined in walreceiver.h as
WalReceiverFunctionsType; the actual libpq implementation lives in
replication/libpqwalreceiver/. The indirection means a third party could
in principle supply an alternative transport (e.g. RDMA, a shared-memory
channel for same-host standbys) by providing a compatible shared library,
though this API is currently described as “internal.”
Walsender shutdown sequence
Section titled “Walsender shutdown sequence”The postmaster-driven shutdown differs from a regular backend’s. The README explains: walsenders must deliver the shutdown checkpoint record to standbys before terminating. The sequence is:
- After all regular backends have exited, the checkpointer sends
PROCSIG_WALSND_INIT_STOPPINGto every walsender. - Each walsender transitions to
WALSNDSTATE_STOPPING, rejects new commands, and signals readiness. - Checkpointer begins the shutdown checkpoint only once all walsenders
confirm stopping (
WalSndWaitStopping). - When the shutdown checkpoint finishes, the postmaster sends
SIGUSR2to each walsender. WalSndDoneflushes any remaining WAL, waits for standby acknowledgement, then callsproc_exit(0).
This choreography ensures standbys receive the shutdown checkpoint record, so after a clean primary shutdown, a promoted standby does not need to replay past that point.
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by process/subsystem. Files are under
/data/hgryoo/references/postgres/.
Walsender: shared memory and state (walsender_private.h, walsender.c)
Section titled “Walsender: shared memory and state (walsender_private.h, walsender.c)”WalSndState(enum) —WALSNDSTATE_STARTUPthroughWALSNDSTATE_STOPPING.WalSnd(struct) — per-sender slot:pid,state,sentPtr,write,flush,apply, lag offsets,mutex,replyTime,kind.WalSndCtlData(struct) — the shared control block:SyncRepQueue[],lsn[],sync_standbys_status,wal_flush_cv,wal_replay_cv,wal_confirm_rcv_cv,walsnds[](flexible array).WalSndCtl— global pointer toWalSndCtlData.MyWalSnd— this process’sWalSndslot.WalSndShmemSize/WalSndShmemInit— allocate/initialize theWalSndCtlDatashmem block.WalSndSetState— transitionMyWalSnd->stateand log the event.NUM_SYNC_REP_WAIT_MODE— array dimension forSyncRepQueue[]/lsn[].
Walsender: command dispatch (walsender.c)
Section titled “Walsender: command dispatch (walsender.c)”InitWalSender— claim aWalSndslot; called fromPostgresMain.exec_replication_command— parse and dispatch a replication command string; returnsfalseif not a replication command (SQL passthrough for db-mode walsenders).IdentifySystem— handleIDENTIFY_SYSTEM; returns sysid, timeline, LSN, dbname.StartReplication— handleSTART_REPLICATION(physical); sets upxlogreader, resolves timeline, enters COPY mode, callsWalSndLoop(XLogSendPhysical).StartLogicalReplication— handleSTART_REPLICATION(logical); callsWalSndLoop(XLogSendLogical).
Walsender: send loop (walsender.c)
Section titled “Walsender: send loop (walsender.c)”WalSndLoop— the steady-state loop; dispatches tosend_data, manages keepalives, processes replies, detects catchup→streaming transition, handles shutdown.XLogSendPhysical— computeSendRqstPtr(flush pointer on primary, standby flush ptr on cascading), read WAL viaWALRead, enqueue'w'message.XLogSendLogical— pull decoded changes from the logical decoding context; callsWalSndWriteDatavia theLogicalDecodingContextoutput plugin callback.ProcessRepliesIfAny— non-blocking drain of the client socket; dispatches'd'CopyData toProcessStandbyMessage, handlesCopyDoneandTerminate.ProcessStandbyMessage/ProcessStandbyReplyMessage/ProcessStandbyHSFeedbackMessage— updateWalSndslots; release sync waiters; update slotxmin.WalSndCheckTimeOut— enforcewal_sender_timeout.WalSndKeepalive/WalSndKeepaliveIfNecessary— send'k'message.WalSndDone— graceful shutdown: drain remaining WAL, wait for standby ack,proc_exit(0).WalSndWakeup/WalSndWakeupRequest/WalSndWakeupProcessRequests— condition-variable / latch wakeup from WAL flush path.
Walreceiver functions (walreceiverfuncs.c)
Section titled “Walreceiver functions (walreceiverfuncs.c)”RequestXLogStreaming— startup-process entry: fillWalRcvData, transition state, signal postmaster or nudge existing receiver.GetWalRcvFlushRecPtr— returnflushedUpto(and optionallylatestChunkStart,receivedTLI) under spinlock.WalRcvRunning/WalRcvStreaming— status predicates; detectWALRCV_STARTUP_TIMEOUTand force-stop if startup took too long.ShutdownWalRcv— send SIGTERM and wait onwalRcvStoppedCV.
Walreceiver main loop (walreceiver.c)
Section titled “Walreceiver main loop (walreceiver.c)”WalReceiverMain— the main process function:AuxiliaryProcessMainCommon, claimWalRcvData, loadlibpqwalreceiver, connect, handshake, stream loop.WalRcvWaitForStartPosition— when the primary ends a stream without disconnecting, wait here forWALRCV_RESTARTINGor termination.XLogWalRcvProcessMsg— dispatch'w'toXLogWalRcvWriteand'k'to keepalive reply.XLogWalRcvWrite— segment-alignedpg_pwriteloop; updateswrittenUptoatomic.XLogWalRcvFlush—issue_xlog_fsync, advanceflushedUpto, signal startup process and cascading walsenders, send reply.XLogWalRcvSendReply— build and send'r'status reply.XLogWalRcvSendHSFeedback— build and send'h'feedback whenhot_standby_feedbackis on.
Shared data structures (walreceiver.h)
Section titled “Shared data structures (walreceiver.h)”WalRcvState(enum) —WALRCV_STOPPED,WALRCV_STARTING,WALRCV_STREAMING,WALRCV_WAITING,WALRCV_RESTARTING,WALRCV_STOPPING.WalRcvData(struct) —procno,pid,walRcvState,walRcvStoppedCV,receiveStart/TLI,flushedUpto,receivedTLI,latestChunkStart,writtenUpto(atomic),force_reply,conninfo,slotname, timing fields.WalRcv— global pointer to the singleWalRcvDatashmem block.WalReceiverFunctionsType/WalReceiverFunctions— the dispatch table populated bylibpqwalreceiver.AllowCascadeReplication()— macro:EnableHotStandby && max_wal_senders > 0.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
WalSndState (enum) | src/include/replication/walsender_private.h | 24 |
WalSnd (struct) | src/include/replication/walsender_private.h | 41 |
WalSndCtlData (struct) | src/include/replication/walsender_private.h | 80 |
WalRcvState (enum) | src/include/replication/walreceiver.h | 46 |
WalRcvData (struct) | src/include/replication/walreceiver.h | 56 |
AllowCascadeReplication (macro) | src/include/replication/walreceiver.h | 42 |
am_walsender | src/include/replication/walsender.h | 22 |
max_wal_senders | src/backend/replication/walsender.c | 126 |
InitWalSender | src/backend/replication/walsender.c | 297 |
IdentifySystem | src/backend/replication/walsender.c | 397 |
StartLogicalReplication | src/backend/replication/walsender.c | 1447 |
exec_replication_command | src/backend/replication/walsender.c | 1990 |
ProcessRepliesIfAny | src/backend/replication/walsender.c | 2246 |
ProcessStandbyReplyMessage | src/backend/replication/walsender.c | 2423 |
ProcessStandbyHSFeedbackMessage | src/backend/replication/walsender.c | 2611 |
WalSndCheckTimeOut | src/backend/replication/walsender.c | 2779 |
WalSndLoop | src/backend/replication/walsender.c | 2806 |
InitWalSenderSlot | src/backend/replication/walsender.c | 2948 |
XLogSendPhysical | src/backend/replication/walsender.c | 3118 |
WalSndDone | src/backend/replication/walsender.c | 3521 |
WalSndShmemSize | src/backend/replication/walsender.c | 3669 |
WalSndShmemInit | src/backend/replication/walsender.c | 3681 |
WalSndSetState | src/backend/replication/walsender.c | 3869 |
WalSndKeepalive | src/backend/replication/walsender.c | 4094 |
WalSndKeepaliveIfNecessary | src/backend/replication/walsender.c | 4117 |
WalReceiverMain | src/backend/replication/walreceiver.c | 152 |
WalRcvWaitForStartPosition | src/backend/replication/walreceiver.c | 645 |
XLogWalRcvProcessMsg | src/backend/replication/walreceiver.c | 819 |
XLogWalRcvWrite | src/backend/replication/walreceiver.c | 890 |
XLogWalRcvFlush | src/backend/replication/walreceiver.c | 985 |
WalRcvRunning | src/backend/replication/walreceiverfuncs.c | 76 |
WalRcvStreaming | src/backend/replication/walreceiverfuncs.c | 127 |
RequestXLogStreaming | src/backend/replication/walreceiverfuncs.c | 246 |
GetWalRcvFlushRecPtr | src/backend/replication/walreceiverfuncs.c | 336 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
A walsender is a backend variant, not a dedicated auxiliary process. Verified in
walsender.cfile header: “A walsender is similar to a regular backend, ie. there is a one-to-one relationship between a connection and a walsender process.”InitWalSenderis called fromPostgresMainafter thereplication=parameter is detected. TheBackendTypeisB_WAL_SENDER. -
max_wal_sendersis a GUC with a default of 10; theWalSndCtlshmem block scales with it. Verified:int max_wal_senders = 10inwalsender.cline 126;WalSndShmemSizereturnssizeof(WalSndCtlData) + max_wal_senders * sizeof(WalSnd)(line 3674). Changingmax_wal_sendersrequires a server restart (shmem is allocated at postmaster startup). -
The streaming protocol uses CopyBoth mode over the standard wire protocol. Verified in
StartReplication:pq_beginmessage(&buf, PqMsg_CopyBothResponse)is sent before enteringWalSndLoop. The WAL data messages usepq_putmessage_noblock('d', ...)(CopyData). -
Three LSN cursors (write / flush / apply) are reported by the standby and stored in
WalSndunder a spinlock. Verified inProcessStandbyReplyMessage:writePtr,flushPtr,applyPtrare unpacked from the'r'message and stored intowalsnd->write,->flush,->applyunderSpinLockAcquire(&walsnd->mutex). -
flushedUptois updated only afterissue_xlog_fsyncinXLogWalRcvFlush. The advance offlushedUptoinWalRcvDatais inside the spinlock afterissue_xlog_fsyncreturns. Writes advancewrittenUpto(atomic, no lock) earlier, letting cascading walsenders serve unsynced data while keeping theflushedUptoguarantee clean. -
WakeupRecovery()is called directly after updatingflushedUpto. Verified inXLogWalRcvFlush: the call toWakeupRecovery()is immediately after the spinlock release that advancesflushedUpto. This is the mechanism by which the startup process learns that WAL replay can advance. -
The walreceiver loads libpq dynamically via
load_file. Verified inWalReceiverMain:load_file("libpqwalreceiver", false)followed by an assertion thatWalReceiverFunctions != NULL. The README confirms the rationale: avoiding linking the server binary with libpq. -
When the primary ends streaming without disconnecting, the walreceiver enters
WALRCV_WAITINGand waits for the startup process to issue new instructions. Verified inWalReceiverMain’s loop: afterwalrcv_endstreamingreturns, the process callsWalRcvWaitForStartPositionwhich sleeps until eitherWALRCV_RESTARTINGis set (startup nudges it) or termination is requested. The startup process then callsRequestXLogStreamingagain, which setsWALRCV_RESTARTINGand wakes the receiver’s latch without forking a new process. -
Shutdown choreography gates the shutdown checkpoint on walsender readiness. Verified in the file header comment and
WalSndInitStopping/WalSndWaitStopping: the checkpointer callsWalSndInitStopping(which sendsPROCSIG_WALSND_INIT_STOPPINGto each walsender), thenWalSndWaitStopping(which loops until all walsenders are inWALSNDSTATE_STOPPINGorWALSNDSTATE_STARTUP). Only then does the checkpointer proceed with the shutdown checkpoint.
Open questions
Section titled “Open questions”-
Does
writtenUptorisk serving torn WAL to cascading walsenders?writtenUptoadvances inXLogWalRcvWritebeforeXLogWalRcvFlushis called, so a cascading walsender could read bytes that have beenpwrite()’d but not yetfsync()’d. If the standby crashes before fsyncing, those bytes are lost but the cascading standby might already have sent them downstream. Investigation path: readGetStandbyFlushRecPtr(whichXLogSendPhysicalcalls on a cascading sender) vs.GetWalRcvFlushRecPtr— are these the same pointer or different? -
How does
hot_standby_feedbackinteract with slotxminwhen both are active?ProcessStandbyHSFeedbackMessagecallsPhysicalReplicationSlotNewXminwhenMyReplicationSlotis set, and also setsMyProc->xmindirectly. The interaction between the slot’seffective_xminand the proc’sxminin the global xmin horizon calculation (GetOldestXmin) is non-obvious. Investigation path: readReplicationSlotsComputeRequiredXminand trace how both contribute to the horizon. -
WALRCV_STARTUP_TIMEOUT: what happens when walreceiver is slow to start?WalRcvRunningandWalRcvStreamingboth check whetherWALRCV_STARTINGhas persisted pastWALRCV_STARTUP_TIMEOUT(5 seconds, a compile-time constant) and force-transition toWALRCV_STOPPED. The startup process then re-requests streaming. Is there a risk of a tight loop if the primary is unreachable? Investigation path: traceRequestXLogStreamingre-invocation inxlogrecovery.c.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
MySQL/InnoDB Group Replication and Galera (synchronous multi-master) — PostgreSQL’s walsender/walreceiver is strictly primary-to-standby (one direction, one primary). Group Replication and Galera use a certification-based protocol (Paxos / Galera replication) where every node participates in write ordering. The comparison would clarify what PostgreSQL trades by keeping a simple unidirectional pipe rather than a multi-master ring. The relevant PostgreSQL analog for the write-ordering half is
synchronous_commit = remote_applywith multiple synchronous standbys — though this still has a single primary. -
Raft-based replication (CockroachDB, etcd-backed PostgreSQL HA) — Raft provides leader election and log replication in a single protocol. PostgreSQL separates these concerns: WAL streaming handles log transport, Patroni/Stolon/repmgr handles leader election externally. A note on how the split of concerns maps onto Raft’s log-append, commit-quorum, and leader-lease primitives would be useful for the
postgres-evolution-replication.mdarc. -
Logical replication vs. physical streaming — this doc covers only the transport layer that both share. The logical decoding path (
XLogSendLogical,LogicalDecodingContext) that turns WAL bytes into decoded row changes is the subject ofpostgres-logical-decoding.md. The key cross-reference:StartLogicalReplicationinwalsender.ccalls the sameWalSndLoopas the physical path — the transport is identical, only the data source differs. -
Replication lag and the lag-tracking mechanism —
LagTrackerWriteandLagTrackerReadinwalsender.crecord when each WAL position was sent and when the standby acknowledged it. The resultingwriteLag,flushLag,applyLagfields inWalSndfeedpg_stat_replication. A dedicated note on lag estimation accuracy (sampling bias, the interaction withCommitDelay, and how large-transaction lags are attributed) would complement the overview inpostgres-overview-replication-ha.md.
Sources
Section titled “Sources”In-tree design docs:
src/backend/replication/README— walreceiver IPC, walsender IPC, walsender–walreceiver protocol (“See manual”).
Source files (REL_18_STABLE, commit 273fe94):
src/backend/replication/walsender.c— walsender process: command dispatch, send loop, standby reply processing, keepalives, shutdown.src/backend/replication/walreceiver.c— walreceiver process: main loop, message dispatch, write/flush to pg_wal, reply sending.src/backend/replication/walreceiverfuncs.c— startup-process API:RequestXLogStreaming,GetWalRcvFlushRecPtr,WalRcvRunning,WalRcvStreaming,ShutdownWalRcv.src/include/replication/walsender.h— public walsender API, GUC declarations,WalSndWakeupRequestmacro.src/include/replication/walsender_private.h—WalSndState,WalSnd,WalSndCtlData.src/include/replication/walreceiver.h—WalRcvState,WalRcvData,WalReceiverFunctionsType.
Textbooks:
- Database Internals (Petrov), ch. 11 — replication, consistency, leader and follower roles, log shipping vs. streaming.
Cross-references (mechanism owned elsewhere — not duplicated here):
postgres-xlog-wal.md— WAL record format, LSN, durability pipeline; the flush pointer that walsender reads viaGetFlushRecPtr.postgres-replication-slots.md— slot creation, WAL retention,xminhorizon management.postgres-logical-decoding.md— decoded change stream; the logical path that sharesWalSndLoop.postgres-synchronous-replication.md—SyncRepReleaseWaiters, thesynchronous_standby_namespolicy, and the commit-wait path.postgres-overview-replication-ha.md— subcategory router with reading order across all replication-ha docs.postgres-architecture-overview.md— Axis 3, the WAL-centric durability spine that makes streaming replication possible.