Skip to content

PostgreSQL WAL Sender/Receiver — Streaming Replication Transport

Contents:

Streaming replication is one answer to a question every database system must answer: how does a standby (replica) server stay current with the primary’s durable state? The design space has three broad families:

  1. Log shipping — completed WAL segment files are copied to the standby after they fill. Recovery replays them in order. Simple, but the standby can be a full segment (default 16 MB) behind the primary at any moment, and failover forfeits any WAL not yet shipped.

  2. Streaming replication — the standby connects to the primary over a persistent TCP link and receives WAL records in near-real time, as they are flushed. The standby can be within a few milliseconds of the primary, limited by network RTT and the standby’s apply rate.

  3. Shared-storage / semi-synchronous — the primary and standby share a disk volume (SAN), or the primary waits for at least one standby to acknowledge WAL before returning from commit. These collapse the log-shipping gap to zero at the cost of coupling latency to the replication link.

PostgreSQL 9.0 introduced streaming replication as option 2. The theoretical grounding is the same WAL durability spine described in postgres-xlog-wal.md: an LSN is a byte offset into the append-only WAL stream; a standby is simply a consumer of that stream from some starting position. Every design choice in walsender/walreceiver flows from that observation.

The two supplementary ideas that streaming replication adds on top of plain WAL replay are:

Replication slots. Without a slot, the primary does not know how far behind a standby is, so it may recycle WAL segments the standby still needs, causing the standby to fall fatally behind. A physical replication slot records the standby’s confirmed-flush LSN and prevents WAL reclaim below it. The slot mechanism is defined in postgres-replication-slots.md; walsender acquires/releases a slot if one is named in START_REPLICATION.

Hot Standby feedback. A standby running queries under hot_standby = on can cancel those queries when the primary vacuums rows the standby is still reading. Hot-standby feedback is a mechanism for the standby to advertise its oldest active xmin back to the primary, preventing the primary from removing those rows until the standby’s transactions finish. This is a pure replication-protocol addition, invisible to the storage or lock layers on the primary.

Database Internals (Petrov, ch. 11, “Replication and Consistency”) frames the streaming model as a two-node pipeline where the primary is the log producer and the standby is the log consumer; the transport layer’s job is to deliver each log record exactly once and in order, with durable acknowledgement flowing back upstream. Every design element of walsender/walreceiver is one piece of that pipeline.

The design space a streaming-replication implementer chooses within:

  1. Sync vs. async — does COMMIT wait for the standby to acknowledge? PostgreSQL supports both: synchronous_commit = on waits for the standby’s flush (or apply) acknowledgement; = off or = remote_write trades durability for latency.

  2. Physical vs. logical — does the stream carry raw WAL bytes (same page layout as the primary) or decoded change records (row-level, usable across schema versions)? PostgreSQL supports both from the same walsender process; the transport is identical, only the data source differs (XLogSendPhysical vs. XLogSendLogical).

  3. Transport coupling — is the replication channel a dedicated process, a thread, or an in-process queue? PostgreSQL uses a dedicated walsender process per standby and a dedicated walreceiver process on the standby side, which fits the postmaster fork model.

Almost every streaming-replication implementation converges on the same engineering conventions. Naming them here makes PostgreSQL’s specific symbols read as one set of choices within a shared playbook.

Log shipping can be batched and handled by a file-copy daemon. Streaming cannot: the send loop must react to new WAL, to standby replies, and to keepalive timeouts at sub-second granularity. The universal pattern is a dedicated replication-sender process (or thread) per connected replica, so each sender’s event loop is its own context and a slow or disconnected replica does not block others.

A persistent connection with a handshake phase

Section titled “A persistent connection with a handshake phase”

The standby connects, authenticates, and runs a handshake exchange before streaming begins: identify the primary system, optionally negotiate a timeline, then declare the starting LSN. The handshake’s job is to catch mismatches early (wrong cluster, already-diverged timeline) rather than silently streaming garbage.

WAL streaming in COPY mode over the existing wire protocol

Section titled “WAL streaming in COPY mode over the existing wire protocol”

Re-using the database wire protocol for replication avoids a separate port and authentication path. The practical choice is to multiplex replication over the standard client connection: a START_REPLICATION command puts the connection into streaming mode (PostgreSQL uses CopyBothResponse, a bidirectional COPY sub-protocol), and raw WAL bytes flow as CopyData messages. Keepalive and acknowledgement messages share the same channel.

The standby’s progress is not a single number. Three ordered cursors advance at different rates:

  • write — the standby has written WAL to its local disk buffer.
  • flush — the standby has fsynced WAL to stable storage.
  • apply — the standby’s startup process has replayed through this LSN.

The primary tracks all three per sender (in its WalSnd struct) and uses flush for durability decisions and apply for lag monitoring. Synchronous replication waits on either flush or apply depending on synchronous_commit.

Keepalives and timeouts in both directions

Section titled “Keepalives and timeouts in both directions”

A streaming connection can be idle when the primary has no new WAL. Silence is indistinguishable from a dead connection, so both sides send periodic keepalive messages. The primary sends a keepalive (with a replyRequested flag) if no new WAL has been sent in wal_sender_timeout / 2; the standby sends a status reply whenever it flushes new WAL or when wal_receiver_status_interval elapses.

ConceptPostgreSQL name
Replication sender processwalsender (B_WAL_SENDER BackendType)
Replication receiver processwalreceiver (B_WAL_RECEIVER BackendType)
Handshake commandsIDENTIFY_SYSTEM, TIMELINE_HISTORY
Stream-start commandSTART_REPLICATION
Streaming sub-protocolCopyBothResponse (bidirectional COPY)
WAL data message (primary→standby)'w' CopyData with (dataStart, walEnd, sendTime) header
Keepalive message (primary→standby)'k' CopyData with (walEnd, sendTime, replyRequested)
Acknowledgement (standby→primary)'r' CopyData with (write, flush, apply, sendTime, replyRequested)
Hot-standby feedback message'h' CopyData with (xmin, epoch, catalogXmin, ...)
Per-sender shared-memory slotWalSnd struct in WalSndCtl->walsnds[]
Receiver shared-memory stateWalRcvData (global WalRcv)
Receiver state machineWalRcvState enum (WALRCV_STOPPEDWALRCV_STOPPING)
Sender state machineWalSndState enum (WALSNDSTATE_STARTUPWALSNDSTATE_STOPPING)
Receiver flush pointer (shared memory)WalRcv->flushedUpto
Replication slot (WAL retention)MyReplicationSlot (see postgres-replication-slots.md)

A walsender starts life as a regular backend: the postmaster fork()s a child to handle the incoming connection, authenticates it, and only then determines — from the connection’s replication= parameter — that it should become a walsender. The file header of walsender.c says it plainly: “A walsender is similar to a regular backend, ie. there is a one-to-one relationship between a connection and a walsender process, but instead of processing SQL queries, it understands a small set of special replication-mode commands.”

InitWalSender (called from PostgresMain after the replication decision) marks the process as a walsender and claims a slot in the WalSndCtl->walsnds[] array, which is sized max_wal_senders * sizeof(WalSnd):

// WalSnd (struct) — src/include/replication/walsender_private.h
typedef struct WalSnd
{
pid_t pid; /* this walsender's PID, or 0 if not active */
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
bool needreload; /* does currently-open file need reloading? */
XLogRecPtr write; /* standby's write position */
XLogRecPtr flush; /* standby's flush position */
XLogRecPtr apply; /* standby's apply position */
TimeOffset writeLag;
TimeOffset flushLag;
TimeOffset applyLag;
int sync_standby_priority;
slock_t mutex;
TimestampTz replyTime;
ReplicationKind kind;
} WalSnd;

The five state values span the life of a sender:

// WalSndState — src/include/replication/walsender_private.h
typedef enum WalSndState
{
WALSNDSTATE_STARTUP = 0,
WALSNDSTATE_BACKUP,
WALSNDSTATE_CATCHUP,
WALSNDSTATE_STREAMING,
WALSNDSTATE_STOPPING,
} WalSndState;

CATCHUP is the initial streaming state — the standby is behind the primary. The transition to STREAMING fires when WalSndLoop observes WalSndCaughtUp true and flips it via WalSndSetState.

The walsender’s main loop (PostgresMain in tcop) calls exec_replication_command for every message it receives in replication mode. That function parses the command with a dedicated replication_scanner / replication_yyparse, then dispatches:

  • IDENTIFY_SYSTEM — returns the cluster system identifier, timeline, flush LSN, and (for db-mode) database name. The standby uses this to verify it is connecting to the right primary.
  • TIMELINE_HISTORY tli — returns the contents of the timeline history file for the requested timeline, so the standby can follow a switchover.
  • START_REPLICATION — enters streaming mode; the entry point is StartReplication for physical streams and StartLogicalReplication for logical.

The commands form a small grammar (repl_gram.y, repl_scanner.l) that is separate from the SQL parser; physical walsenders do not process SQL.

StartReplication resolves the timeline (current or historical), sends a CopyBothResponse to put the connection in bidirectional COPY mode, sets sentPtr to the requested start position, calls SyncRepInitConfig, and then hands off to the main send loop:

// StartReplication — src/backend/replication/walsender.c
WalSndSetState(WALSNDSTATE_CATCHUP);
/* ... send CopyBothResponse ... */
sentPtr = cmd->startpoint;
SpinLockAcquire(&MyWalSnd->mutex);
MyWalSnd->sentPtr = sentPtr;
SpinLockRelease(&MyWalSnd->mutex);
SyncRepInitConfig();
replication_active = true;
WalSndLoop(XLogSendPhysical);

WalSndLoop(send_data) is the inner steady-state loop. It iterates until both sides exchange CopyDone:

// WalSndLoop — src/backend/replication/walsender.c
for (;;)
{
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
/* handle config reload, check for replies */
ProcessRepliesIfAny();
if (streamingDoneReceiving && streamingDoneSending && !pq_is_send_pending())
break;
if (!pq_is_send_pending())
send_data(); /* XLogSendPhysical or XLogSendLogical */
else
WalSndCaughtUp = false;
if (pq_flush_if_writable() != 0)
WalSndShutdown();
/* CATCHUP -> STREAMING transition when caught up */
if (WalSndCaughtUp && !pq_is_send_pending())
{
if (MyWalSnd->state == WALSNDSTATE_CATCHUP)
WalSndSetState(WALSNDSTATE_STREAMING);
if (got_SIGUSR2)
WalSndDone(send_data);
}
WalSndCheckTimeOut();
WalSndKeepaliveIfNecessary();
/* block on latch when caught up or output buffer full */
WalSndWait(...);
}

The loop’s structure is: try to drain the output buffer, try to fill it with more WAL, flush what can be written, sleep on a latch when idle.

flowchart TB
  A["WalSndLoop iteration start<br/>ResetLatch"] --> B["ProcessRepliesIfAny<br/>(drain standby messages)"]
  B --> C{CopyDone both ways?}
  C -- yes --> EXIT["exit loop"]
  C -- no --> D{output buffer empty?}
  D -- yes --> E["send_data<br/>XLogSendPhysical or XLogSendLogical"]
  D -- no --> F["WalSndCaughtUp = false"]
  E --> G["pq_flush_if_writable"]
  F --> G
  G --> H{WalSndCaughtUp<br/>and buffer empty?}
  H -- yes --> I["if CATCHUP: WalSndSetState STREAMING<br/>if got_SIGUSR2: WalSndDone"]
  H -- no --> J["WalSndCheckTimeOut<br/>WalSndKeepaliveIfNecessary"]
  I --> J
  J --> K["WalSndWait on latch<br/>(sleep when idle)"]
  K --> A

Figure 1 — The steady-state WalSndLoop iteration. The loop drains standby messages, sends WAL, flushes the output buffer, and sleeps on a latch when the standby is caught up. The CATCHUP → STREAMING transition fires here. (Flow from WalSndLoop in walsender.c.)

XLogSendPhysical: reading WAL and packaging it

Section titled “XLogSendPhysical: reading WAL and packaging it”

XLogSendPhysical is the physical-stream data source. It decides how far it can safely send — GetFlushRecPtr on a primary, GetStandbyFlushRecPtr on a cascading standby — then reads WAL from disk (via WALRead) up to an 8 KB page, wraps it in the streaming protocol header, and enqueues it for pq_putmessage_noblock:

// XLogSendPhysical (excerpt) — src/backend/replication/walsender.c
/* How far can we send? */
SendRqstPtr = GetFlushRecPtr(NULL); /* primary path */
/* ... read WAL into output_message ... */
/* header: dataStart (int64), walEnd (int64), sendTime (int64) */
resetStringInfo(&output_message);
pq_sendbyte(&output_message, 'w');
pq_sendint64(&output_message, sentPtr); /* dataStart */
pq_sendint64(&output_message, SendRqstPtr); /* walEnd */
pq_sendint64(&output_message, sendTime);
pq_sendbytes(&output_message, (char *) buf, nbytes);
pq_putmessage_noblock('d', ...);

The 'w' message type byte identifies WAL data; dataStart is the LSN of the first byte in the payload; walEnd is the primary’s current flush pointer (telling the standby “nothing new beyond this point yet”); sendTime is used to measure replication lag.

ProcessRepliesIfAny drains any non-blocking messages from the standby connection. Two message types matter:

Status update ('r') — the standby reports its write, flush, and apply positions. ProcessStandbyReplyMessage unpacks them, updates the WalSnd struct under the spinlock, and calls SyncRepReleaseWaiters if this is not a cascading standby (so synchronous-commit waiters on the primary can proceed):

// ProcessStandbyReplyMessage — src/backend/replication/walsender.c
writePtr = pq_getmsgint64(&reply_message);
flushPtr = pq_getmsgint64(&reply_message);
applyPtr = pq_getmsgint64(&reply_message);
/* ... compute lag times ... */
SpinLockAcquire(&walsnd->mutex);
walsnd->write = writePtr;
walsnd->flush = flushPtr;
walsnd->apply = applyPtr;
walsnd->writeLag = writeLag;
walsnd->flushLag = flushLag;
walsnd->applyLag = applyLag;
SpinLockRelease(&walsnd->mutex);
if (!am_cascading_walsender)
SyncRepReleaseWaiters();

Hot-standby feedback ('h') — the standby reports its oldest active xmin. ProcessStandbyHSFeedbackMessage calls PhysicalReplicationSlotNewXmin (if a slot is active) or directly updates MyProc->xmin, preventing the primary’s vacuum from removing rows the standby might still need.

sequenceDiagram
  participant P as Primary (walsender)
  participant S as Standby (walreceiver)

  S->>P: START_REPLICATION LSN/TLI
  P->>S: CopyBothResponse
  loop steady-state streaming
    P->>S: 'w' WAL data (dataStart, walEnd, sendTime)
    S->>P: 'r' reply (write, flush, apply, time)
    P->>S: 'k' keepalive (walEnd, time, replyRequested)
    S->>P: 'r' reply (triggered by keepalive)
    Note over S: hot_standby_feedback on
    S->>P: 'h' HS feedback (xmin, catalogXmin)
  end
  S->>P: CopyDone
  P->>S: CopyDone

Figure 2 — The streaming protocol message exchange. The primary sends 'w' WAL messages and 'k' keepalives; the standby sends 'r' status replies and 'h' hot-standby feedback. Both sides exchange CopyDone to end the stream. (Protocol from the PostgreSQL documentation and walsender.c / walreceiver.c.)

The walreceiver runs WalReceiverMain, launched by the postmaster in response to PMSIGNAL_START_WALRECEIVER sent by the startup process when WAL recovery has exhausted archive/local WAL and streaming is configured.

The startup process fills in WalRcvData->conninfo, ->slotname, and ->receiveStart before signalling the postmaster. WalReceiverMain reads those values from the shared-memory WalRcvData struct, marks itself WALRCV_STREAMING, then dynamically loads libpqwalreceiver to get the actual libpq transport functions:

// WalReceiverMain — src/backend/replication/walreceiver.c
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_STREAMING;
strlcpy(conninfo, walrcv->conninfo, MAXCONNINFO);
startpoint = walrcv->receiveStart;
startpointTLI = walrcv->receiveStartTLI;
SpinLockRelease(&walrcv->mutex);
load_file("libpqwalreceiver", false);
/* ... connect to primary ... */
wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);

After the IDENTIFY_SYSTEM handshake verifies the system identifier matches, the loop calls walrcv_startstreaming (which issues START_REPLICATION) and then receives messages in a walrcv_receive loop, dispatching each to XLogWalRcvProcessMsg.

The WalRcvData shared structure is the receiver’s communication channel with the startup process and with cascading walsenders:

// WalRcvData (excerpt) — src/include/replication/walreceiver.h
typedef struct
{
ProcNumber procno; /* walreceiver's proc number */
pid_t pid;
WalRcvState walRcvState;
ConditionVariable walRcvStoppedCV;
XLogRecPtr receiveStart; /* where startup told us to begin */
TimeLineID receiveStartTLI;
XLogRecPtr flushedUpto; /* last byte fsynced to pg_wal */
TimeLineID receivedTLI;
XLogRecPtr latestChunkStart; /* previous flushedUpto before last flush */
/* ... timing, conninfo, slotname ... */
pg_atomic_uint64 writtenUpto; /* written (not yet fsynced) */
sig_atomic_t force_reply;
} WalRcvData;

flushedUpto is the canonical progress indicator: after each XLogWalRcvFlush call the startup process wakes up and can advance WAL replay to that point. writtenUpto is an atomic that advances ahead of flushedUpto — it lets a cascading walsender serve WAL that has been written but not yet fsynced.

XLogWalRcvProcessMsg, XLogWalRcvWrite, XLogWalRcvFlush

Section titled “XLogWalRcvProcessMsg, XLogWalRcvWrite, XLogWalRcvFlush”

XLogWalRcvProcessMsg dispatches on the one-byte message type:

  • 'w' (WAL data) — strips the three-int64 header, calls XLogWalRcvWrite with the payload bytes.
  • 'k' (keepalive) — reads walEnd and sendTime, optionally triggers an immediate reply.

XLogWalRcvWrite appends bytes to the current WAL segment file in pg_wal, opening a new segment via XLogFileInit when the current one is full, and updating the writtenUpto atomic:

// XLogWalRcvWrite (excerpt) — src/backend/replication/walreceiver.c
while (nbytes > 0)
{
if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
{
/* open/create new segment */
XLByteToSeg(recptr, recvSegNo, wal_segment_size);
recvFile = XLogFileInit(recvSegNo, tli);
}
startoff = XLogSegmentOffset(recptr, wal_segment_size);
byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
/* ... error handling ... */
recptr += byteswritten;
nbytes -= byteswritten;
}
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);

XLogWalRcvFlush fsyncs the current file via issue_xlog_fsync, advances LogstreamResult.Flush, and then — crucially — wakes the startup process and any cascading walsender:

// XLogWalRcvFlush (excerpt) — src/backend/replication/walreceiver.c
issue_xlog_fsync(recvFile, recvSegNo, tli);
LogstreamResult.Flush = LogstreamResult.Write;
SpinLockAcquire(&walrcv->mutex);
if (walrcv->flushedUpto < LogstreamResult.Flush)
{
walrcv->latestChunkStart = walrcv->flushedUpto;
walrcv->flushedUpto = LogstreamResult.Flush;
walrcv->receivedTLI = tli;
}
SpinLockRelease(&walrcv->mutex);
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup(true, false);
/* send reply and hot-standby feedback back to primary */
XLogWalRcvSendReply(false, false);
XLogWalRcvSendHSFeedback(false);

WakeupRecovery() signals the startup process latch; the startup process reads GetWalRcvFlushRecPtr and advances recovery up to that LSN.

The WalRcvState enum tracks the walreceiver’s life from the startup process’s perspective:

stateDiagram-v2
  [*] --> WALRCV_STOPPED
  WALRCV_STOPPED --> WALRCV_STARTING : RequestXLogStreaming\nlaunches process
  WALRCV_STARTING --> WALRCV_STREAMING : WalReceiverMain\ninitializes
  WALRCV_STREAMING --> WALRCV_WAITING : primary ends stream\nbut stays connected
  WALRCV_WAITING --> WALRCV_RESTARTING : startup nudges\nexisting process
  WALRCV_RESTARTING --> WALRCV_STREAMING : process loops\nand reconnects
  WALRCV_STREAMING --> WALRCV_STOPPING : ShutdownWalRcv\nor SIGTERM
  WALRCV_WAITING --> WALRCV_STOPPING : ShutdownWalRcv
  WALRCV_STOPPING --> WALRCV_STOPPED : WalRcvDie\non shmem_exit
  WALRCV_STOPPED --> [*]

Figure 3 — WalRcvState lifecycle. RequestXLogStreaming (called by the startup process) transitions STOPPED → STARTING and signals the postmaster to fork; if a WAITING receiver already exists, it transitions WAITING → RESTARTING and wakes the existing process instead of forking a new one. (States from walreceiver.h; transitions from walreceiverfuncs.c and walreceiver.c.)

RequestXLogStreaming in walreceiverfuncs.c is the startup-process entry point. It fills in WalRcvData under the spinlock, then either sends PMSIGNAL_START_WALRECEIVER to the postmaster (first start) or wakes the existing receiver’s latch (restart):

// RequestXLogStreaming — src/backend/replication/walreceiverfuncs.c
SpinLockAcquire(&walrcv->mutex);
if (walrcv->walRcvState == WALRCV_STOPPED)
{
launch = true;
walrcv->walRcvState = WALRCV_STARTING;
strlcpy(walrcv->conninfo, conninfo, MAXCONNINFO);
}
else
walrcv->walRcvState = WALRCV_RESTARTING;
walrcv->receiveStart = recptr;
walrcv->receiveStartTLI = tli;
walrcv_proc = walrcv->procno;
SpinLockRelease(&walrcv->mutex);
if (launch)
SendPostmasterSignal(PMSIGNAL_START_WALRECEIVER);
else if (walrcv_proc != INVALID_PROC_NUMBER)
SetLatch(&GetPGProcByNumber(walrcv_proc)->procLatch);

The walreceiver does not link libpq directly into the server binary. Instead, walreceiver.c calls through the WalReceiverFunctions function pointer table, which is populated by load_file("libpqwalreceiver") at startup. The README’s first paragraph captures the rationale: “The transport-specific part of walreceiver … is loaded dynamically to avoid having to link the main server binary with libpq.”

The function pointer table is defined in walreceiver.h as WalReceiverFunctionsType; the actual libpq implementation lives in replication/libpqwalreceiver/. The indirection means a third party could in principle supply an alternative transport (e.g. RDMA, a shared-memory channel for same-host standbys) by providing a compatible shared library, though this API is currently described as “internal.”

The postmaster-driven shutdown differs from a regular backend’s. The README explains: walsenders must deliver the shutdown checkpoint record to standbys before terminating. The sequence is:

  1. After all regular backends have exited, the checkpointer sends PROCSIG_WALSND_INIT_STOPPING to every walsender.
  2. Each walsender transitions to WALSNDSTATE_STOPPING, rejects new commands, and signals readiness.
  3. Checkpointer begins the shutdown checkpoint only once all walsenders confirm stopping (WalSndWaitStopping).
  4. When the shutdown checkpoint finishes, the postmaster sends SIGUSR2 to each walsender.
  5. WalSndDone flushes any remaining WAL, waits for standby acknowledgement, then calls proc_exit(0).

This choreography ensures standbys receive the shutdown checkpoint record, so after a clean primary shutdown, a promoted standby does not need to replay past that point.

Symbols grouped by process/subsystem. Files are under /data/hgryoo/references/postgres/.

Walsender: shared memory and state (walsender_private.h, walsender.c)

Section titled “Walsender: shared memory and state (walsender_private.h, walsender.c)”
  • WalSndState (enum) — WALSNDSTATE_STARTUP through WALSNDSTATE_STOPPING.
  • WalSnd (struct) — per-sender slot: pid, state, sentPtr, write, flush, apply, lag offsets, mutex, replyTime, kind.
  • WalSndCtlData (struct) — the shared control block: SyncRepQueue[], lsn[], sync_standbys_status, wal_flush_cv, wal_replay_cv, wal_confirm_rcv_cv, walsnds[] (flexible array).
  • WalSndCtl — global pointer to WalSndCtlData.
  • MyWalSnd — this process’s WalSnd slot.
  • WalSndShmemSize / WalSndShmemInit — allocate/initialize the WalSndCtlData shmem block.
  • WalSndSetState — transition MyWalSnd->state and log the event.
  • NUM_SYNC_REP_WAIT_MODE — array dimension for SyncRepQueue[] / lsn[].
  • InitWalSender — claim a WalSnd slot; called from PostgresMain.
  • exec_replication_command — parse and dispatch a replication command string; returns false if not a replication command (SQL passthrough for db-mode walsenders).
  • IdentifySystem — handle IDENTIFY_SYSTEM; returns sysid, timeline, LSN, dbname.
  • StartReplication — handle START_REPLICATION (physical); sets up xlogreader, resolves timeline, enters COPY mode, calls WalSndLoop(XLogSendPhysical).
  • StartLogicalReplication — handle START_REPLICATION (logical); calls WalSndLoop(XLogSendLogical).
  • WalSndLoop — the steady-state loop; dispatches to send_data, manages keepalives, processes replies, detects catchup→streaming transition, handles shutdown.
  • XLogSendPhysical — compute SendRqstPtr (flush pointer on primary, standby flush ptr on cascading), read WAL via WALRead, enqueue 'w' message.
  • XLogSendLogical — pull decoded changes from the logical decoding context; calls WalSndWriteData via the LogicalDecodingContext output plugin callback.
  • ProcessRepliesIfAny — non-blocking drain of the client socket; dispatches 'd' CopyData to ProcessStandbyMessage, handles CopyDone and Terminate.
  • ProcessStandbyMessage / ProcessStandbyReplyMessage / ProcessStandbyHSFeedbackMessage — update WalSnd slots; release sync waiters; update slot xmin.
  • WalSndCheckTimeOut — enforce wal_sender_timeout.
  • WalSndKeepalive / WalSndKeepaliveIfNecessary — send 'k' message.
  • WalSndDone — graceful shutdown: drain remaining WAL, wait for standby ack, proc_exit(0).
  • WalSndWakeup / WalSndWakeupRequest / WalSndWakeupProcessRequests — condition-variable / latch wakeup from WAL flush path.

Walreceiver functions (walreceiverfuncs.c)

Section titled “Walreceiver functions (walreceiverfuncs.c)”
  • RequestXLogStreaming — startup-process entry: fill WalRcvData, transition state, signal postmaster or nudge existing receiver.
  • GetWalRcvFlushRecPtr — return flushedUpto (and optionally latestChunkStart, receivedTLI) under spinlock.
  • WalRcvRunning / WalRcvStreaming — status predicates; detect WALRCV_STARTUP_TIMEOUT and force-stop if startup took too long.
  • ShutdownWalRcv — send SIGTERM and wait on walRcvStoppedCV.
  • WalReceiverMain — the main process function: AuxiliaryProcessMainCommon, claim WalRcvData, load libpqwalreceiver, connect, handshake, stream loop.
  • WalRcvWaitForStartPosition — when the primary ends a stream without disconnecting, wait here for WALRCV_RESTARTING or termination.
  • XLogWalRcvProcessMsg — dispatch 'w' to XLogWalRcvWrite and 'k' to keepalive reply.
  • XLogWalRcvWrite — segment-aligned pg_pwrite loop; updates writtenUpto atomic.
  • XLogWalRcvFlushissue_xlog_fsync, advance flushedUpto, signal startup process and cascading walsenders, send reply.
  • XLogWalRcvSendReply — build and send 'r' status reply.
  • XLogWalRcvSendHSFeedback — build and send 'h' feedback when hot_standby_feedback is on.
  • WalRcvState (enum) — WALRCV_STOPPED, WALRCV_STARTING, WALRCV_STREAMING, WALRCV_WAITING, WALRCV_RESTARTING, WALRCV_STOPPING.
  • WalRcvData (struct) — procno, pid, walRcvState, walRcvStoppedCV, receiveStart/TLI, flushedUpto, receivedTLI, latestChunkStart, writtenUpto (atomic), force_reply, conninfo, slotname, timing fields.
  • WalRcv — global pointer to the single WalRcvData shmem block.
  • WalReceiverFunctionsType / WalReceiverFunctions — the dispatch table populated by libpqwalreceiver.
  • AllowCascadeReplication() — macro: EnableHotStandby && max_wal_senders > 0.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
WalSndState (enum)src/include/replication/walsender_private.h24
WalSnd (struct)src/include/replication/walsender_private.h41
WalSndCtlData (struct)src/include/replication/walsender_private.h80
WalRcvState (enum)src/include/replication/walreceiver.h46
WalRcvData (struct)src/include/replication/walreceiver.h56
AllowCascadeReplication (macro)src/include/replication/walreceiver.h42
am_walsendersrc/include/replication/walsender.h22
max_wal_senderssrc/backend/replication/walsender.c126
InitWalSendersrc/backend/replication/walsender.c297
IdentifySystemsrc/backend/replication/walsender.c397
StartLogicalReplicationsrc/backend/replication/walsender.c1447
exec_replication_commandsrc/backend/replication/walsender.c1990
ProcessRepliesIfAnysrc/backend/replication/walsender.c2246
ProcessStandbyReplyMessagesrc/backend/replication/walsender.c2423
ProcessStandbyHSFeedbackMessagesrc/backend/replication/walsender.c2611
WalSndCheckTimeOutsrc/backend/replication/walsender.c2779
WalSndLoopsrc/backend/replication/walsender.c2806
InitWalSenderSlotsrc/backend/replication/walsender.c2948
XLogSendPhysicalsrc/backend/replication/walsender.c3118
WalSndDonesrc/backend/replication/walsender.c3521
WalSndShmemSizesrc/backend/replication/walsender.c3669
WalSndShmemInitsrc/backend/replication/walsender.c3681
WalSndSetStatesrc/backend/replication/walsender.c3869
WalSndKeepalivesrc/backend/replication/walsender.c4094
WalSndKeepaliveIfNecessarysrc/backend/replication/walsender.c4117
WalReceiverMainsrc/backend/replication/walreceiver.c152
WalRcvWaitForStartPositionsrc/backend/replication/walreceiver.c645
XLogWalRcvProcessMsgsrc/backend/replication/walreceiver.c819
XLogWalRcvWritesrc/backend/replication/walreceiver.c890
XLogWalRcvFlushsrc/backend/replication/walreceiver.c985
WalRcvRunningsrc/backend/replication/walreceiverfuncs.c76
WalRcvStreamingsrc/backend/replication/walreceiverfuncs.c127
RequestXLogStreamingsrc/backend/replication/walreceiverfuncs.c246
GetWalRcvFlushRecPtrsrc/backend/replication/walreceiverfuncs.c336
  • A walsender is a backend variant, not a dedicated auxiliary process. Verified in walsender.c file header: “A walsender is similar to a regular backend, ie. there is a one-to-one relationship between a connection and a walsender process.” InitWalSender is called from PostgresMain after the replication= parameter is detected. The BackendType is B_WAL_SENDER.

  • max_wal_senders is a GUC with a default of 10; the WalSndCtl shmem block scales with it. Verified: int max_wal_senders = 10 in walsender.c line 126; WalSndShmemSize returns sizeof(WalSndCtlData) + max_wal_senders * sizeof(WalSnd) (line 3674). Changing max_wal_senders requires a server restart (shmem is allocated at postmaster startup).

  • The streaming protocol uses CopyBoth mode over the standard wire protocol. Verified in StartReplication: pq_beginmessage(&buf, PqMsg_CopyBothResponse) is sent before entering WalSndLoop. The WAL data messages use pq_putmessage_noblock('d', ...) (CopyData).

  • Three LSN cursors (write / flush / apply) are reported by the standby and stored in WalSnd under a spinlock. Verified in ProcessStandbyReplyMessage: writePtr, flushPtr, applyPtr are unpacked from the 'r' message and stored into walsnd->write, ->flush, ->apply under SpinLockAcquire(&walsnd->mutex).

  • flushedUpto is updated only after issue_xlog_fsync in XLogWalRcvFlush. The advance of flushedUpto in WalRcvData is inside the spinlock after issue_xlog_fsync returns. Writes advance writtenUpto (atomic, no lock) earlier, letting cascading walsenders serve unsynced data while keeping the flushedUpto guarantee clean.

  • WakeupRecovery() is called directly after updating flushedUpto. Verified in XLogWalRcvFlush: the call to WakeupRecovery() is immediately after the spinlock release that advances flushedUpto. This is the mechanism by which the startup process learns that WAL replay can advance.

  • The walreceiver loads libpq dynamically via load_file. Verified in WalReceiverMain: load_file("libpqwalreceiver", false) followed by an assertion that WalReceiverFunctions != NULL. The README confirms the rationale: avoiding linking the server binary with libpq.

  • When the primary ends streaming without disconnecting, the walreceiver enters WALRCV_WAITING and waits for the startup process to issue new instructions. Verified in WalReceiverMain’s loop: after walrcv_endstreaming returns, the process calls WalRcvWaitForStartPosition which sleeps until either WALRCV_RESTARTING is set (startup nudges it) or termination is requested. The startup process then calls RequestXLogStreaming again, which sets WALRCV_RESTARTING and wakes the receiver’s latch without forking a new process.

  • Shutdown choreography gates the shutdown checkpoint on walsender readiness. Verified in the file header comment and WalSndInitStopping / WalSndWaitStopping: the checkpointer calls WalSndInitStopping (which sends PROCSIG_WALSND_INIT_STOPPING to each walsender), then WalSndWaitStopping (which loops until all walsenders are in WALSNDSTATE_STOPPING or WALSNDSTATE_STARTUP). Only then does the checkpointer proceed with the shutdown checkpoint.

  1. Does writtenUpto risk serving torn WAL to cascading walsenders? writtenUpto advances in XLogWalRcvWrite before XLogWalRcvFlush is called, so a cascading walsender could read bytes that have been pwrite()’d but not yet fsync()’d. If the standby crashes before fsyncing, those bytes are lost but the cascading standby might already have sent them downstream. Investigation path: read GetStandbyFlushRecPtr (which XLogSendPhysical calls on a cascading sender) vs. GetWalRcvFlushRecPtr — are these the same pointer or different?

  2. How does hot_standby_feedback interact with slot xmin when both are active? ProcessStandbyHSFeedbackMessage calls PhysicalReplicationSlotNewXmin when MyReplicationSlot is set, and also sets MyProc->xmin directly. The interaction between the slot’s effective_xmin and the proc’s xmin in the global xmin horizon calculation (GetOldestXmin) is non-obvious. Investigation path: read ReplicationSlotsComputeRequiredXmin and trace how both contribute to the horizon.

  3. WALRCV_STARTUP_TIMEOUT: what happens when walreceiver is slow to start? WalRcvRunning and WalRcvStreaming both check whether WALRCV_STARTING has persisted past WALRCV_STARTUP_TIMEOUT (5 seconds, a compile-time constant) and force-transition to WALRCV_STOPPED. The startup process then re-requests streaming. Is there a risk of a tight loop if the primary is unreachable? Investigation path: trace RequestXLogStreaming re-invocation in xlogrecovery.c.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • MySQL/InnoDB Group Replication and Galera (synchronous multi-master) — PostgreSQL’s walsender/walreceiver is strictly primary-to-standby (one direction, one primary). Group Replication and Galera use a certification-based protocol (Paxos / Galera replication) where every node participates in write ordering. The comparison would clarify what PostgreSQL trades by keeping a simple unidirectional pipe rather than a multi-master ring. The relevant PostgreSQL analog for the write-ordering half is synchronous_commit = remote_apply with multiple synchronous standbys — though this still has a single primary.

  • Raft-based replication (CockroachDB, etcd-backed PostgreSQL HA) — Raft provides leader election and log replication in a single protocol. PostgreSQL separates these concerns: WAL streaming handles log transport, Patroni/Stolon/repmgr handles leader election externally. A note on how the split of concerns maps onto Raft’s log-append, commit-quorum, and leader-lease primitives would be useful for the postgres-evolution-replication.md arc.

  • Logical replication vs. physical streaming — this doc covers only the transport layer that both share. The logical decoding path (XLogSendLogical, LogicalDecodingContext) that turns WAL bytes into decoded row changes is the subject of postgres-logical-decoding.md. The key cross-reference: StartLogicalReplication in walsender.c calls the same WalSndLoop as the physical path — the transport is identical, only the data source differs.

  • Replication lag and the lag-tracking mechanismLagTrackerWrite and LagTrackerRead in walsender.c record when each WAL position was sent and when the standby acknowledged it. The resulting writeLag, flushLag, applyLag fields in WalSnd feed pg_stat_replication. A dedicated note on lag estimation accuracy (sampling bias, the interaction with CommitDelay, and how large-transaction lags are attributed) would complement the overview in postgres-overview-replication-ha.md.

In-tree design docs:

  • src/backend/replication/README — walreceiver IPC, walsender IPC, walsender–walreceiver protocol (“See manual”).

Source files (REL_18_STABLE, commit 273fe94):

  • src/backend/replication/walsender.c — walsender process: command dispatch, send loop, standby reply processing, keepalives, shutdown.
  • src/backend/replication/walreceiver.c — walreceiver process: main loop, message dispatch, write/flush to pg_wal, reply sending.
  • src/backend/replication/walreceiverfuncs.c — startup-process API: RequestXLogStreaming, GetWalRcvFlushRecPtr, WalRcvRunning, WalRcvStreaming, ShutdownWalRcv.
  • src/include/replication/walsender.h — public walsender API, GUC declarations, WalSndWakeupRequest macro.
  • src/include/replication/walsender_private.hWalSndState, WalSnd, WalSndCtlData.
  • src/include/replication/walreceiver.hWalRcvState, WalRcvData, WalReceiverFunctionsType.

Textbooks:

  • Database Internals (Petrov), ch. 11 — replication, consistency, leader and follower roles, log shipping vs. streaming.

Cross-references (mechanism owned elsewhere — not duplicated here):

  • postgres-xlog-wal.md — WAL record format, LSN, durability pipeline; the flush pointer that walsender reads via GetFlushRecPtr.
  • postgres-replication-slots.md — slot creation, WAL retention, xmin horizon management.
  • postgres-logical-decoding.md — decoded change stream; the logical path that shares WalSndLoop.
  • postgres-synchronous-replication.mdSyncRepReleaseWaiters, the synchronous_standby_names policy, and the commit-wait path.
  • postgres-overview-replication-ha.md — subcategory router with reading order across all replication-ha docs.
  • postgres-architecture-overview.md — Axis 3, the WAL-centric durability spine that makes streaming replication possible.