Skip to content

PostgreSQL Synchronous Replication — synchronous_commit and the Wait Queue

Contents:

Asynchronous streaming replication (the transport described in postgres-wal-sender-receiver.md) answers how WAL reaches a standby, but it deliberately does not couple the primary’s COMMIT to the standby’s progress. The primary flushes its own WAL, returns success to the client, and lets the standby catch up whenever the network and the standby’s I/O allow. That decoupling is exactly what makes asynchronous replication fast — and exactly what makes it lossy. If the primary’s disk and the standby both vanish in the window between local flush and remote receipt, an acknowledged commit is gone.

Synchronous replication closes that window by holding the commit. After the primary flushes its commit WAL record locally, the committing backend does not immediately tell the client “committed.” Instead it waits until a configured set of standbys confirm that they too hold (or have flushed, or have applied) that same LSN. Only then does the backend return. The price is latency: every synchronous commit now pays at least one primary→standby network round trip on top of the local fsync.

This is the classic durability-vs-latency trade that every replicated log system must make. Designing Data-Intensive Applications (Kleppmann, ch. 5, “Replication”) frames it as synchronous vs. asynchronous follower replication: a synchronous follower guarantees an up-to-date copy at the cost of blocking the leader if the follower is slow or down; an asynchronous follower never blocks the leader but can lose recently acknowledged writes on failover. Kleppmann’s pragmatic middle ground — semi-synchronous replication, where one follower is synchronous and the rest asynchronous — is precisely the shape of PostgreSQL’s synchronous_standby_names with FIRST 1 (...).

There are three orthogonal axes a synchronous-replication design must pin down, and PostgreSQL exposes all three as configuration:

  1. What event counts as “acknowledged”? Three monotonically stronger points exist on the standby: the WAL bytes have been written to the OS (may still be in page cache), flushed (fsynced to durable storage), or applied (replayed into the standby’s data pages so a read on the standby would see the change). Waiting for flush guarantees no data loss on standby crash; waiting for apply additionally guarantees a read on the standby is causally consistent with the commit.

  2. How many standbys, and which ones? With more than one standby the primary can require all of them, any k of them (quorum), or the first k by priority. Quorum tolerates the loss of any one member; priority gives deterministic failover order.

  3. Where does the policy live? The primary can push durability requirements down to standbys (each standby knows it must ack), or keep all policy on the primary (standbys just stream and reply, blissfully unaware). PostgreSQL chose the second: the file header of syncrep.c states the design “isolates all logic about waiting/releasing onto the primary … The standbys are completely unaware of the durability requirements of transactions on the primary.”

The theoretical grounding is the same LSN-as-byte-offset model from postgres-xlog-wal.md: a commit produces a WAL record ending at some LSN (XactLastRecEnd), and “the standby acknowledged the commit” reduces to “the standby’s reported write/flush/apply LSN has reached or passed that value.” Synchronous replication is therefore a waiting-on-an-LSN problem, and the entire module is an LSN-ordered wait queue plus the policy that decides when the queue head’s LSN has been satisfied.

Replicated databases that offer a synchronous mode converge on a small set of structural choices. Naming them makes PostgreSQL’s specific symbols read as one point in a shared design space.

Commit-path interception after local durability

Section titled “Commit-path interception after local durability”

The wait is inserted after the local WAL flush, never before. The committing transaction is already durable on the primary; what remains is to confirm remote durability. This ordering matters for crash semantics: if the primary crashes mid-wait, the transaction is recoverable locally even though the client never got an acknowledgement. PostgreSQL places the call exactly here — RecordTransactionCommit flushes XLOG, marks CLOG, then calls SyncRepWaitForLSN.

A shared-memory wait structure keyed by LSN

Section titled “A shared-memory wait structure keyed by LSN”

The committing backend cannot busy-poll the standby’s progress; it must sleep and be woken. The universal pattern is a shared-memory structure that records “process P is waiting for LSN L,” plus a waker (the replication receiver/sender side) that, on each progress report, releases every waiter whose LSN has been reached. Keeping the structure ordered by LSN turns “wake everyone who is satisfied” into a single walk from the head that stops at the first unsatisfied waiter — O(woken) rather than O(all-waiters) per reply.

Because “acknowledged” has several meanings (write / flush / apply), a mature design keeps a separate queue per level: a backend waiting for apply must not be woken by a mere flush report. PostgreSQL keeps NUM_SYNC_REP_WAIT_MODE == 3 independent queues and three separate “released up to here” cursors.

With N candidate standbys and a requirement of k, the design must define the synced position — the LSN that is guaranteed to be on at least k standbys. For a quorum (any k), that is the k-th largest reported LSN. For priority (first k), it is the oldest (smallest) LSN among the top-k by priority, since all k must have it. PostgreSQL implements both: SyncRepGetNthLatestSyncRecPtr for quorum, SyncRepGetOldestSyncRecPtr for priority.

Cancellation that cannot lie to the client

Section titled “Cancellation that cannot lie to the client”

A backend blocked in the wait is in a delicate spot: the transaction is already committed locally. If the wait is interrupted (query cancel, backend termination, postmaster death), the system must not tell the client “aborted” — that would be a lie. The standard resolution is to emit a warning that explicitly says “committed locally, maybe not replicated” and then proceed, never converting the situation into a transaction abort.

ConceptPostgreSQL name
Commit-path wait entry pointSyncRepWaitForLSN(XactLastRecEnd, true) (from RecordTransactionCommit)
Durability level GUCsynchronous_commit (SyncCommitLevel enum)
Internal wait modeSyncRepWaitMode (SYNC_REP_WAIT_WRITE/FLUSH/APPLY)
Standby-selection GUCsynchronous_standby_names
Parsed selection policySyncRepConfigData (syncrep_method, num_sync, member_names)
Priority methodFIRST k (...)SYNC_REP_PRIORITY
Quorum methodANY k (...)SYNC_REP_QUORUM
LSN-ordered wait queuesWalSndCtl->SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]
Per-backend wait recordMyProc->waitLSN, ->syncRepState, ->syncRepLinks
”Released up to” cursorWalSndCtl->lsn[mode]
Waiter wakeup (from walsender)SyncRepReleaseWaitersSyncRepWakeQueue
Synced-position computationSyncRepGetSyncRecPtr
Candidate standby snapshotSyncRepGetCandidateStandbys
Per-walsender sync priorityWalSnd->sync_standby_priority
Config-initialized flagWalSndCtl->sync_standbys_status (SYNC_STANDBY_INIT/DEFINED)

Synchronous replication is configured by exactly two user-facing parameters, and they answer two different questions.

synchronous_commit answers “how strong an acknowledgement do I want?” It is an enum (SyncCommitLevel in xact.h):

// SyncCommitLevel — src/include/access/xact.h
typedef enum
{
SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */
SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */
SYNCHRONOUS_COMMIT_REMOTE_WRITE,/* wait for local flush and remote write */
SYNCHRONOUS_COMMIT_REMOTE_FLUSH,/* wait for local and remote flush */
SYNCHRONOUS_COMMIT_REMOTE_APPLY,/* wait for local and remote flush and remote apply */
} SyncCommitLevel;
/* Define the default setting for synchronous_commit */
#define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSH

The familiar setting names map onto these: offOFF (don’t even wait for local flush before returning — async commit), localLOCAL_FLUSH (local fsync only, no remote wait), remote_writeREMOTE_WRITE, onREMOTE_FLUSH (the default value of on), and remote_applyREMOTE_APPLY. Only the last three involve a remote wait; off and local never enter the sync-rep queue at all.

The GUC assign hook translates the user-visible level into the internal wait mode that indexes the queues:

// assign_synchronous_commit — src/backend/replication/syncrep.c
void
assign_synchronous_commit(int newval, void *extra)
{
switch (newval)
{
case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
SyncRepWaitMode = SYNC_REP_WAIT_WRITE; /* 0 */
break;
case SYNCHRONOUS_COMMIT_REMOTE_FLUSH:
SyncRepWaitMode = SYNC_REP_WAIT_FLUSH; /* 1 */
break;
case SYNCHRONOUS_COMMIT_REMOTE_APPLY:
SyncRepWaitMode = SYNC_REP_WAIT_APPLY; /* 2 */
break;
default:
SyncRepWaitMode = SYNC_REP_NO_WAIT; /* -1: off / local */
break;
}
}

synchronous_standby_names answers the orthogonal question “acknowledged by which standbys, and how many?” Its value is a tiny grammar (syncrep_gram.y):

// productions — src/backend/replication/syncrep_gram.y
standby_list -> create_syncrep_config("1", list, SYNC_REP_PRIORITY)
NUM '(' standby_list ')' -> create_syncrep_config(NUM, list, SYNC_REP_PRIORITY)
ANY NUM '(' standby_list ')' -> create_syncrep_config(NUM, list, SYNC_REP_QUORUM)
FIRST NUM '(' standby_list ')'-> create_syncrep_config(NUM, list, SYNC_REP_PRIORITY)

So s1, s2, s3 means FIRST 1 (s1, s2, s3) (backward compatible with the 9.6-era single-standby behaviour: one synchronous standby, priority order); ANY 2 (s1, s2, s3) is a quorum of any two; FIRST 2 (s1, s2, s3) waits for the two highest-priority connected standbys. The parse result is a flat, malloc-able SyncRepConfigData:

// SyncRepConfigData — src/include/replication/syncrep.h
typedef struct SyncRepConfigData
{
int config_size; /* total size of this struct, in bytes */
int num_sync; /* number of sync standbys that we need to wait for */
uint8 syncrep_method; /* SYNC_REP_PRIORITY or SYNC_REP_QUORUM */
int nmembers; /* number of members in the following list */
char member_names[FLEXIBLE_ARRAY_MEMBER]; /* nmembers NUL-terminated names */
} SyncRepConfigData;

check_synchronous_standby_names runs the parser in the GUC check hook and stashes the parsed struct as the GUC’s “extra”; assign_synchronous_standby_names simply publishes it into the global SyncRepConfig. Because the result is flat malloc’d memory, it can live as GUC extra data without a memory context.

The committing backend’s view: SyncRepWaitForLSN

Section titled “The committing backend’s view: SyncRepWaitForLSN”

The entry point is called once per commit, from RecordTransactionCommit in xact.c, only when the transaction actually wrote WAL and assigned an XID:

// RecordTransactionCommit (excerpt) — src/backend/access/transam/xact.c
if (wrote_xlog && markXidCommitted)
SyncRepWaitForLSN(XactLastRecEnd, true);

XactLastRecEnd is the LSN of the end of this transaction’s commit record — exactly the position the standbys must reach. The commit = true argument tells the wait that this LSN is a commit record, which matters for remote_apply (only commit records produce apply feedback, so a non-commit LSN is capped to flush level).

SyncRepWaitForLSN is built to be cheap when sync rep is off, because it runs on every commit. The fast path checks SyncRepRequested() and the sync_standbys_status flag without taking any lock:

// SyncRepWaitForLSN fast path — src/backend/replication/syncrep.c
if (!SyncRepRequested() ||
((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) &
(SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT)
return;
/* Cap the level for anything other than commit to remote flush only. */
if (commit)
mode = SyncRepWaitMode;
else
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);

SyncRepRequested() is the macro (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH) — if the level is off or local, there is nothing to do. The second clause is the “checkpointer has initialized the status data, and it says no sync standbys are defined” case: SYNC_STANDBY_INIT set but SYNC_STANDBY_DEFINED clear means we can skip without the lock.

If the fast path does not exit, the backend takes SyncRepLock exclusively and re-checks under the lock (closing the race in SyncRepUpdateSyncStandbysDefined), then enqueues itself:

// SyncRepWaitForLSN enqueue — src/backend/replication/syncrep.c
MyProc->waitLSN = lsn;
MyProc->syncRepState = SYNC_REP_WAITING;
SyncRepQueueInsert(mode);
Assert(SyncRepQueueIsOrderedByLSN(mode));
LWLockRelease(SyncRepLock);

After releasing the lock it sleeps in a latch loop, waking only on its own procLatch. The state machine for one backend is NOT_WAITING → WAITING → WAIT_COMPLETE → NOT_WAITING:

// SyncRepWaitForLSN wait loop (condensed) — src/backend/replication/syncrep.c
for (;;)
{
ResetLatch(MyLatch);
if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
break;
if (ProcDiePending) { /* WARNING: committed locally, maybe not replicated */
whereToSendOutput = DestNone; SyncRepCancelWait(); break; }
if (QueryCancelPending) { QueryCancelPending = false; /* WARNING ... */
SyncRepCancelWait(); break; }
rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP);
if (rc & WL_POSTMASTER_DEATH) {
ProcDiePending = true; whereToSendOutput = DestNone; SyncRepCancelWait(); break;
}
}

The crucial detail is what happens on cancellation: the code never aborts the transaction. It already committed locally. The most it does is emit a WARNING“The transaction has already committed locally, but might not have been replicated to the standby.” — and stop sending output. On normal completion (SYNC_REP_WAIT_COMPLETE, set by a walsender) it breaks, issues a pg_read_barrier(), asserts it has been detached from the queue, and resets syncRepState/waitLSN.

flowchart TB
  A["RecordTransactionCommit<br/>XLogFlush local WAL"] --> B["SyncRepWaitForLSN(XactLastRecEnd, true)"]
  B --> C{SyncRepRequested<br/>and standbys defined?}
  C -- no --> R["return immediately<br/>(async / no sync standby)"]
  C -- yes --> D["acquire SyncRepLock<br/>set waitLSN, syncRepState=WAITING"]
  D --> E["SyncRepQueueInsert(mode)<br/>insert sorted by LSN"]
  E --> F["release SyncRepLock"]
  F --> G["latch wait loop"]
  G --> H{syncRepState ==<br/>WAIT_COMPLETE?}
  H -- yes --> I["read barrier<br/>reset state, return to client"]
  H -- no --> J{ProcDie / QueryCancel /<br/>PostmasterDeath?}
  J -- yes --> K["WARNING: committed locally<br/>SyncRepCancelWait, break"]
  J -- no --> L["WaitLatch on procLatch"]
  L --> G

Figure 1 — The committing backend’s path through SyncRepWaitForLSN. The transaction is already locally durable before the wait begins; the only exits are completion (woken by a walsender) or an interruption that still preserves the local commit. (Flow from SyncRepWaitForLSN in syncrep.c and the call site in RecordTransactionCommit.)

The wait queue: three LSN-ordered lists in shared memory

Section titled “The wait queue: three LSN-ordered lists in shared memory”

The waiters live in WalSndCtl, the same shared-memory control block the walsenders use. The sync-rep portion is three doubly-linked-list heads (one per wait mode) and three “released up to” cursors:

// WalSndCtlData (sync-rep fields) — src/include/replication/walsender_private.h
typedef struct
{
dlist_head SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]; /* one queue per request type */
XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE]; /* head-of-queue release cursor */
bits8 sync_standbys_status; /* SYNC_STANDBY_INIT | _DEFINED */
/* ... condition variables, walsnds[] flexible array ... */
} WalSndCtlData;

Each backend’s link node and wait LSN live in its own PGPROC:

// PGPROC (sync-rep fields) — src/include/storage/proc.h
XLogRecPtr waitLSN; /* waiting for this LSN or higher */
int syncRepState; /* wait state for sync rep */
dlist_node syncRepLinks; /* list link if process is in syncrep queue */

SyncRepQueueInsert keeps each queue sorted ascending by waitLSN. Most commits arrive in LSN order (later commits have larger XactLastRecEnd), so the common case appends at the tail; the function therefore scans from the tail backward to find the insertion point:

// SyncRepQueueInsert — src/backend/replication/syncrep.c
queue = &WalSndCtl->SyncRepQueue[mode];
dlist_reverse_foreach(iter, queue)
{
PGPROC *proc = dlist_container(PGPROC, syncRepLinks, iter.cur);
/* Stop at the element we should insert after, to keep queue LSN-ordered. */
if (proc->waitLSN < MyProc->waitLSN)
{
dlist_insert_after(&proc->syncRepLinks, &MyProc->syncRepLinks);
return;
}
}
/* list was empty, or we belong at the head */
dlist_push_head(queue, &MyProc->syncRepLinks);

The sorted invariant is what makes wakeup O(woken): the waker walks from the head and stops at the first waiter whose waitLSN exceeds the released position. Under assertions, SyncRepQueueIsOrderedByLSN validates the invariant (and that no two waiters share an LSN) on every insert and wake.

The wakers are the walsenders. When a walsender’s ProcessStandbyReplyMessage (see postgres-wal-sender-receiver.md) updates a standby’s reported write/flush/apply positions, it calls SyncRepReleaseWaiters (for non-cascading standbys). That function decides whether this walsender’s standby is one of the synchronous ones, recomputes the synced positions across all sync standbys, advances the release cursors, and wakes the satisfied waiters.

It bails out immediately if this walsender is not a potential sync standby, is not yet streaming, or has no valid flush position:

// SyncRepReleaseWaiters guard — src/backend/replication/syncrep.c
if (MyWalSnd->sync_standby_priority == 0 ||
(MyWalSnd->state != WALSNDSTATE_STREAMING &&
MyWalSnd->state != WALSNDSTATE_STOPPING) ||
XLogRecPtrIsInvalid(MyWalSnd->flush))
{
announce_next_takeover = true;
return;
}

Otherwise it takes SyncRepLock exclusively, computes the synced positions, and — only if it really is a synchronous standby and there are enough of them — advances each per-mode cursor and wakes that mode’s queue:

// SyncRepReleaseWaiters release loop — src/backend/replication/syncrep.c
got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
/* ... if (!got_recptr || !am_sync) release lock and leave ... */
if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
{
walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
}
if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
{
walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
}
if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
{
walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
}
LWLockRelease(SyncRepLock);

Note the cursors advance monotonically and only forward — the < guards prevent a late or reordered reply from rewinding the release point. The three modes are released independently from their own queues.

SyncRepWakeQueue is the actual wakeup: walk the mode’s queue from the head, and for each waiter whose waitLSN has been reached, detach it, set its state to SYNC_REP_WAIT_COMPLETE, and set its latch:

// SyncRepWakeQueue — src/backend/replication/syncrep.c
dlist_foreach_modify(iter, &WalSndCtl->SyncRepQueue[mode])
{
PGPROC *proc = dlist_container(PGPROC, syncRepLinks, iter.cur);
/* Queue is ordered by LSN: stop at first unsatisfied waiter. */
if (!all && walsndctl->lsn[mode] < proc->waitLSN)
return numprocs;
dlist_delete_thoroughly(&proc->syncRepLinks);
pg_write_barrier(); /* publish detach before state */
proc->syncRepState = SYNC_REP_WAIT_COMPLETE;
SetLatch(&(proc->procLatch));
numprocs++;
}

The pg_write_barrier() pairs with the pg_read_barrier() in SyncRepWaitForLSN: the waiter reads syncRepState without the lock, so the waker must make the queue-detach visible before the state flips to WAIT_COMPLETE, ensuring the woken backend sees itself off the queue.

Choosing the synced position: priority vs. quorum

Section titled “Choosing the synced position: priority vs. quorum”

SyncRepGetSyncRecPtr is where the FIRST/ANY policy materializes. It first snapshots the candidate standbys, checks whether this walsender is among them, and verifies there are at least num_sync candidates:

// SyncRepGetSyncRecPtr (excerpt) — src/backend/replication/syncrep.c
num_standbys = SyncRepGetCandidateStandbys(&sync_standbys);
for (i = 0; i < num_standbys; i++)
if (sync_standbys[i].is_me) { *am_sync = true; break; }
if (!(*am_sync) || num_standbys < SyncRepConfig->num_sync)
{
pfree(sync_standbys);
return false; /* not enough sync standbys yet — don't release */
}
if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr,
sync_standbys, num_standbys);
else
SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr,
sync_standbys, num_standbys,
SyncRepConfig->num_sync);

For priority (FIRST k), SyncRepGetCandidateStandbys has already trimmed the candidate list to the top-num_sync by priority, so the guaranteed position is the oldest (minimum) LSN among them — every one of those k standbys is at least that far along:

// SyncRepGetOldestSyncRecPtr (excerpt) — src/backend/replication/syncrep.c
for (i = 0; i < num_standbys; i++)
{
if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > sync_standbys[i].flush)
*flushPtr = sync_standbys[i].flush;
/* ... same for write and apply ... */
}

For quorum (ANY k), any k of the candidates suffices, so the guaranteed position is the k-th largest LSN: sort descending and take index nth - 1:

// SyncRepGetNthLatestSyncRecPtr (excerpt) — src/backend/replication/syncrep.c
for (i = 0; i < num_standbys; i++) { write_array[i] = sync_standbys[i].write; /* ... */ }
qsort(write_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn); /* descending */
qsort(flush_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn);
qsort(apply_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn);
*writePtr = write_array[nth - 1];
*flushPtr = flush_array[nth - 1];
*applyPtr = apply_array[nth - 1];

cmp_lsn sorts descending via pg_cmp_u64(lsn2, lsn1), so element [nth-1] is the k-th highest — the LSN that at least k standbys have reached.

flowchart TB
  A["walsender: standby reply<br/>ProcessStandbyReplyMessage"] --> B["SyncRepReleaseWaiters"]
  B --> C{this walsender a sync<br/>candidate, streaming,<br/>valid flush?}
  C -- no --> Z["return (announce_next_takeover)"]
  C -- yes --> D["SyncRepGetCandidateStandbys<br/>snapshot WalSnd[] under spinlocks"]
  D --> E{am_sync and<br/>num_standbys >= num_sync?}
  E -- no --> Z
  E -- yes --> F{syncrep_method?}
  F -- PRIORITY (FIRST) --> G["SyncRepGetOldestSyncRecPtr<br/>min LSN of top-k by priority"]
  F -- QUORUM (ANY) --> H["SyncRepGetNthLatestSyncRecPtr<br/>k-th largest LSN (qsort desc)"]
  G --> I["advance WalSndCtl->lsn[mode]<br/>if greater"]
  H --> I
  I --> J["SyncRepWakeQueue(false, mode)<br/>wake waiters up to lsn[mode]"]
  J --> K["set syncRepState=WAIT_COMPLETE<br/>SetLatch on each woken proc"]

Figure 2 — How a standby reply releases waiters. The walsender computes the synced position with the policy from synchronous_standby_names (oldest-of-top-k for FIRST, k-th-largest for ANY), advances the per-mode release cursor, and wakes every queued backend whose waitLSN has been satisfied. (Flow from SyncRepReleaseWaiters / SyncRepGetSyncRecPtr / SyncRepWakeQueue in syncrep.c.)

Candidate snapshot and per-walsender priority

Section titled “Candidate snapshot and per-walsender priority”

SyncRepGetCandidateStandbys walks WalSndCtl->walsnds[], copying each active walsender’s reported positions and sync_standby_priority out from under that walsender’s spinlock into a private SyncRepStandbyData array. A walsender qualifies only if it has a PID, is STREAMING or STOPPING, has non-zero priority, and has a valid flush position. In priority mode, if more than num_sync qualify, the array is sorted by priority and trimmed to num_sync:

// SyncRepGetCandidateStandbys (excerpt) — src/backend/replication/syncrep.c
if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY &&
n > SyncRepConfig->num_sync)
{
qsort(*standbys, n, sizeof(SyncRepStandbyData), standby_priority_comparator);
n = SyncRepConfig->num_sync; /* keep only the highest-priority ones */
}

Each walsender computes its own priority once at startup and after each SIGHUP via SyncRepInitConfigSyncRepGetStandbyPriority, which matches the standby’s application_name against the parsed member_names list (or a * wildcard). The position in the list is the priority for FIRST; under ANY, every member has priority 1. Cascading walsenders always get priority 0 (synchronous cascade replication is not supported):

// SyncRepGetStandbyPriority (excerpt) — src/backend/replication/syncrep.c
if (am_cascading_walsender)
return 0;
if (!SyncStandbysDefined() || SyncRepConfig == NULL)
return 0;
standby_name = SyncRepConfig->member_names;
for (priority = 1; priority <= SyncRepConfig->nmembers; priority++)
{
if (pg_strcasecmp(standby_name, application_name) == 0 ||
strcmp(standby_name, "*") == 0) { found = true; break; }
standby_name += strlen(standby_name) + 1;
}
if (!found) return 0;
return (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) ? priority : 1;

synchronous_standby_names can be set to empty at runtime (SIGHUP). A naive implementation would leave backends wedged forever on the queue with nobody left to wake them. The checkpointer owns the sync_standbys_status flag and reconciles it in SyncRepUpdateSyncStandbysDefined: when the GUC transitions to empty, it wakes all queues (SyncRepWakeQueue(true, i) for every mode) and clears SYNC_STANDBY_DEFINED; the flag also gates joining the queue, so a backend that hasn’t yet reloaded its config cannot enqueue after the flush:

// SyncRepUpdateSyncStandbysDefined (excerpt) — src/backend/replication/syncrep.c
if (!sync_standbys_defined)
{
for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
SyncRepWakeQueue(true, i); /* all=true: drain the queue unconditionally */
}
WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT |
(sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);

This is why SyncRepWaitForLSN re-checks SYNC_STANDBY_DEFINED under the lock before enqueuing: the flag and the queue are manipulated atomically under SyncRepLock, so there is no window where a backend joins a queue that has just been drained.

Symbols grouped by role. Files are under /data/hgryoo/references/postgres/.

Configuration: the two GUCs (xact.h, syncrep.h, syncrep_gram.y, syncrep.c)

Section titled “Configuration: the two GUCs (xact.h, syncrep.h, syncrep_gram.y, syncrep.c)”
  • SyncCommitLevel (enum) — SYNCHRONOUS_COMMIT_OFF_REMOTE_APPLY; SYNCHRONOUS_COMMIT_ON aliases _REMOTE_FLUSH.
  • synchronous_commit (GUC int) — current level.
  • SyncRepWaitMode (file-static int) — internal mode derived from the level: SYNC_REP_NO_WAIT (-1), SYNC_REP_WAIT_WRITE/FLUSH/APPLY (0/1/2).
  • NUM_SYNC_REP_WAIT_MODE — the constant 3; dimensions the queues and cursors.
  • SyncRepRequested() (macro) — `max_wal_senders > 0 && synchronous_commit

    SYNCHRONOUS_COMMIT_LOCAL_FLUSH`.

  • assign_synchronous_commit — GUC assign hook; maps level → SyncRepWaitMode.
  • SyncRepConfigData (struct) — parsed synchronous_standby_names: num_sync, syncrep_method, nmembers, member_names[].
  • SYNC_REP_PRIORITY / SYNC_REP_QUORUMsyncrep_method values.
  • SyncStandbysDefined() (macro) — SyncRepStandbyNames is non-empty.
  • check_synchronous_standby_names / assign_synchronous_standby_names — GUC check/assign hooks; parse via syncrep_yyparse, publish SyncRepConfig.

The committing backend (syncrep.c, xact.c, proc.h)

Section titled “The committing backend (syncrep.c, xact.c, proc.h)”
  • SyncRepWaitForLSN — commit-path entry; fast-path exit, enqueue, latch wait loop, cancellation handling.
  • RecordTransactionCommit — call site: SyncRepWaitForLSN(XactLastRecEnd, true) after local XLogFlush.
  • MyProc->waitLSN / ->syncRepState / ->syncRepLinks — per-backend wait record in PGPROC.
  • SYNC_REP_NOT_WAITING / SYNC_REP_WAITING / SYNC_REP_WAIT_COMPLETEsyncRepState values.
  • SyncRepQueueInsert — insert MyProc into a mode’s queue, sorted by waitLSN (scan from tail).
  • SyncRepCancelWait — detach from queue under lock on interruption.
  • SyncRepCleanupAtProcExit — detach on backend exit (lock-free pre-check).

Shared-memory wait structure (walsender_private.h)

Section titled “Shared-memory wait structure (walsender_private.h)”
  • WalSndCtlData.SyncRepQueue[NUM_SYNC_REP_WAIT_MODE] — the three LSN-ordered queues.
  • WalSndCtlData.lsn[NUM_SYNC_REP_WAIT_MODE] — per-mode “released up to” cursor.
  • WalSndCtlData.sync_standbys_statusSYNC_STANDBY_INIT / SYNC_STANDBY_DEFINED bits.
  • SyncRepQueueIsOrderedByLSN (assert-only) — validates the sort invariant.
  • SyncRepReleaseWaiters — called from ProcessStandbyReplyMessage; advances cursors and wakes queues.
  • SyncRepGetSyncRecPtr — compute synced write/flush/apply; dispatch on syncrep_method; set am_sync.
  • SyncRepGetOldestSyncRecPtr — priority: minimum LSN over candidates.
  • SyncRepGetNthLatestSyncRecPtr — quorum: k-th largest LSN (qsort descending).
  • cmp_lsn — descending XLogRecPtr comparator.
  • SyncRepGetCandidateStandbys — snapshot WalSnd[]; filter active/ streaming/priority/valid-flush; trim to num_sync in priority mode.
  • standby_priority_comparator — sort by priority, tie-break by walsnd_index.
  • SyncRepWakeQueue — walk a queue from head, wake satisfied waiters, set WAIT_COMPLETE.
  • SyncRepInitConfig / SyncRepGetStandbyPriority — per-walsender priority computation at startup / SIGHUP.
  • WalSnd->sync_standby_priority — this walsender’s priority (0 = not a sync candidate).
  • SyncRepStandbyData (struct) — private copy of a candidate’s write/flush/apply/priority/is_me.
  • SyncRepUpdateSyncStandbysDefined — maintain sync_standbys_status; drain queues when the GUC goes empty.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
SyncCommitLevel (enum)src/include/access/xact.h68
SYNCHRONOUS_COMMIT_ON (macro)src/include/access/xact.h80
SyncRepRequested (macro)src/include/replication/syncrep.h18
SYNC_REP_WAIT_WRITE (macro)src/include/replication/syncrep.h23
NUM_SYNC_REP_WAIT_MODE (macro)src/include/replication/syncrep.h27
SYNC_REP_NOT_WAITING (macro)src/include/replication/syncrep.h30
SYNC_REP_PRIORITY (macro)src/include/replication/syncrep.h35
SyncRepStandbyData (struct)src/include/replication/syncrep.h42
SyncRepConfigData (struct)src/include/replication/syncrep.h63
WalSndCtlData.SyncRepQueuesrc/include/replication/walsender_private.h90
WalSndCtlData.lsnsrc/include/replication/walsender_private.h96
WalSndCtlData.sync_standbys_statussrc/include/replication/walsender_private.h103
SYNC_STANDBY_INIT (macro)src/include/replication/walsender_private.h125
SYNC_STANDBY_DEFINED (macro)src/include/replication/walsender_private.h132
PGPROC.waitLSNsrc/include/storage/proc.h267
PGPROC.syncRepStatesrc/include/storage/proc.h268
PGPROC.syncRepLinkssrc/include/storage/proc.h269
SyncRepWaitForLSNsrc/backend/replication/syncrep.c148
SyncRepQueueInsertsrc/backend/replication/syncrep.c372
SyncRepCancelWaitsrc/backend/replication/syncrep.c406
SyncRepCleanupAtProcExitsrc/backend/replication/syncrep.c416
SyncRepInitConfigsrc/backend/replication/syncrep.c445
SyncRepReleaseWaiterssrc/backend/replication/syncrep.c474
SyncRepGetSyncRecPtrsrc/backend/replication/syncrep.c586
SyncRepGetOldestSyncRecPtrsrc/backend/replication/syncrep.c660
SyncRepGetNthLatestSyncRecPtrsrc/backend/replication/syncrep.c693
cmp_lsnsrc/backend/replication/syncrep.c738
SyncRepGetCandidateStandbyssrc/backend/replication/syncrep.c754
standby_priority_comparatorsrc/backend/replication/syncrep.c833
SyncRepGetStandbyPrioritysrc/backend/replication/syncrep.c860
SyncRepWakeQueuesrc/backend/replication/syncrep.c907
SyncRepUpdateSyncStandbysDefinedsrc/backend/replication/syncrep.c964
SyncRepQueueIsOrderedByLSNsrc/backend/replication/syncrep.c1024
check_synchronous_standby_namessrc/backend/replication/syncrep.c1058
assign_synchronous_standby_namessrc/backend/replication/syncrep.c1118
assign_synchronous_commitsrc/backend/replication/syncrep.c1124
SyncRepWaitForLSN call sitesrc/backend/access/transam/xact.c1557

All claims below were re-read against REL_18_STABLE at commit 273fe94 on 2026-06-05. Symbol line numbers are in the position-hint table above.

  • All synchronous-replication logic runs on the primary; standbys are unaware they are synchronous. Verified in the syncrep.c file header: “All code in this module executes on the primary. … it isolates all logic about waiting/releasing onto the primary. The primary defines which standbys it wishes to wait for. The standbys are completely unaware of the durability requirements of transactions on the primary.” A standby’s walreceiver sends the same write/flush/apply feedback regardless of whether it is named in any primary’s synchronous_standby_names.

  • The wait is inserted after the local WAL flush, never before. Verified at the call site in RecordTransactionCommit (xact.c): the local XLogFlush of XactLastRecEnd and the CLOG status update precede the SyncRepWaitForLSN(XactLastRecEnd, true) call. A primary crash mid-wait therefore leaves the transaction recoverable locally even though the client never received an acknowledgement.

  • synchronous_commit has five levels; on is remote_flush. Verified in SyncCommitLevel (xact.h): OFF, LOCAL_FLUSH, REMOTE_WRITE, REMOTE_FLUSH, REMOTE_APPLY, with #define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSH. assign_synchronous_commit maps only the three REMOTE_* levels to a real wait mode; off and local yield SYNC_REP_NO_WAIT (-1).

  • There are exactly three wait modes and three queues. Verified: NUM_SYNC_REP_WAIT_MODE == 3 (syncrep.h); WalSndCtlData carries dlist_head SyncRepQueue[NUM_SYNC_REP_WAIT_MODE] and XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE] (walsender_private.h). The three modes are released independently in SyncRepReleaseWaiters.

  • SyncRepWaitForLSN has a lock-free fast path keyed on SyncRepRequested() and sync_standbys_status. Verified: the function returns immediately when !SyncRepRequested() or when the status flag equals exactly SYNC_STANDBY_INIT (initialized, but no sync standby defined). Only if neither short-circuit fires does it take SyncRepLock exclusively and re-check SYNC_STANDBY_DEFINED.

  • The wait queues are kept strictly ordered by ascending waitLSN. Verified in SyncRepQueueInsert (reverse scan from the tail, insert after the first element with a smaller waitLSN) and asserted on every insert/wake by SyncRepQueueIsOrderedByLSN, which also asserts no two waiters share an LSN. This invariant is what makes SyncRepWakeQueue stop at the first unsatisfied waiter.

  • FIRST k (priority) uses the oldest LSN of the top-k candidates; ANY k (quorum) uses the k-th largest LSN. Verified: SyncRepGetSyncRecPtr dispatches on SyncRepConfig->syncrep_method, calling SyncRepGetOldestSyncRecPtr (minimum over candidates) for SYNC_REP_PRIORITY and SyncRepGetNthLatestSyncRecPtr (qsort descending via cmp_lsn, take index nth - 1) for SYNC_REP_QUORUM. In priority mode SyncRepGetCandidateStandbys has already trimmed the candidate set to num_sync highest-priority members.

  • A candidate walsender must be active, streaming/stopping, have non-zero sync priority, and a valid flush LSN. Verified in SyncRepGetCandidateStandbys: it continues past any walsender with pid == 0, with state other than WALSNDSTATE_STREAMING/WALSNDSTATE_STOPPING, with sync_standby_priority == 0, or with XLogRecPtrIsInvalid(flush). Cascading walsenders get priority 0 in SyncRepGetStandbyPriority (am_cascading_walsender → return 0), so synchronous cascade replication is unsupported.

  • Release cursors advance monotonically forward only. Verified in SyncRepReleaseWaiters: each walsndctl->lsn[mode] is updated only when the newly computed position is strictly greater (< guards), so a late or reordered standby reply cannot rewind a release point.

  • Cancellation never aborts the transaction; it emits a WARNING and proceeds. Verified in the SyncRepWaitForLSN wait loop. On ProcDiePending it raises WARNING “canceling the wait for synchronous replication and terminating connection due to administrator command” with detail “The transaction has already committed locally, but might not have been replicated to the standby.”, sets whereToSendOutput = DestNone, and calls SyncRepCancelWait. On QueryCancelPending it clears the flag and raises a parallel WARNING (“canceling wait for synchronous replication due to user request”). On WL_POSTMASTER_DEATH it sets ProcDiePending and cancels. None of these paths roll back.

  • When synchronous_standby_names is cleared at runtime, every queue is drained unconditionally. Verified in SyncRepUpdateSyncStandbysDefined: when !sync_standbys_defined it calls SyncRepWakeQueue(true, i) for every mode (the all = true argument bypasses the LSN check and wakes every waiter), then clears SYNC_STANDBY_DEFINED in sync_standbys_status. The flag is also re-checked under SyncRepLock in SyncRepWaitForLSN before enqueue, so no backend can join a queue that was just drained.

  • The pairing of pg_write_barrier() in SyncRepWakeQueue (after dlist_delete_thoroughly, before setting SYNC_REP_WAIT_COMPLETE) with pg_read_barrier() in SyncRepWaitForLSN (after observing SYNC_REP_WAIT_COMPLETE) is the lock-free handoff that lets the woken backend read syncRepState without SyncRepLock and still see itself detached from the queue. Both barriers are present in the source; the memory-ordering argument is the documented intent, restated here.

  • The “most commits append at the tail” rationale for scanning SyncRepQueueInsert from the tail backward is a performance observation consistent with the reverse-scan code, not a separately measured claim.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Semi-synchronous replication (DDIA ch. 5). PostgreSQL’s FIRST 1 (...) is exactly Kleppmann’s semi-synchronous configuration: one follower is synchronous, the rest asynchronous, and if the synchronous follower stalls, another is promoted to take its place (announce_next_takeover in SyncRepReleaseWaiters is the seam where a higher-priority standby reclaims the synchronous slot). A focused note mapping the DDIA “single synchronous follower” failure modes onto PostgreSQL’s priority-takeover logic would sharpen the availability story. (raw/system/textbooks/Designing Data Intensive...pdf.)

  • Quorum replication vs. Dynamo-style R + W > N. ANY k (s1..sN) is a write quorum of size k over N candidate standbys, but PostgreSQL fixes the read side at “read the primary” rather than offering tunable read quorums. Comparing SyncRepGetNthLatestSyncRecPtr’s “k-th largest LSN” rule against Dynamo’s sloppy-quorum + hinted-handoff model frames what PostgreSQL gives up (no leaderless writes, no read-repair) for a single authoritative primary. Paper: dbms-papers/dynamo.md.

  • No consensus: PostgreSQL sync rep is not Raft/Paxos. A committing backend waits for k acknowledgements, but there is no leader election, no log-matching safety property, and no automatic failover in core — the primary simply blocks. On a primary crash the cluster needs an external arbiter (Patroni, repmgr, pg_auto_failover) to promote a standby. The contrast with Raft (where the commit index is the replicated agreement, and leadership is part of the protocol) is the cleanest way to explain why “synchronous_commit guarantees durability, not availability.” Papers: dbms-papers/raft.md, dbms-papers/paxos.md.

  • CAP positioning. Synchronous replication trades A for C under partition: if the required standbys are unreachable, committing backends block indefinitely rather than acknowledge a possibly-lost write. This is the CP corner of the dbms-papers/cap.md triangle, made operational by the fact that the wait has no timeout (only query-cancel / termination / postmaster-death exits). A note on “why there is no synchronous_replication_timeout” — and the explicit project decision that a timeout would silently downgrade durability — belongs here.

  • ARIES and the commit-record LSN. The whole mechanism rests on the ARIES-style WAL invariant (dbms-papers/aries.md, and postgres-xlog-wal.md): a commit produces a single WAL record ending at XactLastRecEnd, and “replicated” reduces to “the standby’s reported LSN ≥ XactLastRecEnd.” Synchronous replication adds no new log semantics; it only delays the client acknowledgement until the existing redo stream is known to be remote-durable.

  • Apply-level waits and read-your-writes on standbys. remote_apply is the one level that makes a query routed to the standby causally consistent with the commit, at the cost of waiting for standby replay (the slowest of the three feedback points). Comparing this against systems that expose explicit session-level “read your writes” tokens (LSN handoff to the application) would show where PostgreSQL stops: it offers the cluster-wide guarantee but no per-session causal token in core.

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/replication/syncrep.c — the entire module: SyncRepWaitForLSN (commit-path wait, fast path, latch loop, cancellation), SyncRepQueueInsert / SyncRepWakeQueue / SyncRepQueueIsOrderedByLSN (the LSN-ordered queues), SyncRepReleaseWaiters / SyncRepGetSyncRecPtr / SyncRepGetOldestSyncRecPtr / SyncRepGetNthLatestSyncRecPtr / SyncRepGetCandidateStandbys (the release side and FIRST/ANY policy), SyncRepGetStandbyPriority / SyncRepInitConfig (per-walsender priority), SyncRepUpdateSyncStandbysDefined (checkpointer reconciliation), assign_synchronous_commit / check_synchronous_standby_names / assign_synchronous_standby_names (GUC hooks).
  • src/backend/replication/syncrep_gram.y and syncrep_scanner.l — the synchronous_standby_names grammar producing SyncRepConfigData with SYNC_REP_PRIORITY (FIRST/bare list) or SYNC_REP_QUORUM (ANY).
  • src/include/replication/syncrep.hSyncRepRequested / SyncStandbysDefined macros, the SYNC_REP_WAIT_* and NUM_SYNC_REP_WAIT_MODE constants, the SYNC_REP_NOT_WAITING/WAITING/WAIT_COMPLETE states, SyncRepStandbyData, and SyncRepConfigData.
  • src/include/replication/walsender_private.hWalSndCtlData (the SyncRepQueue[], lsn[], sync_standbys_status fields) and the SYNC_STANDBY_INIT / SYNC_STANDBY_DEFINED flag bits.
  • src/include/storage/proc.hPGPROC.waitLSN / syncRepState / syncRepLinks, the per-backend wait record.
  • src/include/access/xact.hSyncCommitLevel enum and SYNCHRONOUS_COMMIT_ON alias.
  • src/backend/access/transam/xact.cRecordTransactionCommit, the single in-core call site of SyncRepWaitForLSN.
  • Designing Data-Intensive Applications (Kleppmann 2017), ch. 5 “Replication” — synchronous vs. asynchronous followers, the semi-synchronous middle ground that FIRST 1 (...) implements. (raw/system/textbooks/.)
  • DeCandia et al. (2007), “Dynamo.” dbms-papers/dynamo.md — quorum replication (R + W > N) as the comparison point for ANY k.
  • Ongaro & Ousterhout (2014), “Raft”; Lamport, “Paxos.” dbms-papers/raft.md, dbms-papers/paxos.md — consensus replication, the contrast that explains why sync rep is durability-not-availability.
  • Gilbert & Lynch (2002), CAP. dbms-papers/cap.md — the CP positioning of a no-timeout synchronous wait.
  • Mohan et al. (1992), “ARIES.” dbms-papers/aries.md — the WAL / commit-record-LSN foundation the wait is keyed on.
  • Database Internals (Petrov 2019), ch. on replication and consensus — general framing of leader-based log replication. (knowledge/research/dbms-general/database-internals.md.)

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-wal-sender-receiver.md — the streaming transport and ProcessStandbyReplyMessage, which calls SyncRepReleaseWaiters. The write/flush/apply feedback positions originate there; this doc only consumes them.
  • postgres-xact.mdRecordTransactionCommit and the commit path that calls SyncRepWaitForLSN after local flush.
  • postgres-xlog-wal.mdXactLastRecEnd, XLogFlush, and the LSN-as-byte- offset model the entire wait is built on.
  • postgres-replication-slots.md — slot-based standby tracking, orthogonal to but often co-deployed with synchronous standbys.
  • postgres-overview-replication-ha.md — the axis-level overview where synchronous replication sits among the replication/HA subsystems.