PostgreSQL Synchronous Replication — synchronous_commit and the Wait Queue
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Asynchronous streaming replication (the transport described in
postgres-wal-sender-receiver.md) answers how WAL reaches a standby,
but it deliberately does not couple the primary’s COMMIT to the
standby’s progress. The primary flushes its own WAL, returns success to
the client, and lets the standby catch up whenever the network and the
standby’s I/O allow. That decoupling is exactly what makes asynchronous
replication fast — and exactly what makes it lossy. If the primary’s disk
and the standby both vanish in the window between local flush and remote
receipt, an acknowledged commit is gone.
Synchronous replication closes that window by holding the commit. After the primary flushes its commit WAL record locally, the committing backend does not immediately tell the client “committed.” Instead it waits until a configured set of standbys confirm that they too hold (or have flushed, or have applied) that same LSN. Only then does the backend return. The price is latency: every synchronous commit now pays at least one primary→standby network round trip on top of the local fsync.
This is the classic durability-vs-latency trade that every replicated
log system must make. Designing Data-Intensive Applications (Kleppmann,
ch. 5, “Replication”) frames it as synchronous vs. asynchronous follower
replication: a synchronous follower guarantees an up-to-date copy at the
cost of blocking the leader if the follower is slow or down; an
asynchronous follower never blocks the leader but can lose recently
acknowledged writes on failover. Kleppmann’s pragmatic middle ground —
semi-synchronous replication, where one follower is synchronous and the
rest asynchronous — is precisely the shape of PostgreSQL’s
synchronous_standby_names with FIRST 1 (...).
There are three orthogonal axes a synchronous-replication design must pin down, and PostgreSQL exposes all three as configuration:
-
What event counts as “acknowledged”? Three monotonically stronger points exist on the standby: the WAL bytes have been written to the OS (may still be in page cache), flushed (fsynced to durable storage), or applied (replayed into the standby’s data pages so a read on the standby would see the change). Waiting for flush guarantees no data loss on standby crash; waiting for apply additionally guarantees a read on the standby is causally consistent with the commit.
-
How many standbys, and which ones? With more than one standby the primary can require all of them, any k of them (quorum), or the first k by priority. Quorum tolerates the loss of any one member; priority gives deterministic failover order.
-
Where does the policy live? The primary can push durability requirements down to standbys (each standby knows it must ack), or keep all policy on the primary (standbys just stream and reply, blissfully unaware). PostgreSQL chose the second: the file header of
syncrep.cstates the design “isolates all logic about waiting/releasing onto the primary … The standbys are completely unaware of the durability requirements of transactions on the primary.”
The theoretical grounding is the same LSN-as-byte-offset model from
postgres-xlog-wal.md: a commit produces a WAL record ending at some LSN
(XactLastRecEnd), and “the standby acknowledged the commit” reduces to
“the standby’s reported write/flush/apply LSN has reached or passed that
value.” Synchronous replication is therefore a waiting-on-an-LSN
problem, and the entire module is an LSN-ordered wait queue plus the
policy that decides when the queue head’s LSN has been satisfied.
Common DBMS Design
Section titled “Common DBMS Design”Replicated databases that offer a synchronous mode converge on a small set of structural choices. Naming them makes PostgreSQL’s specific symbols read as one point in a shared design space.
Commit-path interception after local durability
Section titled “Commit-path interception after local durability”The wait is inserted after the local WAL flush, never before. The
committing transaction is already durable on the primary; what remains is
to confirm remote durability. This ordering matters for crash semantics:
if the primary crashes mid-wait, the transaction is recoverable locally
even though the client never got an acknowledgement. PostgreSQL places the
call exactly here — RecordTransactionCommit flushes XLOG, marks CLOG,
then calls SyncRepWaitForLSN.
A shared-memory wait structure keyed by LSN
Section titled “A shared-memory wait structure keyed by LSN”The committing backend cannot busy-poll the standby’s progress; it must sleep and be woken. The universal pattern is a shared-memory structure that records “process P is waiting for LSN L,” plus a waker (the replication receiver/sender side) that, on each progress report, releases every waiter whose LSN has been reached. Keeping the structure ordered by LSN turns “wake everyone who is satisfied” into a single walk from the head that stops at the first unsatisfied waiter — O(woken) rather than O(all-waiters) per reply.
Multiple acknowledgement levels
Section titled “Multiple acknowledgement levels”Because “acknowledged” has several meanings (write / flush / apply), a
mature design keeps a separate queue per level: a backend waiting for
apply must not be woken by a mere flush report. PostgreSQL keeps
NUM_SYNC_REP_WAIT_MODE == 3 independent queues and three separate
“released up to here” cursors.
A policy for “enough” standbys
Section titled “A policy for “enough” standbys”With N candidate standbys and a requirement of k, the design must define
the synced position — the LSN that is guaranteed to be on at least k
standbys. For a quorum (any k), that is the k-th largest reported LSN.
For priority (first k), it is the oldest (smallest) LSN among the top-k
by priority, since all k must have it. PostgreSQL implements both:
SyncRepGetNthLatestSyncRecPtr for quorum, SyncRepGetOldestSyncRecPtr
for priority.
Cancellation that cannot lie to the client
Section titled “Cancellation that cannot lie to the client”A backend blocked in the wait is in a delicate spot: the transaction is already committed locally. If the wait is interrupted (query cancel, backend termination, postmaster death), the system must not tell the client “aborted” — that would be a lie. The standard resolution is to emit a warning that explicitly says “committed locally, maybe not replicated” and then proceed, never converting the situation into a transaction abort.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Commit-path wait entry point | SyncRepWaitForLSN(XactLastRecEnd, true) (from RecordTransactionCommit) |
| Durability level GUC | synchronous_commit (SyncCommitLevel enum) |
| Internal wait mode | SyncRepWaitMode (SYNC_REP_WAIT_WRITE/FLUSH/APPLY) |
| Standby-selection GUC | synchronous_standby_names |
| Parsed selection policy | SyncRepConfigData (syncrep_method, num_sync, member_names) |
| Priority method | FIRST k (...) → SYNC_REP_PRIORITY |
| Quorum method | ANY k (...) → SYNC_REP_QUORUM |
| LSN-ordered wait queues | WalSndCtl->SyncRepQueue[NUM_SYNC_REP_WAIT_MODE] |
| Per-backend wait record | MyProc->waitLSN, ->syncRepState, ->syncRepLinks |
| ”Released up to” cursor | WalSndCtl->lsn[mode] |
| Waiter wakeup (from walsender) | SyncRepReleaseWaiters → SyncRepWakeQueue |
| Synced-position computation | SyncRepGetSyncRecPtr |
| Candidate standby snapshot | SyncRepGetCandidateStandbys |
| Per-walsender sync priority | WalSnd->sync_standby_priority |
| Config-initialized flag | WalSndCtl->sync_standbys_status (SYNC_STANDBY_INIT/DEFINED) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”The two GUCs and what they each control
Section titled “The two GUCs and what they each control”Synchronous replication is configured by exactly two user-facing parameters, and they answer two different questions.
synchronous_commit answers “how strong an acknowledgement do I want?”
It is an enum (SyncCommitLevel in xact.h):
// SyncCommitLevel — src/include/access/xact.htypedef enum{ SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */ SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */ SYNCHRONOUS_COMMIT_REMOTE_WRITE,/* wait for local flush and remote write */ SYNCHRONOUS_COMMIT_REMOTE_FLUSH,/* wait for local and remote flush */ SYNCHRONOUS_COMMIT_REMOTE_APPLY,/* wait for local and remote flush and remote apply */} SyncCommitLevel;
/* Define the default setting for synchronous_commit */#define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSHThe familiar setting names map onto these: off → OFF (don’t even wait
for local flush before returning — async commit), local → LOCAL_FLUSH
(local fsync only, no remote wait), remote_write → REMOTE_WRITE, on
→ REMOTE_FLUSH (the default value of on), and remote_apply →
REMOTE_APPLY. Only the last three involve a remote wait; off and
local never enter the sync-rep queue at all.
The GUC assign hook translates the user-visible level into the internal wait mode that indexes the queues:
// assign_synchronous_commit — src/backend/replication/syncrep.cvoidassign_synchronous_commit(int newval, void *extra){ switch (newval) { case SYNCHRONOUS_COMMIT_REMOTE_WRITE: SyncRepWaitMode = SYNC_REP_WAIT_WRITE; /* 0 */ break; case SYNCHRONOUS_COMMIT_REMOTE_FLUSH: SyncRepWaitMode = SYNC_REP_WAIT_FLUSH; /* 1 */ break; case SYNCHRONOUS_COMMIT_REMOTE_APPLY: SyncRepWaitMode = SYNC_REP_WAIT_APPLY; /* 2 */ break; default: SyncRepWaitMode = SYNC_REP_NO_WAIT; /* -1: off / local */ break; }}synchronous_standby_names answers the orthogonal question “acknowledged
by which standbys, and how many?” Its value is a tiny grammar
(syncrep_gram.y):
// productions — src/backend/replication/syncrep_gram.ystandby_list -> create_syncrep_config("1", list, SYNC_REP_PRIORITY)NUM '(' standby_list ')' -> create_syncrep_config(NUM, list, SYNC_REP_PRIORITY)ANY NUM '(' standby_list ')' -> create_syncrep_config(NUM, list, SYNC_REP_QUORUM)FIRST NUM '(' standby_list ')'-> create_syncrep_config(NUM, list, SYNC_REP_PRIORITY)So s1, s2, s3 means FIRST 1 (s1, s2, s3) (backward compatible with the
9.6-era single-standby behaviour: one synchronous standby, priority
order); ANY 2 (s1, s2, s3) is a quorum of any two; FIRST 2 (s1, s2, s3) waits for the two highest-priority connected standbys. The parse
result is a flat, malloc-able SyncRepConfigData:
// SyncRepConfigData — src/include/replication/syncrep.htypedef struct SyncRepConfigData{ int config_size; /* total size of this struct, in bytes */ int num_sync; /* number of sync standbys that we need to wait for */ uint8 syncrep_method; /* SYNC_REP_PRIORITY or SYNC_REP_QUORUM */ int nmembers; /* number of members in the following list */ char member_names[FLEXIBLE_ARRAY_MEMBER]; /* nmembers NUL-terminated names */} SyncRepConfigData;check_synchronous_standby_names runs the parser in the GUC check hook
and stashes the parsed struct as the GUC’s “extra”; assign_synchronous_standby_names
simply publishes it into the global SyncRepConfig. Because the result is
flat malloc’d memory, it can live as GUC extra data without a memory
context.
The committing backend’s view: SyncRepWaitForLSN
Section titled “The committing backend’s view: SyncRepWaitForLSN”The entry point is called once per commit, from RecordTransactionCommit
in xact.c, only when the transaction actually wrote WAL and assigned an
XID:
// RecordTransactionCommit (excerpt) — src/backend/access/transam/xact.cif (wrote_xlog && markXidCommitted) SyncRepWaitForLSN(XactLastRecEnd, true);XactLastRecEnd is the LSN of the end of this transaction’s commit
record — exactly the position the standbys must reach. The commit = true
argument tells the wait that this LSN is a commit record, which matters
for remote_apply (only commit records produce apply feedback, so a
non-commit LSN is capped to flush level).
SyncRepWaitForLSN is built to be cheap when sync rep is off, because it
runs on every commit. The fast path checks SyncRepRequested() and the
sync_standbys_status flag without taking any lock:
// SyncRepWaitForLSN fast path — src/backend/replication/syncrep.cif (!SyncRepRequested() || ((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) & (SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT) return;
/* Cap the level for anything other than commit to remote flush only. */if (commit) mode = SyncRepWaitMode;else mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);SyncRepRequested() is the macro (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH) — if the level is
off or local, there is nothing to do. The second clause is the
“checkpointer has initialized the status data, and it says no sync standbys
are defined” case: SYNC_STANDBY_INIT set but SYNC_STANDBY_DEFINED
clear means we can skip without the lock.
If the fast path does not exit, the backend takes SyncRepLock exclusively
and re-checks under the lock (closing the race in
SyncRepUpdateSyncStandbysDefined), then enqueues itself:
// SyncRepWaitForLSN enqueue — src/backend/replication/syncrep.cMyProc->waitLSN = lsn;MyProc->syncRepState = SYNC_REP_WAITING;SyncRepQueueInsert(mode);Assert(SyncRepQueueIsOrderedByLSN(mode));LWLockRelease(SyncRepLock);After releasing the lock it sleeps in a latch loop, waking only on its own
procLatch. The state machine for one backend is NOT_WAITING → WAITING → WAIT_COMPLETE → NOT_WAITING:
// SyncRepWaitForLSN wait loop (condensed) — src/backend/replication/syncrep.cfor (;;){ ResetLatch(MyLatch); if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE) break; if (ProcDiePending) { /* WARNING: committed locally, maybe not replicated */ whereToSendOutput = DestNone; SyncRepCancelWait(); break; } if (QueryCancelPending) { QueryCancelPending = false; /* WARNING ... */ SyncRepCancelWait(); break; } rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP); if (rc & WL_POSTMASTER_DEATH) { ProcDiePending = true; whereToSendOutput = DestNone; SyncRepCancelWait(); break; }}The crucial detail is what happens on cancellation: the code never aborts
the transaction. It already committed locally. The most it does is emit a
WARNING — “The transaction has already committed locally, but might not
have been replicated to the standby.” — and stop sending output. On
normal completion (SYNC_REP_WAIT_COMPLETE, set by a walsender) it breaks,
issues a pg_read_barrier(), asserts it has been detached from the queue,
and resets syncRepState/waitLSN.
flowchart TB
A["RecordTransactionCommit<br/>XLogFlush local WAL"] --> B["SyncRepWaitForLSN(XactLastRecEnd, true)"]
B --> C{SyncRepRequested<br/>and standbys defined?}
C -- no --> R["return immediately<br/>(async / no sync standby)"]
C -- yes --> D["acquire SyncRepLock<br/>set waitLSN, syncRepState=WAITING"]
D --> E["SyncRepQueueInsert(mode)<br/>insert sorted by LSN"]
E --> F["release SyncRepLock"]
F --> G["latch wait loop"]
G --> H{syncRepState ==<br/>WAIT_COMPLETE?}
H -- yes --> I["read barrier<br/>reset state, return to client"]
H -- no --> J{ProcDie / QueryCancel /<br/>PostmasterDeath?}
J -- yes --> K["WARNING: committed locally<br/>SyncRepCancelWait, break"]
J -- no --> L["WaitLatch on procLatch"]
L --> G
Figure 1 — The committing backend’s path through SyncRepWaitForLSN. The
transaction is already locally durable before the wait begins; the only
exits are completion (woken by a walsender) or an interruption that still
preserves the local commit. (Flow from SyncRepWaitForLSN in syncrep.c
and the call site in RecordTransactionCommit.)
The wait queue: three LSN-ordered lists in shared memory
Section titled “The wait queue: three LSN-ordered lists in shared memory”The waiters live in WalSndCtl, the same shared-memory control block the
walsenders use. The sync-rep portion is three doubly-linked-list heads
(one per wait mode) and three “released up to” cursors:
// WalSndCtlData (sync-rep fields) — src/include/replication/walsender_private.htypedef struct{ dlist_head SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]; /* one queue per request type */ XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE]; /* head-of-queue release cursor */ bits8 sync_standbys_status; /* SYNC_STANDBY_INIT | _DEFINED */ /* ... condition variables, walsnds[] flexible array ... */} WalSndCtlData;Each backend’s link node and wait LSN live in its own PGPROC:
// PGPROC (sync-rep fields) — src/include/storage/proc.hXLogRecPtr waitLSN; /* waiting for this LSN or higher */int syncRepState; /* wait state for sync rep */dlist_node syncRepLinks; /* list link if process is in syncrep queue */SyncRepQueueInsert keeps each queue sorted ascending by waitLSN. Most
commits arrive in LSN order (later commits have larger XactLastRecEnd),
so the common case appends at the tail; the function therefore scans from
the tail backward to find the insertion point:
// SyncRepQueueInsert — src/backend/replication/syncrep.cqueue = &WalSndCtl->SyncRepQueue[mode];dlist_reverse_foreach(iter, queue){ PGPROC *proc = dlist_container(PGPROC, syncRepLinks, iter.cur); /* Stop at the element we should insert after, to keep queue LSN-ordered. */ if (proc->waitLSN < MyProc->waitLSN) { dlist_insert_after(&proc->syncRepLinks, &MyProc->syncRepLinks); return; }}/* list was empty, or we belong at the head */dlist_push_head(queue, &MyProc->syncRepLinks);The sorted invariant is what makes wakeup O(woken): the waker walks from
the head and stops at the first waiter whose waitLSN exceeds the released
position. Under assertions, SyncRepQueueIsOrderedByLSN validates the
invariant (and that no two waiters share an LSN) on every insert and wake.
The walsender’s view: releasing waiters
Section titled “The walsender’s view: releasing waiters”The wakers are the walsenders. When a walsender’s ProcessStandbyReplyMessage
(see postgres-wal-sender-receiver.md) updates a standby’s reported
write/flush/apply positions, it calls SyncRepReleaseWaiters (for
non-cascading standbys). That function decides whether this walsender’s
standby is one of the synchronous ones, recomputes the synced positions
across all sync standbys, advances the release cursors, and wakes the
satisfied waiters.
It bails out immediately if this walsender is not a potential sync standby, is not yet streaming, or has no valid flush position:
// SyncRepReleaseWaiters guard — src/backend/replication/syncrep.cif (MyWalSnd->sync_standby_priority == 0 || (MyWalSnd->state != WALSNDSTATE_STREAMING && MyWalSnd->state != WALSNDSTATE_STOPPING) || XLogRecPtrIsInvalid(MyWalSnd->flush)){ announce_next_takeover = true; return;}Otherwise it takes SyncRepLock exclusively, computes the synced
positions, and — only if it really is a synchronous standby and there are
enough of them — advances each per-mode cursor and wakes that mode’s
queue:
// SyncRepReleaseWaiters release loop — src/backend/replication/syncrep.cgot_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);/* ... if (!got_recptr || !am_sync) release lock and leave ... */if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr){ walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr; numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);}if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr){ walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr; numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);}if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr){ walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr; numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);}LWLockRelease(SyncRepLock);Note the cursors advance monotonically and only forward — the < guards
prevent a late or reordered reply from rewinding the release point. The
three modes are released independently from their own queues.
SyncRepWakeQueue is the actual wakeup: walk the mode’s queue from the
head, and for each waiter whose waitLSN has been reached, detach it, set
its state to SYNC_REP_WAIT_COMPLETE, and set its latch:
// SyncRepWakeQueue — src/backend/replication/syncrep.cdlist_foreach_modify(iter, &WalSndCtl->SyncRepQueue[mode]){ PGPROC *proc = dlist_container(PGPROC, syncRepLinks, iter.cur); /* Queue is ordered by LSN: stop at first unsatisfied waiter. */ if (!all && walsndctl->lsn[mode] < proc->waitLSN) return numprocs; dlist_delete_thoroughly(&proc->syncRepLinks); pg_write_barrier(); /* publish detach before state */ proc->syncRepState = SYNC_REP_WAIT_COMPLETE; SetLatch(&(proc->procLatch)); numprocs++;}The pg_write_barrier() pairs with the pg_read_barrier() in
SyncRepWaitForLSN: the waiter reads syncRepState without the lock, so
the waker must make the queue-detach visible before the state flips to
WAIT_COMPLETE, ensuring the woken backend sees itself off the queue.
Choosing the synced position: priority vs. quorum
Section titled “Choosing the synced position: priority vs. quorum”SyncRepGetSyncRecPtr is where the FIRST/ANY policy materializes. It first
snapshots the candidate standbys, checks whether this walsender is among
them, and verifies there are at least num_sync candidates:
// SyncRepGetSyncRecPtr (excerpt) — src/backend/replication/syncrep.cnum_standbys = SyncRepGetCandidateStandbys(&sync_standbys);for (i = 0; i < num_standbys; i++) if (sync_standbys[i].is_me) { *am_sync = true; break; }if (!(*am_sync) || num_standbys < SyncRepConfig->num_sync){ pfree(sync_standbys); return false; /* not enough sync standbys yet — don't release */}if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr, sync_standbys, num_standbys);else SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr, sync_standbys, num_standbys, SyncRepConfig->num_sync);For priority (FIRST k), SyncRepGetCandidateStandbys has already
trimmed the candidate list to the top-num_sync by priority, so the
guaranteed position is the oldest (minimum) LSN among them — every one of
those k standbys is at least that far along:
// SyncRepGetOldestSyncRecPtr (excerpt) — src/backend/replication/syncrep.cfor (i = 0; i < num_standbys; i++){ if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > sync_standbys[i].flush) *flushPtr = sync_standbys[i].flush; /* ... same for write and apply ... */}For quorum (ANY k), any k of the candidates suffices, so the
guaranteed position is the k-th largest LSN: sort descending and take
index nth - 1:
// SyncRepGetNthLatestSyncRecPtr (excerpt) — src/backend/replication/syncrep.cfor (i = 0; i < num_standbys; i++) { write_array[i] = sync_standbys[i].write; /* ... */ }qsort(write_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn); /* descending */qsort(flush_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn);qsort(apply_array, num_standbys, sizeof(XLogRecPtr), cmp_lsn);*writePtr = write_array[nth - 1];*flushPtr = flush_array[nth - 1];*applyPtr = apply_array[nth - 1];cmp_lsn sorts descending via pg_cmp_u64(lsn2, lsn1), so element
[nth-1] is the k-th highest — the LSN that at least k standbys have
reached.
flowchart TB
A["walsender: standby reply<br/>ProcessStandbyReplyMessage"] --> B["SyncRepReleaseWaiters"]
B --> C{this walsender a sync<br/>candidate, streaming,<br/>valid flush?}
C -- no --> Z["return (announce_next_takeover)"]
C -- yes --> D["SyncRepGetCandidateStandbys<br/>snapshot WalSnd[] under spinlocks"]
D --> E{am_sync and<br/>num_standbys >= num_sync?}
E -- no --> Z
E -- yes --> F{syncrep_method?}
F -- PRIORITY (FIRST) --> G["SyncRepGetOldestSyncRecPtr<br/>min LSN of top-k by priority"]
F -- QUORUM (ANY) --> H["SyncRepGetNthLatestSyncRecPtr<br/>k-th largest LSN (qsort desc)"]
G --> I["advance WalSndCtl->lsn[mode]<br/>if greater"]
H --> I
I --> J["SyncRepWakeQueue(false, mode)<br/>wake waiters up to lsn[mode]"]
J --> K["set syncRepState=WAIT_COMPLETE<br/>SetLatch on each woken proc"]
Figure 2 — How a standby reply releases waiters. The walsender computes
the synced position with the policy from synchronous_standby_names
(oldest-of-top-k for FIRST, k-th-largest for ANY), advances the per-mode
release cursor, and wakes every queued backend whose waitLSN has been
satisfied. (Flow from SyncRepReleaseWaiters / SyncRepGetSyncRecPtr /
SyncRepWakeQueue in syncrep.c.)
Candidate snapshot and per-walsender priority
Section titled “Candidate snapshot and per-walsender priority”SyncRepGetCandidateStandbys walks WalSndCtl->walsnds[], copying each
active walsender’s reported positions and sync_standby_priority out from
under that walsender’s spinlock into a private SyncRepStandbyData array.
A walsender qualifies only if it has a PID, is STREAMING or STOPPING,
has non-zero priority, and has a valid flush position. In priority mode, if
more than num_sync qualify, the array is sorted by priority and trimmed
to num_sync:
// SyncRepGetCandidateStandbys (excerpt) — src/backend/replication/syncrep.cif (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY && n > SyncRepConfig->num_sync){ qsort(*standbys, n, sizeof(SyncRepStandbyData), standby_priority_comparator); n = SyncRepConfig->num_sync; /* keep only the highest-priority ones */}Each walsender computes its own priority once at startup and after each
SIGHUP via SyncRepInitConfig → SyncRepGetStandbyPriority, which matches
the standby’s application_name against the parsed member_names list (or
a * wildcard). The position in the list is the priority for FIRST;
under ANY, every member has priority 1. Cascading walsenders always get
priority 0 (synchronous cascade replication is not supported):
// SyncRepGetStandbyPriority (excerpt) — src/backend/replication/syncrep.cif (am_cascading_walsender) return 0;if (!SyncStandbysDefined() || SyncRepConfig == NULL) return 0;standby_name = SyncRepConfig->member_names;for (priority = 1; priority <= SyncRepConfig->nmembers; priority++){ if (pg_strcasecmp(standby_name, application_name) == 0 || strcmp(standby_name, "*") == 0) { found = true; break; } standby_name += strlen(standby_name) + 1;}if (!found) return 0;return (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) ? priority : 1;The race the status flag closes
Section titled “The race the status flag closes”synchronous_standby_names can be set to empty at runtime (SIGHUP). A
naive implementation would leave backends wedged forever on the queue with
nobody left to wake them. The checkpointer owns the
sync_standbys_status flag and reconciles it in
SyncRepUpdateSyncStandbysDefined: when the GUC transitions to empty, it
wakes all queues (SyncRepWakeQueue(true, i) for every mode) and clears
SYNC_STANDBY_DEFINED; the flag also gates joining the queue, so a
backend that hasn’t yet reloaded its config cannot enqueue after the
flush:
// SyncRepUpdateSyncStandbysDefined (excerpt) — src/backend/replication/syncrep.cif (!sync_standbys_defined){ for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++) SyncRepWakeQueue(true, i); /* all=true: drain the queue unconditionally */}WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT | (sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);This is why SyncRepWaitForLSN re-checks SYNC_STANDBY_DEFINED under the
lock before enqueuing: the flag and the queue are manipulated atomically
under SyncRepLock, so there is no window where a backend joins a queue
that has just been drained.
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by role. Files are under /data/hgryoo/references/postgres/.
Configuration: the two GUCs (xact.h, syncrep.h, syncrep_gram.y, syncrep.c)
Section titled “Configuration: the two GUCs (xact.h, syncrep.h, syncrep_gram.y, syncrep.c)”SyncCommitLevel(enum) —SYNCHRONOUS_COMMIT_OFF…_REMOTE_APPLY;SYNCHRONOUS_COMMIT_ONaliases_REMOTE_FLUSH.synchronous_commit(GUC int) — current level.SyncRepWaitMode(file-static int) — internal mode derived from the level:SYNC_REP_NO_WAIT(-1),SYNC_REP_WAIT_WRITE/FLUSH/APPLY(0/1/2).NUM_SYNC_REP_WAIT_MODE— the constant 3; dimensions the queues and cursors.SyncRepRequested()(macro) — `max_wal_senders > 0 && synchronous_commitSYNCHRONOUS_COMMIT_LOCAL_FLUSH`.
assign_synchronous_commit— GUC assign hook; maps level →SyncRepWaitMode.SyncRepConfigData(struct) — parsedsynchronous_standby_names:num_sync,syncrep_method,nmembers,member_names[].SYNC_REP_PRIORITY/SYNC_REP_QUORUM—syncrep_methodvalues.SyncStandbysDefined()(macro) —SyncRepStandbyNamesis non-empty.check_synchronous_standby_names/assign_synchronous_standby_names— GUC check/assign hooks; parse viasyncrep_yyparse, publishSyncRepConfig.
The committing backend (syncrep.c, xact.c, proc.h)
Section titled “The committing backend (syncrep.c, xact.c, proc.h)”SyncRepWaitForLSN— commit-path entry; fast-path exit, enqueue, latch wait loop, cancellation handling.RecordTransactionCommit— call site:SyncRepWaitForLSN(XactLastRecEnd, true)after localXLogFlush.MyProc->waitLSN/->syncRepState/->syncRepLinks— per-backend wait record inPGPROC.SYNC_REP_NOT_WAITING/SYNC_REP_WAITING/SYNC_REP_WAIT_COMPLETE—syncRepStatevalues.SyncRepQueueInsert— insertMyProcinto a mode’s queue, sorted bywaitLSN(scan from tail).SyncRepCancelWait— detach from queue under lock on interruption.SyncRepCleanupAtProcExit— detach on backend exit (lock-free pre-check).
Shared-memory wait structure (walsender_private.h)
Section titled “Shared-memory wait structure (walsender_private.h)”WalSndCtlData.SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]— the three LSN-ordered queues.WalSndCtlData.lsn[NUM_SYNC_REP_WAIT_MODE]— per-mode “released up to” cursor.WalSndCtlData.sync_standbys_status—SYNC_STANDBY_INIT/SYNC_STANDBY_DEFINEDbits.SyncRepQueueIsOrderedByLSN(assert-only) — validates the sort invariant.
The walsender release side (syncrep.c)
Section titled “The walsender release side (syncrep.c)”SyncRepReleaseWaiters— called fromProcessStandbyReplyMessage; advances cursors and wakes queues.SyncRepGetSyncRecPtr— compute synced write/flush/apply; dispatch onsyncrep_method; setam_sync.SyncRepGetOldestSyncRecPtr— priority: minimum LSN over candidates.SyncRepGetNthLatestSyncRecPtr— quorum: k-th largest LSN (qsort descending).cmp_lsn— descendingXLogRecPtrcomparator.SyncRepGetCandidateStandbys— snapshotWalSnd[]; filter active/ streaming/priority/valid-flush; trim tonum_syncin priority mode.standby_priority_comparator— sort by priority, tie-break bywalsnd_index.SyncRepWakeQueue— walk a queue from head, wake satisfied waiters, setWAIT_COMPLETE.SyncRepInitConfig/SyncRepGetStandbyPriority— per-walsender priority computation at startup / SIGHUP.WalSnd->sync_standby_priority— this walsender’s priority (0 = not a sync candidate).SyncRepStandbyData(struct) — private copy of a candidate’swrite/flush/apply/priority/is_me.
Checkpointer reconciliation (syncrep.c)
Section titled “Checkpointer reconciliation (syncrep.c)”SyncRepUpdateSyncStandbysDefined— maintainsync_standbys_status; drain queues when the GUC goes empty.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
SyncCommitLevel (enum) | src/include/access/xact.h | 68 |
SYNCHRONOUS_COMMIT_ON (macro) | src/include/access/xact.h | 80 |
SyncRepRequested (macro) | src/include/replication/syncrep.h | 18 |
SYNC_REP_WAIT_WRITE (macro) | src/include/replication/syncrep.h | 23 |
NUM_SYNC_REP_WAIT_MODE (macro) | src/include/replication/syncrep.h | 27 |
SYNC_REP_NOT_WAITING (macro) | src/include/replication/syncrep.h | 30 |
SYNC_REP_PRIORITY (macro) | src/include/replication/syncrep.h | 35 |
SyncRepStandbyData (struct) | src/include/replication/syncrep.h | 42 |
SyncRepConfigData (struct) | src/include/replication/syncrep.h | 63 |
WalSndCtlData.SyncRepQueue | src/include/replication/walsender_private.h | 90 |
WalSndCtlData.lsn | src/include/replication/walsender_private.h | 96 |
WalSndCtlData.sync_standbys_status | src/include/replication/walsender_private.h | 103 |
SYNC_STANDBY_INIT (macro) | src/include/replication/walsender_private.h | 125 |
SYNC_STANDBY_DEFINED (macro) | src/include/replication/walsender_private.h | 132 |
PGPROC.waitLSN | src/include/storage/proc.h | 267 |
PGPROC.syncRepState | src/include/storage/proc.h | 268 |
PGPROC.syncRepLinks | src/include/storage/proc.h | 269 |
SyncRepWaitForLSN | src/backend/replication/syncrep.c | 148 |
SyncRepQueueInsert | src/backend/replication/syncrep.c | 372 |
SyncRepCancelWait | src/backend/replication/syncrep.c | 406 |
SyncRepCleanupAtProcExit | src/backend/replication/syncrep.c | 416 |
SyncRepInitConfig | src/backend/replication/syncrep.c | 445 |
SyncRepReleaseWaiters | src/backend/replication/syncrep.c | 474 |
SyncRepGetSyncRecPtr | src/backend/replication/syncrep.c | 586 |
SyncRepGetOldestSyncRecPtr | src/backend/replication/syncrep.c | 660 |
SyncRepGetNthLatestSyncRecPtr | src/backend/replication/syncrep.c | 693 |
cmp_lsn | src/backend/replication/syncrep.c | 738 |
SyncRepGetCandidateStandbys | src/backend/replication/syncrep.c | 754 |
standby_priority_comparator | src/backend/replication/syncrep.c | 833 |
SyncRepGetStandbyPriority | src/backend/replication/syncrep.c | 860 |
SyncRepWakeQueue | src/backend/replication/syncrep.c | 907 |
SyncRepUpdateSyncStandbysDefined | src/backend/replication/syncrep.c | 964 |
SyncRepQueueIsOrderedByLSN | src/backend/replication/syncrep.c | 1024 |
check_synchronous_standby_names | src/backend/replication/syncrep.c | 1058 |
assign_synchronous_standby_names | src/backend/replication/syncrep.c | 1118 |
assign_synchronous_commit | src/backend/replication/syncrep.c | 1124 |
SyncRepWaitForLSN call site | src/backend/access/transam/xact.c | 1557 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All claims below were re-read against REL_18_STABLE at commit 273fe94
on 2026-06-05. Symbol line numbers are in the position-hint table above.
Verified facts
Section titled “Verified facts”-
All synchronous-replication logic runs on the primary; standbys are unaware they are synchronous. Verified in the
syncrep.cfile header: “All code in this module executes on the primary. … it isolates all logic about waiting/releasing onto the primary. The primary defines which standbys it wishes to wait for. The standbys are completely unaware of the durability requirements of transactions on the primary.” A standby’swalreceiversends the same write/flush/apply feedback regardless of whether it is named in any primary’ssynchronous_standby_names. -
The wait is inserted after the local WAL flush, never before. Verified at the call site in
RecordTransactionCommit(xact.c): the localXLogFlushofXactLastRecEndand the CLOG status update precede theSyncRepWaitForLSN(XactLastRecEnd, true)call. A primary crash mid-wait therefore leaves the transaction recoverable locally even though the client never received an acknowledgement. -
synchronous_commithas five levels;onisremote_flush. Verified inSyncCommitLevel(xact.h):OFF,LOCAL_FLUSH,REMOTE_WRITE,REMOTE_FLUSH,REMOTE_APPLY, with#define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSH.assign_synchronous_commitmaps only the threeREMOTE_*levels to a real wait mode;offandlocalyieldSYNC_REP_NO_WAIT(-1). -
There are exactly three wait modes and three queues. Verified:
NUM_SYNC_REP_WAIT_MODE == 3(syncrep.h);WalSndCtlDatacarriesdlist_head SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]andXLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE](walsender_private.h). The three modes are released independently inSyncRepReleaseWaiters. -
SyncRepWaitForLSNhas a lock-free fast path keyed onSyncRepRequested()andsync_standbys_status. Verified: the function returns immediately when!SyncRepRequested()or when the status flag equals exactlySYNC_STANDBY_INIT(initialized, but no sync standby defined). Only if neither short-circuit fires does it takeSyncRepLockexclusively and re-checkSYNC_STANDBY_DEFINED. -
The wait queues are kept strictly ordered by ascending
waitLSN. Verified inSyncRepQueueInsert(reverse scan from the tail, insert after the first element with a smallerwaitLSN) and asserted on every insert/wake bySyncRepQueueIsOrderedByLSN, which also asserts no two waiters share an LSN. This invariant is what makesSyncRepWakeQueuestop at the first unsatisfied waiter. -
FIRST k(priority) uses the oldest LSN of the top-k candidates;ANY k(quorum) uses the k-th largest LSN. Verified:SyncRepGetSyncRecPtrdispatches onSyncRepConfig->syncrep_method, callingSyncRepGetOldestSyncRecPtr(minimum over candidates) forSYNC_REP_PRIORITYandSyncRepGetNthLatestSyncRecPtr(qsort descending viacmp_lsn, take indexnth - 1) forSYNC_REP_QUORUM. In priority modeSyncRepGetCandidateStandbyshas already trimmed the candidate set tonum_synchighest-priority members. -
A candidate walsender must be active, streaming/stopping, have non-zero sync priority, and a valid flush LSN. Verified in
SyncRepGetCandidateStandbys: itcontinues past any walsender withpid == 0, with state other thanWALSNDSTATE_STREAMING/WALSNDSTATE_STOPPING, withsync_standby_priority == 0, or withXLogRecPtrIsInvalid(flush). Cascading walsenders get priority 0 inSyncRepGetStandbyPriority(am_cascading_walsender→ return 0), so synchronous cascade replication is unsupported. -
Release cursors advance monotonically forward only. Verified in
SyncRepReleaseWaiters: eachwalsndctl->lsn[mode]is updated only when the newly computed position is strictly greater (<guards), so a late or reordered standby reply cannot rewind a release point. -
Cancellation never aborts the transaction; it emits a WARNING and proceeds. Verified in the
SyncRepWaitForLSNwait loop. OnProcDiePendingit raisesWARNING“canceling the wait for synchronous replication and terminating connection due to administrator command” with detail “The transaction has already committed locally, but might not have been replicated to the standby.”, setswhereToSendOutput = DestNone, and callsSyncRepCancelWait. OnQueryCancelPendingit clears the flag and raises a parallel WARNING (“canceling wait for synchronous replication due to user request”). OnWL_POSTMASTER_DEATHit setsProcDiePendingand cancels. None of these paths roll back. -
When
synchronous_standby_namesis cleared at runtime, every queue is drained unconditionally. Verified inSyncRepUpdateSyncStandbysDefined: when!sync_standbys_definedit callsSyncRepWakeQueue(true, i)for every mode (theall = trueargument bypasses the LSN check and wakes every waiter), then clearsSYNC_STANDBY_DEFINEDinsync_standbys_status. The flag is also re-checked underSyncRepLockinSyncRepWaitForLSNbefore enqueue, so no backend can join a queue that was just drained.
Inference (not line-anchored)
Section titled “Inference (not line-anchored)”-
The pairing of
pg_write_barrier()inSyncRepWakeQueue(afterdlist_delete_thoroughly, before settingSYNC_REP_WAIT_COMPLETE) withpg_read_barrier()inSyncRepWaitForLSN(after observingSYNC_REP_WAIT_COMPLETE) is the lock-free handoff that lets the woken backend readsyncRepStatewithoutSyncRepLockand still see itself detached from the queue. Both barriers are present in the source; the memory-ordering argument is the documented intent, restated here. -
The “most commits append at the tail” rationale for scanning
SyncRepQueueInsertfrom the tail backward is a performance observation consistent with the reverse-scan code, not a separately measured claim.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Semi-synchronous replication (DDIA ch. 5). PostgreSQL’s
FIRST 1 (...)is exactly Kleppmann’s semi-synchronous configuration: one follower is synchronous, the rest asynchronous, and if the synchronous follower stalls, another is promoted to take its place (announce_next_takeoverinSyncRepReleaseWaitersis the seam where a higher-priority standby reclaims the synchronous slot). A focused note mapping the DDIA “single synchronous follower” failure modes onto PostgreSQL’s priority-takeover logic would sharpen the availability story. (raw/system/textbooks/Designing Data Intensive...pdf.) -
Quorum replication vs. Dynamo-style
R + W > N.ANY k (s1..sN)is a write quorum of size k over N candidate standbys, but PostgreSQL fixes the read side at “read the primary” rather than offering tunable read quorums. ComparingSyncRepGetNthLatestSyncRecPtr’s “k-th largest LSN” rule against Dynamo’s sloppy-quorum + hinted-handoff model frames what PostgreSQL gives up (no leaderless writes, no read-repair) for a single authoritative primary. Paper:dbms-papers/dynamo.md. -
No consensus: PostgreSQL sync rep is not Raft/Paxos. A committing backend waits for k acknowledgements, but there is no leader election, no log-matching safety property, and no automatic failover in core — the primary simply blocks. On a primary crash the cluster needs an external arbiter (Patroni, repmgr, pg_auto_failover) to promote a standby. The contrast with Raft (where the commit index is the replicated agreement, and leadership is part of the protocol) is the cleanest way to explain why “synchronous_commit guarantees durability, not availability.” Papers:
dbms-papers/raft.md,dbms-papers/paxos.md. -
CAP positioning. Synchronous replication trades A for C under partition: if the required standbys are unreachable, committing backends block indefinitely rather than acknowledge a possibly-lost write. This is the CP corner of the
dbms-papers/cap.mdtriangle, made operational by the fact that the wait has no timeout (only query-cancel / termination / postmaster-death exits). A note on “why there is nosynchronous_replication_timeout” — and the explicit project decision that a timeout would silently downgrade durability — belongs here. -
ARIES and the commit-record LSN. The whole mechanism rests on the ARIES-style WAL invariant (
dbms-papers/aries.md, andpostgres-xlog-wal.md): a commit produces a single WAL record ending atXactLastRecEnd, and “replicated” reduces to “the standby’s reported LSN ≥XactLastRecEnd.” Synchronous replication adds no new log semantics; it only delays the client acknowledgement until the existing redo stream is known to be remote-durable. -
Apply-level waits and read-your-writes on standbys.
remote_applyis the one level that makes a query routed to the standby causally consistent with the commit, at the cost of waiting for standby replay (the slowest of the three feedback points). Comparing this against systems that expose explicit session-level “read your writes” tokens (LSN handoff to the application) would show where PostgreSQL stops: it offers the cluster-wide guarantee but no per-session causal token in core.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/replication/syncrep.c— the entire module:SyncRepWaitForLSN(commit-path wait, fast path, latch loop, cancellation),SyncRepQueueInsert/SyncRepWakeQueue/SyncRepQueueIsOrderedByLSN(the LSN-ordered queues),SyncRepReleaseWaiters/SyncRepGetSyncRecPtr/SyncRepGetOldestSyncRecPtr/SyncRepGetNthLatestSyncRecPtr/SyncRepGetCandidateStandbys(the release side and FIRST/ANY policy),SyncRepGetStandbyPriority/SyncRepInitConfig(per-walsender priority),SyncRepUpdateSyncStandbysDefined(checkpointer reconciliation),assign_synchronous_commit/check_synchronous_standby_names/assign_synchronous_standby_names(GUC hooks).src/backend/replication/syncrep_gram.yandsyncrep_scanner.l— thesynchronous_standby_namesgrammar producingSyncRepConfigDatawithSYNC_REP_PRIORITY(FIRST/bare list) orSYNC_REP_QUORUM(ANY).src/include/replication/syncrep.h—SyncRepRequested/SyncStandbysDefinedmacros, theSYNC_REP_WAIT_*andNUM_SYNC_REP_WAIT_MODEconstants, theSYNC_REP_NOT_WAITING/WAITING/WAIT_COMPLETEstates,SyncRepStandbyData, andSyncRepConfigData.src/include/replication/walsender_private.h—WalSndCtlData(theSyncRepQueue[],lsn[],sync_standbys_statusfields) and theSYNC_STANDBY_INIT/SYNC_STANDBY_DEFINEDflag bits.src/include/storage/proc.h—PGPROC.waitLSN/syncRepState/syncRepLinks, the per-backend wait record.src/include/access/xact.h—SyncCommitLevelenum andSYNCHRONOUS_COMMIT_ONalias.src/backend/access/transam/xact.c—RecordTransactionCommit, the single in-core call site ofSyncRepWaitForLSN.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Designing Data-Intensive Applications (Kleppmann 2017), ch. 5
“Replication” — synchronous vs. asynchronous followers, the semi-synchronous
middle ground that
FIRST 1 (...)implements. (raw/system/textbooks/.) - DeCandia et al. (2007), “Dynamo.”
dbms-papers/dynamo.md— quorum replication (R + W > N) as the comparison point forANY k. - Ongaro & Ousterhout (2014), “Raft”; Lamport, “Paxos.”
dbms-papers/raft.md,dbms-papers/paxos.md— consensus replication, the contrast that explains why sync rep is durability-not-availability. - Gilbert & Lynch (2002), CAP.
dbms-papers/cap.md— the CP positioning of a no-timeout synchronous wait. - Mohan et al. (1992), “ARIES.”
dbms-papers/aries.md— the WAL / commit-record-LSN foundation the wait is keyed on. - Database Internals (Petrov 2019), ch. on replication and consensus —
general framing of leader-based log replication.
(
knowledge/research/dbms-general/database-internals.md.)
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-wal-sender-receiver.md— the streaming transport andProcessStandbyReplyMessage, which callsSyncRepReleaseWaiters. The write/flush/apply feedback positions originate there; this doc only consumes them.postgres-xact.md—RecordTransactionCommitand the commit path that callsSyncRepWaitForLSNafter local flush.postgres-xlog-wal.md—XactLastRecEnd,XLogFlush, and the LSN-as-byte- offset model the entire wait is built on.postgres-replication-slots.md— slot-based standby tracking, orthogonal to but often co-deployed with synchronous standbys.postgres-overview-replication-ha.md— the axis-level overview where synchronous replication sits among the replication/HA subsystems.