PostgreSQL Replication — From WAL File Shipping to Logical Pub/Sub
PostgreSQL Replication — From WAL File Shipping to Logical Pub/Sub
Section titled “PostgreSQL Replication — From WAL File Shipping to Logical Pub/Sub”PostgreSQL never added replication so much as it kept finding new layers of its own write-ahead log to reuse. Every era in this story reuses the same WAL stream that already existed for crash recovery, and re-exposes it at a finer grain: first as whole 16 MB segment files copied by an external script, then as a continuous byte stream pushed over libpq, then as a decoded sequence of logical row changes that a subscriber can apply through ordinary SQL. The through-line is “the WAL is already the truth — find a cheaper, more selective way to ship it.”
This document traces how replication changed across major releases and why, and ends at the REL_18 design. It does not re-derive the mechanism of any single subsystem — for that, follow the cross-links to the current-state module docs at each era.
Contents:
- Why this subsystem had to evolve — the original 8.x limitation
- Timeline — eras and the releases that introduced them
- 8.x — Warm standby via WAL file shipping —
restore_command,pg_standby - 9.0 — Streaming replication + Hot Standby — walsender/walreceiver, read-only standbys
- 9.1–9.2 — Synchronous replication, then cascading —
syncrep, standby-of-a-standby - 9.4 — Logical decoding + replication slots — reorder buffer, snapshot builder, slot bookkeeping
- 10 — Built-in logical replication (pub/sub) —
PUBLICATION/SUBSCRIPTION, pgoutput, apply workers - 14–15 — Selective and transactional logical replication — two-phase, row filters, column lists
- 17 — Failover slots and slot synchronization —
syncedslots, slotsync worker - Where it stands at REL_18 — the current design and the PG19 forward note
- Sources
Why this subsystem had to evolve (the original limitation)
Section titled “Why this subsystem had to evolve (the original limitation)”The original WAL existed for exactly one purpose: crash recovery. After a crash, the startup process replays WAL records from the last checkpoint forward to bring the data files back to a consistent state. That replay loop is the entire substrate everything below is built on — the insight that drove a decade of replication features is that the same replay loop that recovers a crashed server can keep replaying forever on a second server fed a copy of the WAL.
But in the 8.x era the only way to get the WAL to a second server was the
archive: archive_command copied each completed 16 MB segment somewhere, and a
standby’s restore_command fetched and replayed it. This worked, and it was a
genuine HA story for its day, but it had hard structural limits:
- Granularity is a whole segment. A standby could not see a transaction
until the entire 16 MB WAL segment containing it had filled and been
archived. Replication lag was therefore bounded below by “how long until the
current segment fills,” which on a quiet system could be minutes or (with
archive_timeout) whatever timeout you forced. - The standby was useless while it ran. A server in recovery could not accept connections at all. The replica was a cold spare — it burned a whole machine producing zero read capacity.
- No back-pressure, no feedback. The primary shipped files into the archive blind. It had no idea whether any standby had consumed them, how far behind a standby was, or whether a segment it was about to recycle was still needed downstream.
- Physical only, all-or-nothing. You replicated the entire cluster, byte for byte, to an identical-version, identical-architecture standby. You could not replicate one database, one table, or filter rows; you could not replicate across major versions; you could not have the replica also accept writes.
Each subsequent era chips away at one of these limits. Streaming attacks granularity and the cold spare; slots attack the blind-shipping problem; logical decoding and pub/sub attack the “physical, all-or-nothing” constraint; the 17-era work attacks the durability gap that logical slots opened up on failover.
Timeline
Section titled “Timeline”timeline
title PostgreSQL Replication Evolution
section File era
8.2 / 8.3 : Continuous archiving<br/>restore_command : pg_standby contrib<br/>warm standby
section Streaming era
9.0 : Streaming replication<br/>walsender / walreceiver : Hot Standby<br/>read-only queries on standby
9.1 : Synchronous replication<br/>synchronous_standby_names : pg_basebackup<br/>streaming-only setup
9.2 : Cascading replication<br/>standby feeds standby : Streaming-only standby<br/>no archive required
section Logical era
9.4 : Logical decoding<br/>reorder buffer + snapshot builder : Replication slots<br/>physical and logical
section Pub/Sub era
10 : Built-in logical replication<br/>PUBLICATION / SUBSCRIPTION : pgoutput plugin<br/>apply + tablesync workers
14 : Two-phase decoding<br/>output plugin PREPARE : Streaming of in-progress txns<br/>large transactions
15 : Row filters + column lists<br/>selective replication : Two-phase subscriptions<br/>parallel apply (14/16)
section Resilience era
17 : Failover slots<br/>failover = true : Slot synchronization<br/>slotsync worker on standby
18 : Current state<br/>REL_18_STABLE : (PG19 next: more logical DDL / sequences)
8.x — Warm standby via WAL file shipping
Section titled “8.x — Warm standby via WAL file shipping”Released: continuous archiving and point-in-time recovery landed in 8.0;
the warm-standby pattern — a server that stays permanently in recovery,
continuously replaying archived segments — was documented and made practical in
8.2, with the pg_standby contrib helper added in 8.3.
What changed. PostgreSQL 8.0 introduced archive_command: when a WAL segment
fills, the server runs an operator-supplied shell command to copy that 16 MB file
somewhere durable (an NFS mount, another host, tape). The mirror image is
restore_command in a recovery.conf on the standby: a permanently-recovering
server that, each time it exhausts the WAL it has, runs that command to fetch
the next segment and replays it. pg_standby (8.3) was a contrib program written
specifically to be a smart restore_command — it would wait for the next segment
to appear rather than declaring end-of-recovery, and it cleaned up consumed
segments.
Why. This was the first time the WAL — which had existed since 7.1 purely for crash recovery — was deliberately reused as a replication transport. The recovery replay loop was already there and already correct; warm standby simply arranged to never let it finish.
Structural shape — before → after.
flowchart LR
subgraph Primary8x["Primary 8.x"]
BE["backends<br/>write WAL"] --> WALSEG["WAL segment<br/>16 MB file fills"]
WALSEG -->|archive_command| ARCH["archive<br/>NFS / scp / tape"]
end
subgraph Standby8x["Warm standby 8.x"]
REST["restore_command<br/>pg_standby"] --> STARTUP["startup process<br/>redo loop"]
STARTUP --> DATA["data files<br/>kept consistent"]
end
ARCH -.->|poll for next<br/>16 MB segment| REST
STARTUP -.->|cannot accept<br/>connections| X(("read<br/>blocked"))
The defining properties are visible in the diagram: the unit of transfer is a
whole segment file, the link between the two servers is an out-of-band archive
(PostgreSQL processes never talk to each other), and the standby is in pure
recovery — X shows it refusing all client connections. There is no socket
between primary and standby, no feedback channel, and no way to query the standby.
Why it had to change. Three pain points dominated: minutes-scale lag bounded by segment fill time, a standby that produced zero read capacity, and a primary that shipped segments completely blind to whether anyone consumed them. The next two releases attacked all three.
Cross-link: the segment/archive machinery this era introduced is still the fallback path for streaming today, and is documented in postgres-archiving-walsummary.md and the WAL format in postgres-xlog-wal.md.
9.0 — Streaming replication + Hot Standby
Section titled “9.0 — Streaming replication + Hot Standby”Released: PostgreSQL 9.0 — arguably the single most important release in this entire arc. It shipped two complementary features that together turned the cold file-shipping spare into a live, queryable, low-lag replica.
What changed — streaming replication. Instead of waiting for a 16 MB segment to fill and be archived, the standby now opens a replication connection over the normal libpq wire protocol directly to the primary and streams WAL as it is generated, record by record. Two new processes appear:
- On the primary, a walsender — a special backend, forked per replication connection, that reads WAL and pushes it down the socket.
- On the standby, a walreceiver — a process that connects out to the primary’s walsender, receives the WAL stream, writes it to local WAL, and wakes the startup process to replay it.
This is the birth of src/backend/replication/walsender.c and
src/backend/replication/walreceiver.c, both of which are still the transport
core in REL_18. The replication protocol got new message types
(START_REPLICATION, the XLogData / 'w' byte stream, keepalives) layered on
top of the wire protocol.
What changed — Hot Standby. A separate, equally large change: a server in
recovery can now accept read-only connections and run queries while it
replays. This required teaching the recovery replay loop to publish a running
snapshot of in-progress transactions on the primary (so the standby knows which
XIDs are visible), to manage recovery conflicts (a query on the standby vs. a
VACUUM cleaning up rows the query still needs), and to expose
max_standby_streaming_delay / hot_standby_feedback to tune that conflict.
Why. Streaming kills the granularity limit (lag drops from minutes to milliseconds) and removes the dependence on an external archive for the live path. Hot Standby kills the cold-spare waste — the replica now serves read traffic, turning HA hardware into read-scaling hardware.
Structural shape — before → after.
flowchart LR
subgraph Primary90["Primary 9.0"]
BE["backends"] --> WAL["local WAL"]
WAL --> WS["walsender<br/>per connection"]
end
subgraph Standby90["Hot Standby 9.0"]
WR["walreceiver"] --> SWAL["standby WAL"]
SWAL --> SU["startup process<br/>continuous redo"]
SU --> SDATA["data files"]
RO["read-only<br/>backends"] --> SDATA
end
WS -->|libpq replication conn<br/>XLogData stream| WR
WR -.->|standby reply<br/>flush/apply LSN| WS
Compare with the 8.x diagram: the dotted out-of-band archive is replaced by a
direct socket (walsender → walreceiver), the unit of transfer is a WAL record
not a 16 MB file, a feedback channel now exists (the standby reports its
received/flushed/applied LSN back to the walsender), and crucially RO read-only
backends now hang off the standby’s data files. The archive path still exists as
a fallback when the standby falls too far behind and the primary has recycled the
needed segment.
Why it still had to evolve. 9.0 streaming was asynchronous — the primary committed without waiting for the standby, so a primary crash could lose the last few committed transactions that hadn’t reached the replica. And the topology was flat: every standby connected directly to the primary, so N standbys meant N walsenders all reading WAL on the primary. The next two releases fixed both.
Cross-link: the modern walsender/walreceiver transport, replication protocol messages, and standby reply feedback are documented in postgres-wal-sender-receiver.md. Recovery redo and Hot Standby conflict handling are in postgres-recovery-redo.md.
9.1–9.2 — Synchronous replication, then cascading
Section titled “9.1–9.2 — Synchronous replication, then cascading”These two releases harden the topology that 9.0 created: 9.1 adds a durability guarantee, and 9.2 adds fan-out without piling all the load on the primary.
9.1 — Synchronous replication
Section titled “9.1 — Synchronous replication”What changed. 9.1 added synchronous_standby_names and the
synchronous_commit levels that interact with it. When a standby is named as
synchronous, a committing transaction on the primary does not return success to
the client until the standby has confirmed it received (and, depending on the
level, flushed or applied) the commit’s WAL. This is implemented by making the
committing backend block in a wait queue after writing its commit record, and
having the walsender wake it once the standby’s reply LSN advances past the
commit LSN. This is the birth of src/backend/replication/syncrep.c, still the
core of the feature in REL_18.
Why. 9.0 streaming was asynchronous: a primary crash could silently lose the last handful of committed transactions that had not yet crossed the wire. Workloads that needed “if the client was told it committed, it survives a primary failure” had no answer. Synchronous replication trades a round-trip of commit latency for zero data loss on failover.
The cost is deliberately tunable through synchronous_commit: off (don’t even
wait for local flush), local (wait for local flush only — ignore standbys),
remote_write (standby received and wrote to OS), on (standby flushed to
disk), and later remote_apply (standby has replayed it, so a read on the
standby will see it). The whole point is that durability is now a per-transaction
dial, not a cluster-wide mode.
9.2 — Cascading replication
Section titled “9.2 — Cascading replication”What changed. 9.2 let a standby run its own walsender and feed WAL onward to
further standbys. A standby is no longer a leaf; it can be an interior node in a
replication tree. 9.2 also made a standby able to run entirely from streaming with
no archive at all, and pg_basebackup (introduced 9.1) made cloning a standby a
single built-in command instead of a manual filesystem copy plus
pg_start_backup/pg_stop_backup dance.
Why. In the flat 9.0 topology, every standby connected directly to the primary, so each one added a walsender reading WAL and consuming network bandwidth on the primary. Ten reporting replicas meant ten walsenders on the primary. Cascading lets you build a tree — the primary feeds two standbys, each of those feeds five more — so the primary’s fan-out cost is bounded and downstream bandwidth is spent on the intermediate nodes.
Structural shape — before → after.
flowchart TB
subgraph Flat["9.0 — flat fan-out"]
P0["primary<br/>N walsenders"]
P0 --> A0["standby A"]
P0 --> B0["standby B"]
P0 --> C0["standby C"]
end
subgraph Casc["9.2 — cascading tree"]
P1["primary<br/>2 walsenders"]
P1 --> M1["standby<br/>cascading: own walsenders"]
P1 --> M2["standby<br/>cascading: own walsenders"]
M1 --> L1["leaf standby"]
M1 --> L2["leaf standby"]
M2 --> L3["leaf standby"]
M2 --> L4["leaf standby"]
end
The flat layout puts all N walsenders on the primary; the cascading layout caps
the primary at two and pushes the rest of the fan-out cost into the interior
nodes, each of which now runs walsenders of its own.
Why it still had to evolve. Everything so far is physical: byte-for-byte, whole-cluster, same-version, read-only on the standby. None of it can replicate a single table, transform data, replicate between major versions, or feed a non-PostgreSQL consumer. The WAL was being shipped, but never interpreted. The 9.4 release cracked that open.
Cross-link: the synchronous commit wait queue, the
synchronous_standby_namesgrammar (FIRST/ANYquorum sets), and thesynchronous_commitlevels are documented in postgres-synchronous-replication.md. The standby-reply feedback that drives the wake-up is in postgres-wal-sender-receiver.md.
9.4 — Logical decoding + replication slots
Section titled “9.4 — Logical decoding + replication slots”Released: PostgreSQL 9.4 — the conceptual turning point. Up to here, “the WAL” meant an opaque byte stream you replayed verbatim. 9.4 made it possible to decode the WAL back into a stream of logical row changes, and introduced the bookkeeping object that makes any continuous consumer safe.
Replication slots
Section titled “Replication slots”What changed. A replication slot is a small piece of persistent server-side state that records how far a particular consumer has gotten and what resources the server must therefore retain on the consumer’s behalf. Two kinds exist:
- Physical slots keep the primary from recycling WAL segments the standby has
not yet received (via
restart_lsn), closing the 8.x/9.0 race where the primary recycled a segment the standby still needed and forced a fall back to the archive. - Logical slots additionally pin
catalog_xmin— the oldest transaction ID whose catalog row versions must be kept — so that a logical consumer can still decode old WAL by looking up the catalog as it was when that WAL was written.
This is the birth of src/backend/replication/slot.c, still the slot machinery in
REL_18. Slots are what finally fixed the “ship blind, hope nobody needed that
segment” problem from the 8.x era: the server now knows who is consuming and how
far behind they are, and will refuse to throw away what they still need (at the
cost of unbounded WAL growth if a consumer dies — hence max_slot_wal_keep_size,
added later).
Logical decoding
Section titled “Logical decoding”What changed. Logical decoding reads the physical WAL and reconstructs, per transaction, the sequence of logical changes (INSERT/UPDATE/DELETE on which table, with which column values) in commit order. Three hard problems had to be solved, and each became a module:
- Reassembling transactions — WAL records from concurrent transactions are
interleaved on disk, but a logical consumer wants each transaction whole and in
commit order. The reorder buffer
(
src/backend/replication/logical/reorderbuffer.c) buffers changes per XID and releases them at commit. - Knowing what the rows mean — to turn a heap tuple back into named,
typed column values you need the catalog as of the moment that WAL was
written, even though the live catalog has since changed. The snapshot
builder (
src/backend/replication/logical/snapbuild.c) constructs a historical catalog snapshot by watching catalog-changing transactions go by, andcatalog_xmin(pinned by the slot) guarantees the needed catalog rows still exist. - Pluggable output — the decoded stream is handed to an output plugin that
formats it however the consumer wants. 9.4 shipped only the framework plus the
test_decodingcontrib plugin; a built-in plugin would come in 10.
src/backend/replication/logical/decode.c is the WAL-record-to-logical-change
translator at the bottom of all this.
Why. This single release lifted every “physical only” constraint at once, in principle: a logical change stream can be filtered to one table, transformed, shipped across major versions, or consumed by something that isn’t PostgreSQL at all (a message queue, an analytics sink). 9.4 did not yet ship an end-to-end replication product on top of it — that was deliberately left to extensions and to a future release — but it built the entire substrate.
Structural shape — physical vs. logical consumption.
flowchart LR
WAL["WAL on disk<br/>interleaved records"]
subgraph Physical["Physical path (9.0)"]
WAL --> WSP["walsender<br/>physical"]
WSP --> WRP["walreceiver"]
WRP --> REDOP["startup redo<br/>verbatim replay"]
end
subgraph Logical["Logical path (9.4)"]
WAL --> DEC["decode.c<br/>record to change"]
DEC --> RB["reorder buffer<br/>per-txn, commit order"]
SB["snapshot builder<br/>historical catalog"] -.-> RB
SLOT["logical slot<br/>restart_lsn + catalog_xmin"] -.-> DEC
RB --> PLUG["output plugin<br/>test_decoding (9.4)"]
PLUG --> CONS["any consumer<br/>SQL / queue / app"]
end
The physical path replays opaque bytes; the logical path decodes them into named changes, reorders them per transaction, dresses them with a historical catalog, and hands them to a pluggable formatter — and a slot underwrites the whole thing by keeping the needed WAL and catalog rows alive.
Why it still had to evolve. 9.4 gave you a change stream and a contrib plugin, but no subscriber, no DDL to declare “replicate these tables,” and no worker to apply changes on the other end. Building a real replication pipeline still meant gluing together an extension (like the out-of-tree pglogical). 10 made it first-class.
Cross-link: the reorder buffer, snapshot builder, and decode framework are documented in postgres-logical-decoding.md; the slot lifecycle,
restart_lsn, andcatalog_xminretention in postgres-replication-slots.md.
10 — Built-in logical replication (pub/sub)
Section titled “10 — Built-in logical replication (pub/sub)”Released: PostgreSQL 10 — the release that turned 9.4’s decoding substrate into an end-to-end, SQL-managed replication product shipped in the core server.
What changed. 10 added the whole publish/subscribe surface and the machinery behind it:
CREATE PUBLICATIONon the source declares a named set of tables (and which operations: insert/update/delete) to expose as a logical change stream.CREATE SUBSCRIPTIONon the destination connects to a publisher, creates a logical replication slot there, and starts pulling and applying changes.- pgoutput (
src/backend/replication/pgoutput/pgoutput.c) — the built-in output plugin, finally filling the slot 9.4 had left fortest_decoding. It speaks a compact binary logical-replication protocol that the subscriber understands, and it is publication-aware: it only emits changes for tables in the subscribed publications. - Apply worker + table sync workers — on the subscriber, a background apply
worker receives the pgoutput stream and applies each change through the
executor as ordinary heap operations. Before steady-state apply can begin, a
per-table table sync worker copies the existing table contents (initial
COPY) and then catches up to the apply position. These live insrc/backend/replication/logical/worker.candtablesync.c. - The logical replication launcher (
launcher.c) — a supervisor background worker that watchespg_subscriptionand starts/stops apply workers as subscriptions are created and dropped.
Why. Before 10, building logical replication meant assembling an extension (pglogical) on top of the 9.4 primitives: you supplied your own output plugin, your own apply process, your own DDL. 10 standardized all of that into core SQL objects and core workers, so the lifted constraints from 9.4 became usable: replicate a subset of tables, replicate between different major versions (the subscriber can be a newer release), and even consolidate many publishers into one subscriber. The physical “all-or-nothing, same-version, read-only” model finally had a first-class logical counterpart.
Structural shape — physical stream vs. logical pub/sub.
flowchart LR
subgraph Pub["Publisher (PG 10)"]
PWAL["WAL"] --> PDEC["logical decoding<br/>decode + reorder"]
PSLOT["logical slot<br/>per subscription"] -.-> PDEC
PDEC --> PGO["pgoutput<br/>publication-filtered"]
PUBDEF["CREATE PUBLICATION<br/>table set"] -.-> PGO
PGO --> PWS["walsender<br/>logical mode"]
end
subgraph Sub["Subscriber (PG 10)"]
LAUN["launcher<br/>watches pg_subscription"] --> APPLY["apply worker"]
LAUN --> TSYNC["table sync worker<br/>initial COPY"]
PWS -->|logical repl protocol| APPLY
APPLY --> EXEC["executor<br/>heap insert/update/delete"]
TSYNC --> EXEC
EXEC --> SDATA["subscriber tables<br/>writable, own indexes"]
end
The walsender is reused — but now in logical mode, driving pgoutput instead of shipping raw WAL — and on the far end a launcher-supervised apply worker replays decoded changes through the normal executor into tables that are fully writable and can carry their own indexes and triggers. That last property is the whole point: the subscriber is a real, independent database, not a byte-for-byte mirror.
Why it still had to evolve. 10’s pub/sub was coarse in three ways that the next releases refined: it replicated whole rows of whole tables (no row filter, no column projection); it could only replicate transactions after they committed (no PREPARE/two-phase, and very large transactions had to be fully buffered before anything was sent); and the apply side was single-threaded per subscription. 14 and 15 closed these gaps.
Cross-link: pgoutput’s protocol, message types, and publication awareness are in postgres-pgoutput.md; the launcher, apply worker, and table sync state machine in postgres-logical-replication-apply.md.
14–15 — Selective and transactional logical replication
Section titled “14–15 — Selective and transactional logical replication”The 10 pub/sub system was complete but blunt. The 14 and 15 releases sharpened it along two axes: transaction semantics (stream in-progress and two-phase transactions) and selectivity (replicate only the rows and columns you want).
14 — Two-phase decoding and streaming of in-progress transactions
Section titled “14 — Two-phase decoding and streaming of in-progress transactions”What changed. Two related improvements to how transactions flow:
- Streaming of large in-progress transactions. Previously the reorder buffer
had to hold an entire transaction until commit before pgoutput could emit
anything, so a huge transaction meant a huge memory spike on the publisher (or
spilling to disk) and a long stall. 14 lets the decoder stream changes for a
still-open transaction to the subscriber, which buffers them and only
materializes them at commit — bounding publisher memory via
logical_decoding_work_mem. - Two-phase commit decoding at the output-plugin level. The decoding
framework and pgoutput gained the ability to decode and emit a
PREPARE TRANSACTIONas a distinct event (thetwo_phaseoutput-plugin option, still visible inpgoutput.c), rather than only seeing the finalCOMMIT PREPARED.
Why. Large-transaction memory pressure was a real operational hazard, and true two-phase support is a prerequisite for replicating distributed transactions faithfully — you want the prepared state to cross to the subscriber, not just the eventual commit.
15 — Row filters, column lists, and two-phase subscriptions
Section titled “15 — Row filters, column lists, and two-phase subscriptions”What changed. 15 turned the publication into a genuinely selective declaration and finished the two-phase story on the subscriber side:
- Row filters.
CREATE PUBLICATION ... WHERE (expr)lets a publication emit only the rows matching a predicate. pgoutput evaluates the filter per row before emitting — theRowFilterPubActionmachinery and per-actionExprStatearray are still inpgoutput.c. Different actions (insert/update/delete) can be filtered independently. - Column lists.
CREATE PUBLICATION ... (col1, col2)lets a publication emit only a projection of each row’s columns, so the subscriber can have a narrower schema or simply avoid shipping wide unused columns. - Two-phase subscriptions.
CREATE SUBSCRIPTION ... WITH (two_phase = true)made the subscriber honor the prepared-transaction events that 14 taught the publisher to emit — so a PREPARE on the publisher becomes a PREPARE on the subscriber, and the COMMIT PREPARED crosses over too.
Why. Row filters and column lists are what make logical replication usable for
real data-distribution patterns: sharding by tenant (filter on tenant_id),
replicating only non-sensitive columns to a reporting replica, or feeding a
downstream system a narrow slice. Two-phase subscriptions extend correctness
guarantees to distributed-transaction workloads.
Parallel apply for large streamed transactions arrived alongside this work
(applyparallelworker.c, with refinements landing in the 14/16 window), letting
the subscriber apply a big streamed transaction in a dedicated worker rather than
serializing everything through the single apply worker.
Why it still had to evolve. Logical replication now had rich selectivity and transactional fidelity — but a structural durability gap remained. A logical slot lived only on the primary; if the primary failed and a physical standby was promoted, the slot did not exist on the new primary, so every subscriber’s replication position was lost and the subscription broke. 17 closed that gap.
Cross-link: row-filter and column-list evaluation inside the output plugin are documented in postgres-pgoutput.md; streamed and two-phase apply (including the parallel apply worker) in postgres-logical-replication-apply.md.
17 — Failover slots and slot synchronization
Section titled “17 — Failover slots and slot synchronization”Released: PostgreSQL 17 — the release that made logical replication survive a physical failover, closing the last structural gap opened back in 9.4 when slots were introduced as primary-only state.
What changed. 17 added the ability to mark a logical slot for failover and to synchronize it to a physical standby so that, when that standby is promoted, the slot already exists there and subscribers can reconnect and continue from where they left off:
- Failover slots. A logical slot created (or altered) with
failover = trueis flagged so the system knows it must be kept in sync to standbys. - Slot synchronization. A new slotsync worker on the physical standby
(
src/backend/replication/logical/slotsync.c) periodically fetches failover slot state from the primary and creates/advances matchingsyncedslots locally — driven either automatically by thesync_replication_slotsGUC or manually viapg_sync_replication_slots(). The standby’s copy advances itsrestart_lsnandcatalog_xminto track the primary’s, so it retains exactly the WAL and catalog rows a future-promoted instance would need.
The synced slot is held back so it never gets ahead of what the standby has
actually received and what the primary’s physical replication has confirmed —
otherwise a subscriber could resume past data the promoted standby never got. The
slotsync.c copyright header (2024-2025) marks this as 17-era development.
Why. Before 17, “logical replication + HA via physical standby” was a contradiction: promoting the standby destroyed every subscriber’s position because the slots only existed on the dead primary. Operators worked around it with fragile external scripting. Failover slots make the position durable across promotion, so logical subscribers ride through a primary failover the same way physical standbys always have.
Structural shape — slot durability across failover.
flowchart LR
subgraph Before["Before 17 — slot lost on failover"]
P0["primary<br/>logical slot"] -->|physical WAL| S0["standby<br/>no slot"]
P0 -->|logical| SUB0["subscriber"]
S0 -.->|promote| S0P["new primary<br/>slot GONE"]
S0P -.->|subscription<br/>breaks| SUB0
end
subgraph After["PG 17 — synced failover slot"]
P1["primary<br/>failover=true slot"] -->|physical WAL| S1["standby<br/>slotsync worker"]
P1 -.->|slot state| S1
S1 --> SYN["synced slot<br/>restart_lsn tracks primary"]
P1 -->|logical| SUB1["subscriber"]
S1 -.->|promote| S1P["new primary<br/>slot PRESENT"]
S1P -->|subscriber resumes| SUB1
end
In the “before” topology the standby carries the data but not the slot, so promotion strands the subscriber. In the “after” topology the slotsync worker has been mirroring the failover slot all along, so the promoted standby already has a ready slot and the subscriber simply reconnects and continues.
Cross-link: slot internals —
restart_lsn,catalog_xmin, persistence, and thesynced/failoverflags — are documented in postgres-replication-slots.md; the consuming apply side in postgres-logical-replication-apply.md.
Where it stands at REL_18
Section titled “Where it stands at REL_18”At REL_18_STABLE (the source this document tracks) all of the above coexist as layers of one design, and the directory layout mirrors the history:
- Physical streaming transport —
src/backend/replication/walsender.candwalreceiver.ccarry both the physical WAL stream (9.0) and, in logical mode, the pgoutput change stream (10). This is the common transport for every kind of replication. See postgres-wal-sender-receiver.md. - Synchronous commit —
src/backend/replication/syncrep.cprovides the per-transaction durability dial (9.1) withFIRST/ANYquorum sets. See postgres-synchronous-replication.md. - Replication slots —
src/backend/replication/slot.cprovides resource retention for both physical and logical consumers (9.4), now including thefailover/syncedflags and standby synchronization (17). See postgres-replication-slots.md. - Logical decoding —
src/backend/replication/logical/holds the decode framework, reorder buffer, and snapshot builder (9.4) plus streaming and two-phase support (14). See postgres-logical-decoding.md. - Built-in pub/sub — the same
logical/directory holds the apply worker, launcher, table sync, parallel apply, and slot sync workers (10, 14/15, 17), whilesrc/backend/replication/pgoutput/pgoutput.cis the publication-aware output plugin with row filters and column lists (15). See postgres-logical-replication-apply.md and postgres-pgoutput.md.
The net effect: a modern cluster can run physical standbys for HA and logical subscribers for selective data distribution off the same primary, with synchronous durability where it matters, and the logical positions now survive a physical failover. The decade-long arc — coarse file shipping → byte streaming → decoded change streams → selective pub/sub → failover-durable slots — has converged on a single WAL-reuse architecture with multiple read-out layers.
PG19 next step. Development past REL_18 continues to extend the logical side toward the gaps it still has relative to physical replication — broader DDL and sequence handling so that schema changes and sequence advances propagate without manual intervention, and further hardening of slot synchronization and conflict detection. These are forward notes, not current REL_18 behavior; treat the REL_18 design above as the authoritative current state.
Sources
Section titled “Sources”Release notes (feature attribution):
- PostgreSQL 8.3 release notes —
pg_standbycontrib, warm standby refinements. - PostgreSQL 9.0 release notes — streaming replication, Hot Standby.
- PostgreSQL 9.1 release notes — synchronous replication,
pg_basebackup. - PostgreSQL 9.2 release notes — cascading replication, streaming-only standby.
- PostgreSQL 9.4 release notes — logical decoding, replication slots.
- PostgreSQL 10 release notes — built-in logical replication,
PUBLICATION/SUBSCRIPTION, pgoutput. - PostgreSQL 14 release notes — two-phase decoding, streaming of in-progress transactions.
- PostgreSQL 15 release notes — publication row filters and column lists, two-phase subscriptions.
- PostgreSQL 17 release notes — logical replication failover slots and slot synchronization.
Current-state module docs (mechanism — do not re-derive here):
- postgres-wal-sender-receiver.md — streaming transport.
- postgres-synchronous-replication.md —
synchronous_commitand the wait queue. - postgres-replication-slots.md —
restart_lsn,catalog_xmin, failover/synced slots. - postgres-logical-decoding.md — reorder buffer, snapshot builder, decode framework.
- postgres-logical-replication-apply.md — launcher, apply worker, table sync, parallel apply.
- postgres-pgoutput.md — built-in output plugin, row filters, column lists.
- postgres-overview-replication-ha.md — the replication/HA subsystem overview.
Key source files (observable on REL_18_STABLE, commit 273fe94):
src/backend/replication/walsender.c,walreceiver.c— streaming transport (9.0+).src/backend/replication/syncrep.c— synchronous replication (9.1+).src/backend/replication/slot.c,slotfuncs.c— replication slots (9.4+).src/backend/replication/logical/decode.c,reorderbuffer.c,snapbuild.c— logical decoding (9.4+).src/backend/replication/logical/worker.c,launcher.c,tablesync.c,applyparallelworker.c— apply side (10, 14/15).src/backend/replication/logical/slotsync.c— slot synchronization (17).src/backend/replication/pgoutput/pgoutput.c— built-in output plugin (10), row filters/column lists (15).