PostgreSQL Replication — From WAL File Shipping to Logical Pub/Sub

PostgreSQL never added replication so much as it kept finding new layers of its own write-ahead log to reuse. Every era in this story reuses the same WAL stream that already existed for crash recovery, and re-exposes it at a finer grain: first as whole 16 MB segment files copied by an external script, then as a continuous byte stream pushed over libpq, then as a decoded sequence of logical row changes that a subscriber can apply through ordinary SQL. The through-line is “the WAL is already the truth — find a cheaper, more selective way to ship it.”

This document traces how replication changed across major releases and why, and ends at the REL_18 design. It does not re-derive the mechanism of any single subsystem — for that, follow the cross-links to the current-state module docs at each era.

Contents:

Why this subsystem had to evolve — the original 8.x limitation
Timeline — eras and the releases that introduced them
8.x — Warm standby via WAL file shipping — restore_command, pg_standby
9.0 — Streaming replication + Hot Standby — walsender/walreceiver, read-only standbys
9.1–9.2 — Synchronous replication, then cascading — syncrep, standby-of-a-standby
9.4 — Logical decoding + replication slots — reorder buffer, snapshot builder, slot bookkeeping
10 — Built-in logical replication (pub/sub) — PUBLICATION/SUBSCRIPTION, pgoutput, apply workers
14–15 — Selective and transactional logical replication — two-phase, row filters, column lists
17 — Failover slots and slot synchronization — synced slots, slotsync worker
Where it stands at REL_18 — the current design and the PG19 forward note
Sources

Why this subsystem had to evolve (the original limitation)

The original WAL existed for exactly one purpose: crash recovery. After a crash, the startup process replays WAL records from the last checkpoint forward to bring the data files back to a consistent state. That replay loop is the entire substrate everything below is built on — the insight that drove a decade of replication features is that the same replay loop that recovers a crashed server can keep replaying forever on a second server fed a copy of the WAL.

But in the 8.x era the only way to get the WAL to a second server was the archive: archive_command copied each completed 16 MB segment somewhere, and a standby’s restore_command fetched and replayed it. This worked, and it was a genuine HA story for its day, but it had hard structural limits:

Granularity is a whole segment. A standby could not see a transaction until the entire 16 MB WAL segment containing it had filled and been archived. Replication lag was therefore bounded below by “how long until the current segment fills,” which on a quiet system could be minutes or (with archive_timeout) whatever timeout you forced.
The standby was useless while it ran. A server in recovery could not accept connections at all. The replica was a cold spare — it burned a whole machine producing zero read capacity.
No back-pressure, no feedback. The primary shipped files into the archive blind. It had no idea whether any standby had consumed them, how far behind a standby was, or whether a segment it was about to recycle was still needed downstream.
Physical only, all-or-nothing. You replicated the entire cluster, byte for byte, to an identical-version, identical-architecture standby. You could not replicate one database, one table, or filter rows; you could not replicate across major versions; you could not have the replica also accept writes.

Each subsequent era chips away at one of these limits. Streaming attacks granularity and the cold spare; slots attack the blind-shipping problem; logical decoding and pub/sub attack the “physical, all-or-nothing” constraint; the 17-era work attacks the durability gap that logical slots opened up on failover.

Timeline

timeline
    title PostgreSQL Replication Evolution
    section File era
        8.2 / 8.3 : Continuous archiving<br/>restore_command : pg_standby contrib<br/>warm standby
    section Streaming era
        9.0 : Streaming replication<br/>walsender / walreceiver : Hot Standby<br/>read-only queries on standby
        9.1 : Synchronous replication<br/>synchronous_standby_names : pg_basebackup<br/>streaming-only setup
        9.2 : Cascading replication<br/>standby feeds standby : Streaming-only standby<br/>no archive required
    section Logical era
        9.4 : Logical decoding<br/>reorder buffer + snapshot builder : Replication slots<br/>physical and logical
    section Pub/Sub era
        10 : Built-in logical replication<br/>PUBLICATION / SUBSCRIPTION : pgoutput plugin<br/>apply + tablesync workers
        14 : Two-phase decoding<br/>output plugin PREPARE : Streaming of in-progress txns<br/>large transactions
        15 : Row filters + column lists<br/>selective replication : Two-phase subscriptions<br/>parallel apply (14/16)
    section Resilience era
        17 : Failover slots<br/>failover = true : Slot synchronization<br/>slotsync worker on standby
        18 : Current state<br/>REL_18_STABLE : (PG19 next: more logical DDL / sequences)

8.x — Warm standby via WAL file shipping

Released: continuous archiving and point-in-time recovery landed in 8.0; the warm-standby pattern — a server that stays permanently in recovery, continuously replaying archived segments — was documented and made practical in 8.2, with the pg_standby contrib helper added in 8.3.

What changed. PostgreSQL 8.0 introduced archive_command: when a WAL segment fills, the server runs an operator-supplied shell command to copy that 16 MB file somewhere durable (an NFS mount, another host, tape). The mirror image is restore_command in a recovery.conf on the standby: a permanently-recovering server that, each time it exhausts the WAL it has, runs that command to fetch the next segment and replays it. pg_standby (8.3) was a contrib program written specifically to be a smart restore_command — it would wait for the next segment to appear rather than declaring end-of-recovery, and it cleaned up consumed segments.

Why. This was the first time the WAL — which had existed since 7.1 purely for crash recovery — was deliberately reused as a replication transport. The recovery replay loop was already there and already correct; warm standby simply arranged to never let it finish.

Structural shape — before → after.

flowchart LR
    subgraph Primary8x["Primary 8.x"]
        BE["backends<br/>write WAL"] --> WALSEG["WAL segment<br/>16 MB file fills"]
        WALSEG -->|archive_command| ARCH["archive<br/>NFS / scp / tape"]
    end
    subgraph Standby8x["Warm standby 8.x"]
        REST["restore_command<br/>pg_standby"] --> STARTUP["startup process<br/>redo loop"]
        STARTUP --> DATA["data files<br/>kept consistent"]
    end
    ARCH -.->|poll for next<br/>16 MB segment| REST
    STARTUP -.->|cannot accept<br/>connections| X(("read<br/>blocked"))

The defining properties are visible in the diagram: the unit of transfer is a whole segment file, the link between the two servers is an out-of-band archive (PostgreSQL processes never talk to each other), and the standby is in pure recovery — X shows it refusing all client connections. There is no socket between primary and standby, no feedback channel, and no way to query the standby.

Why it had to change. Three pain points dominated: minutes-scale lag bounded by segment fill time, a standby that produced zero read capacity, and a primary that shipped segments completely blind to whether anyone consumed them. The next two releases attacked all three.

Cross-link: the segment/archive machinery this era introduced is still the fallback path for streaming today, and is documented in postgres-archiving-walsummary.md and the WAL format in postgres-xlog-wal.md.

9.0 — Streaming replication + Hot Standby

Released: PostgreSQL 9.0 — arguably the single most important release in this entire arc. It shipped two complementary features that together turned the cold file-shipping spare into a live, queryable, low-lag replica.

What changed — streaming replication. Instead of waiting for a 16 MB segment to fill and be archived, the standby now opens a replication connection over the normal libpq wire protocol directly to the primary and streams WAL as it is generated, record by record. Two new processes appear:

On the primary, a walsender — a special backend, forked per replication connection, that reads WAL and pushes it down the socket.
On the standby, a walreceiver — a process that connects out to the primary’s walsender, receives the WAL stream, writes it to local WAL, and wakes the startup process to replay it.

This is the birth of src/backend/replication/walsender.c and src/backend/replication/walreceiver.c, both of which are still the transport core in REL_18. The replication protocol got new message types (START_REPLICATION, the XLogData / 'w' byte stream, keepalives) layered on top of the wire protocol.

What changed — Hot Standby. A separate, equally large change: a server in recovery can now accept read-only connections and run queries while it replays. This required teaching the recovery replay loop to publish a running snapshot of in-progress transactions on the primary (so the standby knows which XIDs are visible), to manage recovery conflicts (a query on the standby vs. a VACUUM cleaning up rows the query still needs), and to expose max_standby_streaming_delay / hot_standby_feedback to tune that conflict.

Why. Streaming kills the granularity limit (lag drops from minutes to milliseconds) and removes the dependence on an external archive for the live path. Hot Standby kills the cold-spare waste — the replica now serves read traffic, turning HA hardware into read-scaling hardware.

Structural shape — before → after.

flowchart LR
    subgraph Primary90["Primary 9.0"]
        BE["backends"] --> WAL["local WAL"]
        WAL --> WS["walsender<br/>per connection"]
    end
    subgraph Standby90["Hot Standby 9.0"]
        WR["walreceiver"] --> SWAL["standby WAL"]
        SWAL --> SU["startup process<br/>continuous redo"]
        SU --> SDATA["data files"]
        RO["read-only<br/>backends"] --> SDATA
    end
    WS -->|libpq replication conn<br/>XLogData stream| WR
    WR -.->|standby reply<br/>flush/apply LSN| WS

Compare with the 8.x diagram: the dotted out-of-band archive is replaced by a direct socket (walsender → walreceiver), the unit of transfer is a WAL record not a 16 MB file, a feedback channel now exists (the standby reports its received/flushed/applied LSN back to the walsender), and crucially RO read-only backends now hang off the standby’s data files. The archive path still exists as a fallback when the standby falls too far behind and the primary has recycled the needed segment.

Why it still had to evolve. 9.0 streaming was asynchronous — the primary committed without waiting for the standby, so a primary crash could lose the last few committed transactions that hadn’t reached the replica. And the topology was flat: every standby connected directly to the primary, so N standbys meant N walsenders all reading WAL on the primary. The next two releases fixed both.

Cross-link: the modern walsender/walreceiver transport, replication protocol messages, and standby reply feedback are documented in postgres-wal-sender-receiver.md. Recovery redo and Hot Standby conflict handling are in postgres-recovery-redo.md.

9.1–9.2 — Synchronous replication, then cascading

These two releases harden the topology that 9.0 created: 9.1 adds a durability guarantee, and 9.2 adds fan-out without piling all the load on the primary.

9.1 — Synchronous replication

What changed. 9.1 added synchronous_standby_names and the synchronous_commit levels that interact with it. When a standby is named as synchronous, a committing transaction on the primary does not return success to the client until the standby has confirmed it received (and, depending on the level, flushed or applied) the commit’s WAL. This is implemented by making the committing backend block in a wait queue after writing its commit record, and having the walsender wake it once the standby’s reply LSN advances past the commit LSN. This is the birth of src/backend/replication/syncrep.c, still the core of the feature in REL_18.

Why. 9.0 streaming was asynchronous: a primary crash could silently lose the last handful of committed transactions that had not yet crossed the wire. Workloads that needed “if the client was told it committed, it survives a primary failure” had no answer. Synchronous replication trades a round-trip of commit latency for zero data loss on failover.

The cost is deliberately tunable through synchronous_commit: off (don’t even wait for local flush), local (wait for local flush only — ignore standbys), remote_write (standby received and wrote to OS), on (standby flushed to disk), and later remote_apply (standby has replayed it, so a read on the standby will see it). The whole point is that durability is now a per-transaction dial, not a cluster-wide mode.

9.2 — Cascading replication

What changed. 9.2 let a standby run its own walsender and feed WAL onward to further standbys. A standby is no longer a leaf; it can be an interior node in a replication tree. 9.2 also made a standby able to run entirely from streaming with no archive at all, and pg_basebackup (introduced 9.1) made cloning a standby a single built-in command instead of a manual filesystem copy plus pg_start_backup/pg_stop_backup dance.

Why. In the flat 9.0 topology, every standby connected directly to the primary, so each one added a walsender reading WAL and consuming network bandwidth on the primary. Ten reporting replicas meant ten walsenders on the primary. Cascading lets you build a tree — the primary feeds two standbys, each of those feeds five more — so the primary’s fan-out cost is bounded and downstream bandwidth is spent on the intermediate nodes.

Structural shape — before → after.

flowchart TB
    subgraph Flat["9.0 — flat fan-out"]
        P0["primary<br/>N walsenders"]
        P0 --> A0["standby A"]
        P0 --> B0["standby B"]
        P0 --> C0["standby C"]
    end
    subgraph Casc["9.2 — cascading tree"]
        P1["primary<br/>2 walsenders"]
        P1 --> M1["standby<br/>cascading: own walsenders"]
        P1 --> M2["standby<br/>cascading: own walsenders"]
        M1 --> L1["leaf standby"]
        M1 --> L2["leaf standby"]
        M2 --> L3["leaf standby"]
        M2 --> L4["leaf standby"]
    end

The flat layout puts all N walsenders on the primary; the cascading layout caps the primary at two and pushes the rest of the fan-out cost into the interior nodes, each of which now runs walsenders of its own.

Why it still had to evolve. Everything so far is physical: byte-for-byte, whole-cluster, same-version, read-only on the standby. None of it can replicate a single table, transform data, replicate between major versions, or feed a non-PostgreSQL consumer. The WAL was being shipped, but never interpreted. The 9.4 release cracked that open.

Cross-link: the synchronous commit wait queue, the synchronous_standby_names grammar (FIRST/ANY quorum sets), and the synchronous_commit levels are documented in postgres-synchronous-replication.md. The standby-reply feedback that drives the wake-up is in postgres-wal-sender-receiver.md.

9.4 — Logical decoding + replication slots

Released: PostgreSQL 9.4 — the conceptual turning point. Up to here, “the WAL” meant an opaque byte stream you replayed verbatim. 9.4 made it possible to decode the WAL back into a stream of logical row changes, and introduced the bookkeeping object that makes any continuous consumer safe.

Replication slots

What changed. A replication slot is a small piece of persistent server-side state that records how far a particular consumer has gotten and what resources the server must therefore retain on the consumer’s behalf. Two kinds exist:

Physical slots keep the primary from recycling WAL segments the standby has not yet received (via restart_lsn), closing the 8.x/9.0 race where the primary recycled a segment the standby still needed and forced a fall back to the archive.
Logical slots additionally pin catalog_xmin — the oldest transaction ID whose catalog row versions must be kept — so that a logical consumer can still decode old WAL by looking up the catalog as it was when that WAL was written.

This is the birth of src/backend/replication/slot.c, still the slot machinery in REL_18. Slots are what finally fixed the “ship blind, hope nobody needed that segment” problem from the 8.x era: the server now knows who is consuming and how far behind they are, and will refuse to throw away what they still need (at the cost of unbounded WAL growth if a consumer dies — hence max_slot_wal_keep_size, added later).

Logical decoding

What changed. Logical decoding reads the physical WAL and reconstructs, per transaction, the sequence of logical changes (INSERT/UPDATE/DELETE on which table, with which column values) in commit order. Three hard problems had to be solved, and each became a module:

Reassembling transactions — WAL records from concurrent transactions are interleaved on disk, but a logical consumer wants each transaction whole and in commit order. The reorder buffer (src/backend/replication/logical/reorderbuffer.c) buffers changes per XID and releases them at commit.
Knowing what the rows mean — to turn a heap tuple back into named, typed column values you need the catalog as of the moment that WAL was written, even though the live catalog has since changed. The snapshot builder (src/backend/replication/logical/snapbuild.c) constructs a historical catalog snapshot by watching catalog-changing transactions go by, and catalog_xmin (pinned by the slot) guarantees the needed catalog rows still exist.
Pluggable output — the decoded stream is handed to an output plugin that formats it however the consumer wants. 9.4 shipped only the framework plus the test_decoding contrib plugin; a built-in plugin would come in 10.

src/backend/replication/logical/decode.c is the WAL-record-to-logical-change translator at the bottom of all this.

Why. This single release lifted every “physical only” constraint at once, in principle: a logical change stream can be filtered to one table, transformed, shipped across major versions, or consumed by something that isn’t PostgreSQL at all (a message queue, an analytics sink). 9.4 did not yet ship an end-to-end replication product on top of it — that was deliberately left to extensions and to a future release — but it built the entire substrate.

Structural shape — physical vs. logical consumption.

flowchart LR
    WAL["WAL on disk<br/>interleaved records"]
    subgraph Physical["Physical path (9.0)"]
        WAL --> WSP["walsender<br/>physical"]
        WSP --> WRP["walreceiver"]
        WRP --> REDOP["startup redo<br/>verbatim replay"]
    end
    subgraph Logical["Logical path (9.4)"]
        WAL --> DEC["decode.c<br/>record to change"]
        DEC --> RB["reorder buffer<br/>per-txn, commit order"]
        SB["snapshot builder<br/>historical catalog"] -.-> RB
        SLOT["logical slot<br/>restart_lsn + catalog_xmin"] -.-> DEC
        RB --> PLUG["output plugin<br/>test_decoding (9.4)"]
        PLUG --> CONS["any consumer<br/>SQL / queue / app"]
    end

The physical path replays opaque bytes; the logical path decodes them into named changes, reorders them per transaction, dresses them with a historical catalog, and hands them to a pluggable formatter — and a slot underwrites the whole thing by keeping the needed WAL and catalog rows alive.

Why it still had to evolve. 9.4 gave you a change stream and a contrib plugin, but no subscriber, no DDL to declare “replicate these tables,” and no worker to apply changes on the other end. Building a real replication pipeline still meant gluing together an extension (like the out-of-tree pglogical). 10 made it first-class.

Cross-link: the reorder buffer, snapshot builder, and decode framework are documented in postgres-logical-decoding.md; the slot lifecycle, restart_lsn, and catalog_xmin retention in postgres-replication-slots.md.

10 — Built-in logical replication (pub/sub)

Released: PostgreSQL 10 — the release that turned 9.4’s decoding substrate into an end-to-end, SQL-managed replication product shipped in the core server.

What changed. 10 added the whole publish/subscribe surface and the machinery behind it:

CREATE PUBLICATION on the source declares a named set of tables (and which operations: insert/update/delete) to expose as a logical change stream.
CREATE SUBSCRIPTION on the destination connects to a publisher, creates a logical replication slot there, and starts pulling and applying changes.
pgoutput (src/backend/replication/pgoutput/pgoutput.c) — the built-in output plugin, finally filling the slot 9.4 had left for test_decoding. It speaks a compact binary logical-replication protocol that the subscriber understands, and it is publication-aware: it only emits changes for tables in the subscribed publications.
Apply worker + table sync workers — on the subscriber, a background apply worker receives the pgoutput stream and applies each change through the executor as ordinary heap operations. Before steady-state apply can begin, a per-table table sync worker copies the existing table contents (initial COPY) and then catches up to the apply position. These live in src/backend/replication/logical/worker.c and tablesync.c.
The logical replication launcher (launcher.c) — a supervisor background worker that watches pg_subscription and starts/stops apply workers as subscriptions are created and dropped.

Why. Before 10, building logical replication meant assembling an extension (pglogical) on top of the 9.4 primitives: you supplied your own output plugin, your own apply process, your own DDL. 10 standardized all of that into core SQL objects and core workers, so the lifted constraints from 9.4 became usable: replicate a subset of tables, replicate between different major versions (the subscriber can be a newer release), and even consolidate many publishers into one subscriber. The physical “all-or-nothing, same-version, read-only” model finally had a first-class logical counterpart.

Structural shape — physical stream vs. logical pub/sub.

flowchart LR
    subgraph Pub["Publisher (PG 10)"]
        PWAL["WAL"] --> PDEC["logical decoding<br/>decode + reorder"]
        PSLOT["logical slot<br/>per subscription"] -.-> PDEC
        PDEC --> PGO["pgoutput<br/>publication-filtered"]
        PUBDEF["CREATE PUBLICATION<br/>table set"] -.-> PGO
        PGO --> PWS["walsender<br/>logical mode"]
    end
    subgraph Sub["Subscriber (PG 10)"]
        LAUN["launcher<br/>watches pg_subscription"] --> APPLY["apply worker"]
        LAUN --> TSYNC["table sync worker<br/>initial COPY"]
        PWS -->|logical repl protocol| APPLY
        APPLY --> EXEC["executor<br/>heap insert/update/delete"]
        TSYNC --> EXEC
        EXEC --> SDATA["subscriber tables<br/>writable, own indexes"]
    end

The walsender is reused — but now in logical mode, driving pgoutput instead of shipping raw WAL — and on the far end a launcher-supervised apply worker replays decoded changes through the normal executor into tables that are fully writable and can carry their own indexes and triggers. That last property is the whole point: the subscriber is a real, independent database, not a byte-for-byte mirror.

Why it still had to evolve. 10’s pub/sub was coarse in three ways that the next releases refined: it replicated whole rows of whole tables (no row filter, no column projection); it could only replicate transactions after they committed (no PREPARE/two-phase, and very large transactions had to be fully buffered before anything was sent); and the apply side was single-threaded per subscription. 14 and 15 closed these gaps.

Cross-link: pgoutput’s protocol, message types, and publication awareness are in postgres-pgoutput.md; the launcher, apply worker, and table sync state machine in postgres-logical-replication-apply.md.

14–15 — Selective and transactional logical replication

The 10 pub/sub system was complete but blunt. The 14 and 15 releases sharpened it along two axes: transaction semantics (stream in-progress and two-phase transactions) and selectivity (replicate only the rows and columns you want).

14 — Two-phase decoding and streaming of in-progress transactions

What changed. Two related improvements to how transactions flow:

Streaming of large in-progress transactions. Previously the reorder buffer had to hold an entire transaction until commit before pgoutput could emit anything, so a huge transaction meant a huge memory spike on the publisher (or spilling to disk) and a long stall. 14 lets the decoder stream changes for a still-open transaction to the subscriber, which buffers them and only materializes them at commit — bounding publisher memory via logical_decoding_work_mem.
Two-phase commit decoding at the output-plugin level. The decoding framework and pgoutput gained the ability to decode and emit a PREPARE TRANSACTION as a distinct event (the two_phase output-plugin option, still visible in pgoutput.c), rather than only seeing the final COMMIT PREPARED.

Why. Large-transaction memory pressure was a real operational hazard, and true two-phase support is a prerequisite for replicating distributed transactions faithfully — you want the prepared state to cross to the subscriber, not just the eventual commit.

15 — Row filters, column lists, and two-phase subscriptions

What changed. 15 turned the publication into a genuinely selective declaration and finished the two-phase story on the subscriber side:

Row filters. CREATE PUBLICATION ... WHERE (expr) lets a publication emit only the rows matching a predicate. pgoutput evaluates the filter per row before emitting — the RowFilterPubAction machinery and per-action ExprState array are still in pgoutput.c. Different actions (insert/update/delete) can be filtered independently.
Column lists. CREATE PUBLICATION ... (col1, col2) lets a publication emit only a projection of each row’s columns, so the subscriber can have a narrower schema or simply avoid shipping wide unused columns.
Two-phase subscriptions. CREATE SUBSCRIPTION ... WITH (two_phase = true) made the subscriber honor the prepared-transaction events that 14 taught the publisher to emit — so a PREPARE on the publisher becomes a PREPARE on the subscriber, and the COMMIT PREPARED crosses over too.

Why. Row filters and column lists are what make logical replication usable for real data-distribution patterns: sharding by tenant (filter on tenant_id), replicating only non-sensitive columns to a reporting replica, or feeding a downstream system a narrow slice. Two-phase subscriptions extend correctness guarantees to distributed-transaction workloads.

Parallel apply for large streamed transactions arrived alongside this work (applyparallelworker.c, with refinements landing in the 14/16 window), letting the subscriber apply a big streamed transaction in a dedicated worker rather than serializing everything through the single apply worker.

Why it still had to evolve. Logical replication now had rich selectivity and transactional fidelity — but a structural durability gap remained. A logical slot lived only on the primary; if the primary failed and a physical standby was promoted, the slot did not exist on the new primary, so every subscriber’s replication position was lost and the subscription broke. 17 closed that gap.

Cross-link: row-filter and column-list evaluation inside the output plugin are documented in postgres-pgoutput.md; streamed and two-phase apply (including the parallel apply worker) in postgres-logical-replication-apply.md.

17 — Failover slots and slot synchronization

Released: PostgreSQL 17 — the release that made logical replication survive a physical failover, closing the last structural gap opened back in 9.4 when slots were introduced as primary-only state.

What changed. 17 added the ability to mark a logical slot for failover and to synchronize it to a physical standby so that, when that standby is promoted, the slot already exists there and subscribers can reconnect and continue from where they left off:

Failover slots. A logical slot created (or altered) with failover = true is flagged so the system knows it must be kept in sync to standbys.
Slot synchronization. A new slotsync worker on the physical standby (src/backend/replication/logical/slotsync.c) periodically fetches failover slot state from the primary and creates/advances matching synced slots locally — driven either automatically by the sync_replication_slots GUC or manually via pg_sync_replication_slots(). The standby’s copy advances its restart_lsn and catalog_xmin to track the primary’s, so it retains exactly the WAL and catalog rows a future-promoted instance would need.

The synced slot is held back so it never gets ahead of what the standby has actually received and what the primary’s physical replication has confirmed — otherwise a subscriber could resume past data the promoted standby never got. The slotsync.c copyright header (2024-2025) marks this as 17-era development.

Why. Before 17, “logical replication + HA via physical standby” was a contradiction: promoting the standby destroyed every subscriber’s position because the slots only existed on the dead primary. Operators worked around it with fragile external scripting. Failover slots make the position durable across promotion, so logical subscribers ride through a primary failover the same way physical standbys always have.

Structural shape — slot durability across failover.

flowchart LR
    subgraph Before["Before 17 — slot lost on failover"]
        P0["primary<br/>logical slot"] -->|physical WAL| S0["standby<br/>no slot"]
        P0 -->|logical| SUB0["subscriber"]
        S0 -.->|promote| S0P["new primary<br/>slot GONE"]
        S0P -.->|subscription<br/>breaks| SUB0
    end
    subgraph After["PG 17 — synced failover slot"]
        P1["primary<br/>failover=true slot"] -->|physical WAL| S1["standby<br/>slotsync worker"]
        P1 -.->|slot state| S1
        S1 --> SYN["synced slot<br/>restart_lsn tracks primary"]
        P1 -->|logical| SUB1["subscriber"]
        S1 -.->|promote| S1P["new primary<br/>slot PRESENT"]
        S1P -->|subscriber resumes| SUB1
    end

In the “before” topology the standby carries the data but not the slot, so promotion strands the subscriber. In the “after” topology the slotsync worker has been mirroring the failover slot all along, so the promoted standby already has a ready slot and the subscriber simply reconnects and continues.

Cross-link: slot internals — restart_lsn, catalog_xmin, persistence, and the synced/failover flags — are documented in postgres-replication-slots.md; the consuming apply side in postgres-logical-replication-apply.md.

Where it stands at REL_18

At REL_18_STABLE (the source this document tracks) all of the above coexist as layers of one design, and the directory layout mirrors the history:

Physical streaming transport — src/backend/replication/walsender.c and walreceiver.c carry both the physical WAL stream (9.0) and, in logical mode, the pgoutput change stream (10). This is the common transport for every kind of replication. See postgres-wal-sender-receiver.md.
Synchronous commit — src/backend/replication/syncrep.c provides the per-transaction durability dial (9.1) with FIRST/ANY quorum sets. See postgres-synchronous-replication.md.
Replication slots — src/backend/replication/slot.c provides resource retention for both physical and logical consumers (9.4), now including the failover/synced flags and standby synchronization (17). See postgres-replication-slots.md.
Logical decoding — src/backend/replication/logical/ holds the decode framework, reorder buffer, and snapshot builder (9.4) plus streaming and two-phase support (14). See postgres-logical-decoding.md.
Built-in pub/sub — the same logical/ directory holds the apply worker, launcher, table sync, parallel apply, and slot sync workers (10, 14/15, 17), while src/backend/replication/pgoutput/pgoutput.c is the publication-aware output plugin with row filters and column lists (15). See postgres-logical-replication-apply.md and postgres-pgoutput.md.

The net effect: a modern cluster can run physical standbys for HA and logical subscribers for selective data distribution off the same primary, with synchronous durability where it matters, and the logical positions now survive a physical failover. The decade-long arc — coarse file shipping → byte streaming → decoded change streams → selective pub/sub → failover-durable slots — has converged on a single WAL-reuse architecture with multiple read-out layers.

PG19 next step. Development past REL_18 continues to extend the logical side toward the gaps it still has relative to physical replication — broader DDL and sequence handling so that schema changes and sequence advances propagate without manual intervention, and further hardening of slot synchronization and conflict detection. These are forward notes, not current REL_18 behavior; treat the REL_18 design above as the authoritative current state.

Sources

Release notes (feature attribution):

PostgreSQL 8.3 release notes — pg_standby contrib, warm standby refinements.
PostgreSQL 9.0 release notes — streaming replication, Hot Standby.
PostgreSQL 9.1 release notes — synchronous replication, pg_basebackup.
PostgreSQL 9.2 release notes — cascading replication, streaming-only standby.
PostgreSQL 9.4 release notes — logical decoding, replication slots.
PostgreSQL 10 release notes — built-in logical replication, PUBLICATION/SUBSCRIPTION, pgoutput.
PostgreSQL 14 release notes — two-phase decoding, streaming of in-progress transactions.
PostgreSQL 15 release notes — publication row filters and column lists, two-phase subscriptions.
PostgreSQL 17 release notes — logical replication failover slots and slot synchronization.

Current-state module docs (mechanism — do not re-derive here):

postgres-wal-sender-receiver.md — streaming transport.
postgres-synchronous-replication.md — synchronous_commit and the wait queue.
postgres-replication-slots.md — restart_lsn, catalog_xmin, failover/synced slots.
postgres-logical-decoding.md — reorder buffer, snapshot builder, decode framework.
postgres-logical-replication-apply.md — launcher, apply worker, table sync, parallel apply.
postgres-pgoutput.md — built-in output plugin, row filters, column lists.
postgres-overview-replication-ha.md — the replication/HA subsystem overview.

Key source files (observable on REL_18_STABLE, commit 273fe94):

src/backend/replication/walsender.c, walreceiver.c — streaming transport (9.0+).
src/backend/replication/syncrep.c — synchronous replication (9.1+).
src/backend/replication/slot.c, slotfuncs.c — replication slots (9.4+).
src/backend/replication/logical/decode.c, reorderbuffer.c, snapbuild.c — logical decoding (9.4+).
src/backend/replication/logical/worker.c, launcher.c, tablesync.c, applyparallelworker.c — apply side (10, 14/15).
src/backend/replication/logical/slotsync.c — slot synchronization (17).
src/backend/replication/pgoutput/pgoutput.c — built-in output plugin (10), row filters/column lists (15).