PostgreSQL Replication Slots — Retaining WAL and catalog_xmin
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every replication scheme that ships a log rather than full snapshots faces the same garbage-collection problem: the primary continuously produces a write-ahead log, and a downstream consumer reads it at its own pace. The primary would like to recycle old log segments as soon as they are no longer needed for its own crash recovery — but “no longer needed” now has a second clause: no longer needed by any consumer either. If the primary recycles a segment a lagging standby still has to read, that standby can never catch up; it must be rebuilt from a fresh base backup.
Kleppmann’s Designing Data-Intensive Applications (ch. 5, “Replication”)
frames this as the core tension of log-based replication: the leader’s log
is the single source of truth, followers replay it position by position, and
“the leader must retain log entries until all followers have consumed them.”
The naïve fix — keep all WAL forever — trades a correctness bug for an
operational one: unbounded disk growth that eventually crashes the primary.
The naïve opposite — keep a fixed window (wal_keep_size, archive-and-forget)
— is a guess: too small and a briefly disconnected standby falls off the
window; too large and you waste disk on consumers that have long since caught
up. What the system actually wants is a feedback channel: each consumer
publishes how far it has consumed, and the primary retains exactly back to the
slowest consumer and no further.
For physical replication that feedback is a single number — the oldest WAL position (LSN) any standby still needs. For logical replication (DDIA ch. 11, “Stream Processing” / change data capture) there is a second, subtler resource. A logical decoder reads WAL and reconstructs row-level changes, which requires interpreting each change against the system catalog as it existed when the change was made — the table’s column layout, types, and TOAST mapping at that historical moment. Ordinary MVCC vacuum is free to prune old catalog row versions once no running transaction can see them. But a logical decoder replaying month-old WAL needs the month-old catalog. So a logical consumer must pin not only old WAL but also an old catalog snapshot — expressed as a transaction-id horizon below which catalog tuples must not be vacuumed.
Database Internals (Petrov, 2019, the WAL chapter) supplies the mechanism
vocabulary on the producer side: the log is a sequence of records addressed by
monotonically increasing offsets (LSNs), segments are recycled once a
“low-water mark” passes them, and durability rests on fsync ordering. A
replication slot is precisely a named, persistent low-water mark, registered
by a consumer, that the producer’s recycling logic must respect. Two
properties make it more than a stored number:
- Durability. The bookmark must survive a primary crash. If it lived only in volatile memory, a crash-restart would lose the consumer’s position and the primary might recycle WAL the consumer still needs. So the slot is written to disk and reloaded before recovery begins.
- Atomic publication. Many slots may exist; the retention decision is the minimum across all of them. Computing that minimum, and the inverse operation of reserving WAL when a slot is created, must interleave safely with concurrent checkpoints that are simultaneously deciding what to recycle — otherwise a slot could reserve a segment the checkpoint just decided to delete.
These two properties — crash-safe persistence and race-free min/reserve against the recycler — are the whole engineering problem, and they dominate the source.
It is worth dwelling on why the catalog horizon is separate from the WAL
horizon, because the two-number design is the single most distinctive feature
of PostgreSQL’s slot. WAL retention is a question about bytes: which physical
segments on disk must not be unlinked. Catalog retention is a question about
visibility: which historical row versions of pg_class, pg_attribute,
pg_type, and the rest must remain visible to a snapshot taken at decode time.
These are different units (LSN versus transaction id) consumed by different
subsystems (the WAL recycler versus vacuum’s oldest-xmin), so they cannot be a
single value. A physical standby needs only the first; it replays WAL block by
block and never interprets row contents, so it has no opinion about the
catalog. A logical decoder needs both: the WAL bytes to read the changes, and
the catalog snapshot to interpret them. The slot therefore carries both
horizons, and the discriminator database == InvalidOid selects which ones are
actually populated.
A further consequence of the feedback-channel model is that the direction of
movement matters enormously. A consumer’s reported position may only move
forward — a slot can confirm it has consumed up to LSN X, after which the
producer may discard everything below X, but it can never un-confirm. If a
slot were allowed to move backward, the producer might already have recycled
the WAL the slot now claims to need. So monotonicity of restart_lsn and
confirmed_flush is a correctness invariant, not a nicety, and the advance path
enforces it explicitly by rejecting any target below the current minimum.
Common DBMS Design
Section titled “Common DBMS Design”Log-shipping systems converge on a small design space for “how does the producer know what to retain”:
- Fixed retention window. Keep the last N segments / T hours / S
bytes regardless of consumers (PostgreSQL’s pre-9.4
wal_keep_segments, MySQL’sbinlog_expire_logs_seconds). Simple and bounded, but a consumer outside the window is lost. The window is a guess decoupled from actual consumer progress. - Consumer-registered retention (slots / bookmarks). Each consumer registers a durable marker; the producer retains back to the oldest marker. This is PostgreSQL’s replication slot, Kafka’s consumer-group committed offset, and Oracle GoldenGate’s checkpoint table. The producer’s recycler takes the min over all markers. Pro: never lose a registered consumer. Con: a dead consumer that never un-registers pins resources forever — the classic “slot left behind” disk-fill incident.
- Pull-from-archive. Recycle aggressively to local disk but archive every
segment to cheap storage (S3,
archive_command); a lagging consumer restores from the archive. Decouples local disk from consumer lag, at the cost of archive latency and a separate restore path.
For the catalog-snapshot half of the problem, systems that do logical CDC all
need some form of “don’t garbage-collect schema/row versions the decoder still
needs.” Approaches range from snapshotting the schema at capture-start
(Debezium-style external CDC) to integrating with the engine’s own MVCC GC so
the decoder’s horizon participates directly in the vacuum decision. PostgreSQL
takes the integrated route: a logical slot’s catalog_xmin is published into
the same global oldest-xmin computation that bounds vacuum, so the decoder’s
needs and vacuum’s freedom are reconciled in one place rather than via an
external schema registry.
The locking shape is also conventional. The set of slots is a small, fixed-size shared array (slots are scarce and long-lived, so no dynamic hash is warranted). A coarse lock guards membership (which array entries are live); fine-grained per-entry locks guard the mutable fields. The retention min is recomputed on every event that could change it — slot create, advance, drop, release — and published to the consumer subsystems (the xlog recycler, the ProcArray’s vacuum horizon) through narrow setters.
The recompute-and-publish cadence is itself a design choice with two viable poles. One pole is to push every change immediately and synchronously (low staleness, high contention); the other is to recompute lazily only when a consumer subsystem asks (low contention, possibly stale). PostgreSQL takes a hybrid: the two hot minima — required LSN and required xmin — are maintained eagerly (recomputed on every horizon-moving event and cached in the xlog module and the ProcArray respectively), because the WAL recycler and vacuum consult them at high frequency and cannot afford to walk the slot array themselves. The colder logical-restart LSN is computed on demand, because its callers are rare. This split — cache what is read often, compute what is read seldom — is the same tradeoff every shared-state subsystem makes; what is notable here is that the eager recomputation deliberately uses only a shared lock and accepts a momentarily stale (always-conservative) result rather than serializing all slot activity behind an exclusive lock.
A last structural decision worth naming is fusing creation with ownership.
In many systems “create a resource” and “acquire it for use” are separate calls;
PostgreSQL fuses them for slots (ReplicationSlotCreate leaves the creating
backend as the active owner) because a freshly created slot that nobody owns is
a liability — it pins WAL but has no consumer driving it forward, exactly the
dead-slot failure mode. Making the creator the owner means the slot’s pin is
tied to a live process from birth; the slot only becomes a standalone durable
object once the owner explicitly persists and releases it.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements a slot as a ReplicationSlot struct living in a
fixed-size shared-memory array sized by max_replication_slots. The split
between durable and volatile state is explicit in the type: the on-disk part is
a nested ReplicationSlotPersistentData data substruct; everything outside it
is rebuilt at startup.
// ReplicationSlotPersistentData — src/include/replication/slot.htypedef struct ReplicationSlotPersistentData{ NameData name; /* the slot's identifier */ Oid database; /* InvalidOid => physical, else logical */ ReplicationSlotPersistency persistency; /* PERSISTENT/EPHEMERAL/TEMPORARY */ TransactionId xmin; /* data xmin horizon */ TransactionId catalog_xmin; /* catalog xmin horizon (logical) */ XLogRecPtr restart_lsn; /* oldest WAL this slot may require */ ReplicationSlotInvalidationCause invalidated; /* RS_INVAL_NONE if valid */ XLogRecPtr confirmed_flush; /* oldest LSN the client has acked */ /* ... two_phase_at, two_phase, plugin, synced, failover ... */} ReplicationSlotPersistentData;The two retention knobs are restart_lsn (the oldest WAL byte position the
consumer might still read — the WAL low-water mark) and catalog_xmin (the
oldest catalog transaction id the consumer might still need — the vacuum
horizon, logical only). database == InvalidOid is the discriminator between
physical and logical slots, encoded by two macros:
// slot type discriminators — src/include/replication/slot.h#define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)#define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)A physical slot pins only WAL; it leaves catalog_xmin invalid. A logical slot
is database-scoped (it can only decode the database it was created in) and pins
both WAL and a catalog snapshot. The reason physical slots also bother to pin
xmin at all is hot-standby feedback: a standby can ask the primary not to
vacuum rows its running queries still need, and that horizon rides on the
physical slot’s xmin.
The persistency field is a three-valued enum, and the three values encode a
small state machine over a slot’s durability:
RS_TEMPORARY— the slot lives only for the current session and is dropped on disconnect or error. Used for short-lived consumers (e.g. apg_basebackupthat wants WAL retained only while it runs).RS_EPHEMERAL— a “not quite ready” intermediate state used while building a persistent slot: it is dropped if released before being promoted, so a backend that crashes mid-creation leaves no orphan.ReplicationSlotPersistpromotes an ephemeral slot to persistent once it is fully initialized.RS_PERSISTENT— crash-safe and durable; survives release, session end, and restart, and is only removed by an explicit drop or by invalidation.
This staging is the durability analogue of the effective-vs-persistent xmin rule: a slot becomes a durable, resource-pinning object only after it is fully formed, so a failure at any earlier point cleans up automatically rather than leaking a pin.
The shared struct wraps the persistent data with volatile coordination fields:
// ReplicationSlot — src/include/replication/slot.htypedef struct ReplicationSlot{ slock_t mutex; /* protects the individual fields below */ bool in_use; /* is this array entry a live slot? */ pid_t active_pid; /* who is streaming this slot? 0 = nobody */ bool just_dirtied; bool dirty; /* unsaved changes since last flush? */ TransactionId effective_xmin; /* latest xmin actually on disk */ TransactionId effective_catalog_xmin; ReplicationSlotPersistentData data; /* the crash-safe part */ LWLock io_in_progress_lock; ConditionVariable active_cv; /* signaled when active_pid changes */ /* logical-only candidate fields used to advance horizons lazily: */ TransactionId candidate_catalog_xmin; XLogRecPtr candidate_xmin_lsn; XLogRecPtr candidate_restart_valid; XLogRecPtr candidate_restart_lsn; XLogRecPtr last_saved_confirmed_flush; XLogRecPtr last_saved_restart_lsn; /* ... inactive_since ... */} ReplicationSlot;Two locking concepts, spelled out in the header’s comment, govern access:
ReplicationSlotControlLock(cluster-wide LWLock) guards thein_useflag. A backend flipsin_use(create / drop) only while holding it exclusively; a backend scanning the array to read other slots’ data holds it shared.- Per-slot
mutex(spinlock) guards the mutable fields. The slot’s owner may read its own fields without the spinlock but must take it to write; concurrent non-owners take the spinlock to read.
The effective_xmin / effective_catalog_xmin pair versus the persistent
data.xmin / data.catalog_xmin is the subtlest part of the design. For
logical decoding it is “extremely important that we never remove any data
that’s still needed … even after a crash” (header comment). So the effective
value — the one that actually bounds vacuum and WAL recycling — is only allowed
to advance after the corresponding persistent value has been flushed to disk.
This guarantees that a crash can only ever leave the on-disk horizon behind
(more conservative than) the in-memory one, never ahead of it. For physical
slots the worst case of premature removal is a query cancellation on the
standby, so the effective value simply tracks the persistent value with no
extra lag.
flowchart TB
subgraph SHMEM["Shared memory: ReplicationSlotCtl"]
A["replication_slots[0..max_replication_slots-1]<br/>fixed array of ReplicationSlot"]
A --> S0["slot[0] in_use<br/>data.restart_lsn / data.catalog_xmin<br/>mutex + active_pid"]
A --> S1["slot[1] in_use ..."]
A --> SN["slot[N] free (in_use=false)"]
end
S0 -->|"min restart_lsn<br/>ComputeRequiredLSN"| XLOG["xlog recycler<br/>XLogSetReplicationSlotMinimumLSN"]
S0 -->|"min catalog_xmin<br/>ComputeRequiredXmin"| PROC["ProcArray vacuum horizon<br/>ProcArraySetReplicationSlotXmin"]
S0 -.->|"checkpoint: SaveSlotToPath"| DISK["pg_replslot/<name>/state<br/>tmp + fsync + rename"]
DISK -.->|"StartupReplicationSlots<br/>before redo"| A
The retention pipeline is the heart of the module. On every event that could move a horizon — slot creation reserving WAL, an advance, a release, a drop — PostgreSQL recomputes the cluster-wide minima and republishes them:
ReplicationSlotsComputeRequiredLSN()walks the array, takes the minimum validrestart_lsn, and callsXLogSetReplicationSlotMinimumLSN(). The next checkpoint will not recycle segments below that LSN.ReplicationSlotsComputeRequiredXmin()walks the array, takes the minimumeffective_xminandeffective_catalog_xmin, and callsProcArraySetReplicationSlotXmin(). Vacuum’s global oldest-xmin computation then refuses to prune below that horizon.
The candidate fields (candidate_catalog_xmin, candidate_xmin_lsn,
candidate_restart_lsn, candidate_restart_valid) implement the lazy-advance
machinery for logical slots and deserve a sentence even though their consumers
live in the logical-decoding subsystem. When a logical decoder reads WAL, it
learns of a potential new horizon long before it can safely commit to it — it
must wait until the client confirms it has flushed the corresponding output.
The candidate fields stage that potential horizon: “once the client confirms
flushes at or beyond candidate_xmin_lsn, it is safe to advance the catalog
xmin to candidate_catalog_xmin; once candidate_restart_valid is passed,
restart_lsn may rise to candidate_restart_lsn.” Only when those conditions
are met does the effective horizon move and a recompute fire. This staging is
exactly the mechanism that upholds the effective-never-ahead-of-flushed
invariant for the catalog horizon, mirroring on the xmin axis what
last_saved_restart_lsn does on the LSN axis. The slot.c functions in this doc
read these fields but the logic that sets them lives in logical.c
(LogicalConfirmReceivedLocation), deferred to the logical-decoding doc.
Both deliberately use a shared lock and tolerate a transiently regressing result: a concurrent advance might make the computed minimum momentarily older than reality, which is harmless (it only delays GC, never enables premature removal). This is the recurring safety asymmetry of the whole module — every race is resolved in the direction of retaining too much, never too little.
Source Walkthrough
Section titled “Source Walkthrough”The module’s lifecycle splits into five flows: allocation (create / acquire
/ release / drop), retention computation (the min/reserve functions),
persistence (save / restore at checkpoint and startup), advance / drop
(the SQL-facing operations in slotfuncs.c), and invalidation + failover
(falling-behind detection and slot sync). They share the array and the two
locks introduced above.
Allocation: create, acquire, release, drop
Section titled “Allocation: create, acquire, release, drop”ReplicationSlotCreate (slot.c) is the entry point. It validates the name,
then under ReplicationSlotAllocationLock (which serializes all slot
creation/cleanup so two backends can’t grab the same free entry or fight over
the same directory) it scans the array for a name collision and for a free
slot, initializes the persistent and volatile fields, writes the slot to disk,
and only then flips in_use under an exclusive ReplicationSlotControlLock:
// ReplicationSlotCreate — src/backend/replication/slot.cLWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);for (i = 0; i < max_replication_slots; i++){ ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0) ereport(ERROR, (errcode(ERRCODE_DUPLICATE_OBJECT), errmsg("replication slot \"%s\" already exists", name))); if (!s->in_use && slot == NULL) slot = s; /* first free entry */}LWLockRelease(ReplicationSlotControlLock);if (slot == NULL) ereport(ERROR, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), errmsg("all replication slots are in use"), ...));/* ... memset(&slot->data, 0, ...); set name/database/persistency ... */CreateSlotOnDisk(slot); /* materialize before marking in_use */LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);slot->in_use = true; /* now visible to scanners */slot->data.database = db_specific ? MyDatabaseId : InvalidOid is the line that
decides physical-vs-logical at birth. The newly created slot is also marked
active (active_pid = MyProcPid, MyReplicationSlot = slot) so the creating
backend owns it immediately — creation and acquisition are fused.
ReplicationSlotAcquire is the re-attach path (used by pg_replication_slot_advance,
a starting walsender, etc.). It searches by name, then either claims the slot
(active_pid = MyProcPid) or, if another PID holds it, sleeps on the slot’s
active_cv condition variable until released — unless nowait was requested,
in which case it errors with ERRCODE_OBJECT_IN_USE. Crucially, it checks
invalidation after taking ownership, to close a race with the checkpointer:
// ReplicationSlotAcquire — src/backend/replication/slot.cMyReplicationSlot = s; /* the slot is ours now *//* check invalidation AFTER acquiring to avoid a race with the checkpointer */if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE) ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("can no longer access replication slot \"%s\"", NameStr(s->data.name)), errdetail("This replication slot has been invalidated due to \"%s\".", GetSlotInvalidationCauseName(s->data.invalidated)));ReplicationSlotRelease is the inverse. Its branch on persistency is the
whole point of the three-valued enum: an RS_EPHEMERAL slot (a
not-quite-finished persistent slot) is dropped on release; an RS_PERSISTENT
slot is merely marked inactive (active_pid = 0) and survives; a temporary slot
is handled likewise but disappears at session end. Releasing a slot that had
temporarily constrained xmin for an initial catalog snapshot clears that
constraint and recomputes the xmin horizon.
ReplicationSlotDropPtr performs the durable removal: it renames the slot
directory to <name>.tmp (the atomic “this is no longer a valid slot” step),
fsyncs the parent, clears in_use under the control lock, and then recomputes
both retention minima — because a dropped slot no longer pins anything:
// ReplicationSlotDropPtr — src/backend/replication/slot.cif (rename(path, tmppath) == 0) { /* atomic invalidation on disk */ START_CRIT_SECTION(); fsync_fname(tmppath, true); fsync_fname(PG_REPLSLOT_DIR, true); END_CRIT_SECTION();}LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);slot->active_pid = 0;slot->in_use = false; /* entry returns to the free pool */LWLockRelease(ReplicationSlotControlLock);/* Slot is dead and doesn't prevent resource removal anymore, recompute limits. */ReplicationSlotsComputeRequiredXmin(false);ReplicationSlotsComputeRequiredLSN();Reserving WAL and computing the retention minima
Section titled “Reserving WAL and computing the retention minima”ReplicationSlotReserveWal sets a brand-new slot’s restart_lsn. The choice of
anchor differs by slot type and recovery state — a physical slot anchors at the
last checkpoint redo pointer (where a base backup would start replay); a logical
slot on a primary anchors at the current insert pointer and logs a standby
snapshot so decoding has a consistent starting point:
// ReplicationSlotReserveWal — src/backend/replication/slot.cLWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE); /* serialize vs checkpoint */if (SlotIsPhysical(slot)) restart_lsn = GetRedoRecPtr();else if (RecoveryInProgress()) restart_lsn = GetXLogReplayRecPtr(NULL);else restart_lsn = GetXLogInsertRecPtr();SpinLockAcquire(&slot->mutex);slot->data.restart_lsn = restart_lsn;SpinLockRelease(&slot->mutex);ReplicationSlotsComputeRequiredLSN(); /* prevent WAL removal ASAP */XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);if (XLogGetLastRemovedSegno() >= segno) /* lost the race: segment already gone */ elog(ERROR, "WAL required by replication slot %s has been removed concurrently", ...);The exclusive ReplicationSlotAllocationLock here is the lock that interlocks
with the checkpointer (which holds it shared while flushing slots, see below):
either the reservation happens first and the checkpoint must see the new
restart_lsn, or the checkpoint happens first and the reservation lands at/after
that checkpoint’s redo pointer. The post-hoc XLogGetLastRemovedSegno check
catches the residual window.
ReplicationSlotsComputeRequiredLSN is the WAL low-water-mark aggregator. Note
the persistent-slot subtlety: it uses last_saved_restart_lsn, not the live
restart_lsn, because the segments between the two might still be needed if the
primary crashes before the newer restart_lsn is flushed:
// ReplicationSlotsComputeRequiredLSN — src/backend/replication/slot.cfor (i = 0; i < max_replication_slots; i++) { /* ... read restart_lsn, last_saved_restart_lsn, persistency, invalidated under mutex ... */ if (invalidated) continue; /* invalidated slots pin nothing */ if (persistency == RS_PERSISTENT) { if (last_saved_restart_lsn != InvalidXLogRecPtr && restart_lsn > last_saved_restart_lsn) restart_lsn = last_saved_restart_lsn; /* be conservative across crash */ } if (restart_lsn != InvalidXLogRecPtr && (min_required == InvalidXLogRecPtr || restart_lsn < min_required)) min_required = restart_lsn;}XLogSetReplicationSlotMinimumLSN(min_required);ReplicationSlotsComputeRequiredXmin is the catalog-horizon analogue. It
aggregates the minimum effective_xmin and effective_catalog_xmin (the
flushed-to-disk values, per the effective-vs-persistent rule) and pushes them
into the ProcArray, which is where vacuum reads the global oldest-xmin:
// ReplicationSlotsComputeRequiredXmin — src/backend/replication/slot.cfor (i = 0; i < max_replication_slots; i++) { /* ... read effective_xmin, effective_catalog_xmin, invalidated under mutex ... */ if (invalidated) continue; if (TransactionIdIsValid(effective_xmin) && (!TransactionIdIsValid(agg_xmin) || TransactionIdPrecedes(effective_xmin, agg_xmin))) agg_xmin = effective_xmin; if (TransactionIdIsValid(effective_catalog_xmin) && (!TransactionIdIsValid(agg_catalog_xmin) || TransactionIdPrecedes(effective_catalog_xmin, agg_catalog_xmin))) agg_catalog_xmin = effective_catalog_xmin;}ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);ReplicationSlotsComputeLogicalRestartLSN is a logical-only variant (skips
physical slots) used by callers that need the oldest WAL required specifically
for decoding; it is computed on demand rather than cached.
Persistence: dirty-flush at checkpoint, restore before redo
Section titled “Persistence: dirty-flush at checkpoint, restore before redo”A slot is dirtied (ReplicationSlotMarkDirty) whenever its persistent fields
change, and explicitly flushed (ReplicationSlotSave → SaveSlotToPath) at slot
creation and persist time. The bulk of flushing, though, happens lazily at every
checkpoint via CheckPointReplicationSlots, which walks all in-use slots and
calls SaveSlotToPath for each under a shared ReplicationSlotAllocationLock
(strong enough to freeze in_use, weak enough to let acquisition proceed):
// CheckPointReplicationSlots — src/backend/replication/slot.cLWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);for (i = 0; i < max_replication_slots; i++) { ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; if (!s->in_use) continue; sprintf(path, "%s/%s", PG_REPLSLOT_DIR, NameStr(s->data.name)); if (is_shutdown && SlotIsLogical(s)) { /* force-flush confirmed_flush at shutdown */ SpinLockAcquire(&s->mutex); if (s->data.invalidated == RS_INVAL_NONE && s->data.confirmed_flush > s->last_saved_confirmed_flush) { s->just_dirtied = true; s->dirty = true; } SpinLockRelease(&s->mutex); } if (s->last_saved_restart_lsn != s->data.restart_lsn) last_saved_restart_lsn_updated = true; SaveSlotToPath(s, path, LOG);}LWLockRelease(ReplicationSlotAllocationLock);if (last_saved_restart_lsn_updated) ReplicationSlotsComputeRequiredLSN(); /* WAL can now be recycled further */SaveSlotToPath is the atomic-write workhorse: skip if !dirty, copy
slot->data under the spinlock into a checksummed ReplicationSlotOnDisk
record, write it to state.tmp, fsync, then rename to state and fsync the
directory. The write-tmp-then-rename idiom guarantees that a crash mid-write
leaves the previous good state file intact. Its tail is where the two
last_saved_* fields are updated and the dirty bit is cleared — but only if
nobody re-dirtied the slot during the I/O, which is exactly why just_dirtied
exists as a separate flag from dirty:
// SaveSlotToPath — src/backend/replication/slot.c (tail)if (rename(tmppath, path) != 0) { /* ... unlink, release io lock, ereport ... */ return; }START_CRIT_SECTION();fsync_fname(path, false);fsync_fname(dir, true);fsync_fname(PG_REPLSLOT_DIR, true);END_CRIT_SECTION();/* Successfully wrote; unset dirty unless somebody dirtied again already, *//* and remember the flushed confirmed_flush / restart_lsn. */SpinLockAcquire(&slot->mutex);if (!slot->just_dirtied) slot->dirty = false;slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;slot->last_saved_restart_lsn = cp.slotdata.restart_lsn;SpinLockRelease(&slot->mutex);LWLockRelease(&slot->io_in_progress_lock);Setting last_saved_restart_lsn = cp.slotdata.restart_lsn here is the moment
the conservative “use the saved, not live, restart_lsn” rule in
ReplicationSlotsComputeRequiredLSN is allowed to relax: only after a
successful flush does the saved value catch up to the live one, which is why the
checkpoint loop recomputes the required LSN whenever it observes
last_saved_restart_lsn != data.restart_lsn. RestoreSlotFromDisk is the
reverse, called by StartupReplicationSlots before crash recovery so that
slots pin WAL before redo can examine what to recycle:
// StartupReplicationSlots — src/backend/replication/slot.creplication_dir = AllocateDir(PG_REPLSLOT_DIR);while ((replication_de = ReadDir(replication_dir, PG_REPLSLOT_DIR)) != NULL) { /* skip ".", ".."; rmtree leftover "*.tmp" dirs from a crash mid-create/drop */ if (pg_str_endswith(replication_de->d_name, ".tmp")) { rmtree(path, true); ... continue; } RestoreSlotFromDisk(replication_de->d_name); /* validate magic/version/CRC, fill array */}FreeDir(replication_dir);if (max_replication_slots <= 0) return;ReplicationSlotsComputeRequiredXmin(false); /* republish horizons from restored slots */ReplicationSlotsComputeRequiredLSN();RestoreSlotFromDisk PANICs on a magic/version/length/CRC mismatch (a corrupt
slot file is unrecoverable and must stop startup), and on restore it seeds the
volatile effective_* and last_saved_* fields from the persisted values —
re-establishing the “effective never ahead of on-disk” invariant from a clean
base.
Advance and drop: the SQL surface
Section titled “Advance and drop: the SQL surface”slotfuncs.c is the SQL-callable layer. pg_replication_slot_advance clamps the
target LSN to what has actually been flushed/replayed, acquires the slot,
rejects backward moves, and dispatches to the physical or logical advancer:
// pg_replication_slot_advance — src/backend/replication/slotfuncs.cif (!RecoveryInProgress()) moveto = Min(moveto, GetFlushRecPtr(NULL)); /* can't advance past durable WAL */else moveto = Min(moveto, GetXLogReplayRecPtr(NULL));ReplicationSlotAcquire(NameStr(*slotname), true, true);if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) ereport(ERROR, ...); /* never reserved WAL */minlsn = OidIsValid(MyReplicationSlot->data.database) ? MyReplicationSlot->data.confirmed_flush /* logical: consumed point */ : MyReplicationSlot->data.restart_lsn; /* physical: restart point */if (moveto < minlsn) ereport(ERROR, ...); /* monotonic: forward only */endlsn = OidIsValid(MyReplicationSlot->data.database) ? pg_logical_replication_slot_advance(moveto) : pg_physical_replication_slot_advance(moveto);ReplicationSlotsComputeRequiredXmin(false); /* horizons may have moved */ReplicationSlotsComputeRequiredLSN();ReplicationSlotRelease();The physical advancer is trivial — it just moves restart_lsn forward, dirties
the slot, and wakes any logical failover walsenders that key off this physical
slot’s position:
// pg_physical_replication_slot_advance — src/backend/replication/slotfuncs.cif (startlsn < moveto) { SpinLockAcquire(&MyReplicationSlot->mutex); MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(&MyReplicationSlot->mutex); ReplicationSlotMarkDirty(); /* persisted at next checkpoint */ PhysicalWakeupLogicalWalSnd();}The logical advancer defers to LogicalSlotAdvanceAndCheckSnapState (in
logical.c, deferred to the logical-decoding doc) because advancing a logical
slot means replaying decoding far enough to move confirmed_flush and the
catalog horizon safely. create_physical_replication_slot and
create_logical_replication_slot are the create wrappers; pg_drop_replication_slot
calls ReplicationSlotDrop.
pg_get_replication_slots is the read path behind the pg_replication_slots
system view. It scans the array under a shared control lock and projects each
in-use slot’s restart_lsn, catalog_xmin, confirmed_flush, wal_status,
invalidation_reason, inactive_since, and the synced / failover flags —
the columns operators use to spot a lagging or invalidated slot before it fills
pg_wal. It is the observability half of the safety story discussed in the
Beyond section: the slot mechanism is only production-safe because its state is
inspectable.
ReplicationSlotsDropDBSlots handles the DROP DATABASE interaction. Because a
logical slot is database-scoped, dropping a database must first drop every
logical slot belonging to it; this function iterates the array, skips physical
and non-matching slots, and reuses the acquire-then-ReplicationSlotDropAcquired
path for each match:
// ReplicationSlotsDropDBSlots — src/backend/replication/slot.cfor (i = 0; i < max_replication_slots; i++) { ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; if (!s->in_use) continue; if (!SlotIsLogical(s)) continue; /* only logical slots are db-specific */ if (s->data.database != dboid) continue; /* not our database */ /* NB: intentionally including invalidated slots */ SpinLockAcquire(&s->mutex); active_pid = s->active_pid; if (active_pid == 0) { /* claim it so we can drop it */ MyReplicationSlot = s; s->active_pid = MyProcPid; } SpinLockRelease(&s->mutex); /* ... if still active in another backend, bail out (rare); else DropAcquired ... */}The companion ReplicationSlotsCountDBSlots is the pre-check dropdb calls to
decide whether to error or proceed; it counts how many slots (and how many
active ones) belong to the target database.
Invalidation and failover slots (slot sync)
Section titled “Invalidation and failover slots (slot sync)”A slot that falls too far behind, or whose database is being dropped, or that
sits idle past a timeout, is invalidated rather than silently corrupting the
system. InvalidateObsoleteReplicationSlots (driven by the checkpointer when
max_slot_wal_keep_size is exceeded) iterates slots and calls
InvalidatePossiblyObsoleteSlot, which under the slot mutex determines a cause
via DetermineSlotInvalidationCause and, if the slot is not held by another
PID, marks it invalid in place:
// InvalidatePossiblyObsoleteSlot — src/backend/replication/slot.cSpinLockAcquire(&s->mutex);restart_lsn = s->data.restart_lsn;if (s->data.invalidated == RS_INVAL_NONE) invalidation_cause = DetermineSlotInvalidationCause(possible_causes, s, oldestLSN, dboid, snapshotConflictHorizon, &inactive_since, now);if (invalidation_cause == RS_INVAL_NONE) { SpinLockRelease(&s->mutex); ... break; }if (active_pid == 0) { MyReplicationSlot = s; s->active_pid = MyProcPid; s->data.invalidated = invalidation_cause; /* RS_INVAL_WAL_REMOVED, _HORIZON, ... */ if (invalidation_cause == RS_INVAL_WAL_REMOVED) { s->data.restart_lsn = InvalidXLogRecPtr; /* it pins nothing now */ s->last_saved_restart_lsn = InvalidXLogRecPtr; } *invalidated = true;}SpinLockRelease(&s->mutex);The four causes are RS_INVAL_WAL_REMOVED (required WAL exceeded the keep
limit), RS_INVAL_HORIZON (required catalog rows were removed — happens on a
standby whose primary vacuumed), RS_INVAL_WAL_LEVEL (wal_level dropped below
what the slot needs), and RS_INVAL_IDLE_TIMEOUT (idle_replication_slot_timeout).
Once invalidated, the slot is skipped by both retention aggregators, so it stops
pinning resources — the deliberate tradeoff: lose the lagging consumer, protect
the primary.
Failover slots / slot sync. A logical slot created with failover = true is
a candidate for synchronization to physical standbys, so that after a failover
the promoted standby already has the logical slot at the right position and the
logical consumer can reconnect without losing data. pg_sync_replication_slots
runs on the standby, connects to the primary, and copies the primary’s
failover slots into local slots:
// pg_sync_replication_slots — src/backend/replication/slotfuncs.cif (!RecoveryInProgress()) ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slots can only be synchronized to a standby server"));ValidateSlotSyncParams(ERROR);load_file("libpqwalreceiver", false);wrconn = walrcv_connect(PrimaryConnInfo, false, false, false, app_name.data, &err);SyncReplicationSlots(wrconn); /* the actual copy loop lives in slotsync.c */walrcv_disconnect(wrconn);Synced slots carry data.synced = true and are not directly consumable on the
standby (they exist only to be promoted into real slots at failover). The
ReplicationSlotCreate guard rails enforce the matching invariants: failover
cannot be enabled on a standby-created slot (no cascading sync) or on a temporary
slot (temporaries aren’t synced) — except when the slot-sync worker itself is
the creator. The detailed sync worker loop (slotsync.c) is deferred to
postgres-wal-sender-receiver.md and postgres-logical-decoding.md.
flowchart LR C["pg_create_*_replication_slot"] --> CR["ReplicationSlotCreate<br/>find free entry, set in_use"] CR --> RW["ReplicationSlotReserveWal<br/>set restart_lsn"] RW --> CL["ComputeRequiredLSN / Xmin<br/>publish horizons"] ADV["pg_replication_slot_advance"] --> AQ["ReplicationSlotAcquire<br/>own the slot"] AQ --> MV["move restart_lsn / confirmed_flush forward<br/>MarkDirty"] MV --> CL CKPT["CheckPointReplicationSlots"] --> SV["SaveSlotToPath<br/>tmp+fsync+rename"] SV --> CL INV["InvalidateObsoleteReplicationSlots<br/>max_slot_wal_keep_size exceeded"] --> IPO["InvalidatePossiblyObsoleteSlot<br/>set data.invalidated"] IPO --> CL CL --> XLOGR["xlog recycler skips needed WAL"] CL --> VAC["vacuum honors catalog_xmin"]
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
ReplicationSlotPersistentData (struct) | src/include/replication/slot.h | 70 |
ReplicationSlot (struct) | src/include/replication/slot.h | 155 |
SlotIsPhysical / SlotIsLogical (macros) | src/include/replication/slot.h | 228 |
ReplicationSlotCtlData (struct) | src/include/replication/slot.h | 234 |
ReplicationSlotsShmemSize | src/backend/replication/slot.c | 186 |
ReplicationSlotsShmemInit | src/backend/replication/slot.c | 204 |
ReplicationSlotCreate | src/backend/replication/slot.c | 353 |
SearchNamedReplicationSlot | src/backend/replication/slot.c | 509 |
ReplicationSlotAcquire | src/backend/replication/slot.c | 589 |
ReplicationSlotRelease | src/backend/replication/slot.c | 716 |
ReplicationSlotDrop | src/backend/replication/slot.c | 846 |
ReplicationSlotDropPtr | src/backend/replication/slot.c | 978 |
ReplicationSlotPersist | src/backend/replication/slot.c | 1120 |
ReplicationSlotsComputeRequiredXmin | src/backend/replication/slot.c | 1145 |
ReplicationSlotsComputeRequiredLSN | src/backend/replication/slot.c | 1227 |
ReplicationSlotsComputeLogicalRestartLSN | src/backend/replication/slot.c | 1297 |
ReplicationSlotsCountDBSlots | src/backend/replication/slot.c | 1376 |
ReplicationSlotsDropDBSlots | src/backend/replication/slot.c | 1434 |
ReplicationSlotReserveWal | src/backend/replication/slot.c | 1565 |
InvalidatePossiblyObsoleteSlot | src/backend/replication/slot.c | 1833 |
InvalidateObsoleteReplicationSlots | src/backend/replication/slot.c | 2061 |
CheckPointReplicationSlots | src/backend/replication/slot.c | 2121 |
StartupReplicationSlots | src/backend/replication/slot.c | 2199 |
CreateSlotOnDisk | src/backend/replication/slot.c | 2260 |
SaveSlotToPath | src/backend/replication/slot.c | 2321 |
RestoreSlotFromDisk | src/backend/replication/slot.c | 2484 |
pg_physical_replication_slot_advance | src/backend/replication/slotfuncs.c | 465 |
pg_logical_replication_slot_advance | src/backend/replication/slotfuncs.c | 501 |
pg_replication_slot_advance | src/backend/replication/slotfuncs.c | 510 |
pg_create_physical_replication_slot | src/backend/replication/slotfuncs.c | 65 |
pg_create_logical_replication_slot | src/backend/replication/slotfuncs.c | 169 |
pg_drop_replication_slot | src/backend/replication/slotfuncs.c | 218 |
pg_get_replication_slots | src/backend/replication/slotfuncs.c | 236 |
pg_sync_replication_slots | src/backend/replication/slotfuncs.c | 895 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against /data/hgryoo/references/postgres at REL_18_STABLE, commit
273fe94 (PG 18.x). Method: read slot.c and slotfuncs.c in full for the
quoted regions, cross-checked struct/enum/macro definitions in
src/include/replication/slot.h, and confirmed every quoted symbol exists at
the stated line.
- Struct shape confirmed.
ReplicationSlotPersistentDatacarriesrestart_lsn,xmin,catalog_xmin,confirmed_flush, plus the PG-recent fieldstwo_phase,two_phase_at,synced, andfailover. The volatileReplicationSlotcarrieseffective_xmin/effective_catalog_xmin,last_saved_restart_lsn,last_saved_confirmed_flush, andinactive_since— all present on REL_18, none are pre-18 names. - Invalidation causes confirmed.
ReplicationSlotInvalidationCausehas exactly four power-of-two causes plusRS_INVAL_NONE, withRS_INVAL_MAX_CAUSES == 4; theStaticAssertDeclatslot.c:124enforces theSlotInvalidationCausestable length.RS_INVAL_IDLE_TIMEOUTis the PG-18-era addition and is present. - Locking model confirmed.
ReplicationSlotControlLock(shared for scans, exclusive forin_useflips),ReplicationSlotAllocationLock(exclusive for create/reserve, shared for the checkpoint flush loop), and per-slotmutexspinlocks match the header’s documented two-tier model. - Effective-vs-persistent invariant confirmed. The header comment on
effective_xminand the use ofeffective_*(notdata.*) insideReplicationSlotsComputeRequiredXmin, and oflast_saved_restart_lsn(not liverestart_lsn) insideReplicationSlotsComputeRequiredLSNfor persistent slots, jointly confirm the “on-disk horizon never ahead of in-memory” safety property. - No fabricated symbols. Every function quoted (
ReplicationSlotCreate,ReplicationSlotReserveWal,CheckPointReplicationSlots,StartupReplicationSlots,InvalidatePossiblyObsoleteSlot,pg_replication_slot_advance,pg_sync_replication_slots, etc.) was located by direct grep of the two source files.SyncReplicationSlots,LogicalSlotAdvanceAndCheckSnapState, andPhysicalWakeupLogicalWalSndare referenced as call targets but defined inslotsync.c/logical.c/walsender.c(out of scope here; cross-referenced). - Scope boundary. This doc does not assert the contents of
slotsync.c’s worker loop, the reorder-buffer/snapbuild internals, or the walsender streaming protocol; those are deferred to the cross-referenced docs.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s slot is a specific point in a broad design space. Comparing it to neighboring systems sharpens what is essential versus incidental.
MySQL binlog + GTIDs. MySQL does not register per-consumer markers on the
producer; the binary log is retained by a time/size policy
(binlog_expire_logs_seconds, max_binlog_size) and a replica tracks its own
position (file+offset, or a GTID set) on the replica side. The producer is
stateless about consumers. This is the “fixed retention window” design: simpler
on the primary (no shared array, no min computation, no slot files to fsync),
but a replica that disconnects longer than the window must be re-cloned. There
is no analogue of catalog_xmin, because MySQL’s binlog in ROW format embeds
enough column metadata (and the replica has the schema) that historical catalog
pinning is not required. PostgreSQL deliberately pays the slot-bookkeeping cost
to get the never-lose-a-consumer guarantee — and pays a second cost,
catalog_xmin, specifically because its logical decoding reconstructs changes
against the live catalog rather than embedding schema in the log.
Oracle GoldenGate / LogMiner. Oracle’s redo/archive logs are retained by RMAN
retention policy; CDC tools track a checkpoint (SCN) in their own metadata
table. The “don’t vacuum what the miner needs” problem is handled by Oracle’s
undo retention and explicit LogMiner dictionary snapshots rather than by a
horizon fed into the GC computation. The architectural contrast is the same as
MySQL: consumer position lives outside the engine’s recycler, so the engine’s
GC is consumer-oblivious and the CDC tool owns the retention risk.
Kafka consumer offsets. Kafka is the cleanest “consumer-registered marker”
analogue: each consumer group commits an offset, and log retention is by
time/size with an optional compaction policy — but notably Kafka does not by
default retain back to the slowest consumer; a consumer that lags past retention
silently loses messages (auto.offset.reset). PostgreSQL’s slot is stricter:
the primary will retain WAL unboundedly to protect a slot, which is why
max_slot_wal_keep_size and slot invalidation exist as the escape valve. The
PostgreSQL design thus sits between Kafka (consumer can be sacrificed silently)
and a naïve unbounded slot (primary fills its disk): invalidation makes the
sacrifice explicit and observable (pg_replication_slots.invalidation_reason).
The dead-slot disk-fill failure mode. The single most common
slot-operational incident — a forgotten or crashed consumer pinning WAL until
the primary’s pg_wal fills and the database halts — drove a multi-release arc
of features: max_slot_wal_keep_size (bound the pin, invalidate on breach),
idle_replication_slot_timeout (invalidate slots inactive too long), and the
inactive_since field plus monitoring columns so operators can spot a lagging
slot before it bites. The research/engineering lesson is that a
consumer-registered retention marker is only safe in production if paired with a
bounded override and observability — the marker alone is a foot-gun.
Failover slots as a distributed-systems problem. Synchronizing a logical
slot to a standby (slot sync, PG 17+) is a small instance of a hard distributed
problem: keeping a piece of consumer-progress state replicated and consistent
across a leader change without losing or double-processing data. The PostgreSQL
solution — copy the failover slot’s restart_lsn/catalog_xmin/confirmed_flush
to the standby and only let the standby’s synced slot lag behind the primary’s
(never ahead) — is the same “be conservative across the boundary” invariant that
governs the effective-vs-persistent split, applied across nodes instead of
across a crash. Active research frontiers in this space include reducing the
coordination cost of synchronous slot sync (it adds a round-trip to the
critical path) and extending logical replication to multi-active topologies
where a single linear confirmed_flush no longer captures consumer progress.
Where the abstraction leaks. The catalog_xmin mechanism couples logical
replication to vacuum in a way that surprises operators: a busy logical slot can
block catalog vacuum cluster-wide, causing catalog bloat far from the
replicated tables. This is the price of reconstructing changes against the live
catalog. Systems that embed schema in the log (MySQL ROW format, Debezium schema
history) avoid this coupling at the cost of larger logs and a separate schema
store — a genuine, unresolved tradeoff rather than a bug.
Sources
Section titled “Sources”- Source tree.
/data/hgryoo/references/postgresat REL_18_STABLE, commit 273fe94 (PG 18.x):src/backend/replication/slot.c— slot lifecycle, retention computation, persistence, invalidation, standby-slot helpers.src/backend/replication/slotfuncs.c— SQL-callable create / drop / advance / copy / sync functions andpg_get_replication_slots.src/include/replication/slot.h—ReplicationSlot,ReplicationSlotPersistentData,ReplicationSlotCtlData, persistency and invalidation enums,SlotIsPhysical/SlotIsLogical.
- Textbook theory.
- Kleppmann, Designing Data-Intensive Applications (2017), ch. 5
(Replication — leader log retention) and ch. 11 (Stream Processing / change
data capture — schema-aware change reconstruction). Captured in the KB
bibliography (
raw/system/textbooks/). - Petrov, Database Internals (2019), the WAL chapter (LSN-addressed log,
segment recycling low-water marks, fsync-ordered durability). Captured at
knowledge/research/dbms-general/database-internals.md.
- Kleppmann, Designing Data-Intensive Applications (2017), ch. 5
(Replication — leader log retention) and ch. 11 (Stream Processing / change
data capture — schema-aware change reconstruction). Captured in the KB
bibliography (
- Cross-references within this KB.
postgres-wal-sender-receiver.md— the walsender/walreceiver transport that consumes a slot and drivesconfirmed_flush; also the slot-sync worker.postgres-logical-decoding.md— reorder buffer, snapshot builder, andLogicalSlotAdvanceAndCheckSnapStatethat move a logical slot’s horizons.postgres-overview-replication-ha.md— how slots sit among the other replication-ha consumers of the WAL stream.postgres-xlog-wal.md/postgres-checkpoint.md— the WAL recycler and the checkpoint that readsXLogSetReplicationSlotMinimumLSNand flushes slots.postgres-procarray.md/postgres-vacuum.md— the ProcArray oldest-xmin computation that honorsReplicationSlotsComputeRequiredXmin.