Skip to content

PostgreSQL Replication Slots — Retaining WAL and catalog_xmin

Contents:

Every replication scheme that ships a log rather than full snapshots faces the same garbage-collection problem: the primary continuously produces a write-ahead log, and a downstream consumer reads it at its own pace. The primary would like to recycle old log segments as soon as they are no longer needed for its own crash recovery — but “no longer needed” now has a second clause: no longer needed by any consumer either. If the primary recycles a segment a lagging standby still has to read, that standby can never catch up; it must be rebuilt from a fresh base backup.

Kleppmann’s Designing Data-Intensive Applications (ch. 5, “Replication”) frames this as the core tension of log-based replication: the leader’s log is the single source of truth, followers replay it position by position, and “the leader must retain log entries until all followers have consumed them.” The naïve fix — keep all WAL forever — trades a correctness bug for an operational one: unbounded disk growth that eventually crashes the primary. The naïve opposite — keep a fixed window (wal_keep_size, archive-and-forget) — is a guess: too small and a briefly disconnected standby falls off the window; too large and you waste disk on consumers that have long since caught up. What the system actually wants is a feedback channel: each consumer publishes how far it has consumed, and the primary retains exactly back to the slowest consumer and no further.

For physical replication that feedback is a single number — the oldest WAL position (LSN) any standby still needs. For logical replication (DDIA ch. 11, “Stream Processing” / change data capture) there is a second, subtler resource. A logical decoder reads WAL and reconstructs row-level changes, which requires interpreting each change against the system catalog as it existed when the change was made — the table’s column layout, types, and TOAST mapping at that historical moment. Ordinary MVCC vacuum is free to prune old catalog row versions once no running transaction can see them. But a logical decoder replaying month-old WAL needs the month-old catalog. So a logical consumer must pin not only old WAL but also an old catalog snapshot — expressed as a transaction-id horizon below which catalog tuples must not be vacuumed.

Database Internals (Petrov, 2019, the WAL chapter) supplies the mechanism vocabulary on the producer side: the log is a sequence of records addressed by monotonically increasing offsets (LSNs), segments are recycled once a “low-water mark” passes them, and durability rests on fsync ordering. A replication slot is precisely a named, persistent low-water mark, registered by a consumer, that the producer’s recycling logic must respect. Two properties make it more than a stored number:

  1. Durability. The bookmark must survive a primary crash. If it lived only in volatile memory, a crash-restart would lose the consumer’s position and the primary might recycle WAL the consumer still needs. So the slot is written to disk and reloaded before recovery begins.
  2. Atomic publication. Many slots may exist; the retention decision is the minimum across all of them. Computing that minimum, and the inverse operation of reserving WAL when a slot is created, must interleave safely with concurrent checkpoints that are simultaneously deciding what to recycle — otherwise a slot could reserve a segment the checkpoint just decided to delete.

These two properties — crash-safe persistence and race-free min/reserve against the recycler — are the whole engineering problem, and they dominate the source.

It is worth dwelling on why the catalog horizon is separate from the WAL horizon, because the two-number design is the single most distinctive feature of PostgreSQL’s slot. WAL retention is a question about bytes: which physical segments on disk must not be unlinked. Catalog retention is a question about visibility: which historical row versions of pg_class, pg_attribute, pg_type, and the rest must remain visible to a snapshot taken at decode time. These are different units (LSN versus transaction id) consumed by different subsystems (the WAL recycler versus vacuum’s oldest-xmin), so they cannot be a single value. A physical standby needs only the first; it replays WAL block by block and never interprets row contents, so it has no opinion about the catalog. A logical decoder needs both: the WAL bytes to read the changes, and the catalog snapshot to interpret them. The slot therefore carries both horizons, and the discriminator database == InvalidOid selects which ones are actually populated.

A further consequence of the feedback-channel model is that the direction of movement matters enormously. A consumer’s reported position may only move forward — a slot can confirm it has consumed up to LSN X, after which the producer may discard everything below X, but it can never un-confirm. If a slot were allowed to move backward, the producer might already have recycled the WAL the slot now claims to need. So monotonicity of restart_lsn and confirmed_flush is a correctness invariant, not a nicety, and the advance path enforces it explicitly by rejecting any target below the current minimum.

Log-shipping systems converge on a small design space for “how does the producer know what to retain”:

  • Fixed retention window. Keep the last N segments / T hours / S bytes regardless of consumers (PostgreSQL’s pre-9.4 wal_keep_segments, MySQL’s binlog_expire_logs_seconds). Simple and bounded, but a consumer outside the window is lost. The window is a guess decoupled from actual consumer progress.
  • Consumer-registered retention (slots / bookmarks). Each consumer registers a durable marker; the producer retains back to the oldest marker. This is PostgreSQL’s replication slot, Kafka’s consumer-group committed offset, and Oracle GoldenGate’s checkpoint table. The producer’s recycler takes the min over all markers. Pro: never lose a registered consumer. Con: a dead consumer that never un-registers pins resources forever — the classic “slot left behind” disk-fill incident.
  • Pull-from-archive. Recycle aggressively to local disk but archive every segment to cheap storage (S3, archive_command); a lagging consumer restores from the archive. Decouples local disk from consumer lag, at the cost of archive latency and a separate restore path.

For the catalog-snapshot half of the problem, systems that do logical CDC all need some form of “don’t garbage-collect schema/row versions the decoder still needs.” Approaches range from snapshotting the schema at capture-start (Debezium-style external CDC) to integrating with the engine’s own MVCC GC so the decoder’s horizon participates directly in the vacuum decision. PostgreSQL takes the integrated route: a logical slot’s catalog_xmin is published into the same global oldest-xmin computation that bounds vacuum, so the decoder’s needs and vacuum’s freedom are reconciled in one place rather than via an external schema registry.

The locking shape is also conventional. The set of slots is a small, fixed-size shared array (slots are scarce and long-lived, so no dynamic hash is warranted). A coarse lock guards membership (which array entries are live); fine-grained per-entry locks guard the mutable fields. The retention min is recomputed on every event that could change it — slot create, advance, drop, release — and published to the consumer subsystems (the xlog recycler, the ProcArray’s vacuum horizon) through narrow setters.

The recompute-and-publish cadence is itself a design choice with two viable poles. One pole is to push every change immediately and synchronously (low staleness, high contention); the other is to recompute lazily only when a consumer subsystem asks (low contention, possibly stale). PostgreSQL takes a hybrid: the two hot minima — required LSN and required xmin — are maintained eagerly (recomputed on every horizon-moving event and cached in the xlog module and the ProcArray respectively), because the WAL recycler and vacuum consult them at high frequency and cannot afford to walk the slot array themselves. The colder logical-restart LSN is computed on demand, because its callers are rare. This split — cache what is read often, compute what is read seldom — is the same tradeoff every shared-state subsystem makes; what is notable here is that the eager recomputation deliberately uses only a shared lock and accepts a momentarily stale (always-conservative) result rather than serializing all slot activity behind an exclusive lock.

A last structural decision worth naming is fusing creation with ownership. In many systems “create a resource” and “acquire it for use” are separate calls; PostgreSQL fuses them for slots (ReplicationSlotCreate leaves the creating backend as the active owner) because a freshly created slot that nobody owns is a liability — it pins WAL but has no consumer driving it forward, exactly the dead-slot failure mode. Making the creator the owner means the slot’s pin is tied to a live process from birth; the slot only becomes a standalone durable object once the owner explicitly persists and releases it.

PostgreSQL implements a slot as a ReplicationSlot struct living in a fixed-size shared-memory array sized by max_replication_slots. The split between durable and volatile state is explicit in the type: the on-disk part is a nested ReplicationSlotPersistentData data substruct; everything outside it is rebuilt at startup.

// ReplicationSlotPersistentData — src/include/replication/slot.h
typedef struct ReplicationSlotPersistentData
{
NameData name; /* the slot's identifier */
Oid database; /* InvalidOid => physical, else logical */
ReplicationSlotPersistency persistency; /* PERSISTENT/EPHEMERAL/TEMPORARY */
TransactionId xmin; /* data xmin horizon */
TransactionId catalog_xmin; /* catalog xmin horizon (logical) */
XLogRecPtr restart_lsn; /* oldest WAL this slot may require */
ReplicationSlotInvalidationCause invalidated; /* RS_INVAL_NONE if valid */
XLogRecPtr confirmed_flush; /* oldest LSN the client has acked */
/* ... two_phase_at, two_phase, plugin, synced, failover ... */
} ReplicationSlotPersistentData;

The two retention knobs are restart_lsn (the oldest WAL byte position the consumer might still read — the WAL low-water mark) and catalog_xmin (the oldest catalog transaction id the consumer might still need — the vacuum horizon, logical only). database == InvalidOid is the discriminator between physical and logical slots, encoded by two macros:

// slot type discriminators — src/include/replication/slot.h
#define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
#define SlotIsLogical(slot) ((slot)->data.database != InvalidOid)

A physical slot pins only WAL; it leaves catalog_xmin invalid. A logical slot is database-scoped (it can only decode the database it was created in) and pins both WAL and a catalog snapshot. The reason physical slots also bother to pin xmin at all is hot-standby feedback: a standby can ask the primary not to vacuum rows its running queries still need, and that horizon rides on the physical slot’s xmin.

The persistency field is a three-valued enum, and the three values encode a small state machine over a slot’s durability:

  • RS_TEMPORARY — the slot lives only for the current session and is dropped on disconnect or error. Used for short-lived consumers (e.g. a pg_basebackup that wants WAL retained only while it runs).
  • RS_EPHEMERAL — a “not quite ready” intermediate state used while building a persistent slot: it is dropped if released before being promoted, so a backend that crashes mid-creation leaves no orphan. ReplicationSlotPersist promotes an ephemeral slot to persistent once it is fully initialized.
  • RS_PERSISTENT — crash-safe and durable; survives release, session end, and restart, and is only removed by an explicit drop or by invalidation.

This staging is the durability analogue of the effective-vs-persistent xmin rule: a slot becomes a durable, resource-pinning object only after it is fully formed, so a failure at any earlier point cleans up automatically rather than leaking a pin.

The shared struct wraps the persistent data with volatile coordination fields:

// ReplicationSlot — src/include/replication/slot.h
typedef struct ReplicationSlot
{
slock_t mutex; /* protects the individual fields below */
bool in_use; /* is this array entry a live slot? */
pid_t active_pid; /* who is streaming this slot? 0 = nobody */
bool just_dirtied;
bool dirty; /* unsaved changes since last flush? */
TransactionId effective_xmin; /* latest xmin actually on disk */
TransactionId effective_catalog_xmin;
ReplicationSlotPersistentData data; /* the crash-safe part */
LWLock io_in_progress_lock;
ConditionVariable active_cv; /* signaled when active_pid changes */
/* logical-only candidate fields used to advance horizons lazily: */
TransactionId candidate_catalog_xmin;
XLogRecPtr candidate_xmin_lsn;
XLogRecPtr candidate_restart_valid;
XLogRecPtr candidate_restart_lsn;
XLogRecPtr last_saved_confirmed_flush;
XLogRecPtr last_saved_restart_lsn;
/* ... inactive_since ... */
} ReplicationSlot;

Two locking concepts, spelled out in the header’s comment, govern access:

  • ReplicationSlotControlLock (cluster-wide LWLock) guards the in_use flag. A backend flips in_use (create / drop) only while holding it exclusively; a backend scanning the array to read other slots’ data holds it shared.
  • Per-slot mutex (spinlock) guards the mutable fields. The slot’s owner may read its own fields without the spinlock but must take it to write; concurrent non-owners take the spinlock to read.

The effective_xmin / effective_catalog_xmin pair versus the persistent data.xmin / data.catalog_xmin is the subtlest part of the design. For logical decoding it is “extremely important that we never remove any data that’s still needed … even after a crash” (header comment). So the effective value — the one that actually bounds vacuum and WAL recycling — is only allowed to advance after the corresponding persistent value has been flushed to disk. This guarantees that a crash can only ever leave the on-disk horizon behind (more conservative than) the in-memory one, never ahead of it. For physical slots the worst case of premature removal is a query cancellation on the standby, so the effective value simply tracks the persistent value with no extra lag.

flowchart TB
  subgraph SHMEM["Shared memory: ReplicationSlotCtl"]
    A["replication_slots[0..max_replication_slots-1]<br/>fixed array of ReplicationSlot"]
    A --> S0["slot[0] in_use<br/>data.restart_lsn / data.catalog_xmin<br/>mutex + active_pid"]
    A --> S1["slot[1] in_use ..."]
    A --> SN["slot[N] free (in_use=false)"]
  end
  S0 -->|"min restart_lsn<br/>ComputeRequiredLSN"| XLOG["xlog recycler<br/>XLogSetReplicationSlotMinimumLSN"]
  S0 -->|"min catalog_xmin<br/>ComputeRequiredXmin"| PROC["ProcArray vacuum horizon<br/>ProcArraySetReplicationSlotXmin"]
  S0 -.->|"checkpoint: SaveSlotToPath"| DISK["pg_replslot/&lt;name&gt;/state<br/>tmp + fsync + rename"]
  DISK -.->|"StartupReplicationSlots<br/>before redo"| A

The retention pipeline is the heart of the module. On every event that could move a horizon — slot creation reserving WAL, an advance, a release, a drop — PostgreSQL recomputes the cluster-wide minima and republishes them:

  • ReplicationSlotsComputeRequiredLSN() walks the array, takes the minimum valid restart_lsn, and calls XLogSetReplicationSlotMinimumLSN(). The next checkpoint will not recycle segments below that LSN.
  • ReplicationSlotsComputeRequiredXmin() walks the array, takes the minimum effective_xmin and effective_catalog_xmin, and calls ProcArraySetReplicationSlotXmin(). Vacuum’s global oldest-xmin computation then refuses to prune below that horizon.

The candidate fields (candidate_catalog_xmin, candidate_xmin_lsn, candidate_restart_lsn, candidate_restart_valid) implement the lazy-advance machinery for logical slots and deserve a sentence even though their consumers live in the logical-decoding subsystem. When a logical decoder reads WAL, it learns of a potential new horizon long before it can safely commit to it — it must wait until the client confirms it has flushed the corresponding output. The candidate fields stage that potential horizon: “once the client confirms flushes at or beyond candidate_xmin_lsn, it is safe to advance the catalog xmin to candidate_catalog_xmin; once candidate_restart_valid is passed, restart_lsn may rise to candidate_restart_lsn.” Only when those conditions are met does the effective horizon move and a recompute fire. This staging is exactly the mechanism that upholds the effective-never-ahead-of-flushed invariant for the catalog horizon, mirroring on the xmin axis what last_saved_restart_lsn does on the LSN axis. The slot.c functions in this doc read these fields but the logic that sets them lives in logical.c (LogicalConfirmReceivedLocation), deferred to the logical-decoding doc.

Both deliberately use a shared lock and tolerate a transiently regressing result: a concurrent advance might make the computed minimum momentarily older than reality, which is harmless (it only delays GC, never enables premature removal). This is the recurring safety asymmetry of the whole module — every race is resolved in the direction of retaining too much, never too little.

The module’s lifecycle splits into five flows: allocation (create / acquire / release / drop), retention computation (the min/reserve functions), persistence (save / restore at checkpoint and startup), advance / drop (the SQL-facing operations in slotfuncs.c), and invalidation + failover (falling-behind detection and slot sync). They share the array and the two locks introduced above.

Allocation: create, acquire, release, drop

Section titled “Allocation: create, acquire, release, drop”

ReplicationSlotCreate (slot.c) is the entry point. It validates the name, then under ReplicationSlotAllocationLock (which serializes all slot creation/cleanup so two backends can’t grab the same free entry or fight over the same directory) it scans the array for a name collision and for a free slot, initializes the persistent and volatile fields, writes the slot to disk, and only then flips in_use under an exclusive ReplicationSlotControlLock:

// ReplicationSlotCreate — src/backend/replication/slot.c
LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
for (i = 0; i < max_replication_slots; i++)
{
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
if (s->in_use && strcmp(name, NameStr(s->data.name)) == 0)
ereport(ERROR, (errcode(ERRCODE_DUPLICATE_OBJECT),
errmsg("replication slot \"%s\" already exists", name)));
if (!s->in_use && slot == NULL)
slot = s; /* first free entry */
}
LWLockRelease(ReplicationSlotControlLock);
if (slot == NULL)
ereport(ERROR, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
errmsg("all replication slots are in use"), ...));
/* ... memset(&slot->data, 0, ...); set name/database/persistency ... */
CreateSlotOnDisk(slot); /* materialize before marking in_use */
LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
slot->in_use = true; /* now visible to scanners */

slot->data.database = db_specific ? MyDatabaseId : InvalidOid is the line that decides physical-vs-logical at birth. The newly created slot is also marked active (active_pid = MyProcPid, MyReplicationSlot = slot) so the creating backend owns it immediately — creation and acquisition are fused.

ReplicationSlotAcquire is the re-attach path (used by pg_replication_slot_advance, a starting walsender, etc.). It searches by name, then either claims the slot (active_pid = MyProcPid) or, if another PID holds it, sleeps on the slot’s active_cv condition variable until released — unless nowait was requested, in which case it errors with ERRCODE_OBJECT_IN_USE. Crucially, it checks invalidation after taking ownership, to close a race with the checkpointer:

// ReplicationSlotAcquire — src/backend/replication/slot.c
MyReplicationSlot = s; /* the slot is ours now */
/* check invalidation AFTER acquiring to avoid a race with the checkpointer */
if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer access replication slot \"%s\"",
NameStr(s->data.name)),
errdetail("This replication slot has been invalidated due to \"%s\".",
GetSlotInvalidationCauseName(s->data.invalidated)));

ReplicationSlotRelease is the inverse. Its branch on persistency is the whole point of the three-valued enum: an RS_EPHEMERAL slot (a not-quite-finished persistent slot) is dropped on release; an RS_PERSISTENT slot is merely marked inactive (active_pid = 0) and survives; a temporary slot is handled likewise but disappears at session end. Releasing a slot that had temporarily constrained xmin for an initial catalog snapshot clears that constraint and recomputes the xmin horizon.

ReplicationSlotDropPtr performs the durable removal: it renames the slot directory to <name>.tmp (the atomic “this is no longer a valid slot” step), fsyncs the parent, clears in_use under the control lock, and then recomputes both retention minima — because a dropped slot no longer pins anything:

// ReplicationSlotDropPtr — src/backend/replication/slot.c
if (rename(path, tmppath) == 0) { /* atomic invalidation on disk */
START_CRIT_SECTION();
fsync_fname(tmppath, true);
fsync_fname(PG_REPLSLOT_DIR, true);
END_CRIT_SECTION();
}
LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
slot->active_pid = 0;
slot->in_use = false; /* entry returns to the free pool */
LWLockRelease(ReplicationSlotControlLock);
/* Slot is dead and doesn't prevent resource removal anymore, recompute limits. */
ReplicationSlotsComputeRequiredXmin(false);
ReplicationSlotsComputeRequiredLSN();

Reserving WAL and computing the retention minima

Section titled “Reserving WAL and computing the retention minima”

ReplicationSlotReserveWal sets a brand-new slot’s restart_lsn. The choice of anchor differs by slot type and recovery state — a physical slot anchors at the last checkpoint redo pointer (where a base backup would start replay); a logical slot on a primary anchors at the current insert pointer and logs a standby snapshot so decoding has a consistent starting point:

// ReplicationSlotReserveWal — src/backend/replication/slot.c
LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE); /* serialize vs checkpoint */
if (SlotIsPhysical(slot))
restart_lsn = GetRedoRecPtr();
else if (RecoveryInProgress())
restart_lsn = GetXLogReplayRecPtr(NULL);
else
restart_lsn = GetXLogInsertRecPtr();
SpinLockAcquire(&slot->mutex);
slot->data.restart_lsn = restart_lsn;
SpinLockRelease(&slot->mutex);
ReplicationSlotsComputeRequiredLSN(); /* prevent WAL removal ASAP */
XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);
if (XLogGetLastRemovedSegno() >= segno) /* lost the race: segment already gone */
elog(ERROR, "WAL required by replication slot %s has been removed concurrently", ...);

The exclusive ReplicationSlotAllocationLock here is the lock that interlocks with the checkpointer (which holds it shared while flushing slots, see below): either the reservation happens first and the checkpoint must see the new restart_lsn, or the checkpoint happens first and the reservation lands at/after that checkpoint’s redo pointer. The post-hoc XLogGetLastRemovedSegno check catches the residual window.

ReplicationSlotsComputeRequiredLSN is the WAL low-water-mark aggregator. Note the persistent-slot subtlety: it uses last_saved_restart_lsn, not the live restart_lsn, because the segments between the two might still be needed if the primary crashes before the newer restart_lsn is flushed:

// ReplicationSlotsComputeRequiredLSN — src/backend/replication/slot.c
for (i = 0; i < max_replication_slots; i++) {
/* ... read restart_lsn, last_saved_restart_lsn, persistency, invalidated under mutex ... */
if (invalidated) continue; /* invalidated slots pin nothing */
if (persistency == RS_PERSISTENT) {
if (last_saved_restart_lsn != InvalidXLogRecPtr &&
restart_lsn > last_saved_restart_lsn)
restart_lsn = last_saved_restart_lsn; /* be conservative across crash */
}
if (restart_lsn != InvalidXLogRecPtr &&
(min_required == InvalidXLogRecPtr || restart_lsn < min_required))
min_required = restart_lsn;
}
XLogSetReplicationSlotMinimumLSN(min_required);

ReplicationSlotsComputeRequiredXmin is the catalog-horizon analogue. It aggregates the minimum effective_xmin and effective_catalog_xmin (the flushed-to-disk values, per the effective-vs-persistent rule) and pushes them into the ProcArray, which is where vacuum reads the global oldest-xmin:

// ReplicationSlotsComputeRequiredXmin — src/backend/replication/slot.c
for (i = 0; i < max_replication_slots; i++) {
/* ... read effective_xmin, effective_catalog_xmin, invalidated under mutex ... */
if (invalidated) continue;
if (TransactionIdIsValid(effective_xmin) &&
(!TransactionIdIsValid(agg_xmin) ||
TransactionIdPrecedes(effective_xmin, agg_xmin)))
agg_xmin = effective_xmin;
if (TransactionIdIsValid(effective_catalog_xmin) &&
(!TransactionIdIsValid(agg_catalog_xmin) ||
TransactionIdPrecedes(effective_catalog_xmin, agg_catalog_xmin)))
agg_catalog_xmin = effective_catalog_xmin;
}
ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);

ReplicationSlotsComputeLogicalRestartLSN is a logical-only variant (skips physical slots) used by callers that need the oldest WAL required specifically for decoding; it is computed on demand rather than cached.

Persistence: dirty-flush at checkpoint, restore before redo

Section titled “Persistence: dirty-flush at checkpoint, restore before redo”

A slot is dirtied (ReplicationSlotMarkDirty) whenever its persistent fields change, and explicitly flushed (ReplicationSlotSaveSaveSlotToPath) at slot creation and persist time. The bulk of flushing, though, happens lazily at every checkpoint via CheckPointReplicationSlots, which walks all in-use slots and calls SaveSlotToPath for each under a shared ReplicationSlotAllocationLock (strong enough to freeze in_use, weak enough to let acquisition proceed):

// CheckPointReplicationSlots — src/backend/replication/slot.c
LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
for (i = 0; i < max_replication_slots; i++) {
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
if (!s->in_use) continue;
sprintf(path, "%s/%s", PG_REPLSLOT_DIR, NameStr(s->data.name));
if (is_shutdown && SlotIsLogical(s)) { /* force-flush confirmed_flush at shutdown */
SpinLockAcquire(&s->mutex);
if (s->data.invalidated == RS_INVAL_NONE &&
s->data.confirmed_flush > s->last_saved_confirmed_flush) {
s->just_dirtied = true;
s->dirty = true;
}
SpinLockRelease(&s->mutex);
}
if (s->last_saved_restart_lsn != s->data.restart_lsn)
last_saved_restart_lsn_updated = true;
SaveSlotToPath(s, path, LOG);
}
LWLockRelease(ReplicationSlotAllocationLock);
if (last_saved_restart_lsn_updated)
ReplicationSlotsComputeRequiredLSN(); /* WAL can now be recycled further */

SaveSlotToPath is the atomic-write workhorse: skip if !dirty, copy slot->data under the spinlock into a checksummed ReplicationSlotOnDisk record, write it to state.tmp, fsync, then rename to state and fsync the directory. The write-tmp-then-rename idiom guarantees that a crash mid-write leaves the previous good state file intact. Its tail is where the two last_saved_* fields are updated and the dirty bit is cleared — but only if nobody re-dirtied the slot during the I/O, which is exactly why just_dirtied exists as a separate flag from dirty:

// SaveSlotToPath — src/backend/replication/slot.c (tail)
if (rename(tmppath, path) != 0) { /* ... unlink, release io lock, ereport ... */ return; }
START_CRIT_SECTION();
fsync_fname(path, false);
fsync_fname(dir, true);
fsync_fname(PG_REPLSLOT_DIR, true);
END_CRIT_SECTION();
/* Successfully wrote; unset dirty unless somebody dirtied again already, */
/* and remember the flushed confirmed_flush / restart_lsn. */
SpinLockAcquire(&slot->mutex);
if (!slot->just_dirtied)
slot->dirty = false;
slot->last_saved_confirmed_flush = cp.slotdata.confirmed_flush;
slot->last_saved_restart_lsn = cp.slotdata.restart_lsn;
SpinLockRelease(&slot->mutex);
LWLockRelease(&slot->io_in_progress_lock);

Setting last_saved_restart_lsn = cp.slotdata.restart_lsn here is the moment the conservative “use the saved, not live, restart_lsn” rule in ReplicationSlotsComputeRequiredLSN is allowed to relax: only after a successful flush does the saved value catch up to the live one, which is why the checkpoint loop recomputes the required LSN whenever it observes last_saved_restart_lsn != data.restart_lsn. RestoreSlotFromDisk is the reverse, called by StartupReplicationSlots before crash recovery so that slots pin WAL before redo can examine what to recycle:

// StartupReplicationSlots — src/backend/replication/slot.c
replication_dir = AllocateDir(PG_REPLSLOT_DIR);
while ((replication_de = ReadDir(replication_dir, PG_REPLSLOT_DIR)) != NULL) {
/* skip ".", ".."; rmtree leftover "*.tmp" dirs from a crash mid-create/drop */
if (pg_str_endswith(replication_de->d_name, ".tmp")) { rmtree(path, true); ... continue; }
RestoreSlotFromDisk(replication_de->d_name); /* validate magic/version/CRC, fill array */
}
FreeDir(replication_dir);
if (max_replication_slots <= 0) return;
ReplicationSlotsComputeRequiredXmin(false); /* republish horizons from restored slots */
ReplicationSlotsComputeRequiredLSN();

RestoreSlotFromDisk PANICs on a magic/version/length/CRC mismatch (a corrupt slot file is unrecoverable and must stop startup), and on restore it seeds the volatile effective_* and last_saved_* fields from the persisted values — re-establishing the “effective never ahead of on-disk” invariant from a clean base.

slotfuncs.c is the SQL-callable layer. pg_replication_slot_advance clamps the target LSN to what has actually been flushed/replayed, acquires the slot, rejects backward moves, and dispatches to the physical or logical advancer:

// pg_replication_slot_advance — src/backend/replication/slotfuncs.c
if (!RecoveryInProgress())
moveto = Min(moveto, GetFlushRecPtr(NULL)); /* can't advance past durable WAL */
else
moveto = Min(moveto, GetXLogReplayRecPtr(NULL));
ReplicationSlotAcquire(NameStr(*slotname), true, true);
if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
ereport(ERROR, ...); /* never reserved WAL */
minlsn = OidIsValid(MyReplicationSlot->data.database)
? MyReplicationSlot->data.confirmed_flush /* logical: consumed point */
: MyReplicationSlot->data.restart_lsn; /* physical: restart point */
if (moveto < minlsn) ereport(ERROR, ...); /* monotonic: forward only */
endlsn = OidIsValid(MyReplicationSlot->data.database)
? pg_logical_replication_slot_advance(moveto)
: pg_physical_replication_slot_advance(moveto);
ReplicationSlotsComputeRequiredXmin(false); /* horizons may have moved */
ReplicationSlotsComputeRequiredLSN();
ReplicationSlotRelease();

The physical advancer is trivial — it just moves restart_lsn forward, dirties the slot, and wakes any logical failover walsenders that key off this physical slot’s position:

// pg_physical_replication_slot_advance — src/backend/replication/slotfuncs.c
if (startlsn < moveto) {
SpinLockAcquire(&MyReplicationSlot->mutex);
MyReplicationSlot->data.restart_lsn = moveto;
SpinLockRelease(&MyReplicationSlot->mutex);
ReplicationSlotMarkDirty(); /* persisted at next checkpoint */
PhysicalWakeupLogicalWalSnd();
}

The logical advancer defers to LogicalSlotAdvanceAndCheckSnapState (in logical.c, deferred to the logical-decoding doc) because advancing a logical slot means replaying decoding far enough to move confirmed_flush and the catalog horizon safely. create_physical_replication_slot and create_logical_replication_slot are the create wrappers; pg_drop_replication_slot calls ReplicationSlotDrop.

pg_get_replication_slots is the read path behind the pg_replication_slots system view. It scans the array under a shared control lock and projects each in-use slot’s restart_lsn, catalog_xmin, confirmed_flush, wal_status, invalidation_reason, inactive_since, and the synced / failover flags — the columns operators use to spot a lagging or invalidated slot before it fills pg_wal. It is the observability half of the safety story discussed in the Beyond section: the slot mechanism is only production-safe because its state is inspectable.

ReplicationSlotsDropDBSlots handles the DROP DATABASE interaction. Because a logical slot is database-scoped, dropping a database must first drop every logical slot belonging to it; this function iterates the array, skips physical and non-matching slots, and reuses the acquire-then-ReplicationSlotDropAcquired path for each match:

// ReplicationSlotsDropDBSlots — src/backend/replication/slot.c
for (i = 0; i < max_replication_slots; i++) {
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
if (!s->in_use) continue;
if (!SlotIsLogical(s)) continue; /* only logical slots are db-specific */
if (s->data.database != dboid) continue; /* not our database */
/* NB: intentionally including invalidated slots */
SpinLockAcquire(&s->mutex);
active_pid = s->active_pid;
if (active_pid == 0) { /* claim it so we can drop it */
MyReplicationSlot = s;
s->active_pid = MyProcPid;
}
SpinLockRelease(&s->mutex);
/* ... if still active in another backend, bail out (rare); else DropAcquired ... */
}

The companion ReplicationSlotsCountDBSlots is the pre-check dropdb calls to decide whether to error or proceed; it counts how many slots (and how many active ones) belong to the target database.

Invalidation and failover slots (slot sync)

Section titled “Invalidation and failover slots (slot sync)”

A slot that falls too far behind, or whose database is being dropped, or that sits idle past a timeout, is invalidated rather than silently corrupting the system. InvalidateObsoleteReplicationSlots (driven by the checkpointer when max_slot_wal_keep_size is exceeded) iterates slots and calls InvalidatePossiblyObsoleteSlot, which under the slot mutex determines a cause via DetermineSlotInvalidationCause and, if the slot is not held by another PID, marks it invalid in place:

// InvalidatePossiblyObsoleteSlot — src/backend/replication/slot.c
SpinLockAcquire(&s->mutex);
restart_lsn = s->data.restart_lsn;
if (s->data.invalidated == RS_INVAL_NONE)
invalidation_cause = DetermineSlotInvalidationCause(possible_causes, s,
oldestLSN, dboid, snapshotConflictHorizon,
&inactive_since, now);
if (invalidation_cause == RS_INVAL_NONE) { SpinLockRelease(&s->mutex); ... break; }
if (active_pid == 0) {
MyReplicationSlot = s;
s->active_pid = MyProcPid;
s->data.invalidated = invalidation_cause; /* RS_INVAL_WAL_REMOVED, _HORIZON, ... */
if (invalidation_cause == RS_INVAL_WAL_REMOVED) {
s->data.restart_lsn = InvalidXLogRecPtr; /* it pins nothing now */
s->last_saved_restart_lsn = InvalidXLogRecPtr;
}
*invalidated = true;
}
SpinLockRelease(&s->mutex);

The four causes are RS_INVAL_WAL_REMOVED (required WAL exceeded the keep limit), RS_INVAL_HORIZON (required catalog rows were removed — happens on a standby whose primary vacuumed), RS_INVAL_WAL_LEVEL (wal_level dropped below what the slot needs), and RS_INVAL_IDLE_TIMEOUT (idle_replication_slot_timeout). Once invalidated, the slot is skipped by both retention aggregators, so it stops pinning resources — the deliberate tradeoff: lose the lagging consumer, protect the primary.

Failover slots / slot sync. A logical slot created with failover = true is a candidate for synchronization to physical standbys, so that after a failover the promoted standby already has the logical slot at the right position and the logical consumer can reconnect without losing data. pg_sync_replication_slots runs on the standby, connects to the primary, and copies the primary’s failover slots into local slots:

// pg_sync_replication_slots — src/backend/replication/slotfuncs.c
if (!RecoveryInProgress())
ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slots can only be synchronized to a standby server"));
ValidateSlotSyncParams(ERROR);
load_file("libpqwalreceiver", false);
wrconn = walrcv_connect(PrimaryConnInfo, false, false, false, app_name.data, &err);
SyncReplicationSlots(wrconn); /* the actual copy loop lives in slotsync.c */
walrcv_disconnect(wrconn);

Synced slots carry data.synced = true and are not directly consumable on the standby (they exist only to be promoted into real slots at failover). The ReplicationSlotCreate guard rails enforce the matching invariants: failover cannot be enabled on a standby-created slot (no cascading sync) or on a temporary slot (temporaries aren’t synced) — except when the slot-sync worker itself is the creator. The detailed sync worker loop (slotsync.c) is deferred to postgres-wal-sender-receiver.md and postgres-logical-decoding.md.

flowchart LR
  C["pg_create_*_replication_slot"] --> CR["ReplicationSlotCreate<br/>find free entry, set in_use"]
  CR --> RW["ReplicationSlotReserveWal<br/>set restart_lsn"]
  RW --> CL["ComputeRequiredLSN / Xmin<br/>publish horizons"]
  ADV["pg_replication_slot_advance"] --> AQ["ReplicationSlotAcquire<br/>own the slot"]
  AQ --> MV["move restart_lsn / confirmed_flush forward<br/>MarkDirty"]
  MV --> CL
  CKPT["CheckPointReplicationSlots"] --> SV["SaveSlotToPath<br/>tmp+fsync+rename"]
  SV --> CL
  INV["InvalidateObsoleteReplicationSlots<br/>max_slot_wal_keep_size exceeded"] --> IPO["InvalidatePossiblyObsoleteSlot<br/>set data.invalidated"]
  IPO --> CL
  CL --> XLOGR["xlog recycler skips needed WAL"]
  CL --> VAC["vacuum honors catalog_xmin"]

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
ReplicationSlotPersistentData (struct)src/include/replication/slot.h70
ReplicationSlot (struct)src/include/replication/slot.h155
SlotIsPhysical / SlotIsLogical (macros)src/include/replication/slot.h228
ReplicationSlotCtlData (struct)src/include/replication/slot.h234
ReplicationSlotsShmemSizesrc/backend/replication/slot.c186
ReplicationSlotsShmemInitsrc/backend/replication/slot.c204
ReplicationSlotCreatesrc/backend/replication/slot.c353
SearchNamedReplicationSlotsrc/backend/replication/slot.c509
ReplicationSlotAcquiresrc/backend/replication/slot.c589
ReplicationSlotReleasesrc/backend/replication/slot.c716
ReplicationSlotDropsrc/backend/replication/slot.c846
ReplicationSlotDropPtrsrc/backend/replication/slot.c978
ReplicationSlotPersistsrc/backend/replication/slot.c1120
ReplicationSlotsComputeRequiredXminsrc/backend/replication/slot.c1145
ReplicationSlotsComputeRequiredLSNsrc/backend/replication/slot.c1227
ReplicationSlotsComputeLogicalRestartLSNsrc/backend/replication/slot.c1297
ReplicationSlotsCountDBSlotssrc/backend/replication/slot.c1376
ReplicationSlotsDropDBSlotssrc/backend/replication/slot.c1434
ReplicationSlotReserveWalsrc/backend/replication/slot.c1565
InvalidatePossiblyObsoleteSlotsrc/backend/replication/slot.c1833
InvalidateObsoleteReplicationSlotssrc/backend/replication/slot.c2061
CheckPointReplicationSlotssrc/backend/replication/slot.c2121
StartupReplicationSlotssrc/backend/replication/slot.c2199
CreateSlotOnDisksrc/backend/replication/slot.c2260
SaveSlotToPathsrc/backend/replication/slot.c2321
RestoreSlotFromDisksrc/backend/replication/slot.c2484
pg_physical_replication_slot_advancesrc/backend/replication/slotfuncs.c465
pg_logical_replication_slot_advancesrc/backend/replication/slotfuncs.c501
pg_replication_slot_advancesrc/backend/replication/slotfuncs.c510
pg_create_physical_replication_slotsrc/backend/replication/slotfuncs.c65
pg_create_logical_replication_slotsrc/backend/replication/slotfuncs.c169
pg_drop_replication_slotsrc/backend/replication/slotfuncs.c218
pg_get_replication_slotssrc/backend/replication/slotfuncs.c236
pg_sync_replication_slotssrc/backend/replication/slotfuncs.c895

Verified against /data/hgryoo/references/postgres at REL_18_STABLE, commit 273fe94 (PG 18.x). Method: read slot.c and slotfuncs.c in full for the quoted regions, cross-checked struct/enum/macro definitions in src/include/replication/slot.h, and confirmed every quoted symbol exists at the stated line.

  • Struct shape confirmed. ReplicationSlotPersistentData carries restart_lsn, xmin, catalog_xmin, confirmed_flush, plus the PG-recent fields two_phase, two_phase_at, synced, and failover. The volatile ReplicationSlot carries effective_xmin / effective_catalog_xmin, last_saved_restart_lsn, last_saved_confirmed_flush, and inactive_since — all present on REL_18, none are pre-18 names.
  • Invalidation causes confirmed. ReplicationSlotInvalidationCause has exactly four power-of-two causes plus RS_INVAL_NONE, with RS_INVAL_MAX_CAUSES == 4; the StaticAssertDecl at slot.c:124 enforces the SlotInvalidationCauses table length. RS_INVAL_IDLE_TIMEOUT is the PG-18-era addition and is present.
  • Locking model confirmed. ReplicationSlotControlLock (shared for scans, exclusive for in_use flips), ReplicationSlotAllocationLock (exclusive for create/reserve, shared for the checkpoint flush loop), and per-slot mutex spinlocks match the header’s documented two-tier model.
  • Effective-vs-persistent invariant confirmed. The header comment on effective_xmin and the use of effective_* (not data.*) inside ReplicationSlotsComputeRequiredXmin, and of last_saved_restart_lsn (not live restart_lsn) inside ReplicationSlotsComputeRequiredLSN for persistent slots, jointly confirm the “on-disk horizon never ahead of in-memory” safety property.
  • No fabricated symbols. Every function quoted (ReplicationSlotCreate, ReplicationSlotReserveWal, CheckPointReplicationSlots, StartupReplicationSlots, InvalidatePossiblyObsoleteSlot, pg_replication_slot_advance, pg_sync_replication_slots, etc.) was located by direct grep of the two source files. SyncReplicationSlots, LogicalSlotAdvanceAndCheckSnapState, and PhysicalWakeupLogicalWalSnd are referenced as call targets but defined in slotsync.c / logical.c / walsender.c (out of scope here; cross-referenced).
  • Scope boundary. This doc does not assert the contents of slotsync.c’s worker loop, the reorder-buffer/snapbuild internals, or the walsender streaming protocol; those are deferred to the cross-referenced docs.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s slot is a specific point in a broad design space. Comparing it to neighboring systems sharpens what is essential versus incidental.

MySQL binlog + GTIDs. MySQL does not register per-consumer markers on the producer; the binary log is retained by a time/size policy (binlog_expire_logs_seconds, max_binlog_size) and a replica tracks its own position (file+offset, or a GTID set) on the replica side. The producer is stateless about consumers. This is the “fixed retention window” design: simpler on the primary (no shared array, no min computation, no slot files to fsync), but a replica that disconnects longer than the window must be re-cloned. There is no analogue of catalog_xmin, because MySQL’s binlog in ROW format embeds enough column metadata (and the replica has the schema) that historical catalog pinning is not required. PostgreSQL deliberately pays the slot-bookkeeping cost to get the never-lose-a-consumer guarantee — and pays a second cost, catalog_xmin, specifically because its logical decoding reconstructs changes against the live catalog rather than embedding schema in the log.

Oracle GoldenGate / LogMiner. Oracle’s redo/archive logs are retained by RMAN retention policy; CDC tools track a checkpoint (SCN) in their own metadata table. The “don’t vacuum what the miner needs” problem is handled by Oracle’s undo retention and explicit LogMiner dictionary snapshots rather than by a horizon fed into the GC computation. The architectural contrast is the same as MySQL: consumer position lives outside the engine’s recycler, so the engine’s GC is consumer-oblivious and the CDC tool owns the retention risk.

Kafka consumer offsets. Kafka is the cleanest “consumer-registered marker” analogue: each consumer group commits an offset, and log retention is by time/size with an optional compaction policy — but notably Kafka does not by default retain back to the slowest consumer; a consumer that lags past retention silently loses messages (auto.offset.reset). PostgreSQL’s slot is stricter: the primary will retain WAL unboundedly to protect a slot, which is why max_slot_wal_keep_size and slot invalidation exist as the escape valve. The PostgreSQL design thus sits between Kafka (consumer can be sacrificed silently) and a naïve unbounded slot (primary fills its disk): invalidation makes the sacrifice explicit and observable (pg_replication_slots.invalidation_reason).

The dead-slot disk-fill failure mode. The single most common slot-operational incident — a forgotten or crashed consumer pinning WAL until the primary’s pg_wal fills and the database halts — drove a multi-release arc of features: max_slot_wal_keep_size (bound the pin, invalidate on breach), idle_replication_slot_timeout (invalidate slots inactive too long), and the inactive_since field plus monitoring columns so operators can spot a lagging slot before it bites. The research/engineering lesson is that a consumer-registered retention marker is only safe in production if paired with a bounded override and observability — the marker alone is a foot-gun.

Failover slots as a distributed-systems problem. Synchronizing a logical slot to a standby (slot sync, PG 17+) is a small instance of a hard distributed problem: keeping a piece of consumer-progress state replicated and consistent across a leader change without losing or double-processing data. The PostgreSQL solution — copy the failover slot’s restart_lsn/catalog_xmin/confirmed_flush to the standby and only let the standby’s synced slot lag behind the primary’s (never ahead) — is the same “be conservative across the boundary” invariant that governs the effective-vs-persistent split, applied across nodes instead of across a crash. Active research frontiers in this space include reducing the coordination cost of synchronous slot sync (it adds a round-trip to the critical path) and extending logical replication to multi-active topologies where a single linear confirmed_flush no longer captures consumer progress.

Where the abstraction leaks. The catalog_xmin mechanism couples logical replication to vacuum in a way that surprises operators: a busy logical slot can block catalog vacuum cluster-wide, causing catalog bloat far from the replicated tables. This is the price of reconstructing changes against the live catalog. Systems that embed schema in the log (MySQL ROW format, Debezium schema history) avoid this coupling at the cost of larger logs and a separate schema store — a genuine, unresolved tradeoff rather than a bug.

  • Source tree. /data/hgryoo/references/postgres at REL_18_STABLE, commit 273fe94 (PG 18.x):
    • src/backend/replication/slot.c — slot lifecycle, retention computation, persistence, invalidation, standby-slot helpers.
    • src/backend/replication/slotfuncs.c — SQL-callable create / drop / advance / copy / sync functions and pg_get_replication_slots.
    • src/include/replication/slot.hReplicationSlot, ReplicationSlotPersistentData, ReplicationSlotCtlData, persistency and invalidation enums, SlotIsPhysical / SlotIsLogical.
  • Textbook theory.
    • Kleppmann, Designing Data-Intensive Applications (2017), ch. 5 (Replication — leader log retention) and ch. 11 (Stream Processing / change data capture — schema-aware change reconstruction). Captured in the KB bibliography (raw/system/textbooks/).
    • Petrov, Database Internals (2019), the WAL chapter (LSN-addressed log, segment recycling low-water marks, fsync-ordered durability). Captured at knowledge/research/dbms-general/database-internals.md.
  • Cross-references within this KB.
    • postgres-wal-sender-receiver.md — the walsender/walreceiver transport that consumes a slot and drives confirmed_flush; also the slot-sync worker.
    • postgres-logical-decoding.md — reorder buffer, snapshot builder, and LogicalSlotAdvanceAndCheckSnapState that move a logical slot’s horizons.
    • postgres-overview-replication-ha.md — how slots sit among the other replication-ha consumers of the WAL stream.
    • postgres-xlog-wal.md / postgres-checkpoint.md — the WAL recycler and the checkpoint that reads XLogSetReplicationSlotMinimumLSN and flushes slots.
    • postgres-procarray.md / postgres-vacuum.md — the ProcArray oldest-xmin computation that honors ReplicationSlotsComputeRequiredXmin.