PostgreSQL Vacuum — Dead-Tuple Reclamation, Freezing, and XID Wraparound Prevention

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

MVCC (Multi-Version Concurrency Control) achieves read-without-block by keeping old row versions visible until no running transaction can see them any more (Database Internals, Petrov, ch. 5 §“MVCC Versions and Cleanup”; Database System Concepts, 7e, ch. 15 §“Snapshot Isolation”). The price is that delete and update operations do not reclaim space immediately — they stamp the old version as dead (set xmax) and leave the bytes on the page. Without a background reclamation process, a table that receives a steady stream of updates would grow without bound, and reads would scan ever-longer chains of dead versions to find the live ones. The process that does the reclaiming is conventionally called a garbage collector or vacuum.

The vacuum problem has two separable sub-problems.

Dead-tuple reclamation. A row version is dead when its deleting transaction (xmax) has committed and no snapshot in any active transaction can see the row any more. The oldest active snapshot sets the horizon: row versions deleted before that horizon are safe to remove. The implementation challenge is that “safe to remove” is a global predicate — the vacuum must know about every active transaction, not just the one on its own backend — and computing that global minimum must not itself block writers.

Transaction-ID (XID) wraparound prevention. PostgreSQL uses 32-bit XIDs. The wraparound window is 2^31 transactions (~2.1 billion) ahead of the current XID. A row inserted by a transaction that is more than 2^31 XIDs in the past looks newer than the current transaction, making it invisible. To prevent this, vacuum freezes old XIDs: it rewrites a tuple’s xmin/ xmax to the special value FrozenTransactionId (XID 2) or sets the HEAP_XMIN_FROZEN hint bit, after which the tuple is visible to all future transactions regardless of their XID. Without regular freezing, a database will corrupt data when XID space wraps. This is the hard real-time obligation that makes VACUUM non-optional even for read-only tables.

Both sub-problems interact with the visibility map (VM), a per-relation two-bit-per-page bitmap maintained by the heap access method. A page whose all-visible (AV) bit is set contains no dead tuples visible to any snapshot; vacuum can skip it for dead-tuple reclamation. A page whose all-frozen (AF) bit is set additionally contains no unfrozen XIDs; vacuum can skip it for freezing. The VM is the primary mechanism that makes repeated vacuums cheap on a stable table. See postgres-heap-am.md for the VM bit mechanics; this document covers how vacuum drives and updates the VM.

Common DBMS Design

Most MVCC engines need some form of background reclamation. The patterns PostgreSQL follows are not unique to it; they appear in recognizable form in InnoDB’s purge thread, Oracle’s undo segment reuse, and SQL Server’s ghost-record cleanup.

Horizon computation via a global snapshot of active XIDs

The reclaimer needs the oldest XID still visible to any running transaction. In PostgreSQL this is OldestXmin, derived from the procarray (the shared array of per-backend PGPROC entries holding each backend’s current XID and snapshot horizon). No tuple deleted by a transaction older than OldestXmin can be seen by any new snapshot. Computing OldestXmin is a snapshot of the procarray — a read of shared memory under a spin-lock acquisition, not a full transaction. Competing designs include a low-watermark counter maintained by each backend on every transaction start/commit (avoids the scan) or a global epoch mechanism (used by several NewSQL engines).

Three-phase scan → index-delete → heap-reap loop

Cleaning up dead tuples from indexes is more expensive than cleaning them from the heap, because each index has its own structure. The common pattern across nearly all MVCC storage engines is:

Scan heap pages. Find dead items; record their TIDs (tuple identifiers) in a buffer.
Bulk-delete from indexes. Walk every index once, deleting entries whose TID is in the buffer. This is the expensive index ambulkdelete pass.
Reap heap items. Return to the heap pages, mark the dead line pointers LP_UNUSED to free the slot for future inserts, and update the free-space map (FSM).

The buffer of dead TIDs is the coordination point between phases. When it fills before the heap scan completes, the engine must flush early (run phases 2–3 on the partial batch, then resume phase 1). This creates a trade-off between maintenance_work_mem (buffer size) and the number of index passes.

Cost-based throttling

Background reclamation competes with foreground I/O. Unconstrained vacuum on a large table saturates I/O and inflates query latency. The standard solution is cost-based rate limiting: count the number of buffer reads and writes the vacuum performs per unit time, and insert a sleep when the rate exceeds a threshold. This is a soft real-time knob — vacuum trades throughput for latency headroom, at the cost of taking longer to complete.

Theory ↔ PostgreSQL mapping

Theory concept	PostgreSQL entity
MVCC garbage collector	`vacuum()` / `heap_vacuum_rel()`
Global oldest-snapshot horizon	`OldestXmin` in `VacuumCutoffs`
XID freeze horizon	`FreezeLimit` in `VacuumCutoffs`
Dead-TID buffer	`TidStore *dead_items` in `LVRelState`
Visibility skip map	visibility map (VM) AV/AF bits
Phase 1 (heap scan)	`lazy_scan_heap()`
Phase 2 (index delete)	`lazy_vacuum_all_indexes()` / `ambulkdelete`
Phase 3 (heap reap)	`lazy_vacuum_heap_rel()`
Cost throttle sleep	`vacuum_delay_point()`
Wraparound protection	aggressive vacuum + `lazy_check_wraparound_failsafe()`
Parallel index vacuuming	`ParallelVacuumState` in `vacuumparallel.c`

PostgreSQL’s Approach

Two command variants: lazy VACUUM and VACUUM FULL

PostgreSQL distinguishes two fundamentally different operations sharing the VACUUM command name.

Lazy VACUUM (the default, also called concurrent vacuum) runs in-place: it reclaims dead tuple slots and marks them reusable without rewriting the page or holding exclusive table locks. Readers and writers continue normally during lazy VACUUM. This is what heap_vacuum_rel() in vacuumlazy.c implements, and it is the subject of this document.

VACUUM FULL rewrites the entire relation into a new heap file (via the CLUSTER code path), acquiring an exclusive AccessExclusiveLock on the table. It reclaims all free space but blocks all other access. It is seldom needed in practice — lazy VACUUM is sufficient for ongoing maintenance.

Entry point and parameter flow

The SQL command VACUUM enters via ExecVacuum in vacuum.c, which parses options into a VacuumParams struct and calls vacuum(). vacuum() expands the target relation list (or scans all vacuumable relations for a plain VACUUM), then calls vacuum_rel() per relation. For the heap AM, vacuum_rel() eventually calls table_relation_vacuum() which dispatches to heap_vacuum_rel().

// VacuumParams — src/include/commands/vacuum.h
typedef struct VacuumParams
{
    bits32      options;        /* VACOPT_* bitmask */
    int         freeze_min_age; /* min XID age before freeze, -1 = default */
    int         freeze_table_age;   /* whole-table scan age threshold */
    int         multixact_freeze_min_age;
    int         multixact_freeze_table_age;
    bool        is_wraparound;  /* force for-wraparound vacuum */
    int         log_min_duration;
    VacOptValue index_cleanup;  /* VACOPTVALUE_AUTO/ENABLED/DISABLED */
    VacOptValue truncate;
    Oid         toast_parent;
    double      max_eager_freeze_failure_rate;
    int         nworkers;       /* 0 = auto, -1 = no parallel */
} VacuumParams;

VacuumCutoffs: the immutable XID horizons

Before scanning a single page, vacuum_get_cutoffs() computes four XID horizons that remain constant for the entire VACUUM run:

// VacuumCutoffs — src/include/commands/vacuum.h
struct VacuumCutoffs
{
    TransactionId relfrozenxid;  /* current pg_class.relfrozenxid */
    MultiXactId   relminmxid;

    TransactionId OldestXmin;    /* tuples deleted before this are DEAD */
    MultiXactId   OldestMxact;

    TransactionId FreezeLimit;   /* XIDs older than this must be frozen */
    MultiXactId   MultiXactCutoff;
};

OldestXmin is computed by scanning the procarray for the oldest horizon visible to any running transaction. Tuples whose xmax committed before OldestXmin are dead to all current and future snapshots.

FreezeLimit is derived from vacuum_freeze_min_age (default 50 million): any XID older than current_xid - vacuum_freeze_min_age must be frozen. Freezing replaces the stored XID with FrozenTransactionId (XID 2) or sets the HEAP_XMIN_FROZEN hint bit, making the tuple permanently visible.

LVRelState: per-relation vacuum working state

heap_vacuum_rel() allocates one LVRelState on the heap that lives for the duration of the VACUUM on this relation. It is the single central state object passed through all sub-functions:

// LVRelState (abridged) — src/backend/access/heap/vacuumlazy.c
typedef struct LVRelState
{
    Relation        rel;
    Relation       *indrels;
    int             nindexes;

    BufferAccessStrategy bstrategy;
    ParallelVacuumState *pvs;       /* non-NULL if parallel */

    bool            aggressive;     /* must scan every unfrozen tuple? */
    bool            skipwithvm;     /* use VM to skip pages? */

    struct VacuumCutoffs cutoffs;
    GlobalVisState *vistest;

    TransactionId   NewRelfrozenXid; /* lowest unfrozen XID seen so far */
    MultiXactId     NewRelminMxid;

    TidStore       *dead_items;      /* TIDs of LP_DEAD line pointers */
    VacDeadItemsInfo *dead_items_info;

    BlockNumber     rel_pages;
    BlockNumber     scanned_pages;
    int64           tuples_deleted;
    int64           tuples_frozen;
    int64           lpdead_items;
    int64           live_tuples;
    /* ... more counters ... */
} LVRelState;

The aggressive flag controls whether the vacuum must visit every unfrozen tuple to advance relfrozenxid. An aggressive vacuum is triggered when the table’s relfrozenxid approaches vacuum_freeze_table_age (default 150 million transactions before current_xid). Without it, the table risks XID wraparound.

The three-phase loop in lazy_scan_heap

lazy_scan_heap() drives the entire work loop. It sets up a ReadStream for the relation (PG18’s async I/O infrastructure) and iterates over blocks. The block selection is guided by heap_vac_scan_next_block(), which consults the VM and LVRelState skip-logic to decide what to read next.

// lazy_scan_heap — src/backend/access/heap/vacuumlazy.c
static void
lazy_scan_heap(LVRelState *vacrel)
{
    ReadStream *stream;
    /* ... */
    stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
                                        vacrel->bstrategy,
                                        vacrel->rel,
                                        MAIN_FORKNUM,
                                        heap_vac_scan_next_block,
                                        vacrel,
                                        sizeof(uint8));

    while (true)
    {
        /* Check TID store full → flush early via lazy_vacuum() */
        if (vacrel->dead_items_info->num_items > 0 &&
            TidStoreMemoryUsage(vacrel->dead_items) >
                vacrel->dead_items_info->max_bytes)
        {
            lazy_vacuum(vacrel);       /* phases 2+3 on current batch */
            FreeSpaceMapVacuumRange(vacrel->rel, ...);
        }

        buf = read_stream_next_buffer(stream, &per_buffer_data);
        if (!BufferIsValid(buf))
            break;

        /* Phase 1: prune + freeze this page */
        /* ... cleanup lock, lazy_scan_prune() or lazy_scan_noprune() ... */
    }
    /* Final phase 2+3 after heap scan completes */
    lazy_vacuum(vacrel);
}

Figure 1 — The three-phase vacuum loop

flowchart TD
    A[heap_vacuum_rel] --> B[vacuum_get_cutoffs]
    B --> C[dead_items_alloc]
    C --> D[lazy_scan_heap\nPhase I: heap scan]
    D --> E{TID store full?}
    E -- yes --> F[lazy_vacuum\nPhases II+III]
    F --> D
    E -- no --> G{more pages?}
    G -- yes --> H[lazy_scan_prune / lazy_scan_noprune]
    H --> G
    G -- no --> I[lazy_vacuum\nfinal flush]
    I --> J[lazy_truncate_heap?]
    J --> K[update pg_class stats]

Figure 1 — Outer loop of heap_vacuum_rel / lazy_scan_heap. Phase I (heap scan) loops over pages with optional early flushes when the dead-items TID store fills. Phases II+III (index + heap vacuum) run on each full TID batch and once more after the heap scan completes.

Phase I: lazy_scan_prune and page-level pruning

For each block that clears the visibility-map skip check, vacuum acquires a cleanup lock (LockBufferForCleanup) — exclusive but released as soon as page processing completes — and calls lazy_scan_prune():

// lazy_scan_prune — src/backend/access/heap/vacuumlazy.c
static int
lazy_scan_prune(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
                Page page, Buffer vmbuffer,
                bool all_visible_according_to_vm,
                bool *has_lpdead_items, bool *vm_page_frozen)
{
    PruneFreezeResult presult;
    int  prune_options = HEAP_PAGE_PRUNE_FREEZE;
    if (vacrel->nindexes == 0)
        prune_options |= HEAP_PAGE_PRUNE_MARK_UNUSED_NOW;

    heap_page_prune_and_freeze(rel, buf, vacrel->vistest,
                               prune_options, &vacrel->cutoffs,
                               &presult, PRUNE_VACUUM_SCAN,
                               &vacrel->offnum,
                               &vacrel->NewRelfrozenXid,
                               &vacrel->NewRelminMxid);

    if (presult.lpdead_items > 0)
        dead_items_add(vacrel, blkno,
                       presult.deadoffsets, presult.lpdead_items);
    /* update per-relation counters ... */
}

heap_page_prune_and_freeze() (in pruneheap.c, line 350) does two conceptually distinct things in one page pass under the cleanup lock:

Prune HOT chains. It walks all line pointers on the page, detects dead versions in HOT update chains (where t_ctid chains through multiple versions of the same logical row), collapses the chain, and marks dead line pointers LP_DEAD.
Freeze tuples. For every live tuple whose xmin or xmax XID is older than cutoffs.FreezeLimit (or whose MultiXact is older than cutoffs.MultiXactCutoff), it replaces the stored XID with the frozen sentinel or sets the HEAP_XMIN_FROZEN hint bit.

Dead line pointers are added to vacrel->dead_items (a TidStore, allocated from maintenance_work_mem) so the index passes can remove the corresponding index entries.

Page-skipping via the visibility map

heap_vac_scan_next_block() implements the skip logic. For a normal (non-aggressive) vacuum:

A page with the all-visible bit set and no forced scan can be skipped for dead-tuple collection.
A page with both all-visible and all-frozen bits set can also be skipped for freezing.

For an aggressive vacuum the all-frozen bit is still honoured (fully frozen pages need no freeze work), but the all-visible-only pages must still be visited to check for unfrozen XIDs that need freezing to advance relfrozenxid.

A small threshold SKIP_PAGES_THRESHOLD (32 pages) prevents vacuum from skipping a run of clean pages that is too short to justify the loss of kernel readahead.

Eager freeze scanning (PG18 addition)

PG18 introduced eager freeze scanning: a normal vacuum may proactively visit all-visible-but-not-yet-all-frozen pages to freeze them, amortizing the work that would otherwise all fall on the next aggressive vacuum.

heap_vacuum_eager_scan_setup() computes per-relation eager-scan parameters stored in LVRelState:

// heap_vacuum_eager_scan_setup — src/backend/access/heap/vacuumlazy.c
static void
heap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params)
{
    /* cap on successes: MAX_EAGER_FREEZE_SUCCESS_RATE (0.2) of
     * all-visible-but-not-all-frozen pages */
    vacrel->eager_scan_remaining_successes =
        (BlockNumber) (MAX_EAGER_FREEZE_SUCCESS_RATE *
                       (allvisible - allfrozen));

    /* failure tolerance per EAGER_SCAN_REGION_SIZE (4096) block region */
    vacrel->eager_scan_max_fails_per_region =
        params->max_eager_freeze_failure_rate * EAGER_SCAN_REGION_SIZE;
}

The success cap (MAX_EAGER_FREEZE_SUCCESS_RATE = 0.2, i.e. at most 20% of eligible pages per vacuum) prevents a single normal vacuum from doing an unlimited amount of extra freeze work. The per-region failure tolerance prevents wasting I/O in table regions where all-visible pages turn out to have no freezable XIDs yet (too recently written).

Phase II: index bulk-delete

lazy_vacuum_all_indexes() calls ambulkdelete() on every index, passing vac_tid_reaped() as the callback. The callback returns true for any TID present in vacrel->dead_items. Each index AM walks its own structure and deletes the matching entries in a single pass.

If ParallelVacuumIsActive(vacrel) (i.e., pvs != NULL), the indexes are distributed among worker processes via the ParallelVacuumState DSM segment (in vacuumparallel.c). Each worker calls lazy_vacuum_one_index() for its assigned index. The leader coordinates via shared-memory counters and ParallelVacuumKey DSM keys.

Index vacuuming bypass optimization. If lpdead_item_pages is below BYPASS_THRESHOLD_PAGES (2%) of rel_pages and the TID store fits in 32 MB, lazy_vacuum() bypasses the index and heap vacuum phases entirely — skipping an ambulkdelete pass for a negligible number of dead items avoids the cost discontinuity of a full index scan for a nearly-clean table. The threshold is expressed in pages with at least one LP_DEAD item, not in the raw dead-item count, precisely because the page count is the proxy for how many VM bits stay unset when vacuuming is skipped — and unset VM bits are what cost future scans. The guard insists the optimization only fire when num_index_scans == 0 (a single batch), because once a relation has been large enough to force an earlier index pass, the cheap-bypass assumption no longer holds:

// lazy_vacuum (bypass decision) — src/backend/access/heap/vacuumlazy.c
threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
bypass = (vacrel->lpdead_item_pages < threshold &&
          TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);
/* ... */
if (bypass)
    vacrel->do_index_vacuuming = false;   /* but still do index cleanup */
else if (lazy_vacuum_all_indexes(vacrel))
    lazy_vacuum_heap_rel(vacrel);          /* full phase II then phase III */
else
    Assert(VacuumFailsafeActive);          /* round aborted by failsafe */

Note the asymmetry: a bypass still runs index cleanup (amvacuumcleanup) later, because cleanup is cheap when no ambulkdelete ran and it is what lets the index AM update its own statistics. Only the expensive bulk-delete pass is skipped.

Dead-TID store sizing and the index-pass count. The size of vacrel->dead_items is bounded by dead_items_info->max_bytes, derived from maintenance_work_mem (or autovacuum_work_mem under autovacuum). When the store reaches that ceiling mid-scan, lazy_scan_heap() must flush early — run phases II + III on the partial batch — and then resume the heap scan. Each early flush is one additional full sweep of every index. A table large enough to overflow the store therefore pays (number of flushes + 1) index scans. This is the single most important reason to give vacuum generous maintenance_work_mem: it is not heap-scan memory, it is the buffer that keeps the index-pass count at one. The PG17 TidStore redesign (radix-tree backed, replacing the old flat ItemPointerData array) raised the effective dead-item density per byte dramatically, so the same memory budget now tolerates far more dead tuples before forcing a second index pass.

Phase III: heap LP_DEAD reap

After index entries are removed, lazy_vacuum_heap_rel() makes a second pass over only the heap pages that contributed LP_DEAD items (tracked in the TID store). It acquires a regular exclusive buffer lock (not a cleanup lock) and calls lazy_vacuum_heap_page(), which marks the LP_DEAD line pointers LP_UNUSED, freeing their slots. After each page, the freed space is recorded in the FSM (FreeSpaceMapVacuum), and if the page is now fully clean, the VM all-visible bit is set.

The reason this phase is order-dependent on phase II is correctness, not mere efficiency. A LP_DEAD line pointer must not be turned into LP_UNUSED — and thus made reusable by a future INSERT — until every index entry that points at that TID has been removed. If the heap slot were recycled while a stale index entry still pointed at it, an index scan could follow that entry to a brand-new, logically-unrelated tuple. The dead-TID TidStore is what ties the two passes together: phase II reads it to drive ambulkdelete, and phase III iterates the same store to find exactly which heap pages and offsets to reap. This is why lazy_vacuum() always runs phase II to completion (or the failsafe path) before invoking lazy_vacuum_heap_rel().

Figure 2 — The per-batch phase pipeline (scan+prune → dead-TID store → index vacuum → heap reap → VM/freeze)

flowchart TD
    P1["lazy_scan_prune<br/>heap_page_prune_and_freeze"] --> PR{"presult.lpdead_items > 0?"}
    PR -- yes --> DS["dead_items_add<br/>TidStore *dead_items"]
    PR -- no --> VM
    DS --> VMSET{"page now all-visible?"}
    VMSET -- yes, all-frozen --> VMF["visibilitymap_set<br/>ALL_VISIBLE | ALL_FROZEN"]
    VMSET -- yes, unfrozen XIDs --> VMV["visibilitymap_set<br/>ALL_VISIBLE"]
    VMSET -- no --> VM["NewRelfrozenXid / NewRelminMxid<br/>tracked per page"]
    VMF --> VM
    VMV --> VM
    VM --> FULL{"TidStore over max_bytes<br/>or scan complete?"}
    FULL -- no --> P1
    FULL -- yes --> LV["lazy_vacuum"]
    LV --> BYP{"bypass? lpdead_item_pages<br/>< BYPASS_THRESHOLD_PAGES<br/>and TidStore < 32MB"}
    BYP -- yes --> RST["dead_items_reset<br/>skip index + heap vacuum"]
    BYP -- no --> P2["lazy_vacuum_all_indexes<br/>Phase II: ambulkdelete per index"]
    P2 --> OK{"full index round done?"}
    OK -- no, failsafe --> FS["VacuumFailsafeActive<br/>do_index_vacuuming = false"]
    OK -- yes --> P3["lazy_vacuum_heap_rel<br/>Phase III: LP_DEAD to LP_UNUSED"]
    P3 --> RST

Figure 2 — One full pass through the phases on a TID batch. Phase I (lazy_scan_prune) prunes and freeze-stamps a page, feeds dead TIDs to the TidStore, and opportunistically sets VM all-visible / all-frozen bits and tracks NewRelfrozenXid. When the store fills (or the heap scan ends), lazy_vacuum() either bypasses cheaply, runs phase II (ambulkdelete) followed by phase III (lazy_vacuum_heap_rel), or trips the wraparound failsafe. The store is reset after every batch.

The VM/freeze update at the tail of lazy_scan_prune() is the point at which a freshly-cleaned page earns its all-visible (and possibly all-frozen) status for future skipping:

// lazy_scan_prune (VM-set tail) — src/backend/access/heap/vacuumlazy.c
/* page became all-visible: set ALL_VISIBLE, plus ALL_FROZEN if applicable */
PageSetAllVisible(page);
MarkBufferDirty(buf);
old_vmbits = visibilitymap_set(vacrel->rel, blkno, buf,
                               InvalidXLogRecPtr,
                               vmbuffer, presult.vm_conflict_horizon,
                               flags);
if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
{
    vacrel->vm_new_visible_pages++;
    if (presult.all_frozen)
    {
        vacrel->vm_new_visible_frozen_pages++;
        *vm_page_frozen = true;
    }
}

The presult.vm_conflict_horizon returned by heap_page_prune_and_freeze() is the snapshot-conflict horizon that recovery uses: a hot-standby replica replaying the VM-set record must cancel any query whose snapshot predates that horizon, since the page is about to be declared all-visible. This is the mechanism that keeps VM-driven index-only scans on a standby from returning rows that the standby’s own readers should still be able to see.

// lazy_vacuum_heap_rel — src/backend/access/heap/vacuumlazy.c
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
    TidStoreIter *iter = TidStoreBeginIterate(vacrel->dead_items);
    stream = read_stream_begin_relation(
                 READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
                 vacrel->bstrategy, vacrel->rel, MAIN_FORKNUM,
                 vacuum_reap_lp_read_stream_next, iter,
                 sizeof(TidStoreIterResult));

    while (true) {
        buf = read_stream_next_buffer(stream, (void **) &iter_result);
        if (!BufferIsValid(buf)) break;
        LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
        lazy_vacuum_heap_page(vacrel, blkno, buf,
                              offsets, num_offsets, vmbuffer);
        /* update FSM with freed space */
    }
}

Wraparound failsafe

lazy_check_wraparound_failsafe() is called every FAILSAFE_EVERY_PAGES (~512K pages, i.e. 4 GB at 8 KB/page) during the scan. It calls vacuum_xid_failsafe_check(), which computes the distance from the current XID to the table’s relfrozenxid. If the distance exceeds vacuum_failsafe_age (default 1.6 billion), the failsafe engages:

// lazy_check_wraparound_failsafe — src/backend/access/heap/vacuumlazy.c
if (unlikely(vacuum_xid_failsafe_check(&vacrel->cutoffs)))
{
    VacuumFailsafeActive = true;
    vacrel->bstrategy = NULL;       /* use all shared_buffers */
    vacrel->do_index_vacuuming = false;
    vacrel->do_index_cleanup   = false;
    vacrel->do_rel_truncate    = false;
    VacuumCostActive = false;       /* disable cost delay */
    ereport(WARNING, ...);
    return true;
}

The failsafe trades a possibly-incomplete index cleanup for guaranteed progress through the heap freeze scan. It is a last resort: the normal path is that autovacuum runs often enough to keep all tables well below the failsafe threshold.

Cost-based pacing

vacuum_delay_point(false) is called at the top of each page iteration. It computes the current vacuum cost (buffer reads + dirtied buffers, weighted by vacuum_cost_page_hit, _miss, and _dirty) and sleeps for vacuum_cost_delay milliseconds when the running balance exceeds vacuum_cost_limit. For parallel vacuum, workers report their cost via parallel_vacuum_worker_delay_ns and the leader aggregates; the VacuumSharedCostBalance and VacuumActiveNWorkers atomic counters coordinate proportional delay across workers.

Relation truncation

After all three phases complete, should_attempt_truncation() decides whether to try truncating empty pages from the end of the relation. lazy_truncate_heap() acquires AccessExclusiveLock, confirms the trailing pages are empty (using a backwards scan), and calls smgrtruncate() to shrink the physical file. The truncation path requires the exclusive lock and therefore can be blocked by concurrent readers; it retries with a timeout (VACUUM_TRUNCATE_LOCK_TIMEOUT = 5000 ms) to avoid indefinite blocking.

Statistics and pg_class update

At the end of heap_vacuum_rel(), vac_update_relstats() updates pg_class.relpages, pg_class.reltuples, pg_class.relfrozenxid, and pg_class.relminmxid. relfrozenxid is advanced to vacrel.NewRelfrozenXid — the minimum unfrozen XID actually seen during this vacuum pass. Advancing relfrozenxid allows vac_truncate_clog() to discard old CLOG pages, which frees disk space in pg_xact/.

Parallel vacuum

When params->nworkers >= 0 and the table has at least two indexes whose combined size clears the parallel cutoff, vacuum forks parallel workers using the standard ParallelContext / DSM machinery in vacuumparallel.c. The parallelism unit is one index per worker — phase I (heap scan) and phase III (heap reap) always run in the leader. Only the index passes (phase II ambulkdelete and the final amvacuumcleanup) fan out.

The PVShared struct, allocated in the DSM segment by parallel_vacuum_init(), is the coordination record between leader and workers:

// PVShared — src/backend/commands/vacuumparallel.c
typedef struct PVShared
{
    Oid          relid;
    int          elevel;
    int64        queryid;

    double       reltuples;
    bool         estimated_count;
    int          maintenance_work_mem_worker;
    int          ring_nbuffers;

    pg_atomic_uint32 cost_balance;     /* shared vacuum cost balance */
    pg_atomic_uint32 active_nworkers;  /* workers currently delaying */
    pg_atomic_uint32 idx;              /* next index to claim */

    dsa_handle   dead_items_dsa_handle;
    dsa_pointer  dead_items_handle;    /* shared TidStore */
    VacDeadItemsInfo dead_items_info;
} PVShared;

Three fields carry the design. dead_items_handle / dead_items_dsa_handle point at the shared TidStore in DSA, so every worker reads the same set of dead TIDs the leader’s heap scan produced — the dead-item buffer is not copied per worker. idx is an atomic claim counter: each worker (and the leader, which joins as a worker) does pg_atomic_fetch_add_u32(&shared->idx, 1) to grab the next index to process, a lock-free work-stealing queue over the index array. cost_balance / active_nworkers implement shared cost-based delay so the whole worker pool, not each worker independently, respects vacuum_cost_limit.

The fan-out itself happens in parallel_vacuum_process_all_indexes(), called once per index-vacuum round and once for cleanup. It computes how many workers the current phase wants, launches them, then has the leader join the same work-stealing loop:

// parallel_vacuum_process_all_indexes — src/backend/commands/vacuumparallel.c
/* The leader process will participate */
nworkers--;
nworkers = Min(nworkers, pvs->pcxt->nworkers);
/* ... mark each index's parallel_workers_can_process ... */
pg_atomic_write_u32(&(pvs->shared->idx), 0);
if (nworkers > 0)
{
    pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance);
    pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0);
    ReinitializeParallelWorkers(pvs->pcxt, nworkers);
    LaunchParallelWorkers(pvs->pcxt);
    /* ... enable shared cost balance for leader ... */
}
/* leader vacuums parallel-unsafe indexes itself, then joins the safe loop */
parallel_vacuum_process_unsafe_indexes(pvs);
parallel_vacuum_process_safe_indexes(pvs);

Each participant (worker via parallel_vacuum_main(), leader inline) runs the same claim loop in parallel_vacuum_process_safe_indexes():

// parallel_vacuum_process_safe_indexes — src/backend/commands/vacuumparallel.c
for (;;)
{
    int         idx = pg_atomic_fetch_add_u32(&(pvs->shared->idx), 1);
    if (idx >= pvs->nindexes)
        break;                       /* all indexes claimed */
    indstats = &(pvs->indstats[idx]);
    if (!indstats->parallel_workers_can_process)
        continue;                    /* leader handles unsafe ones */
    parallel_vacuum_process_one_index(pvs, pvs->indrels[idx], indstats);
}

Two index categories never go to workers: parallel-unsafe indexes (whose AM declares it cannot bulk-delete in a worker, e.g. via missing VACUUM_OPTION_PARALLEL_BULKDEL) and indexes too small to be worth a worker. parallel_vacuum_process_unsafe_indexes() handles those in the leader before it joins the safe loop, so the leader is never idle while workers churn.

Figure 3 — Parallel index-vacuum worker fan-out

flowchart TD
    A["lazy_vacuum_all_indexes<br/>ParallelVacuumIsActive?"] -- yes --> B["parallel_vacuum_process_all_indexes"]
    A -- no --> S["leader-only loop:<br/>lazy_vacuum_one_index per index"]
    B --> C["compute nworkers<br/>nworkers-- (leader participates)"]
    C --> D["LaunchParallelWorkers<br/>set shared cost_balance"]
    D --> E["leader: process_unsafe_indexes<br/>(parallel-unsafe + tiny indexes)"]
    D --> W1["worker 1<br/>parallel_vacuum_main"]
    D --> W2["worker 2<br/>parallel_vacuum_main"]
    D --> Wn["worker N<br/>parallel_vacuum_main"]
    E --> F["leader joins:<br/>process_safe_indexes"]
    W1 --> Q["atomic claim:<br/>idx = fetch_add(shared.idx, 1)"]
    W2 --> Q
    Wn --> Q
    F --> Q
    Q --> ONE["parallel_vacuum_process_one_index<br/>ambulkdelete / amvacuumcleanup"]
    ONE --> Q
    Q -- idx >= nindexes --> WAIT["WaitForParallelWorkersToFinish<br/>InstrAccumParallelQuery"]
    WAIT --> CARRY["carry shared cost balance<br/>back to leader heap scan"]

Figure 3 — Index fan-out. The shared TidStore (via dead_items_handle) is read by every participant; shared.idx is an atomic counter that workers and the leader fetch-and-add to claim indexes, giving lock-free work-stealing. The leader handles parallel-unsafe and tiny indexes itself, then joins the safe loop, so it is never idle. After the round, shared cost balance is folded back into the leader’s VacuumCostBalance before the heap scan resumes.

Source Walkthrough

Entry and parameter flow

ExecVacuum — src/backend/commands/vacuum.c — parses VacuumStmt, builds VacuumParams, calls vacuum().
vacuum — vacuum.c — expands relation list; per-relation loop.
vacuum_rel — vacuum.c — opens relation, acquires lock, calls table_relation_vacuum() → heap_vacuum_rel().
vacuum_get_cutoffs — vacuum.c — computes VacuumCutoffs: OldestXmin, OldestMxact, FreezeLimit, MultiXactCutoff.

Core heap vacuum

heap_vacuum_rel — vacuumlazy.c — allocates LVRelState, calls lazy_scan_heap, finalizes stats.
heap_vacuum_eager_scan_setup — vacuumlazy.c — computes eager-freeze parameters (eager_scan_remaining_successes, etc.).
lazy_scan_heap — vacuumlazy.c — outer loop over pages using ReadStream; triggers lazy_vacuum() when TID store fills.
heap_vac_scan_next_block — vacuumlazy.c — ReadStream callback; consults VM to skip all-visible / all-frozen pages.
find_next_unskippable_block — vacuumlazy.c — advances vacrel->next_unskippable_block past skippable pages.
lazy_scan_prune — vacuumlazy.c — per-page: calls heap_page_prune_and_freeze, collects LP_DEAD offsets.
lazy_scan_noprune — vacuumlazy.c — per-page without cleanup lock: counts existing LP_DEAD items without modifying them.
heap_page_prune_and_freeze — pruneheap.c — prunes HOT chains, freezes tuples, returns PruneFreezeResult.

Phases II and III

lazy_vacuum — vacuumlazy.c — decides bypass vs. full phases II+III.
lazy_vacuum_all_indexes — vacuumlazy.c — calls ambulkdelete on each index (possibly via parallel workers).
lazy_vacuum_one_index — vacuumlazy.c — calls index_bulk_delete for a single index.
lazy_vacuum_heap_rel — vacuumlazy.c — second heap pass over LP_DEAD pages; calls lazy_vacuum_heap_page.
lazy_vacuum_heap_page — vacuumlazy.c — marks LP_DEAD offsets LP_UNUSED; updates FSM and VM.

Wraparound and cleanup

lazy_check_wraparound_failsafe — vacuumlazy.c — engages failsafe if relfrozenxid is dangerously old.
vacuum_xid_failsafe_check — vacuum.c — computes XID distance and returns true if failsafe threshold exceeded.
lazy_cleanup_all_indexes — vacuumlazy.c — calls ambulkcleanup on each index after the final vacuum pass.
lazy_truncate_heap — vacuumlazy.c — truncates trailing empty pages.
vac_update_relstats — vacuum.c — updates pg_class rows.
vac_truncate_clog — vacuum.c — advances CLOG truncation horizon.

Parallel vacuum

parallel_vacuum_init — vacuumparallel.c — sets up DSM, forks workers.
PVShared — vacuumparallel.c — DSM struct shared by leader + workers.
vacuum_delay_point — vacuum.c — cost-based sleep; reads VacuumSharedCostBalance in parallel mode.

Position hints (as of 2026-06-05 / commit 273fe94)

Symbol	File	Line
`ExecVacuum`	`src/backend/commands/vacuum.c`	162
`vacuum_get_cutoffs`	`src/backend/commands/vacuum.c`	1116
`vacuum_xid_failsafe_check`	`src/backend/commands/vacuum.c`	1284
`vacuum_rel`	`src/backend/commands/vacuum.c`	2018
`VacuumParams`	`src/include/commands/vacuum.h`	217
`VacuumCutoffs`	`src/include/commands/vacuum.h`	253
`VacDeadItemsInfo`	`src/include/commands/vacuum.h`	292
`LVRelState`	`src/backend/access/heap/vacuumlazy.c`	259
`SKIP_PAGES_THRESHOLD`	`src/backend/access/heap/vacuumlazy.c`	209
`MAX_EAGER_FREEZE_SUCCESS_RATE`	`src/backend/access/heap/vacuumlazy.c`	241
`EAGER_SCAN_REGION_SIZE`	`src/backend/access/heap/vacuumlazy.c`	250
`BYPASS_THRESHOLD_PAGES`	`src/backend/access/heap/vacuumlazy.c`	187
`FAILSAFE_EVERY_PAGES`	`src/backend/access/heap/vacuumlazy.c`	193
`heap_vacuum_rel`	`src/backend/access/heap/vacuumlazy.c`	615
`heap_vacuum_eager_scan_setup`	`src/backend/access/heap/vacuumlazy.c`	488
`lazy_scan_heap`	`src/backend/access/heap/vacuumlazy.c`	1200
`lazy_scan_prune`	`src/backend/access/heap/vacuumlazy.c`	1944
`lazy_vacuum`	`src/backend/access/heap/vacuumlazy.c`	2450
`lazy_vacuum_all_indexes`	`src/backend/access/heap/vacuumlazy.c`	2575
`lazy_vacuum_heap_rel`	`src/backend/access/heap/vacuumlazy.c`	2720
`lazy_cleanup_all_indexes`	`src/backend/access/heap/vacuumlazy.c`	3003
`lazy_vacuum_one_index`	`src/backend/access/heap/vacuumlazy.c`	3071
`lazy_check_wraparound_failsafe`	`src/backend/access/heap/vacuumlazy.c`	2950
`heap_page_prune_and_freeze`	`src/backend/access/heap/pruneheap.c`	350
`PVShared`	`src/backend/commands/vacuumparallel.c`	57
`parallel_vacuum_init`	`src/backend/commands/vacuumparallel.c`	243
`parallel_vacuum_process_all_indexes`	`src/backend/commands/vacuumparallel.c`	611
`parallel_vacuum_process_safe_indexes`	`src/backend/commands/vacuumparallel.c`	774
`parallel_vacuum_process_unsafe_indexes`	`src/backend/commands/vacuumparallel.c`	828
`parallel_vacuum_process_one_index`	`src/backend/commands/vacuumparallel.c`	865
`parallel_vacuum_main`	`src/backend/commands/vacuumparallel.c`	989

Source verification (as of 2026-06-05)

Verified facts

LVRelState is allocated in heap_vacuum_rel() and freed on exit. Confirmed at line 617: vacrel = (LVRelState *) palloc0(sizeof(LVRelState)). It is a palloc0 (not in a longer-lived context), so all counters start at zero automatically.
Dead items are stored in a TidStore, not a flat array. As of REL_18, vacrel->dead_items is a TidStore *, replacing the old LVDeadItems flat array from earlier releases. The TidStore can spill to disk when maintenance_work_mem is exceeded, whereas the old design would trigger a flush at max_items count. This change landed in PG17. The position-hint table references the PG18 interface.
SKIP_PAGES_THRESHOLD = 32 is a compile-time constant. Defined at line 209 in vacuumlazy.c. It is not a GUC and cannot be changed at runtime. The comment at line 51 explains it exists to preserve kernel readahead benefit even on mostly-clean tables.
Eager freeze scanning is controlled by vacuum_max_eager_freeze_failure_rate. This GUC (default not 0) was added in PG18. It gates the entire eager scanning feature: when it is 0, heap_vacuum_eager_scan_setup() returns immediately without enabling eager scanning (verified at line 507–508).
Wraparound failsafe disables cost delay. At line 2990, VacuumCostActive = false and VacuumCostBalance = 0 are set unconditionally when the failsafe engages, allowing vacuum to run at full I/O speed. This is intentional: completing the freeze is more important than I/O fairness.
Parallel vacuum is used only for index phases, not for heap scan. Confirmed by inspection of vacuumparallel.c: parallel_vacuum_init() is called from lazy_vacuum_all_indexes() and lazy_cleanup_all_indexes(), not from lazy_scan_heap() or lazy_vacuum_heap_rel().
vacuum_rel() in vacuum.c advances relfrozenxid database-wide. After per-relation vacuuming, vac_update_datfrozenxid() is called (unless VACOPT_SKIP_DATABASE_STATS is set), which updates pg_database.datfrozenxid. This is the second level of freeze tracking (table level = pg_class.relfrozenxid, database level = pg_database.datfrozenxid) that vac_truncate_clog() uses to decide how much CLOG can be discarded.

Open questions

TidStore spill-to-disk path under maintenance_work_mem pressure. The code comment says TidStore can spill to disk (this is the stated rationale for the PG17 redesign from flat array to TidStore). The tidstore.c implementation was not fully traced in this verification pass. Investigation path: read src/backend/access/common/tidstore.c to confirm the spill path and whether it is exercised in the regression tests.
GlobalVisState *vistest vs. VacuumCutoffs.OldestXmin. Both express “what tuples are dead”, but vistest is a wrapper that factors in GlobalVisibility for each tuple on a page. The exact relationship between vacrel->vistest and vacrel->cutoffs.OldestXmin in the context of heap_page_prune_and_freeze() was not traced to the GlobalVisState struct definition. Investigation path: read src/include/access/heapam.h and GlobalVisTestIsRemovableXid().

Beyond PostgreSQL — Comparative Designs & Research Frontiers

InnoDB’s purge thread. InnoDB performs MVCC garbage collection via a dedicated background purge thread (and, since MySQL 5.7, a purge thread pool). The key difference from PostgreSQL is that InnoDB keeps old row versions in a separate undo log (rollback segments), not in the primary table file. Purge reclaims undo segments rather than marking in-place slots as LP_UNUSED. This means InnoDB’s reclamation does not require an index pass (the primary clustered index already has the new version), but the undo segments can grow large under long transactions. A comparative analysis of undo-based vs. in-place MVCC reclamation would be a natural companion to this document and postgres-heap-am.md.
VACUUM FULL vs. CLUSTER vs. pg_repack. PostgreSQL’s lazy VACUUM leaves bloat — freed LP_UNUSED slots within pages, and empty pages at the tail after truncation. VACUUM FULL / CLUSTER rewrites the relation but requires an exclusive lock. The pg_repack extension (out of scope for this core-only tree) performs an online repack using a trigger-maintained shadow table and a brief lock switchover; the technique is described in its source documentation.
Vampire vacuuming and long transactions. A long-running transaction holds its snapshot’s OldestXmin horizon fixed, preventing vacuum from advancing the dead-tuple horizon. This is the “vacuum being blocked by a long transaction” problem visible in pg_stat_activity. Monitoring and mitigation strategies (idle-in-transaction timeouts, old_snapshot_threshold GUC) are discussed in postgres-mvcc-snapshots.md.
XID wraparound history. The PG16 release introduced the integer4 age(relfrozenxid) monitoring function and an improved autovacuum urgency model; PG17 introduced the TidStore redesign. The full evolution of the freeze mechanism across major releases belongs in postgres-evolution-vacuum-visibility.md (planned; see coverage map).
PostgreSQL’s ARIES lineage. The WAL-based crash recovery that makes vacuum’s steal/no-force policy safe is documented in postgres-xlog-wal.md and anchored in the ARIES paper (dbms-papers/aries.md). Vacuum’s interaction with WAL is indirect: freezing writes WAL records (via heap_page_prune_and_freeze() calling XLogInsert), and LP_UNUSED transitions also generate WAL. Those log records are what allow a replayed recovery to stay consistent with the frozen/reaped state vacuum produced before the crash.

Sources

Raw sources consumed

None (this document was synthesized directly from the PostgreSQL source tree at REL_18_STABLE, commit 273fe94).

Source code paths

src/backend/commands/vacuum.c — entry, param parsing, cutoff computation, CLOG truncation, stats update.
src/backend/commands/vacuumparallel.c — parallel index vacuum setup, worker protocol, PVShared DSM layout.
src/backend/access/heap/vacuumlazy.c — heap_vacuum_rel, lazy_scan_heap, all phase-specific functions, wraparound failsafe, eager scanning.
src/include/commands/vacuum.h — VacuumParams, VacuumCutoffs, VacDeadItemsInfo, VacOptValue, VACOPT_* flags.
src/backend/access/heap/pruneheap.c — heap_page_prune_and_freeze(), HOT pruning, per-page freeze logic.
src/backend/access/heap/visibilitymap.c — VM bit set/clear operations called from lazy_vacuum_heap_page and lazy_scan_prune.
src/backend/storage/freespace/freespace.c — FSM update after heap reap.

Textbook anchors

Database Internals (Petrov, 2019), ch. 5 §“MVCC Versions and Cleanup” — MVCC garbage collection framing.
Database System Concepts, 7e (Silberschatz et al.), ch. 15 §“Snapshot Isolation” — theoretical framing of dead-version visibility.

postgres-heap-am.md — heap tuple layout, HOT update chains, LP_* states, visibility map bit mechanics.
postgres-mvcc-snapshots.md — OldestXmin derivation from procarray, snapshot horizons, long-transaction vacuum blocking.
postgres-xlog-wal.md — WAL records emitted by pruning and freezing; ARIES steal/no-force contract.
postgres-xact.md — transaction lifecycle, relfrozenxid semantics.
postgres-autovacuum.md (planned) — autovacuum daemon that drives heap_vacuum_rel() automatically.
postgres-xid-wraparound-freeze.md (planned) — deep dive on freeze mechanics and varsup.c XID allocation.
dbms-papers/aries.md — ARIES recovery paper underpinning WAL and the steal/no-force policy vacuum depends on.

PostgreSQL Vacuum — Dead-Tuple Reclamation, Freezing, and XID Wraparound Prevention

Theoretical Background

Common DBMS Design

Horizon computation via a global snapshot of active XIDs

Three-phase scan → index-delete → heap-reap loop

Cost-based throttling

Theory ↔ PostgreSQL mapping

PostgreSQL’s Approach

Two command variants: lazy VACUUM and VACUUM FULL

Entry point and parameter flow

VacuumCutoffs: the immutable XID horizons

LVRelState: per-relation vacuum working state

The three-phase loop in lazy_scan_heap

Phase I: lazy_scan_prune and page-level pruning

Page-skipping via the visibility map

Eager freeze scanning (PG18 addition)

Phase II: index bulk-delete

Phase III: heap LP_DEAD reap

Wraparound failsafe

Cost-based pacing

Relation truncation

Statistics and pg_class update

Parallel vacuum

Source Walkthrough

Entry and parameter flow

Core heap vacuum

Phases II and III

Wraparound and cleanup

Parallel vacuum

Position hints (as of 2026-06-05 / commit 273fe94)

Source verification (as of 2026-06-05)

Verified facts

Open questions

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Sources

Raw sources consumed

Source code paths

Textbook anchors

Related documents in this tree