PostgreSQL Vacuum — Dead-Tuple Reclamation, Freezing, and XID Wraparound Prevention
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”MVCC (Multi-Version Concurrency Control) achieves read-without-block by
keeping old row versions visible until no running transaction can see them
any more (Database Internals, Petrov, ch. 5 §“MVCC Versions and Cleanup”;
Database System Concepts, 7e, ch. 15 §“Snapshot Isolation”). The price is
that delete and update operations do not reclaim space immediately — they
stamp the old version as dead (set xmax) and leave the bytes on the page.
Without a background reclamation process, a table that receives a steady
stream of updates would grow without bound, and reads would scan ever-longer
chains of dead versions to find the live ones. The process that does the
reclaiming is conventionally called a garbage collector or vacuum.
The vacuum problem has two separable sub-problems.
Dead-tuple reclamation. A row version is dead when its deleting
transaction (xmax) has committed and no snapshot in any active transaction
can see the row any more. The oldest active snapshot sets the horizon: row
versions deleted before that horizon are safe to remove. The implementation
challenge is that “safe to remove” is a global predicate — the vacuum must
know about every active transaction, not just the one on its own backend —
and computing that global minimum must not itself block writers.
Transaction-ID (XID) wraparound prevention. PostgreSQL uses 32-bit XIDs.
The wraparound window is 2^31 transactions (~2.1 billion) ahead of the
current XID. A row inserted by a transaction that is more than 2^31 XIDs in
the past looks newer than the current transaction, making it invisible.
To prevent this, vacuum freezes old XIDs: it rewrites a tuple’s xmin/
xmax to the special value FrozenTransactionId (XID 2) or sets the
HEAP_XMIN_FROZEN hint bit, after which the tuple is visible to all future
transactions regardless of their XID. Without regular freezing, a database
will corrupt data when XID space wraps. This is the hard real-time obligation
that makes VACUUM non-optional even for read-only tables.
Both sub-problems interact with the visibility map (VM), a per-relation
two-bit-per-page bitmap maintained by the heap access method. A page whose
all-visible (AV) bit is set contains no dead tuples visible to any snapshot;
vacuum can skip it for dead-tuple reclamation. A page whose all-frozen (AF)
bit is set additionally contains no unfrozen XIDs; vacuum can skip it for
freezing. The VM is the primary mechanism that makes repeated vacuums cheap
on a stable table. See postgres-heap-am.md for the VM bit mechanics;
this document covers how vacuum drives and updates the VM.
Common DBMS Design
Section titled “Common DBMS Design”Most MVCC engines need some form of background reclamation. The patterns PostgreSQL follows are not unique to it; they appear in recognizable form in InnoDB’s purge thread, Oracle’s undo segment reuse, and SQL Server’s ghost-record cleanup.
Horizon computation via a global snapshot of active XIDs
Section titled “Horizon computation via a global snapshot of active XIDs”The reclaimer needs the oldest XID still visible to any running transaction.
In PostgreSQL this is OldestXmin, derived from the procarray (the shared
array of per-backend PGPROC entries holding each backend’s current XID and
snapshot horizon). No tuple deleted by a transaction older than OldestXmin
can be seen by any new snapshot. Computing OldestXmin is a snapshot of
the procarray — a read of shared memory under a spin-lock acquisition,
not a full transaction. Competing designs include a low-watermark counter
maintained by each backend on every transaction start/commit (avoids the
scan) or a global epoch mechanism (used by several NewSQL engines).
Three-phase scan → index-delete → heap-reap loop
Section titled “Three-phase scan → index-delete → heap-reap loop”Cleaning up dead tuples from indexes is more expensive than cleaning them from the heap, because each index has its own structure. The common pattern across nearly all MVCC storage engines is:
- Scan heap pages. Find dead items; record their TIDs (tuple identifiers) in a buffer.
- Bulk-delete from indexes. Walk every index once, deleting entries whose
TID is in the buffer. This is the expensive index
ambulkdeletepass. - Reap heap items. Return to the heap pages, mark the dead line pointers
LP_UNUSEDto free the slot for future inserts, and update the free-space map (FSM).
The buffer of dead TIDs is the coordination point between phases. When it
fills before the heap scan completes, the engine must flush early (run phases
2–3 on the partial batch, then resume phase 1). This creates a trade-off
between maintenance_work_mem (buffer size) and the number of index passes.
Cost-based throttling
Section titled “Cost-based throttling”Background reclamation competes with foreground I/O. Unconstrained vacuum on a large table saturates I/O and inflates query latency. The standard solution is cost-based rate limiting: count the number of buffer reads and writes the vacuum performs per unit time, and insert a sleep when the rate exceeds a threshold. This is a soft real-time knob — vacuum trades throughput for latency headroom, at the cost of taking longer to complete.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory concept | PostgreSQL entity |
|---|---|
| MVCC garbage collector | vacuum() / heap_vacuum_rel() |
| Global oldest-snapshot horizon | OldestXmin in VacuumCutoffs |
| XID freeze horizon | FreezeLimit in VacuumCutoffs |
| Dead-TID buffer | TidStore *dead_items in LVRelState |
| Visibility skip map | visibility map (VM) AV/AF bits |
| Phase 1 (heap scan) | lazy_scan_heap() |
| Phase 2 (index delete) | lazy_vacuum_all_indexes() / ambulkdelete |
| Phase 3 (heap reap) | lazy_vacuum_heap_rel() |
| Cost throttle sleep | vacuum_delay_point() |
| Wraparound protection | aggressive vacuum + lazy_check_wraparound_failsafe() |
| Parallel index vacuuming | ParallelVacuumState in vacuumparallel.c |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Two command variants: lazy VACUUM and VACUUM FULL
Section titled “Two command variants: lazy VACUUM and VACUUM FULL”PostgreSQL distinguishes two fundamentally different operations sharing the
VACUUM command name.
Lazy VACUUM (the default, also called concurrent vacuum) runs in-place:
it reclaims dead tuple slots and marks them reusable without rewriting the
page or holding exclusive table locks. Readers and writers continue normally
during lazy VACUUM. This is what heap_vacuum_rel() in vacuumlazy.c
implements, and it is the subject of this document.
VACUUM FULL rewrites the entire relation into a new heap file (via the
CLUSTER code path), acquiring an exclusive AccessExclusiveLock on the
table. It reclaims all free space but blocks all other access. It is seldom
needed in practice — lazy VACUUM is sufficient for ongoing maintenance.
Entry point and parameter flow
Section titled “Entry point and parameter flow”The SQL command VACUUM enters via ExecVacuum in vacuum.c, which
parses options into a VacuumParams struct and calls vacuum(). vacuum()
expands the target relation list (or scans all vacuumable relations for a
plain VACUUM), then calls vacuum_rel() per relation. For the heap AM,
vacuum_rel() eventually calls table_relation_vacuum() which dispatches
to heap_vacuum_rel().
// VacuumParams — src/include/commands/vacuum.htypedef struct VacuumParams{ bits32 options; /* VACOPT_* bitmask */ int freeze_min_age; /* min XID age before freeze, -1 = default */ int freeze_table_age; /* whole-table scan age threshold */ int multixact_freeze_min_age; int multixact_freeze_table_age; bool is_wraparound; /* force for-wraparound vacuum */ int log_min_duration; VacOptValue index_cleanup; /* VACOPTVALUE_AUTO/ENABLED/DISABLED */ VacOptValue truncate; Oid toast_parent; double max_eager_freeze_failure_rate; int nworkers; /* 0 = auto, -1 = no parallel */} VacuumParams;VacuumCutoffs: the immutable XID horizons
Section titled “VacuumCutoffs: the immutable XID horizons”Before scanning a single page, vacuum_get_cutoffs() computes four XID
horizons that remain constant for the entire VACUUM run:
// VacuumCutoffs — src/include/commands/vacuum.hstruct VacuumCutoffs{ TransactionId relfrozenxid; /* current pg_class.relfrozenxid */ MultiXactId relminmxid;
TransactionId OldestXmin; /* tuples deleted before this are DEAD */ MultiXactId OldestMxact;
TransactionId FreezeLimit; /* XIDs older than this must be frozen */ MultiXactId MultiXactCutoff;};OldestXmin is computed by scanning the procarray for the oldest horizon
visible to any running transaction. Tuples whose xmax committed before
OldestXmin are dead to all current and future snapshots.
FreezeLimit is derived from vacuum_freeze_min_age (default 50 million):
any XID older than current_xid - vacuum_freeze_min_age must be frozen.
Freezing replaces the stored XID with FrozenTransactionId (XID 2) or sets
the HEAP_XMIN_FROZEN hint bit, making the tuple permanently visible.
LVRelState: per-relation vacuum working state
Section titled “LVRelState: per-relation vacuum working state”heap_vacuum_rel() allocates one LVRelState on the heap that lives for
the duration of the VACUUM on this relation. It is the single central
state object passed through all sub-functions:
// LVRelState (abridged) — src/backend/access/heap/vacuumlazy.ctypedef struct LVRelState{ Relation rel; Relation *indrels; int nindexes;
BufferAccessStrategy bstrategy; ParallelVacuumState *pvs; /* non-NULL if parallel */
bool aggressive; /* must scan every unfrozen tuple? */ bool skipwithvm; /* use VM to skip pages? */
struct VacuumCutoffs cutoffs; GlobalVisState *vistest;
TransactionId NewRelfrozenXid; /* lowest unfrozen XID seen so far */ MultiXactId NewRelminMxid;
TidStore *dead_items; /* TIDs of LP_DEAD line pointers */ VacDeadItemsInfo *dead_items_info;
BlockNumber rel_pages; BlockNumber scanned_pages; int64 tuples_deleted; int64 tuples_frozen; int64 lpdead_items; int64 live_tuples; /* ... more counters ... */} LVRelState;The aggressive flag controls whether the vacuum must visit every unfrozen
tuple to advance relfrozenxid. An aggressive vacuum is triggered when the
table’s relfrozenxid approaches vacuum_freeze_table_age (default 150
million transactions before current_xid). Without it, the table risks XID
wraparound.
The three-phase loop in lazy_scan_heap
Section titled “The three-phase loop in lazy_scan_heap”lazy_scan_heap() drives the entire work loop. It sets up a ReadStream
for the relation (PG18’s async I/O infrastructure) and iterates over blocks.
The block selection is guided by heap_vac_scan_next_block(), which consults
the VM and LVRelState skip-logic to decide what to read next.
// lazy_scan_heap — src/backend/access/heap/vacuumlazy.cstatic voidlazy_scan_heap(LVRelState *vacrel){ ReadStream *stream; /* ... */ stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, vacrel->bstrategy, vacrel->rel, MAIN_FORKNUM, heap_vac_scan_next_block, vacrel, sizeof(uint8));
while (true) { /* Check TID store full → flush early via lazy_vacuum() */ if (vacrel->dead_items_info->num_items > 0 && TidStoreMemoryUsage(vacrel->dead_items) > vacrel->dead_items_info->max_bytes) { lazy_vacuum(vacrel); /* phases 2+3 on current batch */ FreeSpaceMapVacuumRange(vacrel->rel, ...); }
buf = read_stream_next_buffer(stream, &per_buffer_data); if (!BufferIsValid(buf)) break;
/* Phase 1: prune + freeze this page */ /* ... cleanup lock, lazy_scan_prune() or lazy_scan_noprune() ... */ } /* Final phase 2+3 after heap scan completes */ lazy_vacuum(vacrel);}Figure 1 — The three-phase vacuum loop
flowchart TD
A[heap_vacuum_rel] --> B[vacuum_get_cutoffs]
B --> C[dead_items_alloc]
C --> D[lazy_scan_heap\nPhase I: heap scan]
D --> E{TID store full?}
E -- yes --> F[lazy_vacuum\nPhases II+III]
F --> D
E -- no --> G{more pages?}
G -- yes --> H[lazy_scan_prune / lazy_scan_noprune]
H --> G
G -- no --> I[lazy_vacuum\nfinal flush]
I --> J[lazy_truncate_heap?]
J --> K[update pg_class stats]
Figure 1 — Outer loop of heap_vacuum_rel / lazy_scan_heap. Phase I (heap scan) loops over pages with optional early flushes when the dead-items TID store fills. Phases II+III (index + heap vacuum) run on each full TID batch and once more after the heap scan completes.
Phase I: lazy_scan_prune and page-level pruning
Section titled “Phase I: lazy_scan_prune and page-level pruning”For each block that clears the visibility-map skip check, vacuum acquires a
cleanup lock (LockBufferForCleanup) — exclusive but released as soon as
page processing completes — and calls lazy_scan_prune():
// lazy_scan_prune — src/backend/access/heap/vacuumlazy.cstatic intlazy_scan_prune(LVRelState *vacrel, Buffer buf, BlockNumber blkno, Page page, Buffer vmbuffer, bool all_visible_according_to_vm, bool *has_lpdead_items, bool *vm_page_frozen){ PruneFreezeResult presult; int prune_options = HEAP_PAGE_PRUNE_FREEZE; if (vacrel->nindexes == 0) prune_options |= HEAP_PAGE_PRUNE_MARK_UNUSED_NOW;
heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options, &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN, &vacrel->offnum, &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
if (presult.lpdead_items > 0) dead_items_add(vacrel, blkno, presult.deadoffsets, presult.lpdead_items); /* update per-relation counters ... */}heap_page_prune_and_freeze() (in pruneheap.c, line 350) does two
conceptually distinct things in one page pass under the cleanup lock:
-
Prune HOT chains. It walks all line pointers on the page, detects dead versions in HOT update chains (where
t_ctidchains through multiple versions of the same logical row), collapses the chain, and marks dead line pointersLP_DEAD. -
Freeze tuples. For every live tuple whose
xminorxmaxXID is older thancutoffs.FreezeLimit(or whose MultiXact is older thancutoffs.MultiXactCutoff), it replaces the stored XID with the frozen sentinel or sets theHEAP_XMIN_FROZENhint bit.
Dead line pointers are added to vacrel->dead_items (a TidStore, allocated
from maintenance_work_mem) so the index passes can remove the corresponding
index entries.
Page-skipping via the visibility map
Section titled “Page-skipping via the visibility map”heap_vac_scan_next_block() implements the skip logic. For a normal
(non-aggressive) vacuum:
- A page with the all-visible bit set and no forced scan can be skipped for dead-tuple collection.
- A page with both all-visible and all-frozen bits set can also be skipped for freezing.
For an aggressive vacuum the all-frozen bit is still honoured (fully
frozen pages need no freeze work), but the all-visible-only pages must still
be visited to check for unfrozen XIDs that need freezing to advance
relfrozenxid.
A small threshold SKIP_PAGES_THRESHOLD (32 pages) prevents vacuum from
skipping a run of clean pages that is too short to justify the loss of
kernel readahead.
Eager freeze scanning (PG18 addition)
Section titled “Eager freeze scanning (PG18 addition)”PG18 introduced eager freeze scanning: a normal vacuum may proactively visit all-visible-but-not-yet-all-frozen pages to freeze them, amortizing the work that would otherwise all fall on the next aggressive vacuum.
heap_vacuum_eager_scan_setup() computes per-relation eager-scan
parameters stored in LVRelState:
// heap_vacuum_eager_scan_setup — src/backend/access/heap/vacuumlazy.cstatic voidheap_vacuum_eager_scan_setup(LVRelState *vacrel, VacuumParams *params){ /* cap on successes: MAX_EAGER_FREEZE_SUCCESS_RATE (0.2) of * all-visible-but-not-all-frozen pages */ vacrel->eager_scan_remaining_successes = (BlockNumber) (MAX_EAGER_FREEZE_SUCCESS_RATE * (allvisible - allfrozen));
/* failure tolerance per EAGER_SCAN_REGION_SIZE (4096) block region */ vacrel->eager_scan_max_fails_per_region = params->max_eager_freeze_failure_rate * EAGER_SCAN_REGION_SIZE;}The success cap (MAX_EAGER_FREEZE_SUCCESS_RATE = 0.2, i.e. at most 20% of
eligible pages per vacuum) prevents a single normal vacuum from doing an
unlimited amount of extra freeze work. The per-region failure tolerance
prevents wasting I/O in table regions where all-visible pages turn out to
have no freezable XIDs yet (too recently written).
Phase II: index bulk-delete
Section titled “Phase II: index bulk-delete”lazy_vacuum_all_indexes() calls ambulkdelete() on every index, passing
vac_tid_reaped() as the callback. The callback returns true for any TID
present in vacrel->dead_items. Each index AM walks its own structure and
deletes the matching entries in a single pass.
If ParallelVacuumIsActive(vacrel) (i.e., pvs != NULL), the indexes are
distributed among worker processes via the ParallelVacuumState DSM segment
(in vacuumparallel.c). Each worker calls lazy_vacuum_one_index() for its
assigned index. The leader coordinates via shared-memory counters and
ParallelVacuumKey DSM keys.
Index vacuuming bypass optimization. If lpdead_item_pages is below
BYPASS_THRESHOLD_PAGES (2%) of rel_pages and the TID store fits in 32 MB,
lazy_vacuum() bypasses the index and heap vacuum phases entirely — skipping
an ambulkdelete pass for a negligible number of dead items avoids the cost
discontinuity of a full index scan for a nearly-clean table. The threshold is
expressed in pages with at least one LP_DEAD item, not in the raw dead-item
count, precisely because the page count is the proxy for how many VM bits stay
unset when vacuuming is skipped — and unset VM bits are what cost future
scans. The guard insists the optimization only fire when num_index_scans == 0 (a single batch), because once a relation has been large enough to force an
earlier index pass, the cheap-bypass assumption no longer holds:
// lazy_vacuum (bypass decision) — src/backend/access/heap/vacuumlazy.cthreshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;bypass = (vacrel->lpdead_item_pages < threshold && TidStoreMemoryUsage(vacrel->dead_items) < 32 * 1024 * 1024);/* ... */if (bypass) vacrel->do_index_vacuuming = false; /* but still do index cleanup */else if (lazy_vacuum_all_indexes(vacrel)) lazy_vacuum_heap_rel(vacrel); /* full phase II then phase III */else Assert(VacuumFailsafeActive); /* round aborted by failsafe */Note the asymmetry: a bypass still runs index cleanup (amvacuumcleanup)
later, because cleanup is cheap when no ambulkdelete ran and it is what lets
the index AM update its own statistics. Only the expensive bulk-delete pass is
skipped.
Dead-TID store sizing and the index-pass count. The size of
vacrel->dead_items is bounded by dead_items_info->max_bytes, derived from
maintenance_work_mem (or autovacuum_work_mem under autovacuum). When the
store reaches that ceiling mid-scan, lazy_scan_heap() must flush early — run
phases II + III on the partial batch — and then resume the heap scan. Each
early flush is one additional full sweep of every index. A table large enough
to overflow the store therefore pays (number of flushes + 1) index scans.
This is the single most important reason to give vacuum generous
maintenance_work_mem: it is not heap-scan memory, it is the buffer that keeps
the index-pass count at one. The PG17 TidStore redesign (radix-tree backed,
replacing the old flat ItemPointerData array) raised the effective dead-item
density per byte dramatically, so the same memory budget now tolerates far
more dead tuples before forcing a second index pass.
Phase III: heap LP_DEAD reap
Section titled “Phase III: heap LP_DEAD reap”After index entries are removed, lazy_vacuum_heap_rel() makes a second
pass over only the heap pages that contributed LP_DEAD items (tracked in the
TID store). It acquires a regular exclusive buffer lock (not a cleanup lock)
and calls lazy_vacuum_heap_page(), which marks the LP_DEAD line pointers
LP_UNUSED, freeing their slots. After each page, the freed space is
recorded in the FSM (FreeSpaceMapVacuum), and if the page is now fully
clean, the VM all-visible bit is set.
The reason this phase is order-dependent on phase II is correctness, not
mere efficiency. A LP_DEAD line pointer must not be turned into LP_UNUSED
— and thus made reusable by a future INSERT — until every index entry that
points at that TID has been removed. If the heap slot were recycled while a
stale index entry still pointed at it, an index scan could follow that entry
to a brand-new, logically-unrelated tuple. The dead-TID TidStore is what
ties the two passes together: phase II reads it to drive ambulkdelete, and
phase III iterates the same store to find exactly which heap pages and offsets
to reap. This is why lazy_vacuum() always runs phase II to completion (or
the failsafe path) before invoking lazy_vacuum_heap_rel().
Figure 2 — The per-batch phase pipeline (scan+prune → dead-TID store → index vacuum → heap reap → VM/freeze)
flowchart TD
P1["lazy_scan_prune<br/>heap_page_prune_and_freeze"] --> PR{"presult.lpdead_items > 0?"}
PR -- yes --> DS["dead_items_add<br/>TidStore *dead_items"]
PR -- no --> VM
DS --> VMSET{"page now all-visible?"}
VMSET -- yes, all-frozen --> VMF["visibilitymap_set<br/>ALL_VISIBLE | ALL_FROZEN"]
VMSET -- yes, unfrozen XIDs --> VMV["visibilitymap_set<br/>ALL_VISIBLE"]
VMSET -- no --> VM["NewRelfrozenXid / NewRelminMxid<br/>tracked per page"]
VMF --> VM
VMV --> VM
VM --> FULL{"TidStore over max_bytes<br/>or scan complete?"}
FULL -- no --> P1
FULL -- yes --> LV["lazy_vacuum"]
LV --> BYP{"bypass? lpdead_item_pages<br/>< BYPASS_THRESHOLD_PAGES<br/>and TidStore < 32MB"}
BYP -- yes --> RST["dead_items_reset<br/>skip index + heap vacuum"]
BYP -- no --> P2["lazy_vacuum_all_indexes<br/>Phase II: ambulkdelete per index"]
P2 --> OK{"full index round done?"}
OK -- no, failsafe --> FS["VacuumFailsafeActive<br/>do_index_vacuuming = false"]
OK -- yes --> P3["lazy_vacuum_heap_rel<br/>Phase III: LP_DEAD to LP_UNUSED"]
P3 --> RST
Figure 2 — One full pass through the phases on a TID batch. Phase I (lazy_scan_prune) prunes and freeze-stamps a page, feeds dead TIDs to the TidStore, and opportunistically sets VM all-visible / all-frozen bits and tracks NewRelfrozenXid. When the store fills (or the heap scan ends), lazy_vacuum() either bypasses cheaply, runs phase II (ambulkdelete) followed by phase III (lazy_vacuum_heap_rel), or trips the wraparound failsafe. The store is reset after every batch.
The VM/freeze update at the tail of lazy_scan_prune() is the point at which
a freshly-cleaned page earns its all-visible (and possibly all-frozen) status
for future skipping:
// lazy_scan_prune (VM-set tail) — src/backend/access/heap/vacuumlazy.c/* page became all-visible: set ALL_VISIBLE, plus ALL_FROZEN if applicable */PageSetAllVisible(page);MarkBufferDirty(buf);old_vmbits = visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, vmbuffer, presult.vm_conflict_horizon, flags);if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0){ vacrel->vm_new_visible_pages++; if (presult.all_frozen) { vacrel->vm_new_visible_frozen_pages++; *vm_page_frozen = true; }}The presult.vm_conflict_horizon returned by heap_page_prune_and_freeze()
is the snapshot-conflict horizon that recovery uses: a hot-standby replica
replaying the VM-set record must cancel any query whose snapshot predates that
horizon, since the page is about to be declared all-visible. This is the
mechanism that keeps VM-driven index-only scans on a standby from returning
rows that the standby’s own readers should still be able to see.
// lazy_vacuum_heap_rel — src/backend/access/heap/vacuumlazy.cstatic voidlazy_vacuum_heap_rel(LVRelState *vacrel){ TidStoreIter *iter = TidStoreBeginIterate(vacrel->dead_items); stream = read_stream_begin_relation( READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING, vacrel->bstrategy, vacrel->rel, MAIN_FORKNUM, vacuum_reap_lp_read_stream_next, iter, sizeof(TidStoreIterResult));
while (true) { buf = read_stream_next_buffer(stream, (void **) &iter_result); if (!BufferIsValid(buf)) break; LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); lazy_vacuum_heap_page(vacrel, blkno, buf, offsets, num_offsets, vmbuffer); /* update FSM with freed space */ }}Wraparound failsafe
Section titled “Wraparound failsafe”lazy_check_wraparound_failsafe() is called every FAILSAFE_EVERY_PAGES
(~512K pages, i.e. 4 GB at 8 KB/page) during the scan. It calls
vacuum_xid_failsafe_check(), which computes the distance from the current
XID to the table’s relfrozenxid. If the distance exceeds
vacuum_failsafe_age (default 1.6 billion), the failsafe engages:
// lazy_check_wraparound_failsafe — src/backend/access/heap/vacuumlazy.cif (unlikely(vacuum_xid_failsafe_check(&vacrel->cutoffs))){ VacuumFailsafeActive = true; vacrel->bstrategy = NULL; /* use all shared_buffers */ vacrel->do_index_vacuuming = false; vacrel->do_index_cleanup = false; vacrel->do_rel_truncate = false; VacuumCostActive = false; /* disable cost delay */ ereport(WARNING, ...); return true;}The failsafe trades a possibly-incomplete index cleanup for guaranteed progress through the heap freeze scan. It is a last resort: the normal path is that autovacuum runs often enough to keep all tables well below the failsafe threshold.
Cost-based pacing
Section titled “Cost-based pacing”vacuum_delay_point(false) is called at the top of each page iteration.
It computes the current vacuum cost (buffer reads + dirtied buffers,
weighted by vacuum_cost_page_hit, _miss, and _dirty) and sleeps
for vacuum_cost_delay milliseconds when the running balance exceeds
vacuum_cost_limit. For parallel vacuum, workers report their cost via
parallel_vacuum_worker_delay_ns and the leader aggregates; the
VacuumSharedCostBalance and VacuumActiveNWorkers atomic counters
coordinate proportional delay across workers.
Relation truncation
Section titled “Relation truncation”After all three phases complete, should_attempt_truncation() decides
whether to try truncating empty pages from the end of the relation.
lazy_truncate_heap() acquires AccessExclusiveLock, confirms the
trailing pages are empty (using a backwards scan), and calls
smgrtruncate() to shrink the physical file. The truncation path requires
the exclusive lock and therefore can be blocked by concurrent readers; it
retries with a timeout (VACUUM_TRUNCATE_LOCK_TIMEOUT = 5000 ms) to
avoid indefinite blocking.
Statistics and pg_class update
Section titled “Statistics and pg_class update”At the end of heap_vacuum_rel(), vac_update_relstats() updates
pg_class.relpages, pg_class.reltuples, pg_class.relfrozenxid, and
pg_class.relminmxid. relfrozenxid is advanced to vacrel.NewRelfrozenXid
— the minimum unfrozen XID actually seen during this vacuum pass. Advancing
relfrozenxid allows vac_truncate_clog() to discard old CLOG pages, which
frees disk space in pg_xact/.
Parallel vacuum
Section titled “Parallel vacuum”When params->nworkers >= 0 and the table has at least two indexes whose
combined size clears the parallel cutoff, vacuum forks parallel workers using
the standard ParallelContext / DSM machinery in vacuumparallel.c. The
parallelism unit is one index per worker — phase I (heap scan) and phase
III (heap reap) always run in the leader. Only the index passes (phase II
ambulkdelete and the final amvacuumcleanup) fan out.
The PVShared struct, allocated in the DSM segment by parallel_vacuum_init(),
is the coordination record between leader and workers:
// PVShared — src/backend/commands/vacuumparallel.ctypedef struct PVShared{ Oid relid; int elevel; int64 queryid;
double reltuples; bool estimated_count; int maintenance_work_mem_worker; int ring_nbuffers;
pg_atomic_uint32 cost_balance; /* shared vacuum cost balance */ pg_atomic_uint32 active_nworkers; /* workers currently delaying */ pg_atomic_uint32 idx; /* next index to claim */
dsa_handle dead_items_dsa_handle; dsa_pointer dead_items_handle; /* shared TidStore */ VacDeadItemsInfo dead_items_info;} PVShared;Three fields carry the design. dead_items_handle / dead_items_dsa_handle
point at the shared TidStore in DSA, so every worker reads the same set
of dead TIDs the leader’s heap scan produced — the dead-item buffer is not
copied per worker. idx is an atomic claim counter: each worker (and the
leader, which joins as a worker) does pg_atomic_fetch_add_u32(&shared->idx, 1)
to grab the next index to process, a lock-free work-stealing queue over the
index array. cost_balance / active_nworkers implement shared cost-based
delay so the whole worker pool, not each worker independently, respects
vacuum_cost_limit.
The fan-out itself happens in parallel_vacuum_process_all_indexes(), called
once per index-vacuum round and once for cleanup. It computes how many workers
the current phase wants, launches them, then has the leader join the same
work-stealing loop:
// parallel_vacuum_process_all_indexes — src/backend/commands/vacuumparallel.c/* The leader process will participate */nworkers--;nworkers = Min(nworkers, pvs->pcxt->nworkers);/* ... mark each index's parallel_workers_can_process ... */pg_atomic_write_u32(&(pvs->shared->idx), 0);if (nworkers > 0){ pg_atomic_write_u32(&(pvs->shared->cost_balance), VacuumCostBalance); pg_atomic_write_u32(&(pvs->shared->active_nworkers), 0); ReinitializeParallelWorkers(pvs->pcxt, nworkers); LaunchParallelWorkers(pvs->pcxt); /* ... enable shared cost balance for leader ... */}/* leader vacuums parallel-unsafe indexes itself, then joins the safe loop */parallel_vacuum_process_unsafe_indexes(pvs);parallel_vacuum_process_safe_indexes(pvs);Each participant (worker via parallel_vacuum_main(), leader inline) runs the
same claim loop in parallel_vacuum_process_safe_indexes():
// parallel_vacuum_process_safe_indexes — src/backend/commands/vacuumparallel.cfor (;;){ int idx = pg_atomic_fetch_add_u32(&(pvs->shared->idx), 1); if (idx >= pvs->nindexes) break; /* all indexes claimed */ indstats = &(pvs->indstats[idx]); if (!indstats->parallel_workers_can_process) continue; /* leader handles unsafe ones */ parallel_vacuum_process_one_index(pvs, pvs->indrels[idx], indstats);}Two index categories never go to workers: parallel-unsafe indexes (whose
AM declares it cannot bulk-delete in a worker, e.g. via missing
VACUUM_OPTION_PARALLEL_BULKDEL) and indexes too small to be worth a worker.
parallel_vacuum_process_unsafe_indexes() handles those in the leader before
it joins the safe loop, so the leader is never idle while workers churn.
Figure 3 — Parallel index-vacuum worker fan-out
flowchart TD
A["lazy_vacuum_all_indexes<br/>ParallelVacuumIsActive?"] -- yes --> B["parallel_vacuum_process_all_indexes"]
A -- no --> S["leader-only loop:<br/>lazy_vacuum_one_index per index"]
B --> C["compute nworkers<br/>nworkers-- (leader participates)"]
C --> D["LaunchParallelWorkers<br/>set shared cost_balance"]
D --> E["leader: process_unsafe_indexes<br/>(parallel-unsafe + tiny indexes)"]
D --> W1["worker 1<br/>parallel_vacuum_main"]
D --> W2["worker 2<br/>parallel_vacuum_main"]
D --> Wn["worker N<br/>parallel_vacuum_main"]
E --> F["leader joins:<br/>process_safe_indexes"]
W1 --> Q["atomic claim:<br/>idx = fetch_add(shared.idx, 1)"]
W2 --> Q
Wn --> Q
F --> Q
Q --> ONE["parallel_vacuum_process_one_index<br/>ambulkdelete / amvacuumcleanup"]
ONE --> Q
Q -- idx >= nindexes --> WAIT["WaitForParallelWorkersToFinish<br/>InstrAccumParallelQuery"]
WAIT --> CARRY["carry shared cost balance<br/>back to leader heap scan"]
Figure 3 — Index fan-out. The shared TidStore (via dead_items_handle) is read by every participant; shared.idx is an atomic counter that workers and the leader fetch-and-add to claim indexes, giving lock-free work-stealing. The leader handles parallel-unsafe and tiny indexes itself, then joins the safe loop, so it is never idle. After the round, shared cost balance is folded back into the leader’s VacuumCostBalance before the heap scan resumes.
Source Walkthrough
Section titled “Source Walkthrough”Entry and parameter flow
Section titled “Entry and parameter flow”ExecVacuum—src/backend/commands/vacuum.c— parsesVacuumStmt, buildsVacuumParams, callsvacuum().vacuum—vacuum.c— expands relation list; per-relation loop.vacuum_rel—vacuum.c— opens relation, acquires lock, callstable_relation_vacuum()→heap_vacuum_rel().vacuum_get_cutoffs—vacuum.c— computesVacuumCutoffs:OldestXmin,OldestMxact,FreezeLimit,MultiXactCutoff.
Core heap vacuum
Section titled “Core heap vacuum”heap_vacuum_rel—vacuumlazy.c— allocatesLVRelState, callslazy_scan_heap, finalizes stats.heap_vacuum_eager_scan_setup—vacuumlazy.c— computes eager-freeze parameters (eager_scan_remaining_successes, etc.).lazy_scan_heap—vacuumlazy.c— outer loop over pages usingReadStream; triggerslazy_vacuum()when TID store fills.heap_vac_scan_next_block—vacuumlazy.c— ReadStream callback; consults VM to skip all-visible / all-frozen pages.find_next_unskippable_block—vacuumlazy.c— advancesvacrel->next_unskippable_blockpast skippable pages.lazy_scan_prune—vacuumlazy.c— per-page: callsheap_page_prune_and_freeze, collects LP_DEAD offsets.lazy_scan_noprune—vacuumlazy.c— per-page without cleanup lock: counts existing LP_DEAD items without modifying them.heap_page_prune_and_freeze—pruneheap.c— prunes HOT chains, freezes tuples, returnsPruneFreezeResult.
Phases II and III
Section titled “Phases II and III”lazy_vacuum—vacuumlazy.c— decides bypass vs. full phases II+III.lazy_vacuum_all_indexes—vacuumlazy.c— callsambulkdeleteon each index (possibly via parallel workers).lazy_vacuum_one_index—vacuumlazy.c— callsindex_bulk_deletefor a single index.lazy_vacuum_heap_rel—vacuumlazy.c— second heap pass over LP_DEAD pages; callslazy_vacuum_heap_page.lazy_vacuum_heap_page—vacuumlazy.c— marks LP_DEAD offsetsLP_UNUSED; updates FSM and VM.
Wraparound and cleanup
Section titled “Wraparound and cleanup”lazy_check_wraparound_failsafe—vacuumlazy.c— engages failsafe ifrelfrozenxidis dangerously old.vacuum_xid_failsafe_check—vacuum.c— computes XID distance and returns true if failsafe threshold exceeded.lazy_cleanup_all_indexes—vacuumlazy.c— callsambulkcleanupon each index after the final vacuum pass.lazy_truncate_heap—vacuumlazy.c— truncates trailing empty pages.vac_update_relstats—vacuum.c— updatespg_classrows.vac_truncate_clog—vacuum.c— advances CLOG truncation horizon.
Parallel vacuum
Section titled “Parallel vacuum”parallel_vacuum_init—vacuumparallel.c— sets up DSM, forks workers.PVShared—vacuumparallel.c— DSM struct shared by leader + workers.vacuum_delay_point—vacuum.c— cost-based sleep; readsVacuumSharedCostBalancein parallel mode.
Position hints (as of 2026-06-05 / commit 273fe94)
Section titled “Position hints (as of 2026-06-05 / commit 273fe94)”| Symbol | File | Line |
|---|---|---|
ExecVacuum | src/backend/commands/vacuum.c | 162 |
vacuum_get_cutoffs | src/backend/commands/vacuum.c | 1116 |
vacuum_xid_failsafe_check | src/backend/commands/vacuum.c | 1284 |
vacuum_rel | src/backend/commands/vacuum.c | 2018 |
VacuumParams | src/include/commands/vacuum.h | 217 |
VacuumCutoffs | src/include/commands/vacuum.h | 253 |
VacDeadItemsInfo | src/include/commands/vacuum.h | 292 |
LVRelState | src/backend/access/heap/vacuumlazy.c | 259 |
SKIP_PAGES_THRESHOLD | src/backend/access/heap/vacuumlazy.c | 209 |
MAX_EAGER_FREEZE_SUCCESS_RATE | src/backend/access/heap/vacuumlazy.c | 241 |
EAGER_SCAN_REGION_SIZE | src/backend/access/heap/vacuumlazy.c | 250 |
BYPASS_THRESHOLD_PAGES | src/backend/access/heap/vacuumlazy.c | 187 |
FAILSAFE_EVERY_PAGES | src/backend/access/heap/vacuumlazy.c | 193 |
heap_vacuum_rel | src/backend/access/heap/vacuumlazy.c | 615 |
heap_vacuum_eager_scan_setup | src/backend/access/heap/vacuumlazy.c | 488 |
lazy_scan_heap | src/backend/access/heap/vacuumlazy.c | 1200 |
lazy_scan_prune | src/backend/access/heap/vacuumlazy.c | 1944 |
lazy_vacuum | src/backend/access/heap/vacuumlazy.c | 2450 |
lazy_vacuum_all_indexes | src/backend/access/heap/vacuumlazy.c | 2575 |
lazy_vacuum_heap_rel | src/backend/access/heap/vacuumlazy.c | 2720 |
lazy_cleanup_all_indexes | src/backend/access/heap/vacuumlazy.c | 3003 |
lazy_vacuum_one_index | src/backend/access/heap/vacuumlazy.c | 3071 |
lazy_check_wraparound_failsafe | src/backend/access/heap/vacuumlazy.c | 2950 |
heap_page_prune_and_freeze | src/backend/access/heap/pruneheap.c | 350 |
PVShared | src/backend/commands/vacuumparallel.c | 57 |
parallel_vacuum_init | src/backend/commands/vacuumparallel.c | 243 |
parallel_vacuum_process_all_indexes | src/backend/commands/vacuumparallel.c | 611 |
parallel_vacuum_process_safe_indexes | src/backend/commands/vacuumparallel.c | 774 |
parallel_vacuum_process_unsafe_indexes | src/backend/commands/vacuumparallel.c | 828 |
parallel_vacuum_process_one_index | src/backend/commands/vacuumparallel.c | 865 |
parallel_vacuum_main | src/backend/commands/vacuumparallel.c | 989 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
LVRelStateis allocated inheap_vacuum_rel()and freed on exit. Confirmed at line 617:vacrel = (LVRelState *) palloc0(sizeof(LVRelState)). It is apalloc0(not in a longer-lived context), so all counters start at zero automatically. -
Dead items are stored in a
TidStore, not a flat array. As of REL_18,vacrel->dead_itemsis aTidStore *, replacing the oldLVDeadItemsflat array from earlier releases. TheTidStorecan spill to disk whenmaintenance_work_memis exceeded, whereas the old design would trigger a flush atmax_itemscount. This change landed in PG17. The position-hint table references the PG18 interface. -
SKIP_PAGES_THRESHOLD = 32is a compile-time constant. Defined at line 209 invacuumlazy.c. It is not a GUC and cannot be changed at runtime. The comment at line 51 explains it exists to preserve kernel readahead benefit even on mostly-clean tables. -
Eager freeze scanning is controlled by
vacuum_max_eager_freeze_failure_rate. This GUC (default not 0) was added in PG18. It gates the entire eager scanning feature: when it is 0,heap_vacuum_eager_scan_setup()returns immediately without enabling eager scanning (verified at line 507–508). -
Wraparound failsafe disables cost delay. At line 2990,
VacuumCostActive = falseandVacuumCostBalance = 0are set unconditionally when the failsafe engages, allowing vacuum to run at full I/O speed. This is intentional: completing the freeze is more important than I/O fairness. -
Parallel vacuum is used only for index phases, not for heap scan. Confirmed by inspection of
vacuumparallel.c:parallel_vacuum_init()is called fromlazy_vacuum_all_indexes()andlazy_cleanup_all_indexes(), not fromlazy_scan_heap()orlazy_vacuum_heap_rel(). -
vacuum_rel()invacuum.cadvancesrelfrozenxiddatabase-wide. After per-relation vacuuming,vac_update_datfrozenxid()is called (unlessVACOPT_SKIP_DATABASE_STATSis set), which updatespg_database.datfrozenxid. This is the second level of freeze tracking (table level =pg_class.relfrozenxid, database level =pg_database.datfrozenxid) thatvac_truncate_clog()uses to decide how much CLOG can be discarded.
Open questions
Section titled “Open questions”-
TidStorespill-to-disk path undermaintenance_work_mempressure. The code comment saysTidStorecan spill to disk (this is the stated rationale for the PG17 redesign from flat array to TidStore). Thetidstore.cimplementation was not fully traced in this verification pass. Investigation path: readsrc/backend/access/common/tidstore.cto confirm the spill path and whether it is exercised in the regression tests. -
GlobalVisState *vistestvs.VacuumCutoffs.OldestXmin. Both express “what tuples are dead”, butvistestis a wrapper that factors in GlobalVisibility for each tuple on a page. The exact relationship betweenvacrel->vistestandvacrel->cutoffs.OldestXminin the context ofheap_page_prune_and_freeze()was not traced to theGlobalVisStatestruct definition. Investigation path: readsrc/include/access/heapam.handGlobalVisTestIsRemovableXid().
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
InnoDB’s purge thread. InnoDB performs MVCC garbage collection via a dedicated background purge thread (and, since MySQL 5.7, a purge thread pool). The key difference from PostgreSQL is that InnoDB keeps old row versions in a separate undo log (rollback segments), not in the primary table file. Purge reclaims undo segments rather than marking in-place slots as LP_UNUSED. This means InnoDB’s reclamation does not require an index pass (the primary clustered index already has the new version), but the undo segments can grow large under long transactions. A comparative analysis of undo-based vs. in-place MVCC reclamation would be a natural companion to this document and
postgres-heap-am.md. -
VACUUM FULL vs. CLUSTER vs. pg_repack. PostgreSQL’s lazy VACUUM leaves bloat — freed LP_UNUSED slots within pages, and empty pages at the tail after truncation.
VACUUM FULL/CLUSTERrewrites the relation but requires an exclusive lock. Thepg_repackextension (out of scope for this core-only tree) performs an online repack using a trigger-maintained shadow table and a brief lock switchover; the technique is described in its source documentation. -
Vampire vacuuming and long transactions. A long-running transaction holds its snapshot’s
OldestXminhorizon fixed, preventing vacuum from advancing the dead-tuple horizon. This is the “vacuum being blocked by a long transaction” problem visible inpg_stat_activity. Monitoring and mitigation strategies (idle-in-transaction timeouts,old_snapshot_thresholdGUC) are discussed inpostgres-mvcc-snapshots.md. -
XID wraparound history. The PG16 release introduced the integer4
age(relfrozenxid)monitoring function and an improved autovacuum urgency model; PG17 introduced theTidStoreredesign. The full evolution of the freeze mechanism across major releases belongs inpostgres-evolution-vacuum-visibility.md(planned; see coverage map). -
PostgreSQL’s ARIES lineage. The WAL-based crash recovery that makes vacuum’s steal/no-force policy safe is documented in
postgres-xlog-wal.mdand anchored in the ARIES paper (dbms-papers/aries.md). Vacuum’s interaction with WAL is indirect: freezing writes WAL records (viaheap_page_prune_and_freeze()callingXLogInsert), and LP_UNUSED transitions also generate WAL. Those log records are what allow a replayed recovery to stay consistent with the frozen/reaped state vacuum produced before the crash.
Sources
Section titled “Sources”Raw sources consumed
Section titled “Raw sources consumed”None (this document was synthesized directly from the PostgreSQL source tree at REL_18_STABLE, commit 273fe94).
Source code paths
Section titled “Source code paths”src/backend/commands/vacuum.c— entry, param parsing, cutoff computation, CLOG truncation, stats update.src/backend/commands/vacuumparallel.c— parallel index vacuum setup, worker protocol,PVSharedDSM layout.src/backend/access/heap/vacuumlazy.c—heap_vacuum_rel,lazy_scan_heap, all phase-specific functions, wraparound failsafe, eager scanning.src/include/commands/vacuum.h—VacuumParams,VacuumCutoffs,VacDeadItemsInfo,VacOptValue,VACOPT_*flags.src/backend/access/heap/pruneheap.c—heap_page_prune_and_freeze(), HOT pruning, per-page freeze logic.src/backend/access/heap/visibilitymap.c— VM bit set/clear operations called fromlazy_vacuum_heap_pageandlazy_scan_prune.src/backend/storage/freespace/freespace.c— FSM update after heap reap.
Textbook anchors
Section titled “Textbook anchors”- Database Internals (Petrov, 2019), ch. 5 §“MVCC Versions and Cleanup” — MVCC garbage collection framing.
- Database System Concepts, 7e (Silberschatz et al.), ch. 15 §“Snapshot Isolation” — theoretical framing of dead-version visibility.
Related documents in this tree
Section titled “Related documents in this tree”postgres-heap-am.md— heap tuple layout, HOT update chains, LP_* states, visibility map bit mechanics.postgres-mvcc-snapshots.md—OldestXminderivation from procarray, snapshot horizons, long-transaction vacuum blocking.postgres-xlog-wal.md— WAL records emitted by pruning and freezing; ARIES steal/no-force contract.postgres-xact.md— transaction lifecycle,relfrozenxidsemantics.postgres-autovacuum.md(planned) — autovacuum daemon that drivesheap_vacuum_rel()automatically.postgres-xid-wraparound-freeze.md(planned) — deep dive on freeze mechanics and varsup.c XID allocation.dbms-papers/aries.md— ARIES recovery paper underpinning WAL and the steal/no-force policy vacuum depends on.