PostgreSQL Autovacuum — The Launcher, Workers, and Anti-Wraparound Scheduling
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Autovacuum is the policy layer above PostgreSQL’s MVCC garbage
collector. MVCC (multi-version concurrency control) buys
read-without-blocking by never overwriting a row in place: an UPDATE
or DELETE leaves the old tuple version on the heap page, visible to
transactions whose snapshot predates the change, and the new version
is appended. Database System Concepts (Silberschatz, 7e, §18.7
“Multiversion Schemes”) states the consequence plainly — multiversion
schemes “require that old versions of data items be deleted at some
point”, and the deletion “can be done only when no transaction that
can read the old version is still active.” That deletion is the
vacuum operation. The question this document answers is not how
vacuum reclaims a dead tuple (that is postgres-vacuum.md) but who
decides when to run it, on which table, in which database, and how
hard — the scheduling problem.
The scheduling problem has a hard deadline buried inside it that pure
garbage-collection theory does not surface: transaction-id
wraparound. PostgreSQL stamps every row version with a 32-bit
xmin/xmax and decides visibility by comparing transaction ids in a
modular (circular) space — “older than” means “roughly two billion
ids behind in the ring.” If a table holds a live row whose inserting
transaction is never frozen, then after ~2 billion further
transactions that id wraps from “ancient past” to “distant future” and
the row silently becomes invisible — catastrophic, undetectable data
loss. Database Internals (Petrov, ch. 5, on MVCC and version
maintenance) frames vacuum as the maintenance task that bounds the
version space; PostgreSQL’s specific bound is the freeze: rewrite
an old tuple’s xmin to a frozen marker so it is unconditionally
visible and its original id can be reused. Freezing only happens
inside a vacuum. So vacuum is doing two unrelated jobs — space
reclamation and wraparound prevention — and the scheduler must serve
both, with the second being a correctness deadline, not an
optimization.
Three design tensions shape any automatic-vacuum scheduler, and they are the knobs PostgreSQL turns:
-
When is a table “dirty enough” to vacuum? Vacuuming a table that has accrued three dead tuples wastes I/O; waiting until a high-churn table is 90% bloat wastes disk and slows scans. The standard answer is a threshold relative to table size: vacuum when dead tuples exceed
base + scale × live_tuples. The constant base handles tiny tables (don’t thrash a 10-row table); the size-proportional term handles large ones (a 100M-row table can tolerate more absolute dead tuples before a vacuum pays off). -
How to share one machine across many tables and databases without starving anyone. A naive scheduler that always picks the dirtiest table starves small databases; one that round-robins databases ignores urgency. PostgreSQL splits the decision: a round-robin across databases for fairness, and a threshold-driven choice within a database for urgency, with wraparound risk overriding both.
-
How hard to push. Vacuum is I/O-heavy and competes with foreground queries. The classic control is a cost-based delay: the vacuum accumulates a “cost” for every page it touches and sleeps when the cost crosses a limit, throttling itself. When several vacuums run at once, the aggregate throttle must stay bounded, so the budget is divided across the concurrent workers — a distributed rate-limiter.
The autovacuum subsystem is the embodiment of these three answers. It
is deliberately outside the vacuum mechanism: vacuum can always be
run by hand (VACUUM), and the same throttling code serves manual and
automatic runs alike. Autovacuum is the daemon that decides nobody had
to type the command.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model (multiversion deletion, wraparound bound, threshold scheduling, rate-limited maintenance). This section names the engineering conventions that recur across production engines that bolt an automatic maintenance daemon onto an MVCC or deferred-cleanup storage layer — PostgreSQL, Oracle (its automatic segment/space advisors and SMON undo cleanup), SQL Server (ghost record cleanup + auto-stats), MySQL/InnoDB (purge threads), CUBRID (its dedicated vacuum workers). PostgreSQL’s specific choices in the next section read as one set of dials within this shared space.
A long-lived scheduler plus short-lived executors
Section titled “A long-lived scheduler plus short-lived executors”Almost every engine separates the decision process from the work process. A single always-on scheduler (a daemon, a coordinator thread) holds the global picture — which objects are stale, which deadlines loom — and dispatches work to a pool of executors that do one unit of work and exit (or return to a pool). The split keeps the global state in one place (no N-way coordination over “who is vacuuming what”) and makes the executors disposable: an executor that crashes or is killed mid-vacuum leaves the scheduler’s bookkeeping intact, and the next dispatch simply re-picks the table. PostgreSQL’s launcher/worker pair is exactly this shape; InnoDB’s purge coordinator + purge workers is the same idea inside one process.
A bounded pool sized by a parameter
Section titled “A bounded pool sized by a parameter”Maintenance must not be allowed to consume the whole machine, so the
executor pool is capped by a configuration parameter (max workers).
The scheduler tracks free vs. busy executors in a small fixed-size
shared structure — a free list and a running list — so “can I dispatch
another?” is a constant-time check. The cap is a ceiling, not a
target: the pool sits idle when nothing is dirty.
Statistics-driven thresholds, not a fixed timetable
Section titled “Statistics-driven thresholds, not a fixed timetable”Rather than “vacuum every table every hour,” the scheduler consults cumulative activity statistics — dead-tuple counts, insert counts, modification counts since the last maintenance — maintained by the running system as a side effect of DML. A table that nobody writes is never scheduled; a hot table is scheduled often. The threshold is a formula over those counters and the table’s size, with per-object overrides so a pathological table can be tuned without touching the global knobs.
A hard deadline that overrides the soft policy
Section titled “A hard deadline that overrides the soft policy”Layered on top of the soft “dirty enough” policy is a forced path for the correctness deadline (wraparound, undo-space exhaustion, log-space pressure — engine-specific). When an object crosses the hard limit, the scheduler must service it regardless of how clean it looks and regardless of whether the operator disabled routine maintenance. This forced path typically also changes which object the scheduler picks first (most-endangered first) and whether the executor may be interrupted (it may not, easily).
A shared, divided throttle
Section titled “A shared, divided throttle”To keep aggregate maintenance I/O bounded while several executors run, the rate limit is a shared quantity divided across the active executors. Each executor periodically reads the current divisor from shared memory and recomputes its personal limit, so adding or removing an executor re-balances the others without a central rendezvous. The division is the distributed form of the textbook’s single rate-limiter.
A side queue for ad-hoc maintenance requests
Section titled “A side queue for ad-hoc maintenance requests”Beyond the statistics-driven schedule, other parts of the engine occasionally need a specific maintenance action (“summarize this index range now”). The convention is a small fixed-size work-item queue in the scheduler’s shared memory that any backend can post into and that the executors drain opportunistically, so one-off requests piggyback on the existing executor pool instead of spawning bespoke machinery.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL entity |
|---|---|
| Long-lived scheduler | AutoVacLauncherMain (the B_AUTOVAC_LAUNCHER process) |
| Short-lived executor | AutoVacWorkerMain (one B_AUTOVAC_WORKER per dispatch) |
| Bounded executor pool | autovacuum_worker_slots free list av_freeWorkers in AutoVacuumShmem |
| ”Can I dispatch?” check | av_worker_available (free slots vs. reserved) |
| Cross-database fairness | DatabaseList round-robin built by rebuild_database_list |
| Within-database urgency | relation_needs_vacanalyze threshold equation |
| Activity statistics | PgStat_StatTabEntry (dead_tuples, ins_since_vacuum, mod_since_analyze) |
| Threshold formula | base + scale × reltuples, clamped by vac_max_thresh |
| Hard deadline | relfrozenxid/relminmxid vs. recentXid - freeze_max_age → force_vacuum |
| Most-endangered-first | do_start_worker picks oldest datfrozenxid when for_xid_wrap |
| Shared divided throttle | av_nworkersForBalance + AutoVacuumUpdateCostLimit |
| Ad-hoc request queue | av_workItems[NUM_WORKITEMS] + AutoVacuumRequestWork |
By the time the reader reaches av_nworkersForBalance in the next
section, they already know what kind of thing it is: the divisor of a
distributed rate-limiter.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements the whole scheduler in one file,
src/backend/postmaster/autovacuum.c (~3,475 lines at REL_18), with a
tiny public header src/include/postmaster/autovacuum.h. The
architecture is a two-tier process model glued together by one
shared-memory struct and the postmaster’s fork mechanism. This section
walks the design: the shared state, the launcher’s scheduling loop, the
worker’s per-table decision, the cost-balancing protocol, the forced
anti-wraparound path, and the side work-item queue.
Two processes, one shared struct
Section titled “Two processes, one shared struct”The launcher never connects to a database and never vacuums anything.
It is a perpetual scheduler that decides which database deserves a
worker next, then asks the postmaster to fork one. Workers are
short-lived: each forked worker attaches to exactly one database, does
“an appropriate amount of work,” and exits. The two tiers share no
memory except AutoVacuumShmem, a single struct (plus a trailing
array of per-slot WorkerInfoData) sized at startup.
// AutoVacuumShmemStruct — src/backend/postmaster/autovacuum.ctypedef struct{ sig_atomic_t av_signal[AutoVacNumSignals]; pid_t av_launcherpid; dclist_head av_freeWorkers; /* WorkerInfo free list */ dlist_head av_runningWorkers; /* WorkerInfo non-free queue */ WorkerInfo av_startingWorker; /* one being started; cleared by the worker */ AutoVacuumWorkItem av_workItems[NUM_WORKITEMS]; /* NUM_WORKITEMS == 256 */ pg_atomic_uint32 av_nworkersForBalance; /* cost-balance divisor */} AutoVacuumShmemStruct;The struct is almost entirely protected by one LWLock,
AutovacuumLock. The exceptions are deliberate: av_signal is an
array of sig_atomic_t that remote processes set without locking
(so a backend can flag “rebalance needed” cheaply), and
av_nworkersForBalance is a pg_atomic_uint32 that workers read on a
hot path without taking the lock. Everything else — the worker free
list, the running list, the starting-worker pointer, the work-item
array — moves under AutovacuumLock.
A worker’s whereabouts live in one WorkerInfoData slot, and there
are exactly autovacuum_worker_slots of them, allocated in a flat
array after the fixed struct:
// WorkerInfoData — src/backend/postmaster/autovacuum.ctypedef struct WorkerInfoData{ dlist_node wi_links; /* entry into free list or running list */ Oid wi_dboid; /* database this worker works on */ Oid wi_tableoid; /* table currently being vacuumed, if any */ PGPROC *wi_proc; /* PGPROC of the running worker, NULL if not started */ TimestampTz wi_launchtime; pg_atomic_flag wi_dobalance;/* include this worker in balance calc? */ bool wi_sharedrel;} WorkerInfoData;The same slot threads onto two lists by its single wi_links node: it
sits on av_freeWorkers when idle and on av_runningWorkers when a
worker owns it. wi_tableoid and wi_sharedrel are the two fields a
worker publishes so other workers can see what it is currently
chewing on (protected by AutovacuumScheduleLock, not the main lock)
— that is how two workers in the same database avoid both grabbing the
same table.
The shared memory is sized and laid out at server start:
// AutoVacuumShmemInit — src/backend/postmaster/autovacuum.cAutoVacuumShmem = (AutoVacuumShmemStruct *) ShmemInitStruct("AutoVacuum Data", AutoVacuumShmemSize(), &found);// ... condensed ...worker = (WorkerInfo) ((char *) AutoVacuumShmem + MAXALIGN(sizeof(AutoVacuumShmemStruct)));for (i = 0; i < autovacuum_worker_slots; i++){ dclist_push_head(&AutoVacuumShmem->av_freeWorkers, &worker[i].wi_links); pg_atomic_init_flag(&worker[i].wi_dobalance);}pg_atomic_init_u32(&AutoVacuumShmem->av_nworkersForBalance, 0);Note the PG17→PG18 evolution baked in here: the pool is sized by
autovacuum_worker_slots (the count of physical slots reserved at
startup, fixed for the cluster’s life because shared memory cannot
grow), while autovacuum_max_workers is a runtime GUC that caps how
many of those slots autovacuum will actually use. The gap between the
two is a reserve, and av_worker_available enforces it:
// av_worker_available — src/backend/postmaster/autovacuum.cfree_slots = dclist_count(&AutoVacuumShmem->av_freeWorkers);reserved_slots = autovacuum_worker_slots - autovacuum_max_workers;reserved_slots = Max(0, reserved_slots);return free_slots > reserved_slots;This is the PG18 change that lets an operator raise
autovacuum_max_workers with a reload (no restart) up to the
autovacuum_worker_slots ceiling — a frequent prior pain point.
The overall topology:
flowchart TB
PM["postmaster<br/>(forks every process)"]
LA["autovacuum launcher<br/>AutoVacLauncherMain<br/>perpetual scheduler, no DB"]
subgraph SHM["AutoVacuumShmem (shared memory, AutovacuumLock)"]
FREE["av_freeWorkers<br/>(free WorkerInfo slots)"]
RUN["av_runningWorkers<br/>(busy WorkerInfo slots)"]
START["av_startingWorker<br/>(handoff pointer)"]
WI["av_workItems[256]<br/>(ad-hoc requests)"]
NB["av_nworkersForBalance<br/>(atomic divisor)"]
end
W1["worker (db A)<br/>AutoVacWorkerMain"]
W2["worker (db B)<br/>AutoVacWorkerMain"]
BK["any backend<br/>(BRIN summarize)"]
PM -->|fork| LA
LA -->|"do_start_worker:<br/>pick db, fill startingWorker,<br/>signal PMSIGNAL_START_AUTOVAC_WORKER"| PM
PM -->|fork| W1
PM -->|fork| W2
LA --- SHM
W1 --- SHM
W2 --- SHM
BK -->|"AutoVacuumRequestWork"| WI
W1 -->|"SIGUSR2 'I'm up / I finished'"| LA
Figure 1 — The two-tier process model. The launcher never touches a
database; it picks a target database, parks a WorkerInfo in
av_startingWorker, and signals the postmaster to fork the actual
worker. The forked worker claims the parked slot, moves it to the
running list, and signals the launcher back via SIGUSR2. All
coordination is through AutoVacuumShmem under AutovacuumLock. Any
backend can post a one-off request into av_workItems.
The launcher’s scheduling loop
Section titled “The launcher’s scheduling loop”After the standard auxiliary-process boilerplate (signal handlers,
InitProcess, a sigsetjmp error-recovery block stripped down from
PostgresMain), the launcher builds its database list once and enters
a sleep-then-maybe-launch loop:
// AutoVacLauncherMain — src/backend/postmaster/autovacuum.c (condensed)rebuild_database_list(InvalidOid);
while (!ShutdownRequestPending){ struct timeval nap;
launcher_determine_sleep(av_worker_available(), false, &nap); (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L), WAIT_EVENT_AUTOVACUUM_MAIN); ResetLatch(MyLatch); ProcessAutoVacLauncherInterrupts();
/* ... handle SIGUSR2: rebalance, or retry after a fork failure ... */
current_time = GetCurrentTimestamp(); LWLockAcquire(AutovacuumLock, LW_SHARED); can_launch = av_worker_available(); /* ... if av_startingWorker still pending and not timed out, can_launch = false ... */ LWLockRelease(AutovacuumLock); if (!can_launch) continue;
if (dlist_is_empty(&DatabaseList)) launch_worker(current_time); /* bootstrap: nothing scheduled yet */ else { avl_dbase *avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList); if (TimestampDifferenceExceeds(avdb->adl_next_worker, current_time, 0)) launch_worker(current_time); /* the due database */ }}Two invariants make this loop correct. First, only one worker may be
“starting” at a time: if av_startingWorker is non-NULL the launcher
will not dispatch another, because the forked worker has not yet
claimed its slot and the launcher would otherwise double-book. If a
starting worker takes longer than Min(autovacuum_naptime, 60) seconds
the launcher reclaims its slot and logs a warning — a forked worker
that died before claiming would otherwise wedge the pipeline. Second,
the launcher sleeps until the next due database, computed from the
list, not a fixed tick.
The database round-robin
Section titled “The database round-robin”rebuild_database_list is the cross-database fairness mechanism. It
produces a doubly-linked list of avl_dbase entries, one per database
that has a pgstats entry, ordered so the database due furthest in the
future is at the head and the one due soonest is at the tail — which
is why the loop above reads dlist_tail_element to get the next
target. The “next worker” timestamps are spread evenly across one
autovacuum_naptime interval:
// rebuild_database_list — src/backend/postmaster/autovacuum.c (condensed)millis_increment = 1000.0 * autovacuum_naptime / nelems;if (millis_increment <= MIN_AUTOVAC_SLEEPTIME) /* MIN == 100.0 ms */ millis_increment = MIN_AUTOVAC_SLEEPTIME * 1.1;current_time = GetCurrentTimestamp();for (i = 0; i < nelems; i++){ db = &(dbary[i]); current_time = TimestampTzPlusMilliseconds(current_time, millis_increment); db->adl_next_worker = current_time; dlist_push_head(&DatabaseList, &db->adl_node); /* later goes nearer head */}The structure is genuinely subtle: the function first scores databases
(new database = 0, then the existing list in order, then any
get_database_list() leftovers) into a temporary hash, sorts an array by
score, and rebuilds the list so the ordering of databases within the
naptime window is preserved across rebuilds. The effect is that with N
databases and a 60-second naptime, each database gets a worker roughly
every 60 seconds, evenly staggered, and the order is stable so no
database keeps jumping the queue. When a worker is actually launched
for a database, launch_worker pushes that database’s next_worker
out by one full naptime and moves it to the head, so it goes to the
back of the effective queue.
flowchart LR
subgraph DL["DatabaseList — ordered by adl_next_worker"]
direction LR
H["head:<br/>db due furthest out"]
M["...staggered every<br/>naptime/N ms..."]
T["tail:<br/>db due soonest"]
end
SLEEP["launcher_determine_sleep<br/>sleeps until tail's<br/>adl_next_worker"]
PICK["pick tail database"]
LW["launch_worker:<br/>next_worker += naptime,<br/>move to head"]
T --> SLEEP
SLEEP --> PICK
PICK --> LW
LW -->|"this db now at head<br/>(due furthest out)"| H
Figure 2 — The database round-robin. The list is kept sorted by
adl_next_worker so the soonest-due database is always at the tail.
The launcher sleeps exactly until that database is due, dispatches a
worker, then pushes that database’s next slot one naptime into the
future and moves it to the head. Over one autovacuum_naptime window
every database is visited once, evenly spaced. Anti-wraparound is the
exception that bypasses this ordering — see below.
The worker’s per-table decision
Section titled “The worker’s per-table decision”A forked worker (AutoVacWorkerMain) claims the parked WorkerInfo,
moves it to av_runningWorkers, connects to its assigned database, and
calls do_autovacuum. That function scans pg_class twice (main
tables and matviews first, then TOAST tables, because a TOAST table
inherits its parent’s reloptions), and for every relation calls the
heart of the policy — relation_needs_vacanalyze — to decide three
booleans: vacuum, analyze, force-for-wraparound.
The threshold equation is the textbook formula. Parameters come from the table’s reloptions if set, else the global GUCs:
// relation_needs_vacanalyze — src/backend/postmaster/autovacuum.c (condensed)vac_scale_factor = (relopts && relopts->vacuum_scale_factor >= 0) ? relopts->vacuum_scale_factor : autovacuum_vac_scale;vac_base_thresh = (relopts && relopts->vacuum_threshold >= 0) ? relopts->vacuum_threshold : autovacuum_vac_thresh;/* PG18: a hard ceiling on the computed vacuum threshold; -1 disables it */vac_max_thresh = (relopts && relopts->vacuum_max_threshold >= -1) ? relopts->vacuum_max_threshold : autovacuum_vac_max_thresh;// ... ins and analyze parameters resolved the same way ...
vactuples = tabentry->dead_tuples;instuples = tabentry->ins_since_vacuum;anltuples = tabentry->mod_since_analyze;
vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;if (vac_max_thresh >= 0 && vacthresh > (float4) vac_max_thresh) vacthresh = (float4) vac_max_thresh;
vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples * pcnt_unfrozen;anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
*dovacuum = force_vacuum || (vactuples > vacthresh) || (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);*doanalyze = (anltuples > anlthresh);Three numbers feed the decision, all from cumulative statistics:
dead_tuples (the bloat metric, drives the classic vacuum threshold),
ins_since_vacuum (the insert metric, PG13+ insert-only-table
vacuuming so even an append-only table gets frozen eventually), and
mod_since_analyze (the planner-statistics staleness metric, drives
analyze). The insert path carries one PG18 refinement worth noting: the
insert threshold is scaled by pcnt_unfrozen, the fraction of the
table’s pages that are not already all-frozen (derived from
relallfrozen/relpages), so an insert-heavy table whose old pages are
already frozen is judged on the inserts into its still-active region —
the engine does not re-vacuum a table that is mostly settled.
flowchart TD
START["relation_needs_vacanalyze(rel)"]
XID{"relfrozenxid older than<br/>recentXid - freeze_max_age?"}
MXID{"relminmxid older than<br/>recentMulti - mxid_freeze_max_age?"}
FORCE["force_vacuum = true<br/>wraparound = true"]
ENABLED{"autovacuum enabled<br/>for this table?"}
SKIP["dovacuum = false<br/>doanalyze = false<br/>(unless forced)"]
STATS{"pgstats entry exists<br/>and autovacuum active?"}
THRESH["dovacuum = forced OR dead>vacthresh<br/>OR ins>vacinsthresh<br/>doanalyze = mod>anlthresh"]
ONLYFORCE["dovacuum = force_vacuum<br/>doanalyze = false"]
START --> XID
XID -->|yes| FORCE
XID -->|no| MXID
MXID -->|yes| FORCE
MXID -->|no| ENABLED
FORCE --> STATS
ENABLED -->|"no, and not forced"| SKIP
ENABLED -->|"yes, or forced"| STATS
STATS -->|yes| THRESH
STATS -->|no| ONLYFORCE
Figure 3 — The per-table decision tree. The wraparound check runs
first and sets force_vacuum; a forced table is vacuumed even if the
operator disabled autovacuum for it (the !av_enabled && !force_vacuum
early return only fires when not forced). Only after the forced path
is settled does the soft threshold equation run, and only for tables
with live statistics.
Claiming a table without colliding
Section titled “Claiming a table without colliding”Once do_autovacuum has its list of OIDs to process, it loops over
them, and before touching each table it must avoid two other workers
grabbing the same relation. It holds AutovacuumScheduleLock while it
both checks the running workers and publishes its own claim:
// do_autovacuum — src/backend/postmaster/autovacuum.c (condensed claim)LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);LWLockAcquire(AutovacuumLock, LW_SHARED);dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers){ WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur); if (worker == MyWorkerInfo) continue; if (!worker->wi_sharedrel && worker->wi_dboid != MyDatabaseId) continue; if (worker->wi_tableoid == relid) { skipit = true; break; }}LWLockRelease(AutovacuumLock);if (skipit) { LWLockRelease(AutovacuumScheduleLock); continue; }
/* claim it before releasing the schedule lock */MyWorkerInfo->wi_tableoid = relid;MyWorkerInfo->wi_sharedrel = isshared;LWLockRelease(AutovacuumScheduleLock);
tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc, effective_multixact_freeze_max_age);if (tab == NULL) { /* someone else did it; release claim */ continue; }The publish-then-recheck pattern is the standard “claim and verify”
under a small race window: a worker publishes wi_tableoid while
holding the schedule lock, then re-reads the statistics
(table_recheck_autovac) because another worker might have vacuumed
the table between the first-pass scan and now. If the recheck says the
table no longer needs work, the worker releases the claim and moves on.
Shared catalogs (relisshared) are visible to workers in any
database, so the collision check honors wi_sharedrel across database
boundaries.
Cost-delay balancing across workers
Section titled “Cost-delay balancing across workers”Each table the worker processes also updates the worker’s place in the
shared cost-balance scheme. Vacuum throttles itself by accumulating a
cost per page and sleeping when it crosses vacuum_cost_limit; with
multiple workers active, that limit is divided so the aggregate I/O
rate stays near the single-worker target. The divisor lives in shared
memory and is recomputed whenever the set of balancing workers changes:
// autovac_recalculate_workers_for_balance — autovacuum.c (condensed)dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers){ WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur); if (worker->wi_proc == NULL || pg_atomic_unlocked_test_flag(&worker->wi_dobalance)) continue; nworkers_for_balance++;}pg_atomic_write_u32(&AutoVacuumShmem->av_nworkersForBalance, nworkers_for_balance);// AutoVacuumUpdateCostLimit — autovacuum.c (condensed)if (av_storage_param_cost_limit > 0) vacuum_cost_limit = av_storage_param_cost_limit; /* per-table override: not balanced */else{ vacuum_cost_limit = (autovacuum_vac_cost_limit > 0) ? autovacuum_vac_cost_limit : VacuumCostLimit; if (pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance)) return; /* this worker opted out of balancing */ nworkers_for_balance = pg_atomic_read_u32(&AutoVacuumShmem->av_nworkersForBalance); vacuum_cost_limit = Max(vacuum_cost_limit / nworkers_for_balance, 1);}The wi_dobalance flag is the opt-out: a table with cost-related
reloptions (its own vacuum_cost_delay/vacuum_cost_limit) is not
folded into the shared budget — the operator asked for a specific rate
on that table, so it runs at that rate and is excluded from the
divisor. Every other worker reads av_nworkersForBalance (atomically,
no lock) on a regular basis through VacuumUpdateCosts and divides the
global limit by it. When a worker starts or finishes a table it signals
AutoVacRebalance, the launcher recomputes the divisor under the lock,
and the running workers pick up the new value on their next check —
the distributed rate-limiter rebalancing without a central rendezvous.
Forced anti-wraparound vacuums
Section titled “Forced anti-wraparound vacuums”The wraparound deadline pierces every layer. It changes which database the launcher picks, which tables the worker forces, and whether a disabled autovacuum can be skipped.
At the database level, do_start_worker computes a force limit and
scans all databases (from pg_database, not just those with stats),
preferring the most-endangered:
// do_start_worker — src/backend/postmaster/autovacuum.c (condensed)recentXid = ReadNextTransactionId();xidForceLimit = recentXid - autovacuum_freeze_max_age;if (xidForceLimit < FirstNormalTransactionId) xidForceLimit -= FirstNormalTransactionId;recentMulti = ReadNextMultiXactId();multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();
foreach(cell, dblist){ avw_dbase *tmp = lfirst(cell); if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit)) { if (avdb == NULL || TransactionIdPrecedes(tmp->adw_frozenxid, avdb->adw_frozenxid)) avdb = tmp; for_xid_wrap = true; /* this db is at risk; ignore not-at-risk dbs from here */ continue; } else if (for_xid_wrap) continue; else if (MultiXactIdPrecedes(tmp->adw_minmulti, multiForceLimit)) { /* multixact risk */ } // ... else fall through to "least-recently-autovacuumed" selection ...}Once any database is found in XID-wraparound danger, for_xid_wrap
latches true and every not-at-risk database is ignored for the rest of
the scan; among at-risk databases the one with the oldest
datfrozenxid wins. XID wraparound outranks MultiXact wraparound,
and both outrank the ordinary “least recently autovacuumed” choice.
This is the most-endangered-first override: a database that has not
been touched in days will still be picked ahead of a busy one if its
frozen-xid horizon is the closest to the wall.
At the table level, the same comparison runs inside
relation_needs_vacanalyze and sets force_vacuum, which has three
consequences the soft path lacks:
// relation_needs_vacanalyze — force path (condensed)xidForceLimit = recentXid - freeze_max_age;relfrozenxid = classForm->relfrozenxid;force_vacuum = (TransactionIdIsNormal(relfrozenxid) && TransactionIdPrecedes(relfrozenxid, xidForceLimit));/* ... else check relminmxid against the multixact force limit ... */*wraparound = force_vacuum;
if (!av_enabled && !force_vacuum) /* disabled tables: skip only if NOT forced */{ *doanalyze = false; *dovacuum = false; return;}First, a forced vacuum runs even when av_enabled is false (the table
or the cluster has autovacuum turned off) — wraparound prevention is
not optional. Second, when the worker is processing such a table, the
config-reload check inside do_autovacuum’s loop deliberately does
not bail out if it sees autovacuum was just disabled, with the
comment “this might be a for-wraparound emergency worker.” Third, a
forced (anti-wraparound) vacuum is harder to cancel: it ignores the
usual signals that would let a conflicting lock request kill an
ordinary autovacuum, because letting a DDL statement repeatedly cancel
the only thing preventing data loss would be a foot-gun. (The cancel
behavior itself lives in the lock manager and vacuum.c; see
postgres-xid-wraparound-freeze.md.)
The side work-item queue
Section titled “The side work-item queue”Finally, the small fixed queue that lets any backend request a specific
maintenance action. The only producer in core at REL_18 is BRIN index
summarization (AVW_BRINSummarizeRange), posted by brin_summarize_*:
// AutoVacuumRequestWork — src/backend/postmaster/autovacuum.c (condensed)LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);for (i = 0; i < NUM_WORKITEMS; i++) /* NUM_WORKITEMS == 256 */{ AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i]; if (workitem->avw_used) continue; workitem->avw_used = true; workitem->avw_active = false; workitem->avw_type = type; workitem->avw_database = MyDatabaseId; workitem->avw_relation = relationId; workitem->avw_blockNumber = blkno; result = true; break;}LWLockRelease(AutovacuumLock);return result;The queue is a flat array of 256 slots; a full queue silently drops the
request (returns false). A worker, after finishing its table list,
drains the items belonging to its database via perform_work_item,
marking each avw_active while it runs so a second worker does not
double-process it. This is the engine’s “piggyback ad-hoc maintenance
on the existing worker pool” convention: no bespoke process, just a
mailbox the workers check on their way out.
Source Walkthrough
Section titled “Source Walkthrough”All symbols are in src/backend/postmaster/autovacuum.c unless noted;
the public surface is in src/include/postmaster/autovacuum.h.
Shared state and lifecycle
Section titled “Shared state and lifecycle”AutoVacuumShmemStruct(struct) — the one shared struct: signal array, launcher pid, worker free/running lists, starting-worker handoff pointer, work-item array, balance divisor.WorkerInfoData/WorkerInfo— one per slot; threads onto the free or running list viawi_links; publisheswi_tableoid/wi_sharedrelfor collision avoidance.avl_dbase(struct) — a launcher-side database-list entry (adl_datid,adl_next_worker,adl_score).avw_dbase(struct) — a worker-side database descriptor withadw_frozenxid/adw_minmultifor the wraparound choice.AutoVacuumWorkItem(struct) +NUM_WORKITEMS(== 256) — the ad-hoc request queue element and its array size.AutoVacuumShmemSize/AutoVacuumShmemInit— size and initialize the segment; seed the free list withautovacuum_worker_slotsslots.autovac_init— postmaster-time sanity check (warns iftrack_countsis off).
Launcher
Section titled “Launcher”AutoVacLauncherMain— the scheduler entry point and main loop.ProcessAutoVacLauncherInterrupts— handles SIGHUP (reload + rebuild list), shutdown, barriers.AutoVacLauncherShutdown— clean exit, clearsav_launcherpid.launcher_determine_sleep— compute nap time until the next due database, clamped to[MIN_AUTOVAC_SLEEPTIME, MAX_AUTOVAC_SLEEPTIME].rebuild_database_list— build the round-robin list, evenly spaced overautovacuum_naptime, ordered byadl_next_worker.get_database_list— seqscanpg_database(the launcher’s only transaction).db_comparator— qsort comparator onadl_score.do_start_worker— choose the target database (wraparound-first, else least-recently-autovacuumed), park aWorkerInfoinav_startingWorker, signal the postmaster.launch_worker— wrapper that callsdo_start_workerthen bumps the chosen database’sadl_next_workerby one naptime.av_worker_available— free slots vs. reserved (worker_slots - max_workers).avl_sigusr2_handler/AutoVacWorkerFailed— worker-up/finished and fork-failure signaling.
Worker
Section titled “Worker”AutoVacWorkerMain— worker entry point; claims the parked slot, connects to the database, callsdo_autovacuum.FreeWorkerInfo—on_shmem_exitcallback returning the slot to the free list and waking the launcher.do_autovacuum— two-passpg_classscan, orphan-temp-table cleanup, the per-table claim/recheck/vacuum loop, work-item drain.extract_autovac_opts— pullAutoVacOptsout of apg_classreloptions tuple.relation_needs_vacanalyze— the threshold + freeze-age decision; outputsdovacuum/doanalyze/wraparound.recheck_relation_needs_vacanalyze/table_recheck_autovac— re-evaluate under fresh stats after claiming, building anautovac_tablework descriptor or NULL.autovacuum_do_vac_analyze— hand theautovac_tableto the sharedvacuum()entry point.perform_work_item/autovac_report_workitem— drain and report a queuedAutoVacuumWorkItem.
Cost balancing
Section titled “Cost balancing”VacuumUpdateCosts— recomputevacuum_cost_delay/vacuum_cost_limitfor this worker (or a manual VACUUM); called at vacuum setup and after reloads.AutoVacuumUpdateCostLimit— divide the global limit byav_nworkersForBalance(unless opted out viawi_dobalanceor a per-table cost reloption).autovac_recalculate_workers_for_balance— recount balancing workers and writeav_nworkersForBalance.
Public surface and requests
Section titled “Public surface and requests”AutoVacuumingActive— is the daemon configured on?AutoVacuumRequestWork— post anAutoVacuumWorkItem(returns false if the queue is full).check_autovacuum_work_mem/check_av_worker_gucs— GUC check hooks.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
avl_dbase (struct) | postmaster/autovacuum.c | 171 |
avw_dbase (struct) | postmaster/autovacuum.c | 180 |
WorkerInfoData (struct) | postmaster/autovacuum.c | 231 |
AutoVacuumWorkItem (struct) | postmaster/autovacuum.c | 263 |
NUM_WORKITEMS (== 256) | postmaster/autovacuum.c | 273 |
MIN_AUTOVAC_SLEEPTIME (100.0 ms) | postmaster/autovacuum.c | 139 |
MAX_AUTOVAC_SLEEPTIME (300 s) | postmaster/autovacuum.c | 140 |
AutoVacuumShmemStruct (struct) | postmaster/autovacuum.c | 293 |
AutoVacLauncherMain | postmaster/autovacuum.c | 368 |
ProcessAutoVacLauncherInterrupts | postmaster/autovacuum.c | 747 |
AutoVacLauncherShutdown | postmaster/autovacuum.c | 792 |
launcher_determine_sleep | postmaster/autovacuum.c | 809 |
rebuild_database_list | postmaster/autovacuum.c | 893 |
db_comparator | postmaster/autovacuum.c | 1072 |
do_start_worker | postmaster/autovacuum.c | 1090 |
launch_worker | postmaster/autovacuum.c | 1302 |
AutoVacWorkerFailed | postmaster/autovacuum.c | 1354 |
avl_sigusr2_handler | postmaster/autovacuum.c | 1361 |
AutoVacWorkerMain | postmaster/autovacuum.c | 1376 |
FreeWorkerInfo | postmaster/autovacuum.c | 1606 |
VacuumUpdateCosts | postmaster/autovacuum.c | 1654 |
AutoVacuumUpdateCostLimit | postmaster/autovacuum.c | 1723 |
autovac_recalculate_workers_for_balance | postmaster/autovacuum.c | 1769 |
get_database_list | postmaster/autovacuum.c | 1809 |
do_autovacuum | postmaster/autovacuum.c | 1885 |
perform_work_item | postmaster/autovacuum.c | 2605 |
extract_autovac_opts | postmaster/autovacuum.c | 2719 |
table_recheck_autovac | postmaster/autovacuum.c | 2749 |
recheck_relation_needs_vacanalyze | postmaster/autovacuum.c | 2900 |
relation_needs_vacanalyze | postmaster/autovacuum.c | 2967 |
autovacuum_do_vac_analyze | postmaster/autovacuum.c | 3173 |
autovac_report_workitem | postmaster/autovacuum.c | 3248 |
AutoVacuumingActive | postmaster/autovacuum.c | 3288 |
AutoVacuumRequestWork | postmaster/autovacuum.c | 3300 |
autovac_init | postmaster/autovacuum.c | 3342 |
AutoVacuumShmemSize | postmaster/autovacuum.c | 3359 |
AutoVacuumShmemInit | postmaster/autovacuum.c | 3378 |
av_worker_available | postmaster/autovacuum.c | 3449 |
AutoVacuumWorkItemType (enum) | include/postmaster/autovacuum.h | 23 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
The worker pool is sized by
autovacuum_worker_slotsat startup and capped at runtime byautovacuum_max_workers. Verified inAutoVacuumShmemInit(the free list is seeded withautovacuum_worker_slotsentries) andav_worker_available(which subtractsautovacuum_max_workersfromautovacuum_worker_slotsto compute a reserve) on 2026-06-05. This two-parameter split is the PG17→PG18 change that lets an operator raiseautovacuum_max_workerswith a reload instead of a restart; shared memory still cannot grow, soautovacuum_worker_slotsis the immutable ceiling. -
The launcher dispatches at most one worker at a time and waits for it to claim its slot. Verified in
AutoVacLauncherMain— whenav_startingWorkeris non-NULL the launcher setscan_launch = falseand will reclaim the slot only afterMin(autovacuum_naptime, 60)seconds, logging “autovacuum worker took too long to start; canceled.” The handshake is: launcher parks the slot, signals the postmaster, the forked worker claims the slot inAutoVacWorkerMainand clearsav_startingWorker, then signals the launcher via SIGUSR2. -
The vacuum threshold is
base + scale × reltuples, optionally clamped by a maximum. Verified inrelation_needs_vacanalyze:vacthresh = vac_base_thresh + vac_scale_factor * reltuples, thenif (vac_max_thresh >= 0 && vacthresh > vac_max_thresh) vacthresh = vac_max_thresh. Thevacuum_max_thresholdceiling is a PG18 addition (default 100,000,000) so a very large table’s threshold stops scaling without bound;-1disables the clamp. The three driving counters aredead_tuples,ins_since_vacuum, andmod_since_analyzefromPgStat_StatTabEntry. -
Anti-wraparound vacuum runs even when autovacuum is disabled for a table. Verified in
relation_needs_vacanalyze: the early-return guard isif (!av_enabled && !force_vacuum), so a forced table is never skipped. Independently confirmed indo_autovacuum’s per-table loop, where the config-reload handler explicitly refuses to bail out on a newly-disabled autovacuum, with the in-source comment that the worker “might be a for-wraparound emergency worker.” -
XID wraparound outranks MultiXact wraparound, which outranks the ordinary least-recently-vacuumed choice, at database-selection time. Verified in
do_start_worker: the loop latchesfor_xid_wrapon the first XID-endangered database and thereaftercontinues past every not-at-risk database; only if no XID risk is found does theMultiXactIdPrecedesbranch run; only if neither fires does thelast_autovac_timecomparison choose. Among endangered databases the oldestadw_frozenxid(resp.adw_minmulti) wins. -
The per-worker cost limit is the global limit divided by
av_nworkersForBalance. Verified inAutoVacuumUpdateCostLimit:vacuum_cost_limit = Max(vacuum_cost_limit / nworkers_for_balance, 1). The divisor is recomputed inautovac_recalculate_workers_for_balanceby counting running workers whosewi_dobalanceflag is set, and it is read atomically (pg_atomic_read_u32) on the worker’s hot path without takingAutovacuumLock. A worker with per-table cost reloptions clearswi_dobalanceand is excluded from both the divisor and the division. -
The ad-hoc work-item queue is a flat 256-slot array; a full queue silently drops the request. Verified in
AutoVacuumRequestWork(NUM_WORKITEMS== 256; returnsfalseif no free slot is found) and theav_workItems[NUM_WORKITEMS]field ofAutoVacuumShmemStruct. The only in-core producer at REL_18 isAVW_BRINSummarizeRange— theAutoVacuumWorkItemTypeenum inautovacuum.hhas exactly that one member. -
The launcher runs exactly one transaction, only to read
pg_database. Verified by the header comment onget_database_list(“this is the only function in which the autovacuum launcher uses a transaction”) and byAutoVacLauncherMaincallingInitPostgres(NULL, InvalidOid, ...)— it attaches to no specific database.
Open questions
Section titled “Open questions”-
The
rebuild_database_listinitial hash size is the literal20, flagged/* magic number here FIXME */in source. Whether this ever matters for clusters with thousands of databases (the hash simply grows past 20) or is purely cosmetic is unverified. Investigation path: measurerebuild_database_listcost on a cluster with 10k+ databases and check whether the dynahash resize shows up; trace the FIXME through git blame for any prior discussion. -
The fork-failure retry has no cap.
AutoVacLauncherMain’s handling ofAutoVacForkFailedsleeps 1 second and re-signals the postmaster indefinitely, with an in-sourceXXXquestioning whether a retry limit makes sense. Under sustained fork failure (e.g., process-table exhaustion) the launcher will spin on this path. Whether that is benign or a real availability concern is unverified. Investigation path: reproduce by capping the OS process table and observe launcher log volume and CPU.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
InnoDB purge coordinator + purge threads (MySQL,
innodb_purge_threads) — InnoDB’s deferred cleanup of delete-marked records and old undo-log versions is driven by a purge coordinator dispatching to purge worker threads inside one process, the in-process analogue of PostgreSQL’s launcher/worker fork model. A comparison would weigh process isolation (PG: a crashed worker cannot corrupt the scheduler) against thread-pool latency (InnoDB: no fork cost per unit of work). -
Oracle SMON + automatic maintenance tasks — Oracle’s undo-segment cleanup and its automatic optimizer-statistics gathering are split across SMON and the autotask scheduler windows. Oracle’s use of maintenance windows (time-of-day budgets) instead of PostgreSQL’s continuous statistics-threshold dispatch is the interesting contrast: a calendar policy versus a load-reactive one.
-
CUBRID dedicated vacuum workers — CUBRID also separates a vacuum master/coordinator from vacuum workers, but drives them from the log (MVCC version cleanup follows the transaction log) rather than from per-table dead-tuple statistics. A side-by-side would clarify what PostgreSQL trades by polling
pg_class+ cumulative stats versus CUBRID’s log-driven discovery of reclaimable versions. See the CUBRID vacuum analysis inknowledge/code-analysis/cubrid/cubrid-vacuum.md. -
The 64-bit XID proposal — the long-running PostgreSQL community effort to widen transaction ids to 64 bits would eliminate the wraparound deadline that forces half of autovacuum’s complexity (the forced path, the most-endangered-first database choice, the uncancellable emergency vacuum). Tracking the design discussion would show how much of
do_start_workerandrelation_needs_vacanalyzewould simplify if the freeze deadline became a space-management optimization rather than a correctness deadline. -
Adjacent PostgreSQL docs — the mechanism this scheduler invokes is in
postgres-vacuum.md(heap pruning, index cleanup, the cost-delay accounting itself); the freeze semantics and the wraparound math are inpostgres-xid-wraparound-freeze.md; the fork mechanism andPMSIGNAL_START_AUTOVAC_WORKERhandshake are inpostgres-postmaster.md. The statistics this scheduler reads (PgStat_StatTabEntry) are produced by the cumulative stats system (postgres-overview-monitoring-stats.md).
Sources
Section titled “Sources”Raw materials consumed: none. This document was synthesized
directly from the REL_18 source tree; sources: is empty.
Textbook chapters:
- Database System Concepts (Silberschatz, Korth, Sudarshan, 7th ed.),
§18.7 “Multiversion Schemes” — the requirement that old versions be
deleted once no transaction can read them; the scheduling of that
deletion is what autovacuum automates. Captured in
knowledge/research/dbms-general/database-system-concepts.md. - Database Internals (Alex Petrov, 2019), ch. 5 — MVCC and version
maintenance as the bounding of the version space; the freeze is
PostgreSQL’s specific version-space bound. Captured in
knowledge/research/dbms-general/database-internals.md.
Source code (REL_18_STABLE, commit 273fe94, as of 2026-06-05):
src/backend/postmaster/autovacuum.c— the entire subsystem: launcher, workers, scheduling, thresholds, cost balancing, work-item queue, shared memory.src/include/postmaster/autovacuum.h— public surface (AutoVacuumWorkItemType, the GUC externs, the launcher/worker entry points, the shmem functions).
Adjacent curated docs (cross-references, not duplicated here):
knowledge/code-analysis/postgres/postgres-vacuum.md— the vacuum mechanism this scheduler invokes.knowledge/code-analysis/postgres/postgres-xid-wraparound-freeze.md— freeze semantics and the wraparound deadline math.knowledge/code-analysis/postgres/postgres-postmaster.md— the fork model and worker-start signaling.