PostgreSQL Autovacuum — The Launcher, Workers, and Anti-Wraparound Scheduling

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

Autovacuum is the policy layer above PostgreSQL’s MVCC garbage collector. MVCC (multi-version concurrency control) buys read-without-blocking by never overwriting a row in place: an UPDATE or DELETE leaves the old tuple version on the heap page, visible to transactions whose snapshot predates the change, and the new version is appended. Database System Concepts (Silberschatz, 7e, §18.7 “Multiversion Schemes”) states the consequence plainly — multiversion schemes “require that old versions of data items be deleted at some point”, and the deletion “can be done only when no transaction that can read the old version is still active.” That deletion is the vacuum operation. The question this document answers is not how vacuum reclaims a dead tuple (that is postgres-vacuum.md) but who decides when to run it, on which table, in which database, and how hard — the scheduling problem.

The scheduling problem has a hard deadline buried inside it that pure garbage-collection theory does not surface: transaction-id wraparound. PostgreSQL stamps every row version with a 32-bit xmin/xmax and decides visibility by comparing transaction ids in a modular (circular) space — “older than” means “roughly two billion ids behind in the ring.” If a table holds a live row whose inserting transaction is never frozen, then after ~2 billion further transactions that id wraps from “ancient past” to “distant future” and the row silently becomes invisible — catastrophic, undetectable data loss. Database Internals (Petrov, ch. 5, on MVCC and version maintenance) frames vacuum as the maintenance task that bounds the version space; PostgreSQL’s specific bound is the freeze: rewrite an old tuple’s xmin to a frozen marker so it is unconditionally visible and its original id can be reused. Freezing only happens inside a vacuum. So vacuum is doing two unrelated jobs — space reclamation and wraparound prevention — and the scheduler must serve both, with the second being a correctness deadline, not an optimization.

Three design tensions shape any automatic-vacuum scheduler, and they are the knobs PostgreSQL turns:

When is a table “dirty enough” to vacuum? Vacuuming a table that has accrued three dead tuples wastes I/O; waiting until a high-churn table is 90% bloat wastes disk and slows scans. The standard answer is a threshold relative to table size: vacuum when dead tuples exceed base + scale × live_tuples. The constant base handles tiny tables (don’t thrash a 10-row table); the size-proportional term handles large ones (a 100M-row table can tolerate more absolute dead tuples before a vacuum pays off).
How to share one machine across many tables and databases without starving anyone. A naive scheduler that always picks the dirtiest table starves small databases; one that round-robins databases ignores urgency. PostgreSQL splits the decision: a round-robin across databases for fairness, and a threshold-driven choice within a database for urgency, with wraparound risk overriding both.
How hard to push. Vacuum is I/O-heavy and competes with foreground queries. The classic control is a cost-based delay: the vacuum accumulates a “cost” for every page it touches and sleeps when the cost crosses a limit, throttling itself. When several vacuums run at once, the aggregate throttle must stay bounded, so the budget is divided across the concurrent workers — a distributed rate-limiter.

The autovacuum subsystem is the embodiment of these three answers. It is deliberately outside the vacuum mechanism: vacuum can always be run by hand (VACUUM), and the same throttling code serves manual and automatic runs alike. Autovacuum is the daemon that decides nobody had to type the command.

Common DBMS Design

The textbook gives the model (multiversion deletion, wraparound bound, threshold scheduling, rate-limited maintenance). This section names the engineering conventions that recur across production engines that bolt an automatic maintenance daemon onto an MVCC or deferred-cleanup storage layer — PostgreSQL, Oracle (its automatic segment/space advisors and SMON undo cleanup), SQL Server (ghost record cleanup + auto-stats), MySQL/InnoDB (purge threads), CUBRID (its dedicated vacuum workers). PostgreSQL’s specific choices in the next section read as one set of dials within this shared space.

A long-lived scheduler plus short-lived executors

Almost every engine separates the decision process from the work process. A single always-on scheduler (a daemon, a coordinator thread) holds the global picture — which objects are stale, which deadlines loom — and dispatches work to a pool of executors that do one unit of work and exit (or return to a pool). The split keeps the global state in one place (no N-way coordination over “who is vacuuming what”) and makes the executors disposable: an executor that crashes or is killed mid-vacuum leaves the scheduler’s bookkeeping intact, and the next dispatch simply re-picks the table. PostgreSQL’s launcher/worker pair is exactly this shape; InnoDB’s purge coordinator + purge workers is the same idea inside one process.

A bounded pool sized by a parameter

Maintenance must not be allowed to consume the whole machine, so the executor pool is capped by a configuration parameter (max workers). The scheduler tracks free vs. busy executors in a small fixed-size shared structure — a free list and a running list — so “can I dispatch another?” is a constant-time check. The cap is a ceiling, not a target: the pool sits idle when nothing is dirty.

Statistics-driven thresholds, not a fixed timetable

Rather than “vacuum every table every hour,” the scheduler consults cumulative activity statistics — dead-tuple counts, insert counts, modification counts since the last maintenance — maintained by the running system as a side effect of DML. A table that nobody writes is never scheduled; a hot table is scheduled often. The threshold is a formula over those counters and the table’s size, with per-object overrides so a pathological table can be tuned without touching the global knobs.

A hard deadline that overrides the soft policy

Layered on top of the soft “dirty enough” policy is a forced path for the correctness deadline (wraparound, undo-space exhaustion, log-space pressure — engine-specific). When an object crosses the hard limit, the scheduler must service it regardless of how clean it looks and regardless of whether the operator disabled routine maintenance. This forced path typically also changes which object the scheduler picks first (most-endangered first) and whether the executor may be interrupted (it may not, easily).

A shared, divided throttle

To keep aggregate maintenance I/O bounded while several executors run, the rate limit is a shared quantity divided across the active executors. Each executor periodically reads the current divisor from shared memory and recomputes its personal limit, so adding or removing an executor re-balances the others without a central rendezvous. The division is the distributed form of the textbook’s single rate-limiter.

A side queue for ad-hoc maintenance requests

Beyond the statistics-driven schedule, other parts of the engine occasionally need a specific maintenance action (“summarize this index range now”). The convention is a small fixed-size work-item queue in the scheduler’s shared memory that any backend can post into and that the executors drain opportunistically, so one-off requests piggyback on the existing executor pool instead of spawning bespoke machinery.

Theory ↔ PostgreSQL mapping

Theory / convention	PostgreSQL entity
Long-lived scheduler	`AutoVacLauncherMain` (the `B_AUTOVAC_LAUNCHER` process)
Short-lived executor	`AutoVacWorkerMain` (one `B_AUTOVAC_WORKER` per dispatch)
Bounded executor pool	`autovacuum_worker_slots` free list `av_freeWorkers` in `AutoVacuumShmem`
”Can I dispatch?” check	`av_worker_available` (free slots vs. reserved)
Cross-database fairness	`DatabaseList` round-robin built by `rebuild_database_list`
Within-database urgency	`relation_needs_vacanalyze` threshold equation
Activity statistics	`PgStat_StatTabEntry` (`dead_tuples`, `ins_since_vacuum`, `mod_since_analyze`)
Threshold formula	`base + scale × reltuples`, clamped by `vac_max_thresh`
Hard deadline	`relfrozenxid`/`relminmxid` vs. `recentXid - freeze_max_age` → `force_vacuum`
Most-endangered-first	`do_start_worker` picks oldest `datfrozenxid` when `for_xid_wrap`
Shared divided throttle	`av_nworkersForBalance` + `AutoVacuumUpdateCostLimit`
Ad-hoc request queue	`av_workItems[NUM_WORKITEMS]` + `AutoVacuumRequestWork`

By the time the reader reaches av_nworkersForBalance in the next section, they already know what kind of thing it is: the divisor of a distributed rate-limiter.

PostgreSQL’s Approach

PostgreSQL implements the whole scheduler in one file, src/backend/postmaster/autovacuum.c (~3,475 lines at REL_18), with a tiny public header src/include/postmaster/autovacuum.h. The architecture is a two-tier process model glued together by one shared-memory struct and the postmaster’s fork mechanism. This section walks the design: the shared state, the launcher’s scheduling loop, the worker’s per-table decision, the cost-balancing protocol, the forced anti-wraparound path, and the side work-item queue.

Two processes, one shared struct

The launcher never connects to a database and never vacuums anything. It is a perpetual scheduler that decides which database deserves a worker next, then asks the postmaster to fork one. Workers are short-lived: each forked worker attaches to exactly one database, does “an appropriate amount of work,” and exits. The two tiers share no memory except AutoVacuumShmem, a single struct (plus a trailing array of per-slot WorkerInfoData) sized at startup.

// AutoVacuumShmemStruct — src/backend/postmaster/autovacuum.c
typedef struct
{
    sig_atomic_t av_signal[AutoVacNumSignals];
    pid_t       av_launcherpid;
    dclist_head av_freeWorkers;     /* WorkerInfo free list */
    dlist_head  av_runningWorkers;  /* WorkerInfo non-free queue */
    WorkerInfo  av_startingWorker;  /* one being started; cleared by the worker */
    AutoVacuumWorkItem av_workItems[NUM_WORKITEMS];  /* NUM_WORKITEMS == 256 */
    pg_atomic_uint32 av_nworkersForBalance;          /* cost-balance divisor */
} AutoVacuumShmemStruct;

The struct is almost entirely protected by one LWLock, AutovacuumLock. The exceptions are deliberate: av_signal is an array of sig_atomic_t that remote processes set without locking (so a backend can flag “rebalance needed” cheaply), and av_nworkersForBalance is a pg_atomic_uint32 that workers read on a hot path without taking the lock. Everything else — the worker free list, the running list, the starting-worker pointer, the work-item array — moves under AutovacuumLock.

A worker’s whereabouts live in one WorkerInfoData slot, and there are exactly autovacuum_worker_slots of them, allocated in a flat array after the fixed struct:

// WorkerInfoData — src/backend/postmaster/autovacuum.c
typedef struct WorkerInfoData
{
    dlist_node  wi_links;       /* entry into free list or running list */
    Oid         wi_dboid;       /* database this worker works on */
    Oid         wi_tableoid;    /* table currently being vacuumed, if any */
    PGPROC     *wi_proc;        /* PGPROC of the running worker, NULL if not started */
    TimestampTz wi_launchtime;
    pg_atomic_flag wi_dobalance;/* include this worker in balance calc? */
    bool        wi_sharedrel;
} WorkerInfoData;

The same slot threads onto two lists by its single wi_links node: it sits on av_freeWorkers when idle and on av_runningWorkers when a worker owns it. wi_tableoid and wi_sharedrel are the two fields a worker publishes so other workers can see what it is currently chewing on (protected by AutovacuumScheduleLock, not the main lock) — that is how two workers in the same database avoid both grabbing the same table.

The shared memory is sized and laid out at server start:

// AutoVacuumShmemInit — src/backend/postmaster/autovacuum.c
AutoVacuumShmem = (AutoVacuumShmemStruct *)
    ShmemInitStruct("AutoVacuum Data", AutoVacuumShmemSize(), &found);
// ... condensed ...
worker = (WorkerInfo) ((char *) AutoVacuumShmem +
                       MAXALIGN(sizeof(AutoVacuumShmemStruct)));
for (i = 0; i < autovacuum_worker_slots; i++)
{
    dclist_push_head(&AutoVacuumShmem->av_freeWorkers, &worker[i].wi_links);
    pg_atomic_init_flag(&worker[i].wi_dobalance);
}
pg_atomic_init_u32(&AutoVacuumShmem->av_nworkersForBalance, 0);

Note the PG17→PG18 evolution baked in here: the pool is sized by autovacuum_worker_slots (the count of physical slots reserved at startup, fixed for the cluster’s life because shared memory cannot grow), while autovacuum_max_workers is a runtime GUC that caps how many of those slots autovacuum will actually use. The gap between the two is a reserve, and av_worker_available enforces it:

// av_worker_available — src/backend/postmaster/autovacuum.c
free_slots = dclist_count(&AutoVacuumShmem->av_freeWorkers);
reserved_slots = autovacuum_worker_slots - autovacuum_max_workers;
reserved_slots = Max(0, reserved_slots);
return free_slots > reserved_slots;

This is the PG18 change that lets an operator raise autovacuum_max_workers with a reload (no restart) up to the autovacuum_worker_slots ceiling — a frequent prior pain point.

The overall topology:

flowchart TB
    PM["postmaster<br/>(forks every process)"]
    LA["autovacuum launcher<br/>AutoVacLauncherMain<br/>perpetual scheduler, no DB"]
    subgraph SHM["AutoVacuumShmem (shared memory, AutovacuumLock)"]
      FREE["av_freeWorkers<br/>(free WorkerInfo slots)"]
      RUN["av_runningWorkers<br/>(busy WorkerInfo slots)"]
      START["av_startingWorker<br/>(handoff pointer)"]
      WI["av_workItems[256]<br/>(ad-hoc requests)"]
      NB["av_nworkersForBalance<br/>(atomic divisor)"]
    end
    W1["worker (db A)<br/>AutoVacWorkerMain"]
    W2["worker (db B)<br/>AutoVacWorkerMain"]
    BK["any backend<br/>(BRIN summarize)"]

    PM -->|fork| LA
    LA -->|"do_start_worker:<br/>pick db, fill startingWorker,<br/>signal PMSIGNAL_START_AUTOVAC_WORKER"| PM
    PM -->|fork| W1
    PM -->|fork| W2
    LA --- SHM
    W1 --- SHM
    W2 --- SHM
    BK -->|"AutoVacuumRequestWork"| WI
    W1 -->|"SIGUSR2 'I'm up / I finished'"| LA

Figure 1 — The two-tier process model. The launcher never touches a database; it picks a target database, parks a WorkerInfo in av_startingWorker, and signals the postmaster to fork the actual worker. The forked worker claims the parked slot, moves it to the running list, and signals the launcher back via SIGUSR2. All coordination is through AutoVacuumShmem under AutovacuumLock. Any backend can post a one-off request into av_workItems.

The launcher’s scheduling loop

After the standard auxiliary-process boilerplate (signal handlers, InitProcess, a sigsetjmp error-recovery block stripped down from PostgresMain), the launcher builds its database list once and enters a sleep-then-maybe-launch loop:

// AutoVacLauncherMain — src/backend/postmaster/autovacuum.c (condensed)
rebuild_database_list(InvalidOid);

while (!ShutdownRequestPending)
{
    struct timeval nap;

    launcher_determine_sleep(av_worker_available(), false, &nap);
    (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                     (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L),
                     WAIT_EVENT_AUTOVACUUM_MAIN);
    ResetLatch(MyLatch);
    ProcessAutoVacLauncherInterrupts();

    /* ... handle SIGUSR2: rebalance, or retry after a fork failure ... */

    current_time = GetCurrentTimestamp();
    LWLockAcquire(AutovacuumLock, LW_SHARED);
    can_launch = av_worker_available();
    /* ... if av_startingWorker still pending and not timed out, can_launch = false ... */
    LWLockRelease(AutovacuumLock);
    if (!can_launch)
        continue;

    if (dlist_is_empty(&DatabaseList))
        launch_worker(current_time);           /* bootstrap: nothing scheduled yet */
    else
    {
        avl_dbase *avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList);
        if (TimestampDifferenceExceeds(avdb->adl_next_worker, current_time, 0))
            launch_worker(current_time);        /* the due database */
    }
}

Two invariants make this loop correct. First, only one worker may be “starting” at a time: if av_startingWorker is non-NULL the launcher will not dispatch another, because the forked worker has not yet claimed its slot and the launcher would otherwise double-book. If a starting worker takes longer than Min(autovacuum_naptime, 60) seconds the launcher reclaims its slot and logs a warning — a forked worker that died before claiming would otherwise wedge the pipeline. Second, the launcher sleeps until the next due database, computed from the list, not a fixed tick.

The database round-robin

rebuild_database_list is the cross-database fairness mechanism. It produces a doubly-linked list of avl_dbase entries, one per database that has a pgstats entry, ordered so the database due furthest in the future is at the head and the one due soonest is at the tail — which is why the loop above reads dlist_tail_element to get the next target. The “next worker” timestamps are spread evenly across one autovacuum_naptime interval:

// rebuild_database_list — src/backend/postmaster/autovacuum.c (condensed)
millis_increment = 1000.0 * autovacuum_naptime / nelems;
if (millis_increment <= MIN_AUTOVAC_SLEEPTIME)        /* MIN == 100.0 ms */
    millis_increment = MIN_AUTOVAC_SLEEPTIME * 1.1;
current_time = GetCurrentTimestamp();
for (i = 0; i < nelems; i++)
{
    db = &(dbary[i]);
    current_time = TimestampTzPlusMilliseconds(current_time, millis_increment);
    db->adl_next_worker = current_time;
    dlist_push_head(&DatabaseList, &db->adl_node);    /* later goes nearer head */
}

The structure is genuinely subtle: the function first scores databases (new database = 0, then the existing list in order, then any get_database_list() leftovers) into a temporary hash, sorts an array by score, and rebuilds the list so the ordering of databases within the naptime window is preserved across rebuilds. The effect is that with N databases and a 60-second naptime, each database gets a worker roughly every 60 seconds, evenly staggered, and the order is stable so no database keeps jumping the queue. When a worker is actually launched for a database, launch_worker pushes that database’s next_worker out by one full naptime and moves it to the head, so it goes to the back of the effective queue.

flowchart LR
    subgraph DL["DatabaseList — ordered by adl_next_worker"]
      direction LR
      H["head:<br/>db due furthest out"]
      M["...staggered every<br/>naptime/N ms..."]
      T["tail:<br/>db due soonest"]
    end
    SLEEP["launcher_determine_sleep<br/>sleeps until tail's<br/>adl_next_worker"]
    PICK["pick tail database"]
    LW["launch_worker:<br/>next_worker += naptime,<br/>move to head"]

    T --> SLEEP
    SLEEP --> PICK
    PICK --> LW
    LW -->|"this db now at head<br/>(due furthest out)"| H

Figure 2 — The database round-robin. The list is kept sorted by adl_next_worker so the soonest-due database is always at the tail. The launcher sleeps exactly until that database is due, dispatches a worker, then pushes that database’s next slot one naptime into the future and moves it to the head. Over one autovacuum_naptime window every database is visited once, evenly spaced. Anti-wraparound is the exception that bypasses this ordering — see below.

The worker’s per-table decision

A forked worker (AutoVacWorkerMain) claims the parked WorkerInfo, moves it to av_runningWorkers, connects to its assigned database, and calls do_autovacuum. That function scans pg_class twice (main tables and matviews first, then TOAST tables, because a TOAST table inherits its parent’s reloptions), and for every relation calls the heart of the policy — relation_needs_vacanalyze — to decide three booleans: vacuum, analyze, force-for-wraparound.

The threshold equation is the textbook formula. Parameters come from the table’s reloptions if set, else the global GUCs:

// relation_needs_vacanalyze — src/backend/postmaster/autovacuum.c (condensed)
vac_scale_factor = (relopts && relopts->vacuum_scale_factor >= 0)
    ? relopts->vacuum_scale_factor : autovacuum_vac_scale;
vac_base_thresh  = (relopts && relopts->vacuum_threshold >= 0)
    ? relopts->vacuum_threshold : autovacuum_vac_thresh;
/* PG18: a hard ceiling on the computed vacuum threshold; -1 disables it */
vac_max_thresh   = (relopts && relopts->vacuum_max_threshold >= -1)
    ? relopts->vacuum_max_threshold : autovacuum_vac_max_thresh;
// ... ins and analyze parameters resolved the same way ...

vactuples = tabentry->dead_tuples;
instuples = tabentry->ins_since_vacuum;
anltuples = tabentry->mod_since_analyze;

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
if (vac_max_thresh >= 0 && vacthresh > (float4) vac_max_thresh)
    vacthresh = (float4) vac_max_thresh;

vacinsthresh = (float4) vac_ins_base_thresh +
               vac_ins_scale_factor * reltuples * pcnt_unfrozen;
anlthresh    = (float4) anl_base_thresh + anl_scale_factor * reltuples;

*dovacuum  = force_vacuum || (vactuples > vacthresh) ||
             (vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
*doanalyze = (anltuples > anlthresh);

Three numbers feed the decision, all from cumulative statistics: dead_tuples (the bloat metric, drives the classic vacuum threshold), ins_since_vacuum (the insert metric, PG13+ insert-only-table vacuuming so even an append-only table gets frozen eventually), and mod_since_analyze (the planner-statistics staleness metric, drives analyze). The insert path carries one PG18 refinement worth noting: the insert threshold is scaled by pcnt_unfrozen, the fraction of the table’s pages that are not already all-frozen (derived from relallfrozen/relpages), so an insert-heavy table whose old pages are already frozen is judged on the inserts into its still-active region — the engine does not re-vacuum a table that is mostly settled.

flowchart TD
    START["relation_needs_vacanalyze(rel)"]
    XID{"relfrozenxid older than<br/>recentXid - freeze_max_age?"}
    MXID{"relminmxid older than<br/>recentMulti - mxid_freeze_max_age?"}
    FORCE["force_vacuum = true<br/>wraparound = true"]
    ENABLED{"autovacuum enabled<br/>for this table?"}
    SKIP["dovacuum = false<br/>doanalyze = false<br/>(unless forced)"]
    STATS{"pgstats entry exists<br/>and autovacuum active?"}
    THRESH["dovacuum = forced OR dead>vacthresh<br/>OR ins>vacinsthresh<br/>doanalyze = mod>anlthresh"]
    ONLYFORCE["dovacuum = force_vacuum<br/>doanalyze = false"]

    START --> XID
    XID -->|yes| FORCE
    XID -->|no| MXID
    MXID -->|yes| FORCE
    MXID -->|no| ENABLED
    FORCE --> STATS
    ENABLED -->|"no, and not forced"| SKIP
    ENABLED -->|"yes, or forced"| STATS
    STATS -->|yes| THRESH
    STATS -->|no| ONLYFORCE

Figure 3 — The per-table decision tree. The wraparound check runs first and sets force_vacuum; a forced table is vacuumed even if the operator disabled autovacuum for it (the !av_enabled && !force_vacuum early return only fires when not forced). Only after the forced path is settled does the soft threshold equation run, and only for tables with live statistics.

Claiming a table without colliding

Once do_autovacuum has its list of OIDs to process, it loops over them, and before touching each table it must avoid two other workers grabbing the same relation. It holds AutovacuumScheduleLock while it both checks the running workers and publishes its own claim:

// do_autovacuum — src/backend/postmaster/autovacuum.c (condensed claim)
LWLockAcquire(AutovacuumScheduleLock, LW_EXCLUSIVE);
LWLockAcquire(AutovacuumLock, LW_SHARED);
dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
{
    WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur);
    if (worker == MyWorkerInfo) continue;
    if (!worker->wi_sharedrel && worker->wi_dboid != MyDatabaseId) continue;
    if (worker->wi_tableoid == relid) { skipit = true; break; }
}
LWLockRelease(AutovacuumLock);
if (skipit) { LWLockRelease(AutovacuumScheduleLock); continue; }

/* claim it before releasing the schedule lock */
MyWorkerInfo->wi_tableoid = relid;
MyWorkerInfo->wi_sharedrel = isshared;
LWLockRelease(AutovacuumScheduleLock);

tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc,
                            effective_multixact_freeze_max_age);
if (tab == NULL) { /* someone else did it; release claim */ continue; }

The publish-then-recheck pattern is the standard “claim and verify” under a small race window: a worker publishes wi_tableoid while holding the schedule lock, then re-reads the statistics (table_recheck_autovac) because another worker might have vacuumed the table between the first-pass scan and now. If the recheck says the table no longer needs work, the worker releases the claim and moves on. Shared catalogs (relisshared) are visible to workers in any database, so the collision check honors wi_sharedrel across database boundaries.

Cost-delay balancing across workers

Each table the worker processes also updates the worker’s place in the shared cost-balance scheme. Vacuum throttles itself by accumulating a cost per page and sleeping when it crosses vacuum_cost_limit; with multiple workers active, that limit is divided so the aggregate I/O rate stays near the single-worker target. The divisor lives in shared memory and is recomputed whenever the set of balancing workers changes:

// autovac_recalculate_workers_for_balance — autovacuum.c (condensed)
dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
{
    WorkerInfo worker = dlist_container(WorkerInfoData, wi_links, iter.cur);
    if (worker->wi_proc == NULL ||
        pg_atomic_unlocked_test_flag(&worker->wi_dobalance))
        continue;
    nworkers_for_balance++;
}
pg_atomic_write_u32(&AutoVacuumShmem->av_nworkersForBalance, nworkers_for_balance);

// AutoVacuumUpdateCostLimit — autovacuum.c (condensed)
if (av_storage_param_cost_limit > 0)
    vacuum_cost_limit = av_storage_param_cost_limit;   /* per-table override: not balanced */
else
{
    vacuum_cost_limit = (autovacuum_vac_cost_limit > 0)
        ? autovacuum_vac_cost_limit : VacuumCostLimit;
    if (pg_atomic_unlocked_test_flag(&MyWorkerInfo->wi_dobalance))
        return;                                        /* this worker opted out of balancing */
    nworkers_for_balance = pg_atomic_read_u32(&AutoVacuumShmem->av_nworkersForBalance);
    vacuum_cost_limit = Max(vacuum_cost_limit / nworkers_for_balance, 1);
}

The wi_dobalance flag is the opt-out: a table with cost-related reloptions (its own vacuum_cost_delay/vacuum_cost_limit) is not folded into the shared budget — the operator asked for a specific rate on that table, so it runs at that rate and is excluded from the divisor. Every other worker reads av_nworkersForBalance (atomically, no lock) on a regular basis through VacuumUpdateCosts and divides the global limit by it. When a worker starts or finishes a table it signals AutoVacRebalance, the launcher recomputes the divisor under the lock, and the running workers pick up the new value on their next check — the distributed rate-limiter rebalancing without a central rendezvous.

Forced anti-wraparound vacuums

The wraparound deadline pierces every layer. It changes which database the launcher picks, which tables the worker forces, and whether a disabled autovacuum can be skipped.

At the database level, do_start_worker computes a force limit and scans all databases (from pg_database, not just those with stats), preferring the most-endangered:

// do_start_worker — src/backend/postmaster/autovacuum.c (condensed)
recentXid = ReadNextTransactionId();
xidForceLimit = recentXid - autovacuum_freeze_max_age;
if (xidForceLimit < FirstNormalTransactionId)
    xidForceLimit -= FirstNormalTransactionId;
recentMulti = ReadNextMultiXactId();
multiForceLimit = recentMulti - MultiXactMemberFreezeThreshold();

foreach(cell, dblist)
{
    avw_dbase *tmp = lfirst(cell);
    if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
    {
        if (avdb == NULL || TransactionIdPrecedes(tmp->adw_frozenxid, avdb->adw_frozenxid))
            avdb = tmp;
        for_xid_wrap = true;     /* this db is at risk; ignore not-at-risk dbs from here */
        continue;
    }
    else if (for_xid_wrap) continue;
    else if (MultiXactIdPrecedes(tmp->adw_minmulti, multiForceLimit)) { /* multixact risk */ }
    // ... else fall through to "least-recently-autovacuumed" selection ...
}

Once any database is found in XID-wraparound danger, for_xid_wrap latches true and every not-at-risk database is ignored for the rest of the scan; among at-risk databases the one with the oldest datfrozenxid wins. XID wraparound outranks MultiXact wraparound, and both outrank the ordinary “least recently autovacuumed” choice. This is the most-endangered-first override: a database that has not been touched in days will still be picked ahead of a busy one if its frozen-xid horizon is the closest to the wall.

At the table level, the same comparison runs inside relation_needs_vacanalyze and sets force_vacuum, which has three consequences the soft path lacks:

// relation_needs_vacanalyze — force path (condensed)
xidForceLimit = recentXid - freeze_max_age;
relfrozenxid = classForm->relfrozenxid;
force_vacuum = (TransactionIdIsNormal(relfrozenxid) &&
                TransactionIdPrecedes(relfrozenxid, xidForceLimit));
/* ... else check relminmxid against the multixact force limit ... */
*wraparound = force_vacuum;

if (!av_enabled && !force_vacuum)     /* disabled tables: skip only if NOT forced */
{
    *doanalyze = false; *dovacuum = false; return;
}

First, a forced vacuum runs even when av_enabled is false (the table or the cluster has autovacuum turned off) — wraparound prevention is not optional. Second, when the worker is processing such a table, the config-reload check inside do_autovacuum’s loop deliberately does not bail out if it sees autovacuum was just disabled, with the comment “this might be a for-wraparound emergency worker.” Third, a forced (anti-wraparound) vacuum is harder to cancel: it ignores the usual signals that would let a conflicting lock request kill an ordinary autovacuum, because letting a DDL statement repeatedly cancel the only thing preventing data loss would be a foot-gun. (The cancel behavior itself lives in the lock manager and vacuum.c; see postgres-xid-wraparound-freeze.md.)

The side work-item queue

Finally, the small fixed queue that lets any backend request a specific maintenance action. The only producer in core at REL_18 is BRIN index summarization (AVW_BRINSummarizeRange), posted by brin_summarize_*:

// AutoVacuumRequestWork — src/backend/postmaster/autovacuum.c (condensed)
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
for (i = 0; i < NUM_WORKITEMS; i++)         /* NUM_WORKITEMS == 256 */
{
    AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i];
    if (workitem->avw_used) continue;
    workitem->avw_used = true;
    workitem->avw_active = false;
    workitem->avw_type = type;
    workitem->avw_database = MyDatabaseId;
    workitem->avw_relation = relationId;
    workitem->avw_blockNumber = blkno;
    result = true;
    break;
}
LWLockRelease(AutovacuumLock);
return result;

The queue is a flat array of 256 slots; a full queue silently drops the request (returns false). A worker, after finishing its table list, drains the items belonging to its database via perform_work_item, marking each avw_active while it runs so a second worker does not double-process it. This is the engine’s “piggyback ad-hoc maintenance on the existing worker pool” convention: no bespoke process, just a mailbox the workers check on their way out.

Source Walkthrough

All symbols are in src/backend/postmaster/autovacuum.c unless noted; the public surface is in src/include/postmaster/autovacuum.h.

Shared state and lifecycle

AutoVacuumShmemStruct (struct) — the one shared struct: signal array, launcher pid, worker free/running lists, starting-worker handoff pointer, work-item array, balance divisor.
WorkerInfoData / WorkerInfo — one per slot; threads onto the free or running list via wi_links; publishes wi_tableoid/wi_sharedrel for collision avoidance.
avl_dbase (struct) — a launcher-side database-list entry (adl_datid, adl_next_worker, adl_score).
avw_dbase (struct) — a worker-side database descriptor with adw_frozenxid/adw_minmulti for the wraparound choice.
AutoVacuumWorkItem (struct) + NUM_WORKITEMS (== 256) — the ad-hoc request queue element and its array size.
AutoVacuumShmemSize / AutoVacuumShmemInit — size and initialize the segment; seed the free list with autovacuum_worker_slots slots.
autovac_init — postmaster-time sanity check (warns if track_counts is off).

Launcher

AutoVacLauncherMain — the scheduler entry point and main loop.
ProcessAutoVacLauncherInterrupts — handles SIGHUP (reload + rebuild list), shutdown, barriers.
AutoVacLauncherShutdown — clean exit, clears av_launcherpid.
launcher_determine_sleep — compute nap time until the next due database, clamped to [MIN_AUTOVAC_SLEEPTIME, MAX_AUTOVAC_SLEEPTIME].
rebuild_database_list — build the round-robin list, evenly spaced over autovacuum_naptime, ordered by adl_next_worker.
get_database_list — seqscan pg_database (the launcher’s only transaction).
db_comparator — qsort comparator on adl_score.
do_start_worker — choose the target database (wraparound-first, else least-recently-autovacuumed), park a WorkerInfo in av_startingWorker, signal the postmaster.
launch_worker — wrapper that calls do_start_worker then bumps the chosen database’s adl_next_worker by one naptime.
av_worker_available — free slots vs. reserved (worker_slots - max_workers).
avl_sigusr2_handler / AutoVacWorkerFailed — worker-up/finished and fork-failure signaling.

Worker

AutoVacWorkerMain — worker entry point; claims the parked slot, connects to the database, calls do_autovacuum.
FreeWorkerInfo — on_shmem_exit callback returning the slot to the free list and waking the launcher.
do_autovacuum — two-pass pg_class scan, orphan-temp-table cleanup, the per-table claim/recheck/vacuum loop, work-item drain.
extract_autovac_opts — pull AutoVacOpts out of a pg_class reloptions tuple.
relation_needs_vacanalyze — the threshold + freeze-age decision; outputs dovacuum/doanalyze/wraparound.
recheck_relation_needs_vacanalyze / table_recheck_autovac — re-evaluate under fresh stats after claiming, building an autovac_table work descriptor or NULL.
autovacuum_do_vac_analyze — hand the autovac_table to the shared vacuum() entry point.
perform_work_item / autovac_report_workitem — drain and report a queued AutoVacuumWorkItem.

Cost balancing

VacuumUpdateCosts — recompute vacuum_cost_delay/vacuum_cost_limit for this worker (or a manual VACUUM); called at vacuum setup and after reloads.
AutoVacuumUpdateCostLimit — divide the global limit by av_nworkersForBalance (unless opted out via wi_dobalance or a per-table cost reloption).
autovac_recalculate_workers_for_balance — recount balancing workers and write av_nworkersForBalance.

Public surface and requests

AutoVacuumingActive — is the daemon configured on?
AutoVacuumRequestWork — post an AutoVacuumWorkItem (returns false if the queue is full).
check_autovacuum_work_mem / check_av_worker_gucs — GUC check hooks.

Position hints (as of 2026-06-05, REL_18 273fe94)

Symbol	File	Line
`avl_dbase` (struct)	`postmaster/autovacuum.c`	171
`avw_dbase` (struct)	`postmaster/autovacuum.c`	180
`WorkerInfoData` (struct)	`postmaster/autovacuum.c`	231
`AutoVacuumWorkItem` (struct)	`postmaster/autovacuum.c`	263
`NUM_WORKITEMS` (== 256)	`postmaster/autovacuum.c`	273
`MIN_AUTOVAC_SLEEPTIME` (100.0 ms)	`postmaster/autovacuum.c`	139
`MAX_AUTOVAC_SLEEPTIME` (300 s)	`postmaster/autovacuum.c`	140
`AutoVacuumShmemStruct` (struct)	`postmaster/autovacuum.c`	293
`AutoVacLauncherMain`	`postmaster/autovacuum.c`	368
`ProcessAutoVacLauncherInterrupts`	`postmaster/autovacuum.c`	747
`AutoVacLauncherShutdown`	`postmaster/autovacuum.c`	792
`launcher_determine_sleep`	`postmaster/autovacuum.c`	809
`rebuild_database_list`	`postmaster/autovacuum.c`	893
`db_comparator`	`postmaster/autovacuum.c`	1072
`do_start_worker`	`postmaster/autovacuum.c`	1090
`launch_worker`	`postmaster/autovacuum.c`	1302
`AutoVacWorkerFailed`	`postmaster/autovacuum.c`	1354
`avl_sigusr2_handler`	`postmaster/autovacuum.c`	1361
`AutoVacWorkerMain`	`postmaster/autovacuum.c`	1376
`FreeWorkerInfo`	`postmaster/autovacuum.c`	1606
`VacuumUpdateCosts`	`postmaster/autovacuum.c`	1654
`AutoVacuumUpdateCostLimit`	`postmaster/autovacuum.c`	1723
`autovac_recalculate_workers_for_balance`	`postmaster/autovacuum.c`	1769
`get_database_list`	`postmaster/autovacuum.c`	1809
`do_autovacuum`	`postmaster/autovacuum.c`	1885
`perform_work_item`	`postmaster/autovacuum.c`	2605
`extract_autovac_opts`	`postmaster/autovacuum.c`	2719
`table_recheck_autovac`	`postmaster/autovacuum.c`	2749
`recheck_relation_needs_vacanalyze`	`postmaster/autovacuum.c`	2900
`relation_needs_vacanalyze`	`postmaster/autovacuum.c`	2967
`autovacuum_do_vac_analyze`	`postmaster/autovacuum.c`	3173
`autovac_report_workitem`	`postmaster/autovacuum.c`	3248
`AutoVacuumingActive`	`postmaster/autovacuum.c`	3288
`AutoVacuumRequestWork`	`postmaster/autovacuum.c`	3300
`autovac_init`	`postmaster/autovacuum.c`	3342
`AutoVacuumShmemSize`	`postmaster/autovacuum.c`	3359
`AutoVacuumShmemInit`	`postmaster/autovacuum.c`	3378
`av_worker_available`	`postmaster/autovacuum.c`	3449
`AutoVacuumWorkItemType` (enum)	`include/postmaster/autovacuum.h`	23

Source verification (as of 2026-06-05)

Verified facts

The worker pool is sized by autovacuum_worker_slots at startup and capped at runtime by autovacuum_max_workers. Verified in AutoVacuumShmemInit (the free list is seeded with autovacuum_worker_slots entries) and av_worker_available (which subtracts autovacuum_max_workers from autovacuum_worker_slots to compute a reserve) on 2026-06-05. This two-parameter split is the PG17→PG18 change that lets an operator raise autovacuum_max_workers with a reload instead of a restart; shared memory still cannot grow, so autovacuum_worker_slots is the immutable ceiling.
The launcher dispatches at most one worker at a time and waits for it to claim its slot. Verified in AutoVacLauncherMain — when av_startingWorker is non-NULL the launcher sets can_launch = false and will reclaim the slot only after Min(autovacuum_naptime, 60) seconds, logging “autovacuum worker took too long to start; canceled.” The handshake is: launcher parks the slot, signals the postmaster, the forked worker claims the slot in AutoVacWorkerMain and clears av_startingWorker, then signals the launcher via SIGUSR2.
The vacuum threshold is base + scale × reltuples, optionally clamped by a maximum. Verified in relation_needs_vacanalyze: vacthresh = vac_base_thresh + vac_scale_factor * reltuples, then if (vac_max_thresh >= 0 && vacthresh > vac_max_thresh) vacthresh = vac_max_thresh. The vacuum_max_threshold ceiling is a PG18 addition (default 100,000,000) so a very large table’s threshold stops scaling without bound; -1 disables the clamp. The three driving counters are dead_tuples, ins_since_vacuum, and mod_since_analyze from PgStat_StatTabEntry.
Anti-wraparound vacuum runs even when autovacuum is disabled for a table. Verified in relation_needs_vacanalyze: the early-return guard is if (!av_enabled && !force_vacuum), so a forced table is never skipped. Independently confirmed in do_autovacuum’s per-table loop, where the config-reload handler explicitly refuses to bail out on a newly-disabled autovacuum, with the in-source comment that the worker “might be a for-wraparound emergency worker.”
XID wraparound outranks MultiXact wraparound, which outranks the ordinary least-recently-vacuumed choice, at database-selection time. Verified in do_start_worker: the loop latches for_xid_wrap on the first XID-endangered database and thereafter continues past every not-at-risk database; only if no XID risk is found does the MultiXactIdPrecedes branch run; only if neither fires does the last_autovac_time comparison choose. Among endangered databases the oldest adw_frozenxid (resp. adw_minmulti) wins.
The per-worker cost limit is the global limit divided by av_nworkersForBalance. Verified in AutoVacuumUpdateCostLimit: vacuum_cost_limit = Max(vacuum_cost_limit / nworkers_for_balance, 1). The divisor is recomputed in autovac_recalculate_workers_for_balance by counting running workers whose wi_dobalance flag is set, and it is read atomically (pg_atomic_read_u32) on the worker’s hot path without taking AutovacuumLock. A worker with per-table cost reloptions clears wi_dobalance and is excluded from both the divisor and the division.
The ad-hoc work-item queue is a flat 256-slot array; a full queue silently drops the request. Verified in AutoVacuumRequestWork (NUM_WORKITEMS == 256; returns false if no free slot is found) and the av_workItems[NUM_WORKITEMS] field of AutoVacuumShmemStruct. The only in-core producer at REL_18 is AVW_BRINSummarizeRange — the AutoVacuumWorkItemType enum in autovacuum.h has exactly that one member.
The launcher runs exactly one transaction, only to read pg_database. Verified by the header comment on get_database_list (“this is the only function in which the autovacuum launcher uses a transaction”) and by AutoVacLauncherMain calling InitPostgres(NULL, InvalidOid, ...) — it attaches to no specific database.

Open questions

The rebuild_database_list initial hash size is the literal 20, flagged /* magic number here FIXME */ in source. Whether this ever matters for clusters with thousands of databases (the hash simply grows past 20) or is purely cosmetic is unverified. Investigation path: measure rebuild_database_list cost on a cluster with 10k+ databases and check whether the dynahash resize shows up; trace the FIXME through git blame for any prior discussion.
The fork-failure retry has no cap. AutoVacLauncherMain’s handling of AutoVacForkFailed sleeps 1 second and re-signals the postmaster indefinitely, with an in-source XXX questioning whether a retry limit makes sense. Under sustained fork failure (e.g., process-table exhaustion) the launcher will spin on this path. Whether that is benign or a real availability concern is unverified. Investigation path: reproduce by capping the OS process table and observe launcher log volume and CPU.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

InnoDB purge coordinator + purge threads (MySQL, innodb_purge_threads) — InnoDB’s deferred cleanup of delete-marked records and old undo-log versions is driven by a purge coordinator dispatching to purge worker threads inside one process, the in-process analogue of PostgreSQL’s launcher/worker fork model. A comparison would weigh process isolation (PG: a crashed worker cannot corrupt the scheduler) against thread-pool latency (InnoDB: no fork cost per unit of work).
Oracle SMON + automatic maintenance tasks — Oracle’s undo-segment cleanup and its automatic optimizer-statistics gathering are split across SMON and the autotask scheduler windows. Oracle’s use of maintenance windows (time-of-day budgets) instead of PostgreSQL’s continuous statistics-threshold dispatch is the interesting contrast: a calendar policy versus a load-reactive one.
CUBRID dedicated vacuum workers — CUBRID also separates a vacuum master/coordinator from vacuum workers, but drives them from the log (MVCC version cleanup follows the transaction log) rather than from per-table dead-tuple statistics. A side-by-side would clarify what PostgreSQL trades by polling pg_class + cumulative stats versus CUBRID’s log-driven discovery of reclaimable versions. See the CUBRID vacuum analysis in knowledge/code-analysis/cubrid/cubrid-vacuum.md.
The 64-bit XID proposal — the long-running PostgreSQL community effort to widen transaction ids to 64 bits would eliminate the wraparound deadline that forces half of autovacuum’s complexity (the forced path, the most-endangered-first database choice, the uncancellable emergency vacuum). Tracking the design discussion would show how much of do_start_worker and relation_needs_vacanalyze would simplify if the freeze deadline became a space-management optimization rather than a correctness deadline.
Adjacent PostgreSQL docs — the mechanism this scheduler invokes is in postgres-vacuum.md (heap pruning, index cleanup, the cost-delay accounting itself); the freeze semantics and the wraparound math are in postgres-xid-wraparound-freeze.md; the fork mechanism and PMSIGNAL_START_AUTOVAC_WORKER handshake are in postgres-postmaster.md. The statistics this scheduler reads (PgStat_StatTabEntry) are produced by the cumulative stats system (postgres-overview-monitoring-stats.md).

Sources

Raw materials consumed: none. This document was synthesized directly from the REL_18 source tree; sources: is empty.

Textbook chapters:

Database System Concepts (Silberschatz, Korth, Sudarshan, 7th ed.), §18.7 “Multiversion Schemes” — the requirement that old versions be deleted once no transaction can read them; the scheduling of that deletion is what autovacuum automates. Captured in knowledge/research/dbms-general/database-system-concepts.md.
Database Internals (Alex Petrov, 2019), ch. 5 — MVCC and version maintenance as the bounding of the version space; the freeze is PostgreSQL’s specific version-space bound. Captured in knowledge/research/dbms-general/database-internals.md.

Source code (REL_18_STABLE, commit 273fe94, as of 2026-06-05):

src/backend/postmaster/autovacuum.c — the entire subsystem: launcher, workers, scheduling, thresholds, cost balancing, work-item queue, shared memory.
src/include/postmaster/autovacuum.h — public surface (AutoVacuumWorkItemType, the GUC externs, the launcher/worker entry points, the shmem functions).

Adjacent curated docs (cross-references, not duplicated here):

knowledge/code-analysis/postgres/postgres-vacuum.md — the vacuum mechanism this scheduler invokes.
knowledge/code-analysis/postgres/postgres-xid-wraparound-freeze.md — freeze semantics and the wraparound deadline math.
knowledge/code-analysis/postgres/postgres-postmaster.md — the fork model and worker-start signaling.