PostgreSQL Wait Events and Progress Reporting
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every production database engine has to answer a deceptively simple
operational question: “what is each process doing right now, and if it is
not making progress, what is it blocked on?” This is the introspection
or online observability problem. It is distinct from the cumulative
statistics problem (how many index scans has this table seen since the
last reset — covered in the sibling doc postgres-cumulative-stats.md).
Introspection is about an instantaneous, per-process snapshot; cumulative
stats are about monotonically growing counters aggregated over time.
Three properties define the design space for a per-process introspection facility:
-
Write cost on the hot path. The instrumentation is poked at the exact moment a backend is about to do something interesting — acquire a lock, issue an I/O, enter the main loop. If publishing “I am now waiting on X” costs a lock acquisition or a system call, the instrumentation perturbs the very thing it measures (a probe effect). The design must make the common write a single un-contended, un-synchronised store.
-
Read consistency without blocking the writer. A monitoring query (
SELECT * FROM pg_stat_activity) reads thousands of other backends’ state. It must never take a lock that the observed backend also needs, or it would turn a read-only diagnostic into a source of contention — or worse, a deadlock. The classic answer is a seqlock: the writer bumps a counter before and after each mutation; the reader copies the record, re-reads the counter, and retries if it changed. Writes stay cheap; reads pay the cost of occasional retries. -
Vocabulary management. “Waiting on X” needs a controlled vocabulary of what X can be, partitioned into classes (lock wait, I/O wait, IPC wait, idle-in-main-loop, …) so an operator can reason at the right altitude. As the engine grows, that vocabulary grows to hundreds of entries; hand-maintaining the enum, the C name-lookup, and the user-facing documentation in three places invites drift, so mature engines generate all three from one declarative table.
A fourth concern is progress reporting for long-running maintenance commands (VACUUM, CREATE INDEX, CLUSTER, a base backup). Unlike a wait event — which is a single categorical value — progress is a small vector of integers (phase, blocks scanned, tuples processed, total to do) whose meaning is command-specific. The engine needs a generic transport for that vector and a per-command convention layered on top.
Database System Concepts (Silberschatz, Korth, Sudarshan) frames the DBA’s monitoring loop as observing “the state of currently executing transactions and the resources they hold or await,” and notes that lock waits and I/O waits dominate the diagnosis of stalled OLTP workloads. Architecture of a Database System (Hellerstein, Stonebraker, Hamilton), in its “Process Model” discussion, observes that a multi-process engine must publish per-worker state into shared memory precisely because no single process has a global view — the monitoring process must read the workers’ self-reported state rather than interrogate them. Database Internals (Petrov) connects this to the lock-free / wait-free literature: single-writer/multi-reader records published with memory barriers are the standard mechanism for low-overhead telemetry in concurrent systems.
PostgreSQL’s answer assembles all four pieces: a 4-byte wait_event_info
word for the categorical wait, a table-driven codegen
(wait_event_names.txt) for the vocabulary, a per-backend
PgBackendStatus slot guarded by a st_changecount seqlock for the rest
of the activity snapshot, and a 20-element st_progress_param[] vector
inside that same slot for command progress.
Common DBMS Design
Section titled “Common DBMS Design”This section names the recurring engineering patterns that engines adopt for per-process introspection, so PostgreSQL’s specifics read as choices within a shared space.
Single-writer shared-memory status slot per worker
Section titled “Single-writer shared-memory status slot per worker”Every multi-process (or multi-threaded) engine reserves one fixed slot of
shared memory per worker, indexed by a stable worker id. The worker is the
sole writer of its slot; any other process is a reader. This
single-writer invariant is what makes cheap publication possible — there
is never write-write contention on a slot, only the comparatively rare
read-during-write race, which the seqlock handles. PostgreSQL sizes the
array at MaxBackends + NUM_AUXILIARY_PROCS and indexes it by
ProcNumber.
The seqlock (changecount) protocol
Section titled “The seqlock (changecount) protocol”A reader cannot take a lock the writer needs, so the canonical lock-free publication uses an even/odd version counter:
- Writer:
count++(now odd) → write barrier → mutate fields → write barrier →count++(now even). - Reader: read
count(must be even) → read barrier → copy fields → read barrier → re-readcount; if unchanged and even, the copy is consistent; otherwise retry.
The odd value signals “mutation in flight.” The memory barriers stop the CPU or compiler from reordering the counter bumps around the field writes. This is exactly Linux’s seqlock applied to a per-backend record.
A separate “free” instrumentation word for the hottest signal
Section titled “A separate “free” instrumentation word for the hottest signal”The very hottest signal — “I just started waiting / I just stopped
waiting” — is poked so often (every lock, every buffer I/O) that even the
seqlock’s barriers are too expensive. Engines special-case it as a single
machine-word store that is atomic by virtue of alignment, needing no
counter and no barrier, accepting that a reader may observe a momentarily
stale value. PostgreSQL splits the wait event out of PgBackendStatus
entirely: it lives as MyProc->wait_event_info, written by a bare
*(volatile uint32 *) store.
Class-tagged categorical codes
Section titled “Class-tagged categorical codes”Wait reasons are encoded as an integer whose high bits name a class (lock, I/O, IPC, timeout, client, activity) and whose low bits name the specific event within the class. A reader masks off the class to render the type column and the event column separately. This keeps the wire representation a single integer while preserving a two-level taxonomy.
Table-driven vocabulary codegen
Section titled “Table-driven vocabulary codegen”Rather than hand-write the enum, the integer-to-name function, and the
documentation table, mature engines keep one declarative list and generate
all artifacts. This guarantees the pg_stat_activity name, the C enum
symbol, and the manual entry never drift.
Generic progress vector with per-command semantics
Section titled “Generic progress vector with per-command semantics”Progress for maintenance commands is a fixed-width array of integers published into the same per-worker slot. The transport is generic; a per-command header file assigns meaning to each slot index (slot 0 = phase, slot 1 = heap blocks total, …) and a SQL view maps the raw integers to friendly columns.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Per-worker status slot | BackendStatusArray[ProcNumber], type PgBackendStatus (backend_status.c) |
| Sole-writer pointer to own slot | MyBEEntry |
| Seqlock version counter | st_changecount + PGSTAT_BEGIN/END_WRITE_ACTIVITY macros |
| Reader retry loop | pgstat_begin_read_activity / pgstat_read_activity_complete |
| Free hot-path word | MyProc->wait_event_info, written by pgstat_report_wait_start |
| Class bits / event bits | WAIT_EVENT_CLASS_MASK (0xFF000000) / WAIT_EVENT_ID_MASK (0x0000FFFF) |
| Wait class table | wait_classes.h (PG_WAIT_LWLOCK … PG_WAIT_INJECTIONPOINT) |
| Vocabulary source | wait_event_names.txt |
| Vocabulary codegen | generate-wait_event_types.pl → wait_event_types.h, pgstat_wait_event.c |
| Integer → type string | pgstat_get_wait_event_type |
| Integer → event string | pgstat_get_wait_event |
| Current-activity string | st_activity_raw (raw, possibly mid-multibyte truncated) → pgstat_clip_activity |
| Per-session snapshot copy | pgstat_read_current_status → localBackendStatusTable |
| Progress command tag | st_progress_command (ProgressCommandType) |
| Progress vector | st_progress_param[PGSTAT_NUM_PROGRESS_PARAM] (20 slots) |
| Parallel-worker progress relay | PqMsg_Progress message handled in parallel.c |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL splits a backend’s “what am I doing” state across two shared-memory homes, on purpose:
-
The wait event lives in
MyProc->wait_event_info(aPGPROCfield), written by an inline single-word store with no changecount and no lock. This is the hottest signal, so it is made as close to free as a memory store can be. -
Everything else — session state (
active/idle/idle in transaction), the current query text, the application name, the query and plan identifiers, and the progress vector — lives inPgBackendStatus, the per-backend slot guarded by thest_changecountseqlock.
pg_stat_activity is a join of the two: its wait_event_type /
wait_event columns come from MyProc->wait_event_info decoded by
wait_event.c, while state, query, application_name,
query_id, etc. come from PgBackendStatus decoded by
backend_status.c. The progress views (pg_stat_progress_vacuum,
pg_stat_progress_create_index, …) read st_progress_command plus
st_progress_param[] from that same PgBackendStatus slot.
flowchart TB
subgraph Backend["Observed backend (sole writer)"]
proc["MyProc->wait_event_info<br/>(single uint32, lock-free store)"]
be["MyBEEntry = &BackendStatusArray[ProcNumber]<br/>PgBackendStatus slot"]
be --> st["st_state / st_activity_raw<br/>st_query_id / st_plan_id"]
be --> prog["st_progress_command<br/>st_progress_param[20]"]
end
subgraph Monitor["Monitoring backend (reader)"]
view["pg_stat_activity /<br/>pg_stat_progress_*"]
end
proc -. "pgstat_get_wait_event_type / _event" .-> view
be -. "pgstat_read_current_status<br/>(changecount retry loop)" .-> view
classDef w fill:#eef,stroke:#446;
class proc,be,st,prog w;
Figure 1 — The two homes of per-backend live state. The wait event is a
free single-word store in PGPROC; the rest is the seqlock-guarded
PgBackendStatus slot. A monitoring backend decodes the first directly
and snapshots the second through the changecount retry loop.
The wait event word: class byte + event id
Section titled “The wait event word: class byte + event id”A wait event is a single uint32. The top byte is the class; the low two
bytes are the event id within that class. The class constants live in
wait_classes.h:
// wait class constants — include/utils/wait_classes.h#define PG_WAIT_LWLOCK 0x01000000U#define PG_WAIT_LOCK 0x03000000U#define PG_WAIT_BUFFERPIN 0x04000000U#define PG_WAIT_ACTIVITY 0x05000000U#define PG_WAIT_CLIENT 0x06000000U#define PG_WAIT_EXTENSION 0x07000000U#define PG_WAIT_IPC 0x08000000U#define PG_WAIT_TIMEOUT 0x09000000U#define PG_WAIT_IO 0x0A000000U#define PG_WAIT_INJECTIONPOINT 0x0B000000Uwait_event.c masks the word with two constants to split class from id:
// class/id masks — utils/activity/wait_event.c#define WAIT_EVENT_CLASS_MASK 0xFF000000#define WAIT_EVENT_ID_MASK 0x0000FFFFSo 0x0A000007 reads as class PG_WAIT_IO (0x0A000000), event id 7.
The lock-free store that publishes it is a textbook example of the
“free instrumentation word” pattern — note that it does not check
pgstat_track_activities, because the check would cost more than the store
it guards:
// pgstat_report_wait_start — include/utils/wait_event.hstatic inline voidpgstat_report_wait_start(uint32 wait_event_info){ /* * Since this is a four-byte field which is always read and written as * four-bytes, updates are atomic. */ *(volatile uint32 *) my_wait_event_info = wait_event_info;}my_wait_event_info initially points at a process-local variable
(local_my_wait_event_info) so the store is safe even before MyProc
exists; pgstat_set_wait_event_storage later redirects it into shared
memory. pgstat_report_wait_end simply stores 0, and a zero word is the
sentinel meaning “not waiting.”
Decoding the word: type and event strings
Section titled “Decoding the word: type and event strings”Two functions decode the word for pg_stat_activity.
pgstat_get_wait_event_type masks off the class and returns the type
column string:
// pgstat_get_wait_event_type — utils/activity/wait_event.c (condensed)const char *pgstat_get_wait_event_type(uint32 wait_event_info){ uint32 classId; if (wait_event_info == 0) return NULL; /* not waiting */ classId = wait_event_info & WAIT_EVENT_CLASS_MASK; switch (classId) { case PG_WAIT_LWLOCK: return "LWLock"; case PG_WAIT_LOCK: return "Lock"; case PG_WAIT_IO: return "IO"; case PG_WAIT_IPC: return "IPC"; /* ... Activity, Client, Timeout, BufferPin, Extension ... */ default: return "???"; }}pgstat_get_wait_event then dispatches on the class to a per-class
name-lookup. Note that LWLock and Lock are handled by their own code
(GetLWLockIdentifier, GetLockNameFromTagType — see
postgres-lwlock-spinlock.md), while the generated classes route through
the codegen’d pgstat_get_wait_* helpers:
// pgstat_get_wait_event — utils/activity/wait_event.c (condensed)const char *pgstat_get_wait_event(uint32 wait_event_info){ uint32 classId = wait_event_info & WAIT_EVENT_CLASS_MASK; uint16 eventId = wait_event_info & WAIT_EVENT_ID_MASK; switch (classId) { case PG_WAIT_LWLOCK: return GetLWLockIdentifier(classId, eventId); /* own code */ case PG_WAIT_LOCK: return GetLockNameFromTagType(eventId); /* own code */ case PG_WAIT_EXTENSION: case PG_WAIT_INJECTIONPOINT: return GetWaitEventCustomIdentifier(wait_event_info); case PG_WAIT_IO: return pgstat_get_wait_io((WaitEventIO) wait_event_info); /* ... IPC, Activity, Client, Timeout, BufferPin ... */ }}The pgstat_get_wait_io, pgstat_get_wait_ipc, etc. helpers are not
hand-written — they are generated, as the final #include at the bottom
of wait_event.c admits:
// tail of utils/activity/wait_event.c#include "utils/pgstat_wait_event.c"Custom wait events for extensions
Section titled “Custom wait events for extensions”Extensions register their own wait events (class PG_WAIT_EXTENSION) by
name; the registry lives in two shared hash tables (by-info and by-name)
plus a spinlock-guarded counter. WaitEventExtensionNew is the public
entry point, delegating to WaitEventCustomNew:
// WaitEventCustomNew — utils/activity/wait_event.c (condensed)static uint32WaitEventCustomNew(uint32 classId, const char *wait_event_name){ /* fast path: name already registered? return its id */ LWLockAcquire(WaitEventCustomLock, LW_SHARED); entry_by_name = hash_search(WaitEventCustomHashByName, wait_event_name, HASH_FIND, &found); LWLockRelease(WaitEventCustomLock); if (found) return entry_by_name->wait_event_info;
/* slow path: take exclusive, recheck, allocate a fresh event id */ LWLockAcquire(WaitEventCustomLock, LW_EXCLUSIVE); /* ... recheck ... */ SpinLockAcquire(&WaitEventCustomCounter->mutex); eventId = WaitEventCustomCounter->nextId++; SpinLockRelease(&WaitEventCustomCounter->mutex);
wait_event_info = classId | eventId; /* fold class into the id */ /* register in both hash directions, then release the LWLock */}This is the only part of the wait-event machinery that needs real locking, because it mutates a shared registry rather than a single backend’s own word. The double-checked locking (shared probe, then exclusive recheck) is a deliberate optimisation for the common “already registered” case.
The vocabulary codegen: wait_event_names.txt
Section titled “The vocabulary codegen: wait_event_names.txt”The controlled vocabulary of built-in wait events is a single
tab-separated file, wait_event_names.txt. Each line gives the enum/event
name and a documentation sentence, grouped under Section: ClassName - WaitEvent<Class> headers:
# wait_event_names.txt (excerpt) — class headers + entriesSection: ClassName - WaitEventIOAIO_IO_COMPLETION "Waiting for another process to complete IO."BUFFILE_READ "Waiting for a read from a buffered file."CONTROL_FILE_SYNC "Waiting for the pg_control file to reach durable storage."
Section: ClassName - WaitEventIPCAPPEND_READY "Waiting for subplan nodes of an Append plan node to be ready."BACKEND_TERMINATION "Waiting for the termination of another backend."generate-wait_event_types.pl reads this file at build time and emits
three artifacts from it: wait_event_types.h (the per-class C enums),
pgstat_wait_event.c (the pgstat_get_wait_<class> lookups #included at
the tail of wait_event.c), and the SGML documentation table. The script
turns each SCREAMING_SNAKE token into both a WAIT_EVENT_ enum symbol
and a CamelCase display string:
# generate-wait_event_types.pl (condensed) — name + enum derivationmy $waiteventenumname = "WAIT_EVENT_$waiteventname"; # WAIT_EVENT_BUFFILE_READ# CamelCase the display name (LWLock/Lock classes are left verbatim)my @waiteventparts = split("_", $waiteventname);foreach my $waiteventpart (@waiteventparts){ $waiteventdescription .= substr($waiteventpart, 0, 1) . lc(substr($waiteventpart, 1)); # "BufFileRead"}The enum’s first member in each class is anchored to the class constant, so the low bits become the event id automatically:
# generate-wait_event_types.pl (condensed) — enum base = class constantprintf $h "typedef enum\n{\n";$pg_wait_class = "PG_WAIT_" . $lastuc; # e.g. PG_WAIT_IOprintf $h "\t%s = %s", $wev->[0], $pg_wait_class; # first = PG_WAIT_IO# subsequent members just ", NEXT_NAME" — C auto-increments the idSo the generated WaitEventIO enum starts at PG_WAIT_IO (0x0A000000)
and each subsequent event is +1, which is exactly why masking with
WAIT_EVENT_ID_MASK recovers the per-class ordinal. Four classes are
excluded from C generation — WaitEventExtension,
WaitEventInjectionPoint, WaitEventLWLock, WaitEventLock — because
their names are resolved dynamically (extension/injection-point registry)
or by their own subsystems (LWLock, Lock); the script emits only SGML docs
for those.
flowchart LR
txt["wait_event_names.txt<br/>(declarative table)"]
pl["generate-wait_event_types.pl<br/>(build step)"]
h["wait_event_types.h<br/>(per-class enums)"]
c["pgstat_wait_event.c<br/>(name lookups)"]
sgml["wait_event_types.sgml<br/>(docs table)"]
txt --> pl
pl --> h
pl --> c
pl --> sgml
h -. "WaitEventIO enum" .-> wec["wait_event.c<br/>pgstat_get_wait_event"]
c -. "#include at tail" .-> wec
classDef g fill:#efe,stroke:#484;
class txt,pl,h,c,sgml g;
Figure 2 — One declarative table feeds the enum, the C name lookup, and
the documentation. The single source eliminates drift between the
pg_stat_activity string, the C enum symbol, and the manual.
The PgBackendStatus slot and the changecount seqlock
Section titled “The PgBackendStatus slot and the changecount seqlock”The rest of a backend’s live state is the PgBackendStatus struct. The
array is allocated once at postmaster startup, one slot per possible
ProcNumber, with the variable-length strings (st_appname,
st_clienthostname, st_activity_raw) carved out of separate shared
buffers and pointed into:
// PgBackendStatus core fields — include/utils/backend_status.h (condensed)typedef struct PgBackendStatus{ int st_changecount; /* seqlock version: odd = write in flight */ int st_procpid; /* slot valid iff st_procpid > 0 */ BackendType st_backendType; TimestampTz st_proc_start_timestamp; TimestampTz st_xact_start_timestamp; TimestampTz st_activity_start_timestamp; TimestampTz st_state_start_timestamp; Oid st_databaseid; Oid st_userid; BackendState st_state; /* STATE_RUNNING / STATE_IDLE / ... */ char *st_appname; /* into BackendAppnameBuffer */ char *st_activity_raw; /* into BackendActivityBuffer; may be mid-mb-truncated */ ProgressCommandType st_progress_command; Oid st_progress_command_target; int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM]; /* 20 slots */ int64 st_query_id; int64 st_plan_id;} PgBackendStatus;A backend reaches its own slot through MyBEEntry, set up by
pgstat_beinit (MyBEEntry = &BackendStatusArray[MyProcNumber]). Every
mutation is bracketed by the seqlock macros. Critically, the macros are a
critical section: any error between them is promoted to PANIC, because
there is no unwind that restores st_changecount to even — so the bracketed
region must be short, straight-line, and allocation-free:
// changecount seqlock macros — include/utils/backend_status.h#define PGSTAT_BEGIN_WRITE_ACTIVITY(beentry) \ do { START_CRIT_SECTION(); \ (beentry)->st_changecount++; \ pg_write_barrier(); } while (0)
#define PGSTAT_END_WRITE_ACTIVITY(beentry) \ do { pg_write_barrier(); \ (beentry)->st_changecount++; \ Assert(((beentry)->st_changecount & 1) == 0); \ END_CRIT_SECTION(); } while (0)pgstat_report_activity is the canonical writer — the backend calls it
from tcop/postgres.c on every state transition. It does all the
expensive work (timestamp fetch, string length) before entering the
critical section, then performs only stores inside it:
// pgstat_report_activity — utils/activity/backend_status.c (condensed)voidpgstat_report_activity(BackendState state, const char *cmd_str){ volatile PgBackendStatus *beentry = MyBEEntry; /* ... handle track_activities disabled: one final DISABLED update ... */
/* fetch everything BEFORE the critical section */ start_timestamp = GetCurrentStatementStartTimestamp(); if (cmd_str != NULL) len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1); current_timestamp = GetCurrentTimestamp(); /* ... accumulate conn active/idle time on state change ... */
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); beentry->st_state = state; beentry->st_state_start_timestamp = current_timestamp; if (state == STATE_RUNNING) { beentry->st_query_id = INT64CONST(0); /* reset; set later at parse analysis */ beentry->st_plan_id = INT64CONST(0); } if (cmd_str != NULL) { memcpy(beentry->st_activity_raw, cmd_str, len); beentry->st_activity_raw[len] = '\0'; beentry->st_activity_start_timestamp = start_timestamp; } PGSTAT_END_WRITE_ACTIVITY(beentry);}Note st_activity_raw is stored raw — possibly truncated in the middle
of a multi-byte character — because writes are far more frequent than
reads, so the cost of correct UTF-8 clipping is deferred to the reader via
pgstat_clip_activity.
Reading the array: the snapshot copy
Section titled “Reading the array: the snapshot copy”A monitoring backend does not read other slots field-by-field on demand;
pgstat_read_current_status copies the whole array into local memory once
per transaction, honouring the reader side of the seqlock. The per-entry
copy loop retries until it observes an even, unchanged st_changecount:
// pgstat_read_current_status — utils/activity/backend_status.c (condensed)for (;;){ int before_changecount, after_changecount;
pgstat_begin_read_activity(beentry, before_changecount); localentry->backendStatus.st_procpid = beentry->st_procpid; if (localentry->backendStatus.st_procpid > 0) { memcpy(&localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus)); strcpy(localappname, beentry->st_appname); localentry->backendStatus.st_appname = localappname; /* re-point at local copy */ strcpy(localactivity, beentry->st_activity_raw); localentry->backendStatus.st_activity_raw = localactivity; } pgstat_end_read_activity(beentry, after_changecount);
if (pgstat_read_activity_complete(before_changecount, after_changecount)) break; CHECK_FOR_INTERRUPTS(); /* don't spin forever on a stuck writer */}Because the out-of-line strings are pointers into shared buffers, the copy
also re-points the local struct’s pointers at the local string copies —
strcpy is safe against concurrent writes here only because each shared
buffer is always NUL-terminated. The deadlock detector takes a different
path: pgstat_get_backend_current_activity reads a single slot directly
(no full snapshot) because it already knows the target is blocked and
stable.
Command progress reporting
Section titled “Command progress reporting”Progress reporting reuses the same PgBackendStatus slot but a different
set of fields: a command tag and a 20-integer vector. The command tag is a
small enum, and the vector width is fixed:
typedef enum ProgressCommandType{ PROGRESS_COMMAND_INVALID, PROGRESS_COMMAND_VACUUM, PROGRESS_COMMAND_ANALYZE, PROGRESS_COMMAND_CLUSTER, PROGRESS_COMMAND_CREATE_INDEX, PROGRESS_COMMAND_BASEBACKUP, PROGRESS_COMMAND_COPY,} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20A command “opens” progress with pgstat_progress_start_command(cmdtype, relid), which sets the tag, the target relation, and zeroes the vector —
all inside the seqlock:
// pgstat_progress_start_command — utils/activity/backend_progress.cvoidpgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid){ volatile PgBackendStatus *beentry = MyBEEntry; if (!beentry || !pgstat_track_activities) return; PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); beentry->st_progress_command = cmdtype; beentry->st_progress_command_target = relid; MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param)); PGSTAT_END_WRITE_ACTIVITY(beentry);}Each command then pokes individual slots as it advances. The meaning of
each index is a per-command convention defined in commands/progress.h
(e.g. for VACUUM, slot 0 is the phase, slot 1 is total heap blocks).
pgstat_progress_update_param writes one slot;
pgstat_progress_update_multi_param writes several atomically (one
seqlock bracket, so a reader never sees a half-updated vector):
// pgstat_progress_update_multi_param — utils/activity/backend_progress.c (condensed)voidpgstat_progress_update_multi_param(int nparam, const int *index, const int64 *val){ volatile PgBackendStatus *beentry = MyBEEntry; if (!beentry || !pgstat_track_activities || nparam == 0) return; PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); for (int i = 0; i < nparam; ++i) { Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM); beentry->st_progress_param[index[i]] = val[i]; } PGSTAT_END_WRITE_ACTIVITY(beentry);}pgstat_progress_end_command resets the tag to
PROGRESS_COMMAND_INVALID, which is the signal the progress views use to
decide a backend is no longer running that command.
Progress across parallel workers
Section titled “Progress across parallel workers”A parallel CREATE INDEX or VACUUM spreads work across worker processes, but
the progress vector that the user reads lives only in the leader’s
PgBackendStatus. A worker cannot write the leader’s slot (single-writer
invariant), so it sends the leader a PqMsg_Progress message over the
parallel-worker message queue. The variant
pgstat_progress_parallel_incr_param chooses the path:
// pgstat_progress_parallel_incr_param — utils/activity/backend_progress.cvoidpgstat_progress_parallel_incr_param(int index, int64 incr){ if (IsParallelWorker()) { static StringInfoData progress_message; initStringInfo(&progress_message); pq_beginmessage(&progress_message, PqMsg_Progress); pq_sendint32(&progress_message, index); pq_sendint64(&progress_message, incr); pq_endmessage(&progress_message); } else pgstat_progress_incr_param(index, incr); /* leader: write own slot directly */}The leader picks the message up in its parallel-message handler
(HandleParallelMessage in access/transam/parallel.c) and applies the
increment to its own slot, so the single-writer invariant is preserved —
only the leader ever writes the leader’s vector:
// HandleParallelMessage — access/transam/parallel.c (condensed)case PqMsg_Progress:{ int index = pq_getmsgint(msg, 4); int64 incr = pq_getmsgint64(msg); pq_getmsgend(msg); pgstat_progress_incr_param(index, incr); /* leader updates its own slot */ break;}This is why only the incremental progress API has a parallel variant:
relaying an absolute “set slot N to V” would race between workers, but
“add incr to slot N” composes cleanly when funnelled through the single
leader.
Source Walkthrough
Section titled “Source Walkthrough”This section lists the stable symbols grouped by call-flow. Line numbers are deferred to the position-hint table at the end; anchor on the symbol names, which survive reformatting.
Wait-event encoding and decoding (wait_event.c, wait_event.h, wait_classes.h)
Section titled “Wait-event encoding and decoding (wait_event.c, wait_event.h, wait_classes.h)”PG_WAIT_LWLOCK…PG_WAIT_INJECTIONPOINT(wait_classes.h) — the ten class constants, each occupying the top byte of the 4-byte word.WAIT_EVENT_CLASS_MASK/WAIT_EVENT_ID_MASK(wait_event.c) —0xFF000000/0x0000FFFF; split the word into class and event id.pgstat_report_wait_start/pgstat_report_wait_end(wait_event.h, inline) — the lock-free single-word publish; the hottest instrumentation path in the server. Writes throughmy_wait_event_info.my_wait_event_info/local_my_wait_event_info(wait_event.c) — the redirectable pointer; local before shared memory, then pointed intoPGPROCbypgstat_set_wait_event_storage.pgstat_set_wait_event_storage/pgstat_reset_wait_event_storage(wait_event.c) — redirect / un-redirect the publish target at backend start / shutdown.pgstat_get_wait_event_type(wait_event.c) — class byte →"LWLock"/"IO"/ … (thewait_event_typecolumn).pgstat_get_wait_event(wait_event.c) — full word → event name; dispatches to subsystem code for LWLock/Lock and to the codegen’dpgstat_get_wait_*for the rest.
Custom (extension) wait-event registry (wait_event.c)
Section titled “Custom (extension) wait-event registry (wait_event.c)”WaitEventExtensionNew/WaitEventInjectionPointNew— public registration entry points for the two custom classes.WaitEventCustomNew— double-checked-locking allocate-or-find; the only locking path in the wait-event subsystem.WaitEventCustomCounterData/WaitEventCustomCounter— spinlock-guardednextIdcounter in shared memory.WaitEventCustomHashByInfo/WaitEventCustomHashByName— the two shared hash tables (info→name and name→info).GetWaitEventCustomIdentifier— reverse lookup used bypgstat_get_wait_eventfor the custom classes.GetWaitEventCustomNames— enumerate registered names for a class (backspg_get_wait_events()).WaitEventCustomShmemInit/WaitEventCustomShmemSize— shmem setup for the registry.
Vocabulary codegen
Section titled “Vocabulary codegen”wait_event_names.txt— declarative table; one event per line underSection: ClassName - WaitEvent<Class>headers.generate-wait_event_types.pl— build-time generator; emitswait_event_types.h,pgstat_wait_event.c, and the SGML docs. Skips C generation forWaitEventExtension,WaitEventInjectionPoint,WaitEventLWLock,WaitEventLock.WAIT_EVENT_*enum members /pgstat_get_wait_<class>— generated; the latter are#included at the tail ofwait_event.cviautils/pgstat_wait_event.c.
Backend-status slot lifecycle (backend_status.c, backend_status.h)
Section titled “Backend-status slot lifecycle (backend_status.c, backend_status.h)”PgBackendStatus(backend_status.h) — the per-backend slot struct.st_changecount+PGSTAT_BEGIN_WRITE_ACTIVITY/PGSTAT_END_WRITE_ACTIVITY— the writer-side seqlock (a critical section: errors inside → PANIC).pgstat_begin_read_activity/pgstat_end_read_activity/pgstat_read_activity_complete— the reader-side seqlock.BackendStatusArray/MyBEEntry— the shared array and the backend’s pointer to its own slot.BackendStatusShmemInit/BackendStatusShmemSize— allocate the array plus the appname / hostname / activity string buffers.pgstat_beinit— setMyBEEntry, register the shutdown hook.pgstat_bestart_initial/pgstat_bestart_security/pgstat_bestart_final— three-phase fill of the slot at backend start (STATE_STARTING→ security details →STATE_UNDEFINED+ appname).pgstat_beshutdown_hook— zerost_procpidto mark the slot free.
Activity reporting and readback (backend_status.c)
Section titled “Activity reporting and readback (backend_status.c)”pgstat_report_activity— the main writer (state + query text), called fromtcop/postgres.c.pgstat_report_appname/pgstat_report_query_id/pgstat_report_plan_id/pgstat_report_xact_timestamp— targeted single-field writers.pgstat_read_current_status— snapshot the whole array intolocalBackendStatusTable(the seqlock reader loop).pgstat_get_beentry_by_proc_number/pgstat_get_local_beentry_by_index/pgstat_fetch_stat_numbackends— accessors over the local snapshot used by thepg_stat_get_activitySRF.pgstat_get_backend_current_activity— single-slot direct read used by the deadlock detector.pgstat_get_crashed_backend_activity— postmaster-side, deliberately unsynchronised read of a possibly-corrupt slot for crash reporting.pgstat_clip_activity— multi-byte-safe truncation ofst_activity_rawat read time.
Progress reporting (backend_progress.c, backend_progress.h)
Section titled “Progress reporting (backend_progress.c, backend_progress.h)”ProgressCommandType+PGSTAT_NUM_PROGRESS_PARAM(backend_progress.h) — the command tag enum and the fixed 20-slot vector width.pgstat_progress_start_command— set tag + target, zero the vector.pgstat_progress_update_param/pgstat_progress_incr_param/pgstat_progress_update_multi_param— write one / increment one / atomically write several slots.pgstat_progress_parallel_incr_param— worker sendsPqMsg_Progress; leader increments directly.pgstat_progress_end_command— reset the tag toPROGRESS_COMMAND_INVALID.HandleParallelMessage(parallel.c) — leader-sidePqMsg_Progresshandler that funnels worker increments into the leader’s own slot.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
PG_WAIT_LWLOCK … PG_WAIT_INJECTIONPOINT | src/include/utils/wait_classes.h | 18–27 |
WAIT_EVENT_CLASS_MASK / WAIT_EVENT_ID_MASK | src/backend/utils/activity/wait_event.c | 42–43 |
WaitEventCustomShmemInit | src/backend/utils/activity/wait_event.c | 119 |
WaitEventExtensionNew | src/backend/utils/activity/wait_event.c | 163 |
WaitEventCustomNew | src/backend/utils/activity/wait_event.c | 175 |
GetWaitEventCustomIdentifier | src/backend/utils/activity/wait_event.c | 276 |
GetWaitEventCustomNames | src/backend/utils/activity/wait_event.c | 306 |
pgstat_set_wait_event_storage | src/backend/utils/activity/wait_event.c | 349 |
pgstat_get_wait_event_type | src/backend/utils/activity/wait_event.c | 373 |
pgstat_get_wait_event | src/backend/utils/activity/wait_event.c | 431 |
#include "utils/pgstat_wait_event.c" | src/backend/utils/activity/wait_event.c | 506 |
pgstat_report_wait_start / _end (inline) | src/include/utils/wait_event.h | 68–88 |
| name + enum derivation | src/backend/utils/activity/generate-wait_event_types.pl | 106–130 |
| enum base = class constant | src/backend/utils/activity/generate-wait_event_types.pl | 205–212 |
PgBackendStatus struct | src/include/utils/backend_status.h | 98–177 |
PGSTAT_BEGIN/END_WRITE_ACTIVITY | src/include/utils/backend_status.h | 209–222 |
pgstat_begin_read_activity / _complete | src/include/utils/backend_status.h | 224–238 |
BackendStatusShmemInit | src/backend/utils/activity/backend_status.c | 114 |
pgstat_beinit | src/backend/utils/activity/backend_status.c | 245 |
pgstat_bestart_initial | src/backend/utils/activity/backend_status.c | 270 |
pgstat_beshutdown_hook | src/backend/utils/activity/backend_status.c | 509 |
pgstat_report_activity | src/backend/utils/activity/backend_status.c | 572 |
pgstat_read_current_status | src/backend/utils/activity/backend_status.c | 820 |
pgstat_get_backend_current_activity | src/backend/utils/activity/backend_status.c | 996 |
pgstat_clip_activity | src/backend/utils/activity/backend_status.c | 1315 |
ProgressCommandType / PGSTAT_NUM_PROGRESS_PARAM | src/include/utils/backend_progress.h | 22–33 |
pgstat_progress_start_command | src/backend/utils/activity/backend_progress.c | 27 |
pgstat_progress_update_param | src/backend/utils/activity/backend_progress.c | 48 |
pgstat_progress_incr_param | src/backend/utils/activity/backend_progress.c | 69 |
pgstat_progress_parallel_incr_param | src/backend/utils/activity/backend_progress.c | 91 |
pgstat_progress_update_multi_param | src/backend/utils/activity/backend_progress.c | 121 |
pgstat_progress_end_command | src/backend/utils/activity/backend_progress.c | 150 |
PqMsg_Progress handler | src/backend/access/transam/parallel.c | 1222 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All symbols and excerpts above were read from
/data/hgryoo/references/postgres at REL_18_STABLE, commit 273fe94. Spot
checks worth re-running after a pull:
- Class constants and masks.
grep -n "0x0.000000U" wait_classes.hshould list exactly the tenPG_WAIT_*classes;WAIT_EVENT_CLASS_MASK(0xFF000000) andWAIT_EVENT_ID_MASK(0x0000FFFF) are defined at the top ofwait_event.c. If a new class is added it gets a new top-byte value and a newSection:inwait_event_names.txt. - The lock-free publish has no track-activities guard.
pgstat_report_wait_startinwait_event.his a bare*(volatile uint32 *) my_wait_event_info = wait_event_info;— the file’s header comment explicitly states thepgstat_track_activitiescheck was removed because it cost more than it saved. Confirm the guard is still absent. - The seqlock is a critical section.
PGSTAT_BEGIN_WRITE_ACTIVITYopens withSTART_CRIT_SECTION()andPGSTAT_END_WRITE_ACTIVITYcloses withEND_CRIT_SECTION(); the header comment warns that any error between them promotes to PANIC. This is why all writers fetch timestamps and string lengths before the bracket. - Progress vector width is 20.
PGSTAT_NUM_PROGRESS_PARAMis20inbackend_progress.h; the per-command slot conventions live incommands/progress.h(e.g.PROGRESS_VACUUM_*), out of scope here. - Parallel progress is increment-only. Only
pgstat_progress_parallel_incr_paramexists (no parallel set / multi variant), and the leader’sparallel.chandler callspgstat_progress_incr_param. Confirm both ends still use the increment form. - Codegen excludes four classes. In
generate-wait_event_types.pl, thenext if (...)guard skipsWaitEventExtension,WaitEventInjectionPoint,WaitEventLWLock,WaitEventLockfrom C generation; they get SGML only. - Build-artifact note.
wait_event_types.handpgstat_wait_event.care generated — they are not present in a clean checkout. The excerpts here are reconstructed fromgenerate-wait_event_types.pl+ the#includeat the tail ofwait_event.c, which is authoritative for how the generated names plug in.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Oracle: the canonical wait-interface
Section titled “Oracle: the canonical wait-interface”PostgreSQL’s wait events are a deliberate echo of Oracle’s Wait
Interface, the model that made “wait event” industry vocabulary. Oracle
publishes per-session wait state into the SGA (v$session_wait,
v$session_event, v$system_event) and, crucially, times each wait —
time_waited accumulates microseconds per event class, which feeds the
“DB time” / “Active Session History” methodology. PostgreSQL’s core
pg_stat_activity is comparatively spartan: it shows the current wait
event but does not accumulate per-event wait time in core. The standard
way to recover Oracle-style ASH on PostgreSQL is sampling: an extension
(pg_wait_sampling, or the commercial pg_stat_statements-adjacent
samplers) periodically reads MyProc->wait_event_info across all backends
and histograms it. The lock-free single-word design is what makes such a
high-frequency sampler cheap — a sampler can read every backend’s wait word
without ever taking a lock the backend needs.
MySQL / InnoDB: Performance Schema instruments
Section titled “MySQL / InnoDB: Performance Schema instruments”MySQL’s Performance Schema takes the opposite tradeoff: instead of one
categorical word, it instruments thousands of named points
(wait/synch/mutex/..., wait/io/file/..., stage/sql/...) with
per-instrument timers and aggregations, all configurable at runtime. The
“stage” instruments are MySQL’s analogue of PostgreSQL command progress —
SELECT ... FROM performance_schema.events_stages_current shows the
current execution stage. The cost model differs sharply: Performance Schema
maintains timed, aggregated history in shared memory at all times (with a
documented overhead), whereas PostgreSQL keeps the hot path to a single
un-timed store and pushes aggregation out to optional samplers. This is the
classic always-on rich telemetry vs. cheap raw signal + external
aggregation split.
SQL Server: wait statistics as a first-class tuning surface
Section titled “SQL Server: wait statistics as a first-class tuning surface”SQL Server exposes cumulative wait statistics directly
(sys.dm_os_wait_stats, sys.dm_exec_requests.wait_type) and the wait
taxonomy (PAGEIOLATCH_*, LCK_M_*, CXPACKET, …) is the primary entry
point of its performance-tuning methodology. Like Oracle, the engine
accumulates wait time per type in core. PostgreSQL’s choice to not time
waits in core is a conscious overhead decision; the community has
repeatedly debated adding in-core wait timing and has so far preferred
sampling extensions to avoid taxing the hot path.
The seqlock and lock-free telemetry
Section titled “The seqlock and lock-free telemetry”The st_changecount protocol is a single-writer/multi-reader seqlock,
the same primitive Linux uses for gettimeofday-class fast reads. The
research lineage runs through the lock-free / wait-free literature: the
single-writer record with paired memory barriers is the cheapest known way
to publish a multi-field record for concurrent readers without a lock. The
deliberate asymmetry — cheap writes, retrying reads — is exactly right for
telemetry, where writers are on the hot path and readers are rare
monitoring queries. PostgreSQL pushes this further by splitting out the
hottest field (the wait word) into a barrier-free single store, accepting
momentary reader staleness in exchange for a near-zero write cost. The
apt bibliography in dbms-general/dbms-papers.md situates this under the
broader concurrent-data-structure work on versioned / optimistic reads.
Frontiers
Section titled “Frontiers”- In-core wait timing / sampling. Whether to accumulate per-event wait time in core (Oracle/SQL Server style) or keep relying on external samplers remains an open community tradeoff; the hot-path cost of the former is the sticking point.
- eBPF / USDT probes. PostgreSQL ships USDT tracepoints (e.g.
TRACE_POSTGRESQL_STATEMENT_STATUSfires insidepgstat_report_activity) enabling out-of-process tracing that sidesteps the shared-memory path entirely — a frontier for low-overhead, dynamically-attached observability. - Richer progress models. Today’s progress vector is 20 flat integers
with per-command conventions; extending it to structured / nested phase
reporting (for, say, parallel plans) without breaking the single-writer
invariant is an active design space, as the
PqMsg_Progressincrement relay hints. - Custom wait events for extensions. The
WaitEventExtensionNewregistry (added so extensions stop colliding on the singleExtensionevent) is a recent generalisation; injection-point wait events reuse the same machinery for deterministic testing.
Sources
Section titled “Sources”- PostgreSQL source, REL_18_STABLE @ 273fe94
(
/data/hgryoo/references/postgres):src/backend/utils/activity/wait_event.c— encode/decode, custom registry.src/backend/utils/activity/backend_status.c—PgBackendStatuslifecycle, activity reporting, snapshot readback.src/backend/utils/activity/backend_progress.c— progress API and the parallel relay.src/backend/utils/activity/wait_event_names.txt— the declarative vocabulary.src/backend/utils/activity/generate-wait_event_types.pl— the codegen.src/include/utils/wait_event.h,wait_classes.h,backend_status.h,backend_progress.h— the inline publish, class constants, slot struct + seqlock macros, progress enum.src/backend/access/transam/parallel.c— leader-sidePqMsg_Progresshandler.
- Cross-references (this KB):
postgres-cumulative-stats.md— the other stats system (monotonic counters,pgstat.cshared hash); contrast with the live per-backend snapshot here.postgres-lwlock-spinlock.md—GetLWLockIdentifier, the spinlock behindWaitEventCustomCounter, and the LWLock wait class thatpgstat_get_wait_eventdefers to.postgres-wire-protocol.md—pq_beginmessage/PqMsg_*framing reused by the parallel progress relay.
- Textbook anchors (
knowledge/research/dbms-general/):- Database System Concepts (Silberschatz, Korth, Sudarshan) — DBA monitoring of executing transactions, lock/I-O wait diagnosis.
- Architecture of a Database System (Hellerstein, Stonebraker, Hamilton) — process model; per-worker state published to shared memory.
- Database Internals (Petrov) — lock-free single-writer telemetry, versioned optimistic reads.
- Paper bibliography:
dbms-general/dbms-papers.md(aptentries on seqlocks / concurrent versioned reads); see also Oracle Wait Interface and MySQL Performance Schema documentation for the comparative designs.