PostgreSQL pg_ctl / pg_controldata — Server Lifecycle Control and Cluster State Inspection
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source Verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every production database system faces a bootstrapping problem: the tool that starts and stops the server cannot use the server itself. Before the postmaster is running there is no SQL connection to open, no catalog to query, no session to receive commands. The control interface must be out-of-band — operating purely on the filesystem and OS signals.
Two complementary needs drive the design of such an interface.
Lifecycle control. An operator must be able to start, stop, restart, and reconfigure the server without writing a shell script that duplicates knowledge embedded in the server itself (where the binary lives, what options it accepts, what PID it chose). The control tool must discover that state autonomously and issue the right OS signal at the right time. Three stop modes exist in any serious implementation:
- Smart: refuse new connections, wait for all existing sessions to terminate naturally, then shut down.
- Fast: disconnect active sessions immediately, flush WAL, write a shutdown checkpoint.
- Immediate: kill immediately without checkpoint — analogous to pulling the power; leaves crash recovery as the next startup’s job.
State inspection without a live connection. A crash, a failed upgrade, or a misconfiguration can leave the server in a state where it cannot accept connections. Administrators need a way to read the server’s last known state — the DBState, the checkpoint LSN, the timeline — from a file on disk. This requires a stable, checksummed, human-readable on-disk structure that survives the crash and can be decoded by a standalone tool.
The control file — universally present in server-class databases under
names like pg_control, CURRENT, or control01.ctl — is the standard
answer to the second need. Database Internals (Petrov, ch. 2 “B-Tree
Basics” and the WAL chapter) notes that the control file is the first thing
any crash-recovery path reads: it gives the REDO start point, the timeline,
and the confirmation that the previous shutdown was clean. An atomic write
(always within one disk sector, protected by a CRC) is the correctness
invariant: a torn write must be detectable so recovery does not proceed from
a corrupt baseline.
Common DBMS Design
Section titled “Common DBMS Design”The PID file as the server’s business card
Section titled “The PID file as the server’s business card”A running server must leave a file — commonly named postmaster.pid or
mysql.pid — that records its PID, start time, and status. This file
plays three roles simultaneously:
- Mutual exclusion. A second server startup attempt can read the file, check that the PID is still alive, and refuse to start, preventing two servers from sharing the same data directory.
- Signal routing. A control tool sends the stop or reload signal to the PID it reads from the file, not to a hardcoded value. The PID is the only stable cross-process handle on Unix.
- Readiness advertisement. A structured pidfile — with a status field in a known line position — lets the control tool poll for startup completion without implementing a separate health-check protocol.
The control file: an atomic-write state record
Section titled “The control file: an atomic-write state record”Every DBMS with crash-recovery needs a small (one sector, ≤ 512 bytes) record that captures the recovery entry point. The canonical fields are:
| Field group | Purpose |
|---|---|
| Version identifiers | Detect cross-version or cross-architecture incompatibility before any catalog read |
| System identifier | Uniquely identify this cluster instance; reject WAL from a different cluster |
| DBState / lifecycle flags | Distinguish clean shutdown from in-recovery from crash |
| Checkpoint LSN + copy | Give WAL replay its REDO start point |
| Timeline IDs | Detect and track timeline switches (standby promotion, PITR) |
| WAL-level parameters | Confirm that archiving/replication settings match the WAL that was written |
| Compile-time constants | Detect block-size or alignment mismatches before first page read |
The atomic-write constraint — the active payload must fit in one disk
sector — is shared by PostgreSQL (PG_CONTROL_MAX_SAFE_SIZE = 512),
Oracle (control file header block), and MySQL/InnoDB (ibdata1 system
tablespace header). Violating it introduces a window where a partial write
could leave an inconsistent recovery baseline.
Signal semantics for stop modes
Section titled “Signal semantics for stop modes”The three stop modes map to three Unix signals in PostgreSQL (and similar mappings appear in other servers):
| Mode | Signal | Postmaster behavior |
|---|---|---|
| Smart | SIGTERM | No new connections; wait for idle |
| Fast | SIGINT | Terminate backends; checkpoint; exit |
| Immediate | SIGQUIT | Kill all children; exit without checkpoint |
A control tool that understands this mapping does not need the server to expose a management API; the OS signal mechanism is the API.
Promote: from standby to primary
Section titled “Promote: from standby to primary”Standby promotion is a state transition that cannot safely be signaled
purely by sending a Unix signal to the postmaster. The postmaster must know
the signal is a promote request, not a generic wakeup. The design pattern
is a sentinel file in $PGDATA: the control tool creates the file, sends
SIGUSR1, and the postmaster checks for the file on receipt. The file-based
approach is atomic on POSIX filesystems (the open(O_CREAT) syscall is the
serialization point) and survives a brief race between file creation and
signal delivery.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”pg_ctl: a thin orchestration shell
Section titled “pg_ctl: a thin orchestration shell”pg_ctl is a standalone C binary (src/bin/pg_ctl/pg_ctl.c). It does not
link against any PostgreSQL backend library; its only backend dependency is
src/include/catalog/pg_control.h (for ControlFileData and DBState)
and src/include/utils/pidfile.h (for the postmaster.pid line positions).
Everything else — process launch, signal delivery, wait loops — is POSIX
syscalls.
The command set is encoded in the CtlCommand enum:
// CtlCommand — src/bin/pg_ctl/pg_ctl.ctypedef enum{ NO_COMMAND = 0, INIT_COMMAND, START_COMMAND, STOP_COMMAND, RESTART_COMMAND, RELOAD_COMMAND, STATUS_COMMAND, PROMOTE_COMMAND, LOGROTATE_COMMAND, KILL_COMMAND, REGISTER_COMMAND, /* Windows service only */ UNREGISTER_COMMAND, /* Windows service only */ RUN_AS_SERVICE_COMMAND,} CtlCommand;Starting the server. do_start() calls start_postmaster(), which
forks a child, runs /bin/sh -c "exec postgres -D … < /dev/null" in it,
and returns the shell’s PID to the parent. The parent then calls
wait_for_postmaster_start(), which polls postmaster.pid at 10 Hz for
up to wait_seconds (default 60 s):
// start_postmaster — src/bin/pg_ctl/pg_ctl.cpm_pid = fork();if (pm_pid == 0) { /* child: detach session, exec postgres via shell */ setsid(); cmd = psprintf("exec \"%s\" %s%s < \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL); execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); exit(1); /* exec failed */}return pm_pid; /* parent returns shell PID */The wait loop reads line 8 of postmaster.pid
(LOCK_FILE_LINE_PM_STATUS = 8) and checks for PM_STATUS_READY or
PM_STATUS_STANDBY:
// wait_for_postmaster_start — src/bin/pg_ctl/pg_ctl.cchar *pmstatus = optlines[LOCK_FILE_LINE_PM_STATUS - 1];if (strcmp(pmstatus, PM_STATUS_READY) == 0 || strcmp(pmstatus, PM_STATUS_STANDBY) == 0) return POSTMASTER_READY;If the postmaster dies before setting the ready status, pg_ctl calls
get_control_dbstate() — which reads pg_control directly — to
distinguish DB_SHUTDOWNED_IN_RECOVERY from a genuine startup failure.
Stopping the server. do_stop() reads the PID from postmaster.pid,
sends the mode-appropriate signal (SIGTERM / SIGINT / SIGQUIT), and polls
for the PID file to disappear:
// do_stop — src/bin/pg_ctl/pg_ctl.cShutdownMode Signal sentSMART_MODE SIGTERMFAST_MODE SIGINT (default)IMMEDIATE_MODE SIGQUITThe shutdown_mode defaults to FAST_MODE. The signal is the OS handle;
pg_ctl never calls any in-process function of the server. The mapping from
the textual --mode argument to the global sig variable is centralized in
set_mode(), which is the single source of truth for the stop-mode → signal
correspondence the table above describes:
// set_mode — src/bin/pg_ctl/pg_ctl.cif (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0){ shutdown_mode = SMART_MODE; sig = SIGTERM;}else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0){ shutdown_mode = FAST_MODE; sig = SIGINT;}else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0){ shutdown_mode = IMMEDIATE_MODE; sig = SIGQUIT;}sig is a file-scope global initialized to SIGINT (the fast-mode default),
so a pg_ctl stop with no --mode flag never enters set_mode() and falls
through with FAST_MODE semantics already in place. The kill -s command
reuses the same sig global through a parallel set_sig() parser, which is
why do_kill() can forward an arbitrary signal without any mode logic of its
own.
Promoting a standby. do_promote() first guards against promoting a
non-standby by calling get_control_dbstate() and asserting
DB_IN_ARCHIVE_RECOVERY. It then writes an empty $PGDATA/promote file
and sends SIGUSR1:
// do_promote — src/bin/pg_ctl/pg_ctl.cif (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY){ write_stderr(_("%s: cannot promote server; " "server is not in standby mode\n"), progname); exit(1);}snprintf(promote_file, MAXPGPATH, "%s/promote", pg_data);if ((prmfile = fopen(promote_file, "w")) == NULL) { ... exit(1); }if (fclose(prmfile)) { ... exit(1); }sig = SIGUSR1;if (kill(pid, sig) != 0){ write_stderr(_("%s: could not send promote signal (PID: %d): %m\n"), progname, (int) pid); if (unlink(promote_file) != 0) /* best-effort cleanup on failure */ write_stderr(...); exit(1);}Two correctness details are worth noting. First, the fopen/fclose pair —
not a single creat — is used so that a failure to flush the (empty) file
is caught as a distinct error from a failure to create it. Second, if
kill() fails after the file already exists, do_promote() unlinks the
sentinel so a later restart does not silently auto-promote on a stale
promote file. After the signal lands, wait_for_postmaster_promote() polls
get_control_dbstate() until it observes DB_IN_PRODUCTION, bailing out
early if the PID file vanishes or the postmaster dies mid-promotion:
// wait_for_postmaster_promote — src/bin/pg_ctl/pg_ctl.cfor (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++){ if ((pid = get_pgpid(false)) == 0) return false; /* pid file is gone */ if (kill(pid, 0) != 0) return false; /* postmaster died */
state = get_control_dbstate(); if (state == DB_IN_PRODUCTION) return true; /* successful promotion */
if (cnt % WAITS_PER_SEC == 0) print_msg("."); pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);}return false; /* timeout reached */This is the same 10 Hz poll cadence (WAITS_PER_SEC) used by the start and
stop wait loops, but the readiness signal is the DBState read from
pg_control rather than a line in postmaster.pid — promotion has no
dedicated pidfile status line, so the control file is the only authoritative
witness that the transition to primary has completed.
Reload. do_reload() sends SIGHUP, which the postmaster relays to
all backends, causing them to reread postgresql.conf. No wait loop is
needed because SIGHUP handling is asynchronous and non-disruptive.
get_control_dbstate: the bridge between pg_ctl and pg_control.
This static helper, called from the wait loops and the promote guard, reads
pg_control through the shared get_controlfile() utility:
// get_control_dbstate — src/bin/pg_ctl/pg_ctl.cstatic DBStateget_control_dbstate(void){ bool crc_ok; ControlFileData *ctl = get_controlfile(pg_data, &crc_ok); if (!crc_ok) { write_stderr("control file appears to be corrupt\n"); exit(1); } DBState ret = ctl->state; pfree(ctl); return ret;}ControlFileData and the pg_control format
Section titled “ControlFileData and the pg_control format”ControlFileData (src/include/catalog/pg_control.h) is the on-disk
layout of $PGDATA/global/pg_control. At REL_18_STABLE,
PG_CONTROL_VERSION = 1800. The struct is deliberately kept under
PG_CONTROL_MAX_SAFE_SIZE = 512 bytes — one disk sector — so that
every write is atomic:
// ControlFileData — src/include/catalog/pg_control.htypedef struct ControlFileData{ uint64 system_identifier; /* unique cluster ID (set at initdb) */ uint32 pg_control_version; /* PG_CONTROL_VERSION = 1800 in PG18 */ uint32 catalog_version_no; /* catversion.h; changes on catalog changes */ DBState state; /* current lifecycle state */ pg_time_t time; /* timestamp of last pg_control update */ XLogRecPtr checkPoint; /* LSN of last checkpoint record */ CheckPoint checkPointCopy; /* full body of that checkpoint record */ XLogRecPtr unloggedLSN; /* fake LSN counter for unlogged relations */ XLogRecPtr minRecoveryPoint; /* must replay at least to here */ TimeLineID minRecoveryPointTLI; XLogRecPtr backupStartPoint; /* set during online backup */ XLogRecPtr backupEndPoint; bool backupEndRequired; int wal_level; bool wal_log_hints; int MaxConnections; int max_worker_processes; int max_wal_senders; int max_prepared_xacts; int max_locks_per_xact; bool track_commit_timestamp; uint32 maxAlign; double floatFormat; /* = 1234567.0; architecture check */ uint32 blcksz; /* data block size */ uint32 relseg_size; /* blocks per large-relation segment */ uint32 xlog_blcksz; /* WAL block size */ uint32 xlog_seg_size; /* WAL segment size */ uint32 nameDataLen; /* NAMEDATALEN */ uint32 indexMaxKeys; uint32 toast_max_chunk_size; uint32 loblksize; bool float8ByVal; uint32 data_checksum_version; bool default_char_signedness; /* new in PG18 */ char mock_authentication_nonce[MOCK_AUTH_NONCE_LEN]; pg_crc32c crc; /* MUST BE LAST */} ControlFileData;DBState is the lifecycle state machine for the cluster:
// DBState — src/include/catalog/pg_control.htypedef enum DBState{ DB_STARTUP = 0, DB_SHUTDOWNED, DB_SHUTDOWNED_IN_RECOVERY, DB_SHUTDOWNING, DB_IN_CRASH_RECOVERY, DB_IN_ARCHIVE_RECOVERY, DB_IN_PRODUCTION,} DBState;DB_SHUTDOWNED is the only state from which a clean, non-recovery startup
proceeds. Any other state triggers WAL replay on next startup. pg_ctl
relies on DBState in two places: the promote guard (must see
DB_IN_ARCHIVE_RECOVERY) and the startup-wait fallback (treats
DB_SHUTDOWNED_IN_RECOVERY as a non-error exit).
The CheckPoint struct embedded in checkPointCopy carries:
// CheckPoint — src/include/catalog/pg_control.htypedef struct CheckPoint{ XLogRecPtr redo; /* REDO start LSN */ TimeLineID ThisTimeLineID; TimeLineID PrevTimeLineID; /* non-zero if this record begins a new TL */ bool fullPageWrites; int wal_level; FullTransactionId nextXid; Oid nextOid; MultiXactId nextMulti; MultiXactOffset nextMultiOffset; TransactionId oldestXid; Oid oldestXidDB; MultiXactId oldestMulti; Oid oldestMultiDB; pg_time_t time; TransactionId oldestCommitTsXid; TransactionId newestCommitTsXid; TransactionId oldestActiveXid;} CheckPoint;The read / write path through controldata_utils.c
Section titled “The read / write path through controldata_utils.c”Both backend and frontend code share src/common/controldata_utils.c.
get_controlfile() builds the path $PGDATA/global/pg_control, opens
it with O_RDONLY, reads exactly sizeof(ControlFileData) bytes, and
verifies the CRC32c. In frontend (tool) mode a CRC mismatch triggers up to
10 retries with 10 ms sleeps — a guard against reading a partial write from
a running server:
// get_controlfile_by_exact_path — src/common/controldata_utils.cretry: fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0); r = read(fd, ControlFile, sizeof(ControlFileData)); close(fd); INIT_CRC32C(crc); COMP_CRC32C(crc, ControlFile, offsetof(ControlFileData, crc)); FIN_CRC32C(crc); *crc_ok_p = EQ_CRC32C(crc, ControlFile->crc); if (!*crc_ok_p && retries < 10) { retries++; pg_usleep(10000); goto retry; }update_controlfile() zero-pads the buffer to PG_CONTROL_FILE_SIZE
(8192 bytes — the physical file size, kept constant across format changes
so that an old binary can detect a version mismatch as a wrong-version
error rather than a short read), recalculates the CRC, and writes with
O_WRONLY. In backend mode the caller holds ControlFileLock before
calling this function.
pg_controldata: read-only field printer
Section titled “pg_controldata: read-only field printer”pg_controldata (src/bin/pg_controldata/pg_controldata.c) is a minimal
program that calls get_controlfile() and prints every field with
printf(). It carries no logic beyond field formatting. Notable points:
- It
#define FRONTEND 1but#include "postgres.h"(notpostgres_fe.h) because it needs the WAL-internal types (xlog_internal.h,transam.h) that only the backend header exposes. - The
default_char_signednessfield (new in PG18) is printed assigned/unsignedand encodes the platform’s defaultcharsignedness atinitdbtime — relevant when cross-compiling or migrating between ARM (unsigned) and x86 (signed) systems. data_checksum_versionis 0 when page checksums are disabled; any nonzero value indicates the checksum algorithm version in use.
// main (pg_controldata) — src/bin/pg_controldata/pg_controldata.cControlFile = get_controlfile(DataDir, &crc_ok);if (!crc_ok) pg_log_warning("calculated CRC checksum does not match value stored in control file");
printf("pg_control version number: %u\n", ControlFile->pg_control_version);printf("Database cluster state: %s\n", dbState(ControlFile->state));printf("Latest checkpoint location: %X/%X\n", LSN_FORMAT_ARGS(ControlFile->checkPoint));// ... (all ~40 fields)printf("Default char data signedness: %s\n", ControlFile->default_char_signedness ? "signed" : "unsigned");printf("Mock authentication nonce: %s\n", mock_auth_nonce_str);Lifecycle flow diagram
Section titled “Lifecycle flow diagram”flowchart TD
A["pg_ctl start<br/>do_start()"] --> B["start_postmaster()<br/>fork + exec postgres"]
B --> C["poll postmaster.pid<br/>wait_for_postmaster_start()"]
C --> D{"LOCK_FILE_LINE_PM_STATUS<br/>== PM_STATUS_READY?"}
D -- yes --> E["exit 0: server started"]
D -- postmaster died --> F["get_control_dbstate()<br/>read pg_control"]
F --> G{"DBState?"}
G -- DB_SHUTDOWNED_IN_RECOVERY --> H["exit 0: shutdown in recovery"]
G -- other --> I["exit 1: startup failed"]
D -- timeout --> J["exit 1: server did not start in time"]
K["pg_ctl stop<br/>do_stop()"] --> L["get_pgpid()<br/>read postmaster.pid"]
L --> M["kill pid sig<br/>SIGTERM/SIGINT/SIGQUIT"]
M --> N["poll: postmaster.pid gone?<br/>wait_for_postmaster_stop()"]
N -- gone --> O["exit 0: server stopped"]
N -- timeout --> P["exit 1: server does not shut down"]
Q["pg_ctl promote<br/>do_promote()"] --> R["get_control_dbstate()"]
R --> S{"DB_IN_ARCHIVE_RECOVERY?"}
S -- no --> T["exit 1: not a standby"]
S -- yes --> U["create promote file<br/>send SIGUSR1"]
U --> V["poll get_control_dbstate()<br/>wait_for_postmaster_promote()"]
V --> W{"DB_IN_PRODUCTION?"}
W -- yes --> X["exit 0: server promoted"]
W -- timeout --> Y["exit 1: promote timed out"]
DBState transition diagram
Section titled “DBState transition diagram”flowchart LR
S0["DB_STARTUP"] --> S6["DB_IN_PRODUCTION"]
S0 --> S4["DB_IN_CRASH_RECOVERY"]
S0 --> S5["DB_IN_ARCHIVE_RECOVERY"]
S6 --> S3["DB_SHUTDOWNING"]
S3 --> S1["DB_SHUTDOWNED"]
S5 --> S6
S4 --> S1
S5 --> S2["DB_SHUTDOWNED_IN_RECOVERY"]
Source Walkthrough
Section titled “Source Walkthrough”pg_ctl.c — function inventory
Section titled “pg_ctl.c — function inventory”| Symbol | Role |
|---|---|
CtlCommand (enum) | Command discriminator |
ShutdownMode (enum) | SMART_MODE, FAST_MODE, IMMEDIATE_MODE |
WaitPMResult (enum) | Start-wait outcome |
main | Parse options, build file paths, dispatch ctl_command |
do_init | Fork initdb |
do_start | Fork postgres; wait via wait_for_postmaster_start |
do_stop | Send SIGTERM/INT/QUIT; wait via wait_for_postmaster_stop |
do_restart | do_stop then do_start |
do_reload | Send SIGHUP |
do_promote | Create promote sentinel file; send SIGUSR1; wait |
do_logrotate | Create logrotate sentinel file; send SIGUSR1 |
do_status | Print PID and opts from postmaster.pid |
do_kill | Send arbitrary signal to a given PID |
start_postmaster | fork + exec /bin/sh -c "exec postgres …" |
wait_for_postmaster_start | Poll postmaster.pid line 8 at 10 Hz |
wait_for_postmaster_stop | Poll postmaster.pid absence at 10 Hz |
wait_for_postmaster_promote | Poll get_control_dbstate() at 10 Hz |
get_pgpid | Read PID from postmaster.pid line 1 |
get_control_dbstate | Read pg_control via get_controlfile(); return state |
read_post_opts | Read saved options from postmaster.opts (used in restart) |
postmaster_is_alive | kill(pid, 0) liveness check |
trap_sigint_during_startup | Forward SIGINT to postmaster during start wait |
set_mode | Parse --mode to ShutdownMode; set global sig (SIGTERM/SIGINT/SIGQUIT) |
set_sig | Parse kill -s signal name to global sig |
adjust_data_dir | Handle -D pointing at config-only directory |
pg_controldata.c — function inventory
Section titled “pg_controldata.c — function inventory”| Symbol | Role |
|---|---|
main | Parse -D; call get_controlfile(); print all fields |
dbState | Map DBState enum to human-readable string |
wal_level_str | Map WalLevel enum to string |
controldata_utils.c — shared read/write path
Section titled “controldata_utils.c — shared read/write path”| Symbol | Role |
|---|---|
get_controlfile | Build path $PGDATA/global/pg_control; delegate to get_controlfile_by_exact_path |
get_controlfile_by_exact_path | Open, read, CRC-verify; retry up to 10× in frontend mode |
update_controlfile | Recompute CRC, zero-pad to 8192 B, write; do_sync controls fsync |
Key structs and constants
Section titled “Key structs and constants”| Symbol | Header | Note |
|---|---|---|
ControlFileData | catalog/pg_control.h | On-disk pg_control layout; ≤ 512 B active payload |
CheckPoint | catalog/pg_control.h | Embedded in ControlFileData.checkPointCopy |
DBState | catalog/pg_control.h | 7-value lifecycle enum |
PG_CONTROL_VERSION | catalog/pg_control.h | 1800 at REL_18_STABLE |
PG_CONTROL_MAX_SAFE_SIZE | catalog/pg_control.h | 512 — one-sector atomic-write limit |
PG_CONTROL_FILE_SIZE | catalog/pg_control.h | 8192 — physical file size, version-mismatch probe |
LOCK_FILE_LINE_PM_STATUS | utils/pidfile.h | Line 8 in postmaster.pid |
PM_STATUS_READY | utils/pidfile.h | "ready " — readiness sentinel |
PM_STATUS_STANDBY | utils/pidfile.h | "standby " — hot-standby ready |
Source Verification (as of 2026-06-05)
Section titled “Source Verification (as of 2026-06-05)”Position hints for REL_18_STABLE commit 273fe94. Symbols are the stable anchor; line numbers decay as the tree evolves.
| Symbol | File | Approx. line |
|---|---|---|
CtlCommand enum | src/bin/pg_ctl/pg_ctl.c | 53 |
ShutdownMode enum | src/bin/pg_ctl/pg_ctl.c | 37 |
WaitPMResult enum | src/bin/pg_ctl/pg_ctl.c | 44 |
main | src/bin/pg_ctl/pg_ctl.c | 2202 |
do_start | src/bin/pg_ctl/pg_ctl.c | 931 |
do_stop | src/bin/pg_ctl/pg_ctl.c | 1027 |
do_restart | src/bin/pg_ctl/pg_ctl.c | 1085 |
do_reload | src/bin/pg_ctl/pg_ctl.c | 1149 |
do_promote | src/bin/pg_ctl/pg_ctl.c | 1186 |
do_logrotate | src/bin/pg_ctl/pg_ctl.c | 1267 |
do_status | src/bin/pg_ctl/pg_ctl.c | 1348 |
do_kill | src/bin/pg_ctl/pg_ctl.c | 1405 |
start_postmaster | src/bin/pg_ctl/pg_ctl.c | 439 |
wait_for_postmaster_start | src/bin/pg_ctl/pg_ctl.c | 593 |
wait_for_postmaster_stop | src/bin/pg_ctl/pg_ctl.c | 717 |
wait_for_postmaster_promote | src/bin/pg_ctl/pg_ctl.c | 754 |
get_pgpid | src/bin/pg_ctl/pg_ctl.c | 246 |
get_control_dbstate | src/bin/pg_ctl/pg_ctl.c | 2183 |
postmaster_is_alive | src/bin/pg_ctl/pg_ctl.c | 1324 |
trap_sigint_during_startup | src/bin/pg_ctl/pg_ctl.c | 857 |
read_post_opts | src/bin/pg_ctl/pg_ctl.c | 802 |
set_mode | src/bin/pg_ctl/pg_ctl.c | 2047 |
set_sig | src/bin/pg_ctl/pg_ctl.c | 2075 |
dbState | src/bin/pg_controldata/pg_controldata.c | 49 |
wal_level_str | src/bin/pg_controldata/pg_controldata.c | 73 |
main (pg_controldata) | src/bin/pg_controldata/pg_controldata.c | 88 |
get_controlfile | src/common/controldata_utils.c | 52 |
get_controlfile_by_exact_path | src/common/controldata_utils.c | 68 |
update_controlfile | src/common/controldata_utils.c | 189 |
ControlFileData struct | src/include/catalog/pg_control.h | 104 |
CheckPoint struct | src/include/catalog/pg_control.h | 35 |
DBState enum | src/include/catalog/pg_control.h | 89 |
PG_CONTROL_VERSION | src/include/catalog/pg_control.h | 25 |
PG_CONTROL_MAX_SAFE_SIZE | src/include/catalog/pg_control.h | 247 |
PG_CONTROL_FILE_SIZE | src/include/catalog/pg_control.h | 256 |
LOCK_FILE_LINE_PM_STATUS | src/include/utils/pidfile.h | 44 |
PM_STATUS_READY | src/include/utils/pidfile.h | 53 |
PM_STATUS_STANDBY | src/include/utils/pidfile.h | 54 |
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Other databases’ control utilities
Section titled “Other databases’ control utilities”MySQL / MariaDB uses mysqladmin / mysqld_safe in a similar role: a
wrapper script that forks mysqld, monitors the PID file, and sends
SIGTERM on shutdown. The InnoDB system tablespace header (ibdata1, page 0)
serves the control-file role: it stores the LSN of the last checkpoint and
the tablespace ID, protected by a page checksum. Unlike PostgreSQL, MySQL
stores this inside the tablespace itself rather than in a separate file —
a design choice that couples the control record to the storage engine.
Oracle uses two or more control files mirrored to separate mount points.
The Oracle control file is much larger (megabytes, not 512 bytes) because
it also stores the RMAN backup catalog and archived-log history. The
mirroring is a high-availability measure absent in PostgreSQL (which relies
on the single pg_control surviving on the same filesystem).
SQLite stores its database state in a 100-byte header at offset 0 of the database file itself — the most minimal possible control-record design. The “change counter” at offset 24 is the equivalent of PostgreSQL’s CRC: a reader that detects a changed counter knows it must re-read the shared cache. No separate control file exists; the database is the control file.
Research context
Section titled “Research context”The design of pg_control reflects two classical crash-recovery insights.
First, ARIES (Mohan et al., 1992) established the principle that the
recovery manager must be able to find the REDO start point in a structure
that survives crashes — the “master record” in ARIES terminology, which
maps directly to checkPointCopy.redo in ControlFileData. Second,
the atomicity of the control-file write is a special case of the
write-ahead logging invariant: before any data change is considered
durable, the metadata record naming its location must be durable. Writing
pg_control with a CRC and ensuring it fits in one sector is how
PostgreSQL guarantees this without a second WAL record.
The system_identifier field (a 64-bit random value set at initdb) is
PostgreSQL’s answer to the split-brain problem in HA clusters: a standby
that has been promoted to primary will refuse to apply WAL from the old
primary because the WAL carries a different system_identifier. This
simple check prevents a catastrophic case of a demoted primary reattaching
to its old WAL stream.
The mock_authentication_nonce (32 random bytes) was added in PG10 to
close a timing side-channel in SASL authentication exchanges that could
proceed based on a cluster-unique value even when the user does not exist.
It is stored in pg_control because it must survive server restarts and be
available before any catalog access — exactly the kind of stable,
pre-catalog state that pg_control is designed to hold.
Sources
Section titled “Sources”Primary source files (REL_18_STABLE, commit 273fe94):
src/bin/pg_ctl/pg_ctl.csrc/bin/pg_controldata/pg_controldata.csrc/include/catalog/pg_control.hsrc/common/controldata_utils.csrc/include/utils/pidfile.h
Cross-references within this KB:
postgres-postmaster.md— the postmaster process that pg_ctl launchespostgres-xlog-wal.md— WAL mechanics; checkpoint LSN interpretationpostgres-checkpoint.md— checkpoint writes that updatepg_controlpostgres-recovery-redo.md— startup readspg_controlto begin REDOpostgres-backup-basebackup.md—backupStartPoint/backupEndRequiredfieldspostgres-pg-dump-restore.md— logical backup counterpart (no pg_control dependency)
Research and textbooks:
- Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,” ACM TODS 17(1), 1992 — origin of the REDO-start “master record”
- Petrov, Database Internals (O’Reilly, 2019), ch. 7 “Log-Structured Storage” — control file and WAL bootstrap
- Stonebraker & Rowe, “The Design of POSTGRES,” ACM SIGMOD 1986 — original process-model rationale