Skip to content

PostgreSQL pg_ctl / pg_controldata — Server Lifecycle Control and Cluster State Inspection

Contents:

Every production database system faces a bootstrapping problem: the tool that starts and stops the server cannot use the server itself. Before the postmaster is running there is no SQL connection to open, no catalog to query, no session to receive commands. The control interface must be out-of-band — operating purely on the filesystem and OS signals.

Two complementary needs drive the design of such an interface.

Lifecycle control. An operator must be able to start, stop, restart, and reconfigure the server without writing a shell script that duplicates knowledge embedded in the server itself (where the binary lives, what options it accepts, what PID it chose). The control tool must discover that state autonomously and issue the right OS signal at the right time. Three stop modes exist in any serious implementation:

  • Smart: refuse new connections, wait for all existing sessions to terminate naturally, then shut down.
  • Fast: disconnect active sessions immediately, flush WAL, write a shutdown checkpoint.
  • Immediate: kill immediately without checkpoint — analogous to pulling the power; leaves crash recovery as the next startup’s job.

State inspection without a live connection. A crash, a failed upgrade, or a misconfiguration can leave the server in a state where it cannot accept connections. Administrators need a way to read the server’s last known state — the DBState, the checkpoint LSN, the timeline — from a file on disk. This requires a stable, checksummed, human-readable on-disk structure that survives the crash and can be decoded by a standalone tool.

The control file — universally present in server-class databases under names like pg_control, CURRENT, or control01.ctl — is the standard answer to the second need. Database Internals (Petrov, ch. 2 “B-Tree Basics” and the WAL chapter) notes that the control file is the first thing any crash-recovery path reads: it gives the REDO start point, the timeline, and the confirmation that the previous shutdown was clean. An atomic write (always within one disk sector, protected by a CRC) is the correctness invariant: a torn write must be detectable so recovery does not proceed from a corrupt baseline.

The PID file as the server’s business card

Section titled “The PID file as the server’s business card”

A running server must leave a file — commonly named postmaster.pid or mysql.pid — that records its PID, start time, and status. This file plays three roles simultaneously:

  1. Mutual exclusion. A second server startup attempt can read the file, check that the PID is still alive, and refuse to start, preventing two servers from sharing the same data directory.
  2. Signal routing. A control tool sends the stop or reload signal to the PID it reads from the file, not to a hardcoded value. The PID is the only stable cross-process handle on Unix.
  3. Readiness advertisement. A structured pidfile — with a status field in a known line position — lets the control tool poll for startup completion without implementing a separate health-check protocol.

The control file: an atomic-write state record

Section titled “The control file: an atomic-write state record”

Every DBMS with crash-recovery needs a small (one sector, ≤ 512 bytes) record that captures the recovery entry point. The canonical fields are:

Field groupPurpose
Version identifiersDetect cross-version or cross-architecture incompatibility before any catalog read
System identifierUniquely identify this cluster instance; reject WAL from a different cluster
DBState / lifecycle flagsDistinguish clean shutdown from in-recovery from crash
Checkpoint LSN + copyGive WAL replay its REDO start point
Timeline IDsDetect and track timeline switches (standby promotion, PITR)
WAL-level parametersConfirm that archiving/replication settings match the WAL that was written
Compile-time constantsDetect block-size or alignment mismatches before first page read

The atomic-write constraint — the active payload must fit in one disk sector — is shared by PostgreSQL (PG_CONTROL_MAX_SAFE_SIZE = 512), Oracle (control file header block), and MySQL/InnoDB (ibdata1 system tablespace header). Violating it introduces a window where a partial write could leave an inconsistent recovery baseline.

The three stop modes map to three Unix signals in PostgreSQL (and similar mappings appear in other servers):

ModeSignalPostmaster behavior
SmartSIGTERMNo new connections; wait for idle
FastSIGINTTerminate backends; checkpoint; exit
ImmediateSIGQUITKill all children; exit without checkpoint

A control tool that understands this mapping does not need the server to expose a management API; the OS signal mechanism is the API.

Standby promotion is a state transition that cannot safely be signaled purely by sending a Unix signal to the postmaster. The postmaster must know the signal is a promote request, not a generic wakeup. The design pattern is a sentinel file in $PGDATA: the control tool creates the file, sends SIGUSR1, and the postmaster checks for the file on receipt. The file-based approach is atomic on POSIX filesystems (the open(O_CREAT) syscall is the serialization point) and survives a brief race between file creation and signal delivery.

pg_ctl is a standalone C binary (src/bin/pg_ctl/pg_ctl.c). It does not link against any PostgreSQL backend library; its only backend dependency is src/include/catalog/pg_control.h (for ControlFileData and DBState) and src/include/utils/pidfile.h (for the postmaster.pid line positions). Everything else — process launch, signal delivery, wait loops — is POSIX syscalls.

The command set is encoded in the CtlCommand enum:

// CtlCommand — src/bin/pg_ctl/pg_ctl.c
typedef enum
{
NO_COMMAND = 0,
INIT_COMMAND,
START_COMMAND,
STOP_COMMAND,
RESTART_COMMAND,
RELOAD_COMMAND,
STATUS_COMMAND,
PROMOTE_COMMAND,
LOGROTATE_COMMAND,
KILL_COMMAND,
REGISTER_COMMAND, /* Windows service only */
UNREGISTER_COMMAND, /* Windows service only */
RUN_AS_SERVICE_COMMAND,
} CtlCommand;

Starting the server. do_start() calls start_postmaster(), which forks a child, runs /bin/sh -c "exec postgres -D … < /dev/null" in it, and returns the shell’s PID to the parent. The parent then calls wait_for_postmaster_start(), which polls postmaster.pid at 10 Hz for up to wait_seconds (default 60 s):

// start_postmaster — src/bin/pg_ctl/pg_ctl.c
pm_pid = fork();
if (pm_pid == 0) {
/* child: detach session, exec postgres via shell */
setsid();
cmd = psprintf("exec \"%s\" %s%s < \"%s\" 2>&1",
exec_path, pgdata_opt, post_opts, DEVNULL);
execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
exit(1); /* exec failed */
}
return pm_pid; /* parent returns shell PID */

The wait loop reads line 8 of postmaster.pid (LOCK_FILE_LINE_PM_STATUS = 8) and checks for PM_STATUS_READY or PM_STATUS_STANDBY:

// wait_for_postmaster_start — src/bin/pg_ctl/pg_ctl.c
char *pmstatus = optlines[LOCK_FILE_LINE_PM_STATUS - 1];
if (strcmp(pmstatus, PM_STATUS_READY) == 0 ||
strcmp(pmstatus, PM_STATUS_STANDBY) == 0)
return POSTMASTER_READY;

If the postmaster dies before setting the ready status, pg_ctl calls get_control_dbstate() — which reads pg_control directly — to distinguish DB_SHUTDOWNED_IN_RECOVERY from a genuine startup failure.

Stopping the server. do_stop() reads the PID from postmaster.pid, sends the mode-appropriate signal (SIGTERM / SIGINT / SIGQUIT), and polls for the PID file to disappear:

// do_stop — src/bin/pg_ctl/pg_ctl.c
ShutdownMode Signal sent
SMART_MODE SIGTERM
FAST_MODE SIGINT (default)
IMMEDIATE_MODE SIGQUIT

The shutdown_mode defaults to FAST_MODE. The signal is the OS handle; pg_ctl never calls any in-process function of the server. The mapping from the textual --mode argument to the global sig variable is centralized in set_mode(), which is the single source of truth for the stop-mode → signal correspondence the table above describes:

// set_mode — src/bin/pg_ctl/pg_ctl.c
if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0)
{
shutdown_mode = SMART_MODE;
sig = SIGTERM;
}
else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0)
{
shutdown_mode = FAST_MODE;
sig = SIGINT;
}
else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0)
{
shutdown_mode = IMMEDIATE_MODE;
sig = SIGQUIT;
}

sig is a file-scope global initialized to SIGINT (the fast-mode default), so a pg_ctl stop with no --mode flag never enters set_mode() and falls through with FAST_MODE semantics already in place. The kill -s command reuses the same sig global through a parallel set_sig() parser, which is why do_kill() can forward an arbitrary signal without any mode logic of its own.

Promoting a standby. do_promote() first guards against promoting a non-standby by calling get_control_dbstate() and asserting DB_IN_ARCHIVE_RECOVERY. It then writes an empty $PGDATA/promote file and sends SIGUSR1:

// do_promote — src/bin/pg_ctl/pg_ctl.c
if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
{
write_stderr(_("%s: cannot promote server; "
"server is not in standby mode\n"), progname);
exit(1);
}
snprintf(promote_file, MAXPGPATH, "%s/promote", pg_data);
if ((prmfile = fopen(promote_file, "w")) == NULL) { ... exit(1); }
if (fclose(prmfile)) { ... exit(1); }
sig = SIGUSR1;
if (kill(pid, sig) != 0)
{
write_stderr(_("%s: could not send promote signal (PID: %d): %m\n"),
progname, (int) pid);
if (unlink(promote_file) != 0) /* best-effort cleanup on failure */
write_stderr(...);
exit(1);
}

Two correctness details are worth noting. First, the fopen/fclose pair — not a single creat — is used so that a failure to flush the (empty) file is caught as a distinct error from a failure to create it. Second, if kill() fails after the file already exists, do_promote() unlinks the sentinel so a later restart does not silently auto-promote on a stale promote file. After the signal lands, wait_for_postmaster_promote() polls get_control_dbstate() until it observes DB_IN_PRODUCTION, bailing out early if the PID file vanishes or the postmaster dies mid-promotion:

// wait_for_postmaster_promote — src/bin/pg_ctl/pg_ctl.c
for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
{
if ((pid = get_pgpid(false)) == 0)
return false; /* pid file is gone */
if (kill(pid, 0) != 0)
return false; /* postmaster died */
state = get_control_dbstate();
if (state == DB_IN_PRODUCTION)
return true; /* successful promotion */
if (cnt % WAITS_PER_SEC == 0)
print_msg(".");
pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
}
return false; /* timeout reached */

This is the same 10 Hz poll cadence (WAITS_PER_SEC) used by the start and stop wait loops, but the readiness signal is the DBState read from pg_control rather than a line in postmaster.pid — promotion has no dedicated pidfile status line, so the control file is the only authoritative witness that the transition to primary has completed.

Reload. do_reload() sends SIGHUP, which the postmaster relays to all backends, causing them to reread postgresql.conf. No wait loop is needed because SIGHUP handling is asynchronous and non-disruptive.

get_control_dbstate: the bridge between pg_ctl and pg_control. This static helper, called from the wait loops and the promote guard, reads pg_control through the shared get_controlfile() utility:

// get_control_dbstate — src/bin/pg_ctl/pg_ctl.c
static DBState
get_control_dbstate(void)
{
bool crc_ok;
ControlFileData *ctl = get_controlfile(pg_data, &crc_ok);
if (!crc_ok) { write_stderr("control file appears to be corrupt\n"); exit(1); }
DBState ret = ctl->state;
pfree(ctl);
return ret;
}

ControlFileData (src/include/catalog/pg_control.h) is the on-disk layout of $PGDATA/global/pg_control. At REL_18_STABLE, PG_CONTROL_VERSION = 1800. The struct is deliberately kept under PG_CONTROL_MAX_SAFE_SIZE = 512 bytes — one disk sector — so that every write is atomic:

// ControlFileData — src/include/catalog/pg_control.h
typedef struct ControlFileData
{
uint64 system_identifier; /* unique cluster ID (set at initdb) */
uint32 pg_control_version; /* PG_CONTROL_VERSION = 1800 in PG18 */
uint32 catalog_version_no; /* catversion.h; changes on catalog changes */
DBState state; /* current lifecycle state */
pg_time_t time; /* timestamp of last pg_control update */
XLogRecPtr checkPoint; /* LSN of last checkpoint record */
CheckPoint checkPointCopy; /* full body of that checkpoint record */
XLogRecPtr unloggedLSN; /* fake LSN counter for unlogged relations */
XLogRecPtr minRecoveryPoint; /* must replay at least to here */
TimeLineID minRecoveryPointTLI;
XLogRecPtr backupStartPoint; /* set during online backup */
XLogRecPtr backupEndPoint;
bool backupEndRequired;
int wal_level;
bool wal_log_hints;
int MaxConnections;
int max_worker_processes;
int max_wal_senders;
int max_prepared_xacts;
int max_locks_per_xact;
bool track_commit_timestamp;
uint32 maxAlign;
double floatFormat; /* = 1234567.0; architecture check */
uint32 blcksz; /* data block size */
uint32 relseg_size; /* blocks per large-relation segment */
uint32 xlog_blcksz; /* WAL block size */
uint32 xlog_seg_size; /* WAL segment size */
uint32 nameDataLen; /* NAMEDATALEN */
uint32 indexMaxKeys;
uint32 toast_max_chunk_size;
uint32 loblksize;
bool float8ByVal;
uint32 data_checksum_version;
bool default_char_signedness; /* new in PG18 */
char mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];
pg_crc32c crc; /* MUST BE LAST */
} ControlFileData;

DBState is the lifecycle state machine for the cluster:

// DBState — src/include/catalog/pg_control.h
typedef enum DBState
{
DB_STARTUP = 0,
DB_SHUTDOWNED,
DB_SHUTDOWNED_IN_RECOVERY,
DB_SHUTDOWNING,
DB_IN_CRASH_RECOVERY,
DB_IN_ARCHIVE_RECOVERY,
DB_IN_PRODUCTION,
} DBState;

DB_SHUTDOWNED is the only state from which a clean, non-recovery startup proceeds. Any other state triggers WAL replay on next startup. pg_ctl relies on DBState in two places: the promote guard (must see DB_IN_ARCHIVE_RECOVERY) and the startup-wait fallback (treats DB_SHUTDOWNED_IN_RECOVERY as a non-error exit).

The CheckPoint struct embedded in checkPointCopy carries:

// CheckPoint — src/include/catalog/pg_control.h
typedef struct CheckPoint
{
XLogRecPtr redo; /* REDO start LSN */
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID; /* non-zero if this record begins a new TL */
bool fullPageWrites;
int wal_level;
FullTransactionId nextXid;
Oid nextOid;
MultiXactId nextMulti;
MultiXactOffset nextMultiOffset;
TransactionId oldestXid;
Oid oldestXidDB;
MultiXactId oldestMulti;
Oid oldestMultiDB;
pg_time_t time;
TransactionId oldestCommitTsXid;
TransactionId newestCommitTsXid;
TransactionId oldestActiveXid;
} CheckPoint;

The read / write path through controldata_utils.c

Section titled “The read / write path through controldata_utils.c”

Both backend and frontend code share src/common/controldata_utils.c. get_controlfile() builds the path $PGDATA/global/pg_control, opens it with O_RDONLY, reads exactly sizeof(ControlFileData) bytes, and verifies the CRC32c. In frontend (tool) mode a CRC mismatch triggers up to 10 retries with 10 ms sleeps — a guard against reading a partial write from a running server:

// get_controlfile_by_exact_path — src/common/controldata_utils.c
retry:
fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0);
r = read(fd, ControlFile, sizeof(ControlFileData));
close(fd);
INIT_CRC32C(crc);
COMP_CRC32C(crc, ControlFile, offsetof(ControlFileData, crc));
FIN_CRC32C(crc);
*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);
if (!*crc_ok_p && retries < 10) { retries++; pg_usleep(10000); goto retry; }

update_controlfile() zero-pads the buffer to PG_CONTROL_FILE_SIZE (8192 bytes — the physical file size, kept constant across format changes so that an old binary can detect a version mismatch as a wrong-version error rather than a short read), recalculates the CRC, and writes with O_WRONLY. In backend mode the caller holds ControlFileLock before calling this function.

pg_controldata (src/bin/pg_controldata/pg_controldata.c) is a minimal program that calls get_controlfile() and prints every field with printf(). It carries no logic beyond field formatting. Notable points:

  • It #define FRONTEND 1 but #include "postgres.h" (not postgres_fe.h) because it needs the WAL-internal types (xlog_internal.h, transam.h) that only the backend header exposes.
  • The default_char_signedness field (new in PG18) is printed as signed / unsigned and encodes the platform’s default char signedness at initdb time — relevant when cross-compiling or migrating between ARM (unsigned) and x86 (signed) systems.
  • data_checksum_version is 0 when page checksums are disabled; any nonzero value indicates the checksum algorithm version in use.
// main (pg_controldata) — src/bin/pg_controldata/pg_controldata.c
ControlFile = get_controlfile(DataDir, &crc_ok);
if (!crc_ok)
pg_log_warning("calculated CRC checksum does not match value stored in control file");
printf("pg_control version number: %u\n", ControlFile->pg_control_version);
printf("Database cluster state: %s\n", dbState(ControlFile->state));
printf("Latest checkpoint location: %X/%X\n", LSN_FORMAT_ARGS(ControlFile->checkPoint));
// ... (all ~40 fields)
printf("Default char data signedness: %s\n",
ControlFile->default_char_signedness ? "signed" : "unsigned");
printf("Mock authentication nonce: %s\n", mock_auth_nonce_str);
flowchart TD
    A["pg_ctl start<br/>do_start()"] --> B["start_postmaster()<br/>fork + exec postgres"]
    B --> C["poll postmaster.pid<br/>wait_for_postmaster_start()"]
    C --> D{"LOCK_FILE_LINE_PM_STATUS<br/>== PM_STATUS_READY?"}
    D -- yes --> E["exit 0: server started"]
    D -- postmaster died --> F["get_control_dbstate()<br/>read pg_control"]
    F --> G{"DBState?"}
    G -- DB_SHUTDOWNED_IN_RECOVERY --> H["exit 0: shutdown in recovery"]
    G -- other --> I["exit 1: startup failed"]
    D -- timeout --> J["exit 1: server did not start in time"]

    K["pg_ctl stop<br/>do_stop()"] --> L["get_pgpid()<br/>read postmaster.pid"]
    L --> M["kill pid sig<br/>SIGTERM/SIGINT/SIGQUIT"]
    M --> N["poll: postmaster.pid gone?<br/>wait_for_postmaster_stop()"]
    N -- gone --> O["exit 0: server stopped"]
    N -- timeout --> P["exit 1: server does not shut down"]

    Q["pg_ctl promote<br/>do_promote()"] --> R["get_control_dbstate()"]
    R --> S{"DB_IN_ARCHIVE_RECOVERY?"}
    S -- no --> T["exit 1: not a standby"]
    S -- yes --> U["create promote file<br/>send SIGUSR1"]
    U --> V["poll get_control_dbstate()<br/>wait_for_postmaster_promote()"]
    V --> W{"DB_IN_PRODUCTION?"}
    W -- yes --> X["exit 0: server promoted"]
    W -- timeout --> Y["exit 1: promote timed out"]
flowchart LR
    S0["DB_STARTUP"] --> S6["DB_IN_PRODUCTION"]
    S0 --> S4["DB_IN_CRASH_RECOVERY"]
    S0 --> S5["DB_IN_ARCHIVE_RECOVERY"]
    S6 --> S3["DB_SHUTDOWNING"]
    S3 --> S1["DB_SHUTDOWNED"]
    S5 --> S6
    S4 --> S1
    S5 --> S2["DB_SHUTDOWNED_IN_RECOVERY"]
SymbolRole
CtlCommand (enum)Command discriminator
ShutdownMode (enum)SMART_MODE, FAST_MODE, IMMEDIATE_MODE
WaitPMResult (enum)Start-wait outcome
mainParse options, build file paths, dispatch ctl_command
do_initFork initdb
do_startFork postgres; wait via wait_for_postmaster_start
do_stopSend SIGTERM/INT/QUIT; wait via wait_for_postmaster_stop
do_restartdo_stop then do_start
do_reloadSend SIGHUP
do_promoteCreate promote sentinel file; send SIGUSR1; wait
do_logrotateCreate logrotate sentinel file; send SIGUSR1
do_statusPrint PID and opts from postmaster.pid
do_killSend arbitrary signal to a given PID
start_postmasterfork + exec /bin/sh -c "exec postgres …"
wait_for_postmaster_startPoll postmaster.pid line 8 at 10 Hz
wait_for_postmaster_stopPoll postmaster.pid absence at 10 Hz
wait_for_postmaster_promotePoll get_control_dbstate() at 10 Hz
get_pgpidRead PID from postmaster.pid line 1
get_control_dbstateRead pg_control via get_controlfile(); return state
read_post_optsRead saved options from postmaster.opts (used in restart)
postmaster_is_alivekill(pid, 0) liveness check
trap_sigint_during_startupForward SIGINT to postmaster during start wait
set_modeParse --mode to ShutdownMode; set global sig (SIGTERM/SIGINT/SIGQUIT)
set_sigParse kill -s signal name to global sig
adjust_data_dirHandle -D pointing at config-only directory
SymbolRole
mainParse -D; call get_controlfile(); print all fields
dbStateMap DBState enum to human-readable string
wal_level_strMap WalLevel enum to string

controldata_utils.c — shared read/write path

Section titled “controldata_utils.c — shared read/write path”
SymbolRole
get_controlfileBuild path $PGDATA/global/pg_control; delegate to get_controlfile_by_exact_path
get_controlfile_by_exact_pathOpen, read, CRC-verify; retry up to 10× in frontend mode
update_controlfileRecompute CRC, zero-pad to 8192 B, write; do_sync controls fsync
SymbolHeaderNote
ControlFileDatacatalog/pg_control.hOn-disk pg_control layout; ≤ 512 B active payload
CheckPointcatalog/pg_control.hEmbedded in ControlFileData.checkPointCopy
DBStatecatalog/pg_control.h7-value lifecycle enum
PG_CONTROL_VERSIONcatalog/pg_control.h1800 at REL_18_STABLE
PG_CONTROL_MAX_SAFE_SIZEcatalog/pg_control.h512 — one-sector atomic-write limit
PG_CONTROL_FILE_SIZEcatalog/pg_control.h8192 — physical file size, version-mismatch probe
LOCK_FILE_LINE_PM_STATUSutils/pidfile.hLine 8 in postmaster.pid
PM_STATUS_READYutils/pidfile.h"ready " — readiness sentinel
PM_STATUS_STANDBYutils/pidfile.h"standby " — hot-standby ready

Position hints for REL_18_STABLE commit 273fe94. Symbols are the stable anchor; line numbers decay as the tree evolves.

SymbolFileApprox. line
CtlCommand enumsrc/bin/pg_ctl/pg_ctl.c53
ShutdownMode enumsrc/bin/pg_ctl/pg_ctl.c37
WaitPMResult enumsrc/bin/pg_ctl/pg_ctl.c44
mainsrc/bin/pg_ctl/pg_ctl.c2202
do_startsrc/bin/pg_ctl/pg_ctl.c931
do_stopsrc/bin/pg_ctl/pg_ctl.c1027
do_restartsrc/bin/pg_ctl/pg_ctl.c1085
do_reloadsrc/bin/pg_ctl/pg_ctl.c1149
do_promotesrc/bin/pg_ctl/pg_ctl.c1186
do_logrotatesrc/bin/pg_ctl/pg_ctl.c1267
do_statussrc/bin/pg_ctl/pg_ctl.c1348
do_killsrc/bin/pg_ctl/pg_ctl.c1405
start_postmastersrc/bin/pg_ctl/pg_ctl.c439
wait_for_postmaster_startsrc/bin/pg_ctl/pg_ctl.c593
wait_for_postmaster_stopsrc/bin/pg_ctl/pg_ctl.c717
wait_for_postmaster_promotesrc/bin/pg_ctl/pg_ctl.c754
get_pgpidsrc/bin/pg_ctl/pg_ctl.c246
get_control_dbstatesrc/bin/pg_ctl/pg_ctl.c2183
postmaster_is_alivesrc/bin/pg_ctl/pg_ctl.c1324
trap_sigint_during_startupsrc/bin/pg_ctl/pg_ctl.c857
read_post_optssrc/bin/pg_ctl/pg_ctl.c802
set_modesrc/bin/pg_ctl/pg_ctl.c2047
set_sigsrc/bin/pg_ctl/pg_ctl.c2075
dbStatesrc/bin/pg_controldata/pg_controldata.c49
wal_level_strsrc/bin/pg_controldata/pg_controldata.c73
main (pg_controldata)src/bin/pg_controldata/pg_controldata.c88
get_controlfilesrc/common/controldata_utils.c52
get_controlfile_by_exact_pathsrc/common/controldata_utils.c68
update_controlfilesrc/common/controldata_utils.c189
ControlFileData structsrc/include/catalog/pg_control.h104
CheckPoint structsrc/include/catalog/pg_control.h35
DBState enumsrc/include/catalog/pg_control.h89
PG_CONTROL_VERSIONsrc/include/catalog/pg_control.h25
PG_CONTROL_MAX_SAFE_SIZEsrc/include/catalog/pg_control.h247
PG_CONTROL_FILE_SIZEsrc/include/catalog/pg_control.h256
LOCK_FILE_LINE_PM_STATUSsrc/include/utils/pidfile.h44
PM_STATUS_READYsrc/include/utils/pidfile.h53
PM_STATUS_STANDBYsrc/include/utils/pidfile.h54

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

MySQL / MariaDB uses mysqladmin / mysqld_safe in a similar role: a wrapper script that forks mysqld, monitors the PID file, and sends SIGTERM on shutdown. The InnoDB system tablespace header (ibdata1, page 0) serves the control-file role: it stores the LSN of the last checkpoint and the tablespace ID, protected by a page checksum. Unlike PostgreSQL, MySQL stores this inside the tablespace itself rather than in a separate file — a design choice that couples the control record to the storage engine.

Oracle uses two or more control files mirrored to separate mount points. The Oracle control file is much larger (megabytes, not 512 bytes) because it also stores the RMAN backup catalog and archived-log history. The mirroring is a high-availability measure absent in PostgreSQL (which relies on the single pg_control surviving on the same filesystem).

SQLite stores its database state in a 100-byte header at offset 0 of the database file itself — the most minimal possible control-record design. The “change counter” at offset 24 is the equivalent of PostgreSQL’s CRC: a reader that detects a changed counter knows it must re-read the shared cache. No separate control file exists; the database is the control file.

The design of pg_control reflects two classical crash-recovery insights. First, ARIES (Mohan et al., 1992) established the principle that the recovery manager must be able to find the REDO start point in a structure that survives crashes — the “master record” in ARIES terminology, which maps directly to checkPointCopy.redo in ControlFileData. Second, the atomicity of the control-file write is a special case of the write-ahead logging invariant: before any data change is considered durable, the metadata record naming its location must be durable. Writing pg_control with a CRC and ensuring it fits in one sector is how PostgreSQL guarantees this without a second WAL record.

The system_identifier field (a 64-bit random value set at initdb) is PostgreSQL’s answer to the split-brain problem in HA clusters: a standby that has been promoted to primary will refuse to apply WAL from the old primary because the WAL carries a different system_identifier. This simple check prevents a catastrophic case of a demoted primary reattaching to its old WAL stream.

The mock_authentication_nonce (32 random bytes) was added in PG10 to close a timing side-channel in SASL authentication exchanges that could proceed based on a cluster-unique value even when the user does not exist. It is stored in pg_control because it must survive server restarts and be available before any catalog access — exactly the kind of stable, pre-catalog state that pg_control is designed to hold.

Primary source files (REL_18_STABLE, commit 273fe94):

  • src/bin/pg_ctl/pg_ctl.c
  • src/bin/pg_controldata/pg_controldata.c
  • src/include/catalog/pg_control.h
  • src/common/controldata_utils.c
  • src/include/utils/pidfile.h

Cross-references within this KB:

  • postgres-postmaster.md — the postmaster process that pg_ctl launches
  • postgres-xlog-wal.md — WAL mechanics; checkpoint LSN interpretation
  • postgres-checkpoint.md — checkpoint writes that update pg_control
  • postgres-recovery-redo.md — startup reads pg_control to begin REDO
  • postgres-backup-basebackup.mdbackupStartPoint / backupEndRequired fields
  • postgres-pg-dump-restore.md — logical backup counterpart (no pg_control dependency)

Research and textbooks:

  • Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,” ACM TODS 17(1), 1992 — origin of the REDO-start “master record”
  • Petrov, Database Internals (O’Reilly, 2019), ch. 7 “Log-Structured Storage” — control file and WAL bootstrap
  • Stonebraker & Rowe, “The Design of POSTGRES,” ACM SIGMOD 1986 — original process-model rationale