PostgreSQL Architecture Overview — One Binary, One Shared-Memory Machine, and the WAL Spine
Contents:
- Who this is for
- The thesis: a shared-memory machine, not a process tree
- Axis 1 — Process model
- Axis 2 — Shared-memory and IPC substrate
- Axis 3 — The WAL durability spine
- Axis 4 — Storage and pluggable access methods
- Axis 5 — Query pipeline
- Axis 6 — Catalog and cache layer
- Axis 7 — Extensibility surface
- Where to start reading
- Subcategory map
Who this is for
Section titled “Who this is for”This is the front door for the PostgreSQL code-analysis tree. It is written for someone who has read zero lines of PostgreSQL source and wants the high-level shape — what processes run, what they share, how one query travels through the engine, and what keeps durability and concurrency honest — before deciding which of the ~95 per-module detail docs to open. Every section is deliberately a router, not a duplicate: it names the boundary and points at the doc that owns the mechanism. If a section feels thin, that is by design.
A second audience is the engineer who has read a few detail docs and now
wants to fit them together. The cross-cutting questions — “how does a
SELECT get from the wire to a heap page and back?”, “what does a snapshot
actually read?”, “why does flushing a dirty buffer wait on the WAL?” — each
cut across several subsystems. The diagrams here name the seams those
questions cross, so you can pick the right pair of detail docs to read in
tandem.
A third audience is anyone asking why PostgreSQL looks the way it does.
Its defining choices — a single forked binary instead of threads or
separate daemons, a fixed shared-memory segment sized once at boot, a
no-overwrite MVCC heap that needs vacuum, a WAL that is the universal
event log, and an extensibility surface (pluggable access methods, resource
managers, hooks, foreign data wrappers, background workers) baked into the
core — each is a deliberate trade-off with an OODB-and-academic lineage
(the Berkeley POSTGRES project). The detail docs treat each in depth;
postgres-design-philosophy.md (planned) will collect the rationale, and
the postgres-evolution-*.md docs trace how the big subsystems changed
release by release.
Version note. This tree is written against the cloned source at
REL_18_STABLE(commit273fe94, PostgreSQL 18.x). Symbol names are the anchor; the per-doc “Source verification” sections pin line numbers to this commit.
The thesis: a shared-memory machine, not a process tree
Section titled “The thesis: a shared-memory machine, not a process tree”If you have read another DBMS’s internals, drop one assumption at the door: in PostgreSQL the processes are not the architecture. The architecture is the shared memory they coordinate through, and the processes are interchangeable inhabitants of it.
Concretely:
- PostgreSQL is one executable. There are no separate server/broker/
agent binaries. A single postmaster process
fork()s every other process from itself (postmaster_child_launch→fork_process, inlaunch_backend.c). On Windows, where there is nofork(), the child re-executes the same binary and reattaches — theEXEC_BACKENDpath (also enable-able on other platforms for testing) — but that is a portability fallback, not a second architecture. - All those children attach to one shared-memory segment whose total
size is computed once, at postmaster startup, by
CalculateShmemSizeand created byCreateSharedMemoryAndSemaphores(ipci.c). The segment does not grow on demand; an extension that wants shared memory must reserve it up front through theshmem_request_hook/RequestAddinShmemSpacepath before the segment is sized. - The state lives in that segment, not in the processes: the per-backend
PGPROCslots and the globalProcGlobal(proc.h), the procarray that backs every snapshot (procarray.c), the shared buffer pool, the heavyweight lock table, the predicate-lock (SSI) structures, and the shared-invalidation queue. A backend is mostly a thread of control with a private memory-context tree that operates on shared state under LWLock protection.
Get this picture right and the rest of the tree falls into place: the buffer manager, lock manager, procarray, and WAL insert path all assume “I am one of many processes touching a shared structure under a latch.” The seven axes below are different views of that one machine.
Axis 1 — Process model
Section titled “Axis 1 — Process model”PostgreSQL’s process roles are an enum, not a set of binaries. The
BackendType in miscadmin.h enumerates every kind of process the
postmaster can fork; if you add one you also register it in the
child_process_kinds array in launch_backend.c.
// BackendType — src/include/miscadmin.htypedef enum BackendType{ B_INVALID = 0, B_BACKEND, // a client session B_DEAD_END_BACKEND, B_AUTOVAC_LAUNCHER, B_AUTOVAC_WORKER, B_BG_WORKER, // extension / parallel / logical-apply workers B_WAL_SENDER, B_SLOTSYNC_WORKER, B_STANDALONE_BACKEND, // single-user mode (initdb, --single) /* auxiliary procs: one of each (except IO workers) */ B_ARCHIVER, B_BG_WRITER, B_CHECKPOINTER, B_IO_WORKER, B_STARTUP, B_WAL_RECEIVER, B_WAL_SUMMARIZER, B_WAL_WRITER, B_LOGGER, // NOT attached to shared memory} BackendType;Three things to read off this enum:
- Client sessions are
B_BACKEND. One forked backend per connection. No connection pooling in core — that is left to PgBouncer / pgpool outside the engine. A backend runs the entire parse→plan→execute pipeline (Axis 5) for its session. - The auxiliary processes are the engine’s background machinery. The
B_STARTUPprocess performs crash/archive recovery (Axis 3); theB_CHECKPOINTER,B_BG_WRITER, andB_WAL_WRITERkeep dirty buffers and WAL moving to disk;B_AUTOVAC_LAUNCHERschedulesB_AUTOVAC_WORKERs against the no-overwrite heap;B_WAL_SENDER/B_WAL_RECEIVERcarry replication; newer roles (B_IO_WORKERfor the PG18 async-I/O subsystem,B_SLOTSYNC_WORKERfor failover slots) show the engine still growing roles into this same fork-and-attach model. B_LOGGERis the exception that proves the rule — it is explicitly not attached to shared memory and has noPGPROC, because it only drains a pipe of log messages. Everything else lives in the shared-memory machine.
Detail docs: postgres-postmaster.md (the fork model and child registry),
postgres-backend-lifecycle.md (what a B_BACKEND does from InitPostgres
to the message loop), postgres-aux-processes.md (checkpointer, bgwriter,
walwriter, startup, syslogger), postgres-background-workers.md
(B_BG_WORKER framework), postgres-autovacuum.md.
Axis 2 — Shared-memory and IPC substrate
Section titled “Axis 2 — Shared-memory and IPC substrate”This is the spine. The CUBRID analysis has no equivalent axis, because CUBRID’s processes own state and pass it over sockets; PostgreSQL’s processes share state in one segment and coordinate access with locks and signals. Everything in the other six axes ultimately reads or writes a structure that lives here.
flowchart TB
subgraph BOOT["postmaster startup (once)"]
CALC["CalculateShmemSize()<br/>sums every subsystem's request"]
CREATE["CreateSharedMemoryAndSemaphores()<br/>maps the fixed segment"]
HOOK["shmem_request_hook / shmem_startup_hook<br/>(extensions reserve here)"]
end
CALC --> CREATE
HOOK --> CALC
subgraph SEG["the shared segment"]
PGPROC["PGPROC[] + ProcGlobal<br/>per-process slots, wait queues"]
PROCARR["procarray<br/>running XIDs -> snapshots"]
BUFPOOL["buffer pool<br/>(BufferDesc[] + blocks)"]
LOCKTBL["lock table (LOCK/PROCLOCK hash)"]
PREDLOCK["predicate locks (SSI)"]
SINVAL["sinval message queue"]
DSMREG["dsm / dsa registry<br/>(dynamic + parallel segments)"]
end
CREATE --> SEG
subgraph COORD["coordination primitives"]
LWLOCK["LWLocks<br/>(short, protect shared structs)"]
SPIN["spinlocks / atomics"]
HWLOCK["heavyweight locks<br/>(SQL-visible, deadlock-detected)"]
LATCH["latches / signals<br/>(procsignal, SetLatch)"]
end
SEG --- COORD
Two distinctions are architecturally load-bearing and a frequent source of confusion:
- Static vs dynamic shared memory. The main segment is fixed at boot.
But parallel query and some extensions need shared memory after boot,
so PostgreSQL adds DSM (dynamic shared memory segments) and DSA
(a dynamic allocator over them) plus
shm_mq(a single-reader/ single-writer message queue) for the parallel-worker tuple stream. Seepostgres-shared-memory-ipc.md. - Lightweight vs heavyweight locks. LWLocks are short-lived,
two-mode (shared/exclusive) latches that protect in-memory shared
structures (a buffer header, a hash partition); they are not
deadlock-detected and not SQL-visible. Heavyweight locks are the
SQL-level lock table (
AccessShareLock…AccessExclusiveLockon relations, tuples, transactions) with a full waits-for deadlock detector. These are two different subsystems and the tree keeps them in two docs:postgres-lwlock-spinlock.mdandpostgres-lock-manager.md. Snapshot isolation conflicts (SERIALIZABLE) are a third mechanism again — predicate locks — inpostgres-ssi-predicate-locking.md.
Detail docs: postgres-shared-memory-ipc.md, postgres-procarray.md,
postgres-lwlock-spinlock.md, postgres-lock-manager.md,
postgres-ssi-predicate-locking.md, postgres-latch-signals.md,
postgres-cache-invalidation.md (the sinval loop).
Axis 3 — The WAL durability spine
Section titled “Axis 3 — The WAL durability spine”The write-ahead log is PostgreSQL’s single durable event log, and almost every durability or distribution feature is a consumer of it. This axis transfers cleanly from any ARIES-lineage engine: log before you write, replay forward on recovery. What makes it the spine in PostgreSQL is how many subsystems read the same stream.
flowchart LR
subgraph PRODUCE["WAL production"]
INS["XLogInsert()<br/>per-rmgr records"]
RMGR["resource managers<br/>(Heap, Btree, Xact, CLOG, ...)"]
BUFW["buffer manager<br/>(WAL before flush: the rule)"]
end
RMGR --> INS
BUFW -- "LSN gate" --> INS
INS --> WALFILES["WAL segments<br/>(pg_wal/)"]
subgraph CONSUME["WAL consumers"]
REDO["startup process<br/>crash / archive recovery (redo)"]
PHYS["walsender -> walreceiver<br/>physical streaming replication"]
LOGIC["logical decoding<br/>reorderbuffer -> pgoutput"]
ARCH["archiver + wal summarizer<br/>(PITR, incremental backup)"]
BASE["basebackup / pg_rewind"]
end
WALFILES --> REDO
WALFILES --> PHYS
WALFILES --> LOGIC
WALFILES --> ARCH
WALFILES --> BASE
The resource manager (rmgr) table (rmgrlist.h) is what makes the WAL
extensible and self-describing: each record carries an rmgr id, and the
rmgr supplies redo, desc, identify, optional masking, and — crucially
for logical replication — a decode callback. The built-in set covers
XLOG, Transaction, Storage, CLOG, MultiXact, Heap/Heap2,
Btree, Hash, Gin, Gist, SPGist, BRIN, Sequence, CommitTs,
ReplicationOrigin, Generic, LogicalMessage, Database, Tablespace,
RelMap, and Standby, among others; extensions can RegisterCustomRmgr
(Axis 7).
// rmgr table entry — src/include/access/rmgrlist.h// PG_RMGR(id, name, redo, desc, identify, startup, cleanup, mask, decode)PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)The WAL-before-flush rule is the seam between this axis and Axis 4: the
buffer manager may not write a dirty page to disk until the WAL record
describing that change is durable (the page’s LSN gates the flush). That
single rule is why a SELECT never blocks on the log but a COMMIT does,
and why the checkpointer and walwriter exist.
Detail docs: postgres-xlog-wal.md, postgres-wal-records-rmgr.md,
postgres-recovery-redo.md, postgres-checkpoint.md,
postgres-xact.md, postgres-two-phase-commit.md,
postgres-clog-commit-ts.md, postgres-slru.md. MVCC visibility
(postgres-mvcc-snapshots.md) reads the commit state this axis records;
vacuum (postgres-vacuum.md) reclaims what MVCC leaves behind.
Axis 4 — Storage and pluggable access methods
Section titled “Axis 4 — Storage and pluggable access methods”Below the executor, PostgreSQL is a strict layering from OS files up to
on-page tuples, with a key twist absent from CUBRID: the access method is
pluggable. A table is reached through a TableAmRoutine and an index
through an IndexAmRoutine; heap is just the default table AM.
// the table AM indirection — src/include/access/tableam.htypedef struct TableAmRoutine{ NodeTag type; TableScanDesc (*scan_begin) (Relation rel, ...); // tuple_insert / tuple_update / tuple_delete / index_fetch_tuple ...} TableAmRoutine;The structural highlights a reader must carry forward:
- The heap is no-overwrite (MVCC). An
UPDATEwrites a new tuple version and leaves the old one for snapshot readers; deletes are logical. This is why PostgreSQL needs vacuum and the visibility map, and it is the deepest design difference from in-place engines. Owned bypostgres-heap-am.md(withREADME.HOTfor heap-only-tuple pruning). - The buffer manager is the choke point between memory and disk; the
WAL-before-flush rule (Axis 3) lives at its flush path.
postgres-buffer-manager.md. smgrmaps a relation to file segments (1 GB chunks); the page layout (slotted page,ItemIdline pointers) is shared by every AM.postgres-smgr-md.md,postgres-page-layout.md.
Detail docs: postgres-table-am.md, postgres-heap-am.md,
postgres-index-am.md, postgres-nbtree.md, postgres-gin.md,
postgres-gist.md, postgres-spgist.md, postgres-brin.md,
postgres-hash-index.md, postgres-buffer-manager.md,
postgres-smgr-md.md, postgres-page-layout.md, postgres-toast.md,
postgres-visibility-map.md, postgres-free-space-map.md,
postgres-aio.md (PG18 async I/O).
Axis 5 — Query pipeline
Section titled “Axis 5 — Query pipeline”A query travels a textbook pipeline; the per-stage entry points in
tcop/postgres.c are the spine of the read path.
flowchart LR SQL["SQL text"] --> PARSE["pg_parse_query<br/>(gram.y / scan.l)<br/>raw parse tree"] PARSE --> ANALYZE["parse analysis<br/>(parse_analyze)<br/>Query tree"] ANALYZE --> REWRITE["pg_analyze_and_rewrite<br/>(rule system, views, RLS)"] REWRITE --> PLAN["pg_plan_queries<br/>(planner: paths -> Plan)"] PLAN --> PORTAL["PortalStart / PortalRun<br/>(executor)"] PORTAL --> EXEC["ExecutorRun<br/>node tree pull model"] EXEC --> AM["table / index AMs (Axis 4)"]
What a reader should hold onto:
- Two grammars of plan generation. The planner builds candidate
Paths (cheapest-first, with a cost model), then turns the winning
path into an executable Plan tree. Join ordering uses dynamic
programming, with GEQO as a genetic fallback past a threshold.
postgres-planner-overview.md,postgres-path-generation.md,postgres-join-ordering.md,postgres-cost-model.md,postgres-plan-creation.md. - The executor is a demand-pull tree of plan nodes (
ExecProcNode), with expression evaluation compiled to a switch-threaded interpreter (and optionally JIT-compiled).postgres-executor.md,postgres-expression-eval.md, the per-node docs (postgres-scan-nodes.md,postgres-join-nodes.md,postgres-agg-sort-nodes.md),postgres-jit.md. - Parallelism is a fork of this same tree. A
Gathernode launches background workers that run a copy of the sub-plan and stream tuples back overshm_mq(Axis 2).postgres-parallel-query.md. - Prepared statements and portals cache plans across executions.
postgres-portals-prepared.md.
Detail docs also: postgres-parser.md, postgres-analyze-transform.md,
postgres-rewriter.md, postgres-node-trees.md,
postgres-extended-statistics.md, postgres-tuplesort.md.
Axis 6 — Catalog and cache layer
Section titled “Axis 6 — Catalog and cache layer”PostgreSQL is famously catalog-driven: types, operators, functions, access
methods, and index strategies are all rows in pg_* system tables, which
is what makes the engine extensible at runtime. Because every backend reads
the catalog constantly, three caches sit in front of it, and a
shared-invalidation loop keeps them coherent across processes.
flowchart TB CAT["system catalogs<br/>pg_class, pg_attribute, pg_proc,<br/>pg_type, pg_am, pg_index, ..."] CAT --> RELCACHE["relcache<br/>(Relation descriptors)"] CAT --> CATCACHE["catcache / syscache<br/>(tuple lookups)"] RELCACHE --> BACKEND["backend executes against<br/>cached metadata"] CATCACHE --> BACKEND DDL["a DDL commit<br/>(Axis 7 ProcessUtility)"] --> INVAL["CacheInvalidate*"] INVAL --> SINV["sinval queue (shared memory, Axis 2)"] SINV -. "every backend drains + invalidates" .-> RELCACHE SINV -.-> CATCACHE
The seam to remember: a DDL or catalog mutation in one backend does not
directly touch another backend’s caches. It queues shared-invalidation
messages; each backend drains the queue at well-defined points and drops
the stale cache entries. This sinval loop ties Axis 6 back to the
shared-memory substrate of Axis 2.
Detail docs: postgres-system-catalogs.md, postgres-relcache.md,
postgres-catcache-syscache.md, postgres-cache-invalidation.md,
postgres-dependency-tracking.md, postgres-namespace-search-path.md.
Axis 7 — Extensibility surface
Section titled “Axis 7 — Extensibility surface”This axis has no first-class equivalent in the CUBRID tree, yet it is the reason the PostgreSQL codebase is shaped the way it is. Extensibility is not a bolt-on; the core is built around indirection points that third-party code plugs into without patching the engine.
flowchart TB CORE["PostgreSQL core"] CORE --> TAMX["table / index AMs<br/>(TableAmRoutine, IndexAmRoutine — Axis 4)"] CORE --> RMGRX["custom WAL resource managers<br/>(RegisterCustomRmgr — Axis 3)"] CORE --> HOOKX["hook globals<br/>(planner_hook, ExecutorStart_hook,<br/>ProcessUtility_hook, shmem_*_hook)"] CORE --> FDWX["foreign data wrappers<br/>(FdwRoutine)"] CORE --> BGWX["background workers<br/>(RegisterBackgroundWorker — Axis 1/2)"] CORE --> PLX["procedural languages + SPI<br/>(PL/pgSQL handler, SPI_*)"] CORE --> CSX["custom scan providers<br/>(CustomScanMethods)"] CORE --> EXTX["extensions<br/>(CREATE EXTENSION packaging)"]
Each plug point is a struct-of-callbacks or a function-pointer global that
the core checks at a fixed spot. An extension ships a shared library, fills
the struct, and registers it — contrib/postgres_fdw (an FDW),
pg_stat_statements (a planner_hook + shared memory), and the in-tree
index AMs are all instances of the same pattern. This is why “is X in core
or contrib?” is a recurring scope question (recorded in
postgres-coverage.md): the mechanism is in core; many implementations
ship as contrib or third-party.
Detail docs: postgres-fdw.md, postgres-extensions.md,
postgres-hooks.md, postgres-custom-scan.md, postgres-plpgsql.md,
postgres-spi.md; and Axis-1/2/3/4 docs for AMs, rmgrs, and bgworkers.
Where to start reading
Section titled “Where to start reading”The cross-referenced-first reading order, for someone going deep:
- The machine — this overview, then
postgres-shared-memory-ipc.md,postgres-postmaster.md,postgres-backend-lifecycle.md. Understand the fork-and-attach model before anything else. - The storage floor —
postgres-page-layout.md,postgres-buffer-manager.md,postgres-table-am.md,postgres-heap-am.md,postgres-nbtree.md. - The durability spine —
postgres-xlog-wal.md,postgres-xact.md,postgres-mvcc-snapshots.md(+postgres-procarray.md),postgres-recovery-redo.md,postgres-vacuum.md. - Concurrency —
postgres-lock-manager.md,postgres-lwlock-spinlock.md,postgres-ssi-predicate-locking.md. - The query path —
postgres-parser.md,postgres-planner-overview.md,postgres-executor.md. - The catalog —
postgres-system-catalogs.md,postgres-relcache.md,postgres-catcache-syscache.md. - Then fan out by interest into replication, DDL, extensibility,
monitoring, i18n/text, and the
postgres-evolution-*.mdarcs.
Subcategory map
Section titled “Subcategory map”The per-module docs are grouped into thirteen subcategories, each with its
own section-overview router (postgres-overview-<subcat>.md). The mapping
of subcategory to the PostgreSQL source tree, and the per-module backlog,
lives in postgres-coverage.md.
| Subcategory | What it covers | Section overview |
|---|---|---|
| storage-engine | pages, buffers, smgr, table/index AMs, TOAST, FSM/VM, checksums | postgres-overview-storage-engine.md |
| txn-recovery | MVCC + snapshots + procarray, WAL, clog/SLRU, 2PC, recovery, vacuum, checkpoint | postgres-overview-txn-recovery.md |
| query-processing | parse → analyze → rewrite → plan → execute, stats, JIT, parallel | postgres-overview-query-processing.md |
| server-architecture | postmaster, backend lifecycle, IPC, locks (LW + heavy + SSI), aux procs | postgres-overview-server-architecture.md |
| monitoring-stats | cumulative statistics, wait events, progress reporting | postgres-overview-monitoring-stats.md |
| system-catalog | catalog layout, relcache/catcache, invalidation, dependency, namespace | postgres-overview-system-catalog.md |
| ddl-schema | DDL execution, ALTER, partitioning, constraints, triggers, COPY, RLS | postgres-overview-ddl-schema.md |
| replication-ha | physical + logical replication, slots, archiving, backup, incremental | postgres-overview-replication-ha.md |
| client-protocol | FE/BE wire protocol, authentication, TLS/GSSAPI | postgres-overview-client-protocol.md |
| extensibility | FDW, extensions, hooks, custom scan, PL/pgSQL, SPI | postgres-overview-extensibility.md |
| base-infra | memory contexts, elog, fmgr, datatypes, GUC, sort, dynahash | postgres-overview-base-infra.md |
| i18n-text | full-text search, collation providers, encoding | postgres-overview-i18n-text.md |
| utilities | initdb/genbki, pg_dump, pg_upgrade, basebackup, combinebackup, waldump, psql | postgres-overview-utilities.md |
Cross-cutting historical arcs are captured separately in the
postgres-evolution-*.md docs (replication, vacuum/visibility,
partitioning, parallel query, statistics, pluggable storage, async I/O).