Skip to content

PostgreSQL Architecture Overview — One Binary, One Shared-Memory Machine, and the WAL Spine

Contents:

This is the front door for the PostgreSQL code-analysis tree. It is written for someone who has read zero lines of PostgreSQL source and wants the high-level shape — what processes run, what they share, how one query travels through the engine, and what keeps durability and concurrency honest — before deciding which of the ~95 per-module detail docs to open. Every section is deliberately a router, not a duplicate: it names the boundary and points at the doc that owns the mechanism. If a section feels thin, that is by design.

A second audience is the engineer who has read a few detail docs and now wants to fit them together. The cross-cutting questions — “how does a SELECT get from the wire to a heap page and back?”, “what does a snapshot actually read?”, “why does flushing a dirty buffer wait on the WAL?” — each cut across several subsystems. The diagrams here name the seams those questions cross, so you can pick the right pair of detail docs to read in tandem.

A third audience is anyone asking why PostgreSQL looks the way it does. Its defining choices — a single forked binary instead of threads or separate daemons, a fixed shared-memory segment sized once at boot, a no-overwrite MVCC heap that needs vacuum, a WAL that is the universal event log, and an extensibility surface (pluggable access methods, resource managers, hooks, foreign data wrappers, background workers) baked into the core — each is a deliberate trade-off with an OODB-and-academic lineage (the Berkeley POSTGRES project). The detail docs treat each in depth; postgres-design-philosophy.md (planned) will collect the rationale, and the postgres-evolution-*.md docs trace how the big subsystems changed release by release.

Version note. This tree is written against the cloned source at REL_18_STABLE (commit 273fe94, PostgreSQL 18.x). Symbol names are the anchor; the per-doc “Source verification” sections pin line numbers to this commit.

The thesis: a shared-memory machine, not a process tree

Section titled “The thesis: a shared-memory machine, not a process tree”

If you have read another DBMS’s internals, drop one assumption at the door: in PostgreSQL the processes are not the architecture. The architecture is the shared memory they coordinate through, and the processes are interchangeable inhabitants of it.

Concretely:

  • PostgreSQL is one executable. There are no separate server/broker/ agent binaries. A single postmaster process fork()s every other process from itself (postmaster_child_launchfork_process, in launch_backend.c). On Windows, where there is no fork(), the child re-executes the same binary and reattaches — the EXEC_BACKEND path (also enable-able on other platforms for testing) — but that is a portability fallback, not a second architecture.
  • All those children attach to one shared-memory segment whose total size is computed once, at postmaster startup, by CalculateShmemSize and created by CreateSharedMemoryAndSemaphores (ipci.c). The segment does not grow on demand; an extension that wants shared memory must reserve it up front through the shmem_request_hook / RequestAddinShmemSpace path before the segment is sized.
  • The state lives in that segment, not in the processes: the per-backend PGPROC slots and the global ProcGlobal (proc.h), the procarray that backs every snapshot (procarray.c), the shared buffer pool, the heavyweight lock table, the predicate-lock (SSI) structures, and the shared-invalidation queue. A backend is mostly a thread of control with a private memory-context tree that operates on shared state under LWLock protection.

Figure 1 — One shared-memory segment, many role-typed processes

Get this picture right and the rest of the tree falls into place: the buffer manager, lock manager, procarray, and WAL insert path all assume “I am one of many processes touching a shared structure under a latch.” The seven axes below are different views of that one machine.

PostgreSQL’s process roles are an enum, not a set of binaries. The BackendType in miscadmin.h enumerates every kind of process the postmaster can fork; if you add one you also register it in the child_process_kinds array in launch_backend.c.

// BackendType — src/include/miscadmin.h
typedef enum BackendType
{
B_INVALID = 0,
B_BACKEND, // a client session
B_DEAD_END_BACKEND,
B_AUTOVAC_LAUNCHER, B_AUTOVAC_WORKER,
B_BG_WORKER, // extension / parallel / logical-apply workers
B_WAL_SENDER, B_SLOTSYNC_WORKER,
B_STANDALONE_BACKEND, // single-user mode (initdb, --single)
/* auxiliary procs: one of each (except IO workers) */
B_ARCHIVER, B_BG_WRITER, B_CHECKPOINTER, B_IO_WORKER,
B_STARTUP, B_WAL_RECEIVER, B_WAL_SUMMARIZER, B_WAL_WRITER,
B_LOGGER, // NOT attached to shared memory
} BackendType;

Three things to read off this enum:

  1. Client sessions are B_BACKEND. One forked backend per connection. No connection pooling in core — that is left to PgBouncer / pgpool outside the engine. A backend runs the entire parse→plan→execute pipeline (Axis 5) for its session.
  2. The auxiliary processes are the engine’s background machinery. The B_STARTUP process performs crash/archive recovery (Axis 3); the B_CHECKPOINTER, B_BG_WRITER, and B_WAL_WRITER keep dirty buffers and WAL moving to disk; B_AUTOVAC_LAUNCHER schedules B_AUTOVAC_WORKERs against the no-overwrite heap; B_WAL_SENDER/B_WAL_RECEIVER carry replication; newer roles (B_IO_WORKER for the PG18 async-I/O subsystem, B_SLOTSYNC_WORKER for failover slots) show the engine still growing roles into this same fork-and-attach model.
  3. B_LOGGER is the exception that proves the rule — it is explicitly not attached to shared memory and has no PGPROC, because it only drains a pipe of log messages. Everything else lives in the shared-memory machine.

Detail docs: postgres-postmaster.md (the fork model and child registry), postgres-backend-lifecycle.md (what a B_BACKEND does from InitPostgres to the message loop), postgres-aux-processes.md (checkpointer, bgwriter, walwriter, startup, syslogger), postgres-background-workers.md (B_BG_WORKER framework), postgres-autovacuum.md.

Axis 2 — Shared-memory and IPC substrate

Section titled “Axis 2 — Shared-memory and IPC substrate”

This is the spine. The CUBRID analysis has no equivalent axis, because CUBRID’s processes own state and pass it over sockets; PostgreSQL’s processes share state in one segment and coordinate access with locks and signals. Everything in the other six axes ultimately reads or writes a structure that lives here.

flowchart TB
  subgraph BOOT["postmaster startup (once)"]
    CALC["CalculateShmemSize()<br/>sums every subsystem's request"]
    CREATE["CreateSharedMemoryAndSemaphores()<br/>maps the fixed segment"]
    HOOK["shmem_request_hook / shmem_startup_hook<br/>(extensions reserve here)"]
  end
  CALC --> CREATE
  HOOK --> CALC

  subgraph SEG["the shared segment"]
    PGPROC["PGPROC[] + ProcGlobal<br/>per-process slots, wait queues"]
    PROCARR["procarray<br/>running XIDs -> snapshots"]
    BUFPOOL["buffer pool<br/>(BufferDesc[] + blocks)"]
    LOCKTBL["lock table (LOCK/PROCLOCK hash)"]
    PREDLOCK["predicate locks (SSI)"]
    SINVAL["sinval message queue"]
    DSMREG["dsm / dsa registry<br/>(dynamic + parallel segments)"]
  end
  CREATE --> SEG

  subgraph COORD["coordination primitives"]
    LWLOCK["LWLocks<br/>(short, protect shared structs)"]
    SPIN["spinlocks / atomics"]
    HWLOCK["heavyweight locks<br/>(SQL-visible, deadlock-detected)"]
    LATCH["latches / signals<br/>(procsignal, SetLatch)"]
  end
  SEG --- COORD

Two distinctions are architecturally load-bearing and a frequent source of confusion:

  • Static vs dynamic shared memory. The main segment is fixed at boot. But parallel query and some extensions need shared memory after boot, so PostgreSQL adds DSM (dynamic shared memory segments) and DSA (a dynamic allocator over them) plus shm_mq (a single-reader/ single-writer message queue) for the parallel-worker tuple stream. See postgres-shared-memory-ipc.md.
  • Lightweight vs heavyweight locks. LWLocks are short-lived, two-mode (shared/exclusive) latches that protect in-memory shared structures (a buffer header, a hash partition); they are not deadlock-detected and not SQL-visible. Heavyweight locks are the SQL-level lock table (AccessShareLockAccessExclusiveLock on relations, tuples, transactions) with a full waits-for deadlock detector. These are two different subsystems and the tree keeps them in two docs: postgres-lwlock-spinlock.md and postgres-lock-manager.md. Snapshot isolation conflicts (SERIALIZABLE) are a third mechanism again — predicate locks — in postgres-ssi-predicate-locking.md.

Detail docs: postgres-shared-memory-ipc.md, postgres-procarray.md, postgres-lwlock-spinlock.md, postgres-lock-manager.md, postgres-ssi-predicate-locking.md, postgres-latch-signals.md, postgres-cache-invalidation.md (the sinval loop).

The write-ahead log is PostgreSQL’s single durable event log, and almost every durability or distribution feature is a consumer of it. This axis transfers cleanly from any ARIES-lineage engine: log before you write, replay forward on recovery. What makes it the spine in PostgreSQL is how many subsystems read the same stream.

flowchart LR
  subgraph PRODUCE["WAL production"]
    INS["XLogInsert()<br/>per-rmgr records"]
    RMGR["resource managers<br/>(Heap, Btree, Xact, CLOG, ...)"]
    BUFW["buffer manager<br/>(WAL before flush: the rule)"]
  end
  RMGR --> INS
  BUFW -- "LSN gate" --> INS
  INS --> WALFILES["WAL segments<br/>(pg_wal/)"]

  subgraph CONSUME["WAL consumers"]
    REDO["startup process<br/>crash / archive recovery (redo)"]
    PHYS["walsender -> walreceiver<br/>physical streaming replication"]
    LOGIC["logical decoding<br/>reorderbuffer -> pgoutput"]
    ARCH["archiver + wal summarizer<br/>(PITR, incremental backup)"]
    BASE["basebackup / pg_rewind"]
  end
  WALFILES --> REDO
  WALFILES --> PHYS
  WALFILES --> LOGIC
  WALFILES --> ARCH
  WALFILES --> BASE

The resource manager (rmgr) table (rmgrlist.h) is what makes the WAL extensible and self-describing: each record carries an rmgr id, and the rmgr supplies redo, desc, identify, optional masking, and — crucially for logical replication — a decode callback. The built-in set covers XLOG, Transaction, Storage, CLOG, MultiXact, Heap/Heap2, Btree, Hash, Gin, Gist, SPGist, BRIN, Sequence, CommitTs, ReplicationOrigin, Generic, LogicalMessage, Database, Tablespace, RelMap, and Standby, among others; extensions can RegisterCustomRmgr (Axis 7).

// rmgr table entry — src/include/access/rmgrlist.h
// PG_RMGR(id, name, redo, desc, identify, startup, cleanup, mask, decode)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify,
NULL, NULL, heap_mask, heap_decode)
PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify,
btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)

The WAL-before-flush rule is the seam between this axis and Axis 4: the buffer manager may not write a dirty page to disk until the WAL record describing that change is durable (the page’s LSN gates the flush). That single rule is why a SELECT never blocks on the log but a COMMIT does, and why the checkpointer and walwriter exist.

Detail docs: postgres-xlog-wal.md, postgres-wal-records-rmgr.md, postgres-recovery-redo.md, postgres-checkpoint.md, postgres-xact.md, postgres-two-phase-commit.md, postgres-clog-commit-ts.md, postgres-slru.md. MVCC visibility (postgres-mvcc-snapshots.md) reads the commit state this axis records; vacuum (postgres-vacuum.md) reclaims what MVCC leaves behind.

Axis 4 — Storage and pluggable access methods

Section titled “Axis 4 — Storage and pluggable access methods”

Below the executor, PostgreSQL is a strict layering from OS files up to on-page tuples, with a key twist absent from CUBRID: the access method is pluggable. A table is reached through a TableAmRoutine and an index through an IndexAmRoutine; heap is just the default table AM.

Figure 2 — Storage and pluggable access methods, from the executor down to data files

// the table AM indirection — src/include/access/tableam.h
typedef struct TableAmRoutine
{
NodeTag type;
TableScanDesc (*scan_begin) (Relation rel, ...);
// tuple_insert / tuple_update / tuple_delete / index_fetch_tuple ...
} TableAmRoutine;

The structural highlights a reader must carry forward:

  • The heap is no-overwrite (MVCC). An UPDATE writes a new tuple version and leaves the old one for snapshot readers; deletes are logical. This is why PostgreSQL needs vacuum and the visibility map, and it is the deepest design difference from in-place engines. Owned by postgres-heap-am.md (with README.HOT for heap-only-tuple pruning).
  • The buffer manager is the choke point between memory and disk; the WAL-before-flush rule (Axis 3) lives at its flush path. postgres-buffer-manager.md.
  • smgr maps a relation to file segments (1 GB chunks); the page layout (slotted page, ItemId line pointers) is shared by every AM. postgres-smgr-md.md, postgres-page-layout.md.

Detail docs: postgres-table-am.md, postgres-heap-am.md, postgres-index-am.md, postgres-nbtree.md, postgres-gin.md, postgres-gist.md, postgres-spgist.md, postgres-brin.md, postgres-hash-index.md, postgres-buffer-manager.md, postgres-smgr-md.md, postgres-page-layout.md, postgres-toast.md, postgres-visibility-map.md, postgres-free-space-map.md, postgres-aio.md (PG18 async I/O).

A query travels a textbook pipeline; the per-stage entry points in tcop/postgres.c are the spine of the read path.

flowchart LR
  SQL["SQL text"] --> PARSE["pg_parse_query<br/>(gram.y / scan.l)<br/>raw parse tree"]
  PARSE --> ANALYZE["parse analysis<br/>(parse_analyze)<br/>Query tree"]
  ANALYZE --> REWRITE["pg_analyze_and_rewrite<br/>(rule system, views, RLS)"]
  REWRITE --> PLAN["pg_plan_queries<br/>(planner: paths -> Plan)"]
  PLAN --> PORTAL["PortalStart / PortalRun<br/>(executor)"]
  PORTAL --> EXEC["ExecutorRun<br/>node tree pull model"]
  EXEC --> AM["table / index AMs (Axis 4)"]

What a reader should hold onto:

  • Two grammars of plan generation. The planner builds candidate Paths (cheapest-first, with a cost model), then turns the winning path into an executable Plan tree. Join ordering uses dynamic programming, with GEQO as a genetic fallback past a threshold. postgres-planner-overview.md, postgres-path-generation.md, postgres-join-ordering.md, postgres-cost-model.md, postgres-plan-creation.md.
  • The executor is a demand-pull tree of plan nodes (ExecProcNode), with expression evaluation compiled to a switch-threaded interpreter (and optionally JIT-compiled). postgres-executor.md, postgres-expression-eval.md, the per-node docs (postgres-scan-nodes.md, postgres-join-nodes.md, postgres-agg-sort-nodes.md), postgres-jit.md.
  • Parallelism is a fork of this same tree. A Gather node launches background workers that run a copy of the sub-plan and stream tuples back over shm_mq (Axis 2). postgres-parallel-query.md.
  • Prepared statements and portals cache plans across executions. postgres-portals-prepared.md.

Detail docs also: postgres-parser.md, postgres-analyze-transform.md, postgres-rewriter.md, postgres-node-trees.md, postgres-extended-statistics.md, postgres-tuplesort.md.

PostgreSQL is famously catalog-driven: types, operators, functions, access methods, and index strategies are all rows in pg_* system tables, which is what makes the engine extensible at runtime. Because every backend reads the catalog constantly, three caches sit in front of it, and a shared-invalidation loop keeps them coherent across processes.

flowchart TB
  CAT["system catalogs<br/>pg_class, pg_attribute, pg_proc,<br/>pg_type, pg_am, pg_index, ..."]
  CAT --> RELCACHE["relcache<br/>(Relation descriptors)"]
  CAT --> CATCACHE["catcache / syscache<br/>(tuple lookups)"]
  RELCACHE --> BACKEND["backend executes against<br/>cached metadata"]
  CATCACHE --> BACKEND
  DDL["a DDL commit<br/>(Axis 7 ProcessUtility)"] --> INVAL["CacheInvalidate*"]
  INVAL --> SINV["sinval queue (shared memory, Axis 2)"]
  SINV -. "every backend drains + invalidates" .-> RELCACHE
  SINV -.-> CATCACHE

The seam to remember: a DDL or catalog mutation in one backend does not directly touch another backend’s caches. It queues shared-invalidation messages; each backend drains the queue at well-defined points and drops the stale cache entries. This sinval loop ties Axis 6 back to the shared-memory substrate of Axis 2.

Detail docs: postgres-system-catalogs.md, postgres-relcache.md, postgres-catcache-syscache.md, postgres-cache-invalidation.md, postgres-dependency-tracking.md, postgres-namespace-search-path.md.

This axis has no first-class equivalent in the CUBRID tree, yet it is the reason the PostgreSQL codebase is shaped the way it is. Extensibility is not a bolt-on; the core is built around indirection points that third-party code plugs into without patching the engine.

flowchart TB
  CORE["PostgreSQL core"]
  CORE --> TAMX["table / index AMs<br/>(TableAmRoutine, IndexAmRoutine — Axis 4)"]
  CORE --> RMGRX["custom WAL resource managers<br/>(RegisterCustomRmgr — Axis 3)"]
  CORE --> HOOKX["hook globals<br/>(planner_hook, ExecutorStart_hook,<br/>ProcessUtility_hook, shmem_*_hook)"]
  CORE --> FDWX["foreign data wrappers<br/>(FdwRoutine)"]
  CORE --> BGWX["background workers<br/>(RegisterBackgroundWorker — Axis 1/2)"]
  CORE --> PLX["procedural languages + SPI<br/>(PL/pgSQL handler, SPI_*)"]
  CORE --> CSX["custom scan providers<br/>(CustomScanMethods)"]
  CORE --> EXTX["extensions<br/>(CREATE EXTENSION packaging)"]

Each plug point is a struct-of-callbacks or a function-pointer global that the core checks at a fixed spot. An extension ships a shared library, fills the struct, and registers it — contrib/postgres_fdw (an FDW), pg_stat_statements (a planner_hook + shared memory), and the in-tree index AMs are all instances of the same pattern. This is why “is X in core or contrib?” is a recurring scope question (recorded in postgres-coverage.md): the mechanism is in core; many implementations ship as contrib or third-party.

Detail docs: postgres-fdw.md, postgres-extensions.md, postgres-hooks.md, postgres-custom-scan.md, postgres-plpgsql.md, postgres-spi.md; and Axis-1/2/3/4 docs for AMs, rmgrs, and bgworkers.

The cross-referenced-first reading order, for someone going deep:

  1. The machine — this overview, then postgres-shared-memory-ipc.md, postgres-postmaster.md, postgres-backend-lifecycle.md. Understand the fork-and-attach model before anything else.
  2. The storage floorpostgres-page-layout.md, postgres-buffer-manager.md, postgres-table-am.md, postgres-heap-am.md, postgres-nbtree.md.
  3. The durability spinepostgres-xlog-wal.md, postgres-xact.md, postgres-mvcc-snapshots.md (+ postgres-procarray.md), postgres-recovery-redo.md, postgres-vacuum.md.
  4. Concurrencypostgres-lock-manager.md, postgres-lwlock-spinlock.md, postgres-ssi-predicate-locking.md.
  5. The query pathpostgres-parser.md, postgres-planner-overview.md, postgres-executor.md.
  6. The catalogpostgres-system-catalogs.md, postgres-relcache.md, postgres-catcache-syscache.md.
  7. Then fan out by interest into replication, DDL, extensibility, monitoring, i18n/text, and the postgres-evolution-*.md arcs.

The per-module docs are grouped into thirteen subcategories, each with its own section-overview router (postgres-overview-<subcat>.md). The mapping of subcategory to the PostgreSQL source tree, and the per-module backlog, lives in postgres-coverage.md.

SubcategoryWhat it coversSection overview
storage-enginepages, buffers, smgr, table/index AMs, TOAST, FSM/VM, checksumspostgres-overview-storage-engine.md
txn-recoveryMVCC + snapshots + procarray, WAL, clog/SLRU, 2PC, recovery, vacuum, checkpointpostgres-overview-txn-recovery.md
query-processingparse → analyze → rewrite → plan → execute, stats, JIT, parallelpostgres-overview-query-processing.md
server-architecturepostmaster, backend lifecycle, IPC, locks (LW + heavy + SSI), aux procspostgres-overview-server-architecture.md
monitoring-statscumulative statistics, wait events, progress reportingpostgres-overview-monitoring-stats.md
system-catalogcatalog layout, relcache/catcache, invalidation, dependency, namespacepostgres-overview-system-catalog.md
ddl-schemaDDL execution, ALTER, partitioning, constraints, triggers, COPY, RLSpostgres-overview-ddl-schema.md
replication-haphysical + logical replication, slots, archiving, backup, incrementalpostgres-overview-replication-ha.md
client-protocolFE/BE wire protocol, authentication, TLS/GSSAPIpostgres-overview-client-protocol.md
extensibilityFDW, extensions, hooks, custom scan, PL/pgSQL, SPIpostgres-overview-extensibility.md
base-inframemory contexts, elog, fmgr, datatypes, GUC, sort, dynahashpostgres-overview-base-infra.md
i18n-textfull-text search, collation providers, encodingpostgres-overview-i18n-text.md
utilitiesinitdb/genbki, pg_dump, pg_upgrade, basebackup, combinebackup, waldump, psqlpostgres-overview-utilities.md

Cross-cutting historical arcs are captured separately in the postgres-evolution-*.md docs (replication, vacuum/visibility, partitioning, parallel query, statistics, pluggable storage, async I/O).