Skip to content

Code Analysis

Open-source DBMS internals read at the source level — currently the CUBRID and PostgreSQL codebases, grouped by area the same way for each project. Pick a project to jump to its full analysis tree:

CUBRID 113 docs

A source-level read of the CUBRID codebase — storage, MVCC, lock manager, recovery, the query pipeline, replication/HA, and the PL family.

Start here: architecture overview

PostgreSQL 140 docs

A module-by-module read of the PostgreSQL backend (REL_18) — the shared-memory process model, the WAL durability spine, storage and pluggable access methods, the planner/executor pipeline, the catalog/cache layer, replication, the extensibility surface, plus version-evolution docs.

Start here: architecture overview

A source-level read of the CUBRID codebase — storage, MVCC, lock manager, recovery, the query pipeline, replication/HA, and the PL family.

Start here: architecture overview

Jump to: Overview & Reading Paths · Base / Infrastructure · Storage Engine · Transaction & Recovery · Query Processing · DDL & Schema · Replication & HA · Server Architecture · System Catalog · Procedural Language · Internationalization & Specialty · Utilities

Overview & Reading Paths (18)

  • CUBRID Architecture Overview — Process Model, Layered Stack, and the Map Into the Detail DocsFront-door router for the CUBRID code-analysis tree — names the four long-lived processes (`cub_master`, `cub_server`, `cub_pl`, `cub_broker`+`cub_cas`) and their IPC, the layered storage stack (disk → page buffer + DWB → heap/B+Tree/extendible-hash → catalog → class-object → workspace), the query pipeline (parser → semantic check → rewrite → optimizer → XASL → executor → scan manager → access methods → list-file), the concurrency/logging/recovery axis (MVCC + lock + transaction + log + prior-list + checkpoint + recovery + DWB), the distribution layer (heartbeat + HA replication + CDC + 2PC + flashback + backup), the PL family (`pl_server` JVM, JavaSP, PL/CSQL), and the cross-cutting infrastructure (boot, sessions, thread pools, network protocol, broker, errors, parameters, monitoring, DBI/CCI, SA/CS) — with one diagram per axis and direct cross-refs into ~70 detail docs across eight subcategories.
  • CUBRID Code-Analysis Coverage — Map of What's Documented and What's OpenOverview-level coverage map for the cubrid code-analysis tree — groups the existing docs by subsystem, names the still-open gaps at the same grouping, and records intentional non-goals. Not a per-doc catalog (see README.md) and not an architecture map (see cubrid-architecture-overview.md); this is the answer to 'what's left to write?'
  • CUBRID Design Philosophy — Why the Codebase Looks the Way It DoesThe thirteen architectural decisions that explain the shape of the CUBRID codebase — OODB heritage from UniSQL, ARIES-by-the-book recovery, MVCC paired with a lock manager, Volcano executor, Selinger-style optimizer, lock-free prior list, double-write buffer, separate JVM for PL, broker process pool, local-only HA liveness, page-based storage, SA/CS dual-build utilities, and the deliberate non-goals — each traced to its historical and academic origin.
  • CUBRID Base / Infrastructure — Section OverviewRouter for the base/infra subcategory — the `src/base/` substrate every layer composes with. Two families: custom memory allocators (AREA slab pool for fixed-size objects, per-thread Lea-heap private allocator with C++ STL wrapper) and the lock-free primitives spine (legacy C plus modern C++ generations, sharing one transactional reclamation table). Storage, query, recovery, and PL all sit on top of this.
  • CUBRID DDL & Schema — Section OverviewRouter for the DDL & schema layer — DDL pipelines stage through SM_TEMPLATE, write the on-disk catalog (catcls system classes), rebuild the in-memory SM_CLASS graph, and feed back into authorization, triggers, and statistics that gate every later DML and plan compile.
  • CUBRID Internationalization — Section OverviewRouter for the i18n bucket — two cross-cutting primitives (charset+collation, timezone) shared by every string operator, every comparison, and every date-arithmetic call. Both compile external standards data (LDML for collation, IANA tzdata for timezone) into per-platform shared libraries that the server `dlopen`s at startup, then surface results through small per-record encoded IDs.
  • CUBRID Procedural Language — Section OverviewOne external JVM (cub_pl) hosts two language frontends — JavaSP and PL/CSQL — that ride a single transport, share one catalog, and dispatch through one executor.
  • CUBRID Query Processing — Section OverviewRouter for the largest CUBRID code-analysis subcategory — the parse → execute pipeline organised into a front-end (parser, semantic check, query rewrite), a middle-end (cost-based optimizer, XASL generator, XASL cache), and a back-end (executor, scan manager, list-file) plus runtime helpers (predicate evaluator, scalar functions, external sort, post-processing, hash join, runtime memoization) and specialised features (partition, cursor, serial, parallel query); links nineteen detail docs without duplicating their content.
  • CUBRID Replication & HA — Section OverviewRouter for the replication-ha subcategory: primary/standby replication via logical-log streaming on top of the WAL, leader election by local liveness scoring (no global consensus), and CDC piggy-backing on the same supplemental log stream.
  • CUBRID Server Architecture — Section OverviewProcess-level shape of CUBRID's server tier — JDBC/CCI clients reach a `cub_broker` listener, are routed to a forked `cub_cas` worker, which proxies CSS-framed NRP traffic to `cub_server`, where per-thread workers driven by either the legacy worker pool or the CBRD-26177 NG redesign land on a per-client `SESSION_STATE`, run through the locator OID workspace into storage, and share cross-cutting sysparam, error, and monitoring infrastructure.
  • CUBRID Storage Engine — Section OverviewRouter for the storage-engine subcategory — names the layered stack from OS files (volumes / sectors / pages) up through the page buffer and double-write buffer, the on-page record organisations (heap, B+Tree, extendible hash) plus their overflow chains, the out-of-band path (LOB on the host filesystem), and the page-level encryption layer; explains how the 9 detail docs in `subcategory: storage-engine` fit together and in what order to read them. The AREA slab pool, formerly listed here, has moved to `cubrid-overview-base-infra.md` since it is a memory-allocator concern that any layer can use, not a storage layer.
  • CUBRID System Catalog — Section OverviewRouter for the system-catalog bucket — the engine's SQL-visible self-description surface. Two complementary docs: cubrid-system-catalog-classes covers the *static* surface (the data-driven framework that defines and installs _db_class, _db_attribute, _db_index, ... and the 22 system views layered on top) and cubrid-show-commands covers the *dynamic* surface (SHOW commands rewritten to virtual scans over server runtime state — threads, page buffer, log header, transaction tables, locks).
  • CUBRID Transaction & Recovery — Section OverviewRouter for the txn-recovery subcategory — names how CUBRID realises ACID through MVCC + lock manager for isolation, log manager + prior list + checkpoint + recovery manager (+ DWB cross-section) for atomicity and durability, with 2PC, flashback, and backup-restore extending the same machinery to distributed commit, time travel, and point-in-time recovery — and points at the eleven detail docs that explain each piece.
  • CUBRID Reading Path — How a Stored Procedure Call Executes End-to-End (JavaSP / PL/CSQL with Embedded SQL Callback)End-to-end synthesis of one `CALL my_sp(...)` — a JavaSP whose body issues an embedded `SELECT` via JDBC — from JDBC `CallableStatement.execute` through cub_broker → cub_cas → cub_server's CALL-statement compile → cubpl::executor::request_invoke_command shipping `SP_CODE_INVOKE` to cub_pl → ListenerThread/ExecuteThread dispatch → reflective TargetMethod.invoke → user code's CUBRIDServerSidePreparedStatement embedded SELECT → JVM packing METHOD_CALLBACK_QUERY_PREPARE inside an SP_CODE_INTERNAL_JDBC envelope → server's response_callback_command → callback_prepare/_execute/_fetch invoking the normal compile-and-execute pipeline recursively under METHOD_MAX_RECURSION_DEPTH = 15 — and back. Threads ~15 detail docs (broker, network-protocol, server-session, transaction, parser, semantic-check, optimizer, xasl-generator, query-executor, list-file, mvcc, pl-javasp, pl-plcsql, pl-server-bridge, scan-manager) into one trip and is the natural fourth member of the rpath family alongside cubrid-rpath-select.md / cubrid-rpath-write.md / cubrid-rpath-recovery.md.
  • CUBRID Reading Path — How a Server Restart RecoversReading-path through a `cub_server` cold-start: process boot opens volumes and dispatches to recovery, DWB heals torn pages, the log header points at the most-recent checkpoint, ARIES analysis/redo/undo restore committed work and erase losers, vacuum and HA catch up, then the network listener begins accepting client traffic.
  • CUBRID Reading Path — How a SELECT Executes End-to-EndEnd-to-end synthesis of a single SELECT — JDBC → broker → cub_cas → cub_server → parser → semantic-check → rewrite → optimizer → XASL generator → XASL cache → executor → scan-manager → heap/B+Tree → predicate evaluator → MVCC visibility → list-file → cursor → broker → JDBC, threading roughly twenty detail docs into one trip.
  • CUBRID Reading Path — How a Write Commits End-to-EndINSERT INTO t VALUES (...) followed by COMMIT — parse, locator force fan-in, heap slotted-page write with MVCC stamp, btree key||OID insert, locator constraint and FK checks, BEFORE/AFTER triggers, prior-list WAL append, optional repl record, X-locks via locator, log_commit_local force-flush, log-flush daemon fsync, eventual dirty-page flush via DWB, and vacuum reclaiming dead versions.
  • CUBRID KO Translation Status — Per-Document Phase BoardPer-document phase board for the 107 KO mirrors under `knowledge/ko/code-analysis/cubrid/`. Derived state — each mirror's phase tag (`p1`/`p2`/`p3`/`p4`) in its frontmatter is the SSOT. The general framework (4-phase model, per-doc phase tag, promote/demote rules, model routing) lives in `knowledge/methodology/korean-translation.md`; this doc is cubrid's applied instance, holding only project-specific gates and the aggregated status view.

Base / Infrastructure (8)

  • CUBRID AREA Allocator — Slab-Style Pool Allocator With Free Lists for Same-Size ObjectsHow CUBRID's AREA module slabs-out same-size objects (DB_VALUE, TP_DOMAIN, OBJ_TEMPLATE, DB_OBJLIST, set objects, …) by chaining 256-block BLOCKSET arrays of fixed-cell blocks, fronts each block with a lock-free bitmap, and short-cuts the common case through a single hint pointer.
  • CUBRID Lock-Free Bitmap — Chunked Atomic Allocator for Per-Thread Indexes and Slot PoolsHow CUBRID allocates and recycles small integer slots concurrently — a chunked array of `std::atomic<unsigned int>` words, two chunking styles (one-chunk full-usage and list-of-chunks usage-bounded), CAS bit-flip, and a round-robin start hint that under SERVER_MODE bumps atomically per `get_entry`.
  • CUBRID Lock-Free Circular Queue — Bounded MPMC Ring with Per-Slot Block FlagsHow CUBRID hands work between threads on hot paths — a bounded multi-producer multi-consumer ring buffer with two cursor atomics and a per-slot block-flag word, used for vacuum log-block dispatch, page-buffer victim hand-off, and CDC log-info forwarding.
  • CUBRID Lock-Free Freelist — Typed Node Pool with Back-Buffer Block AllocatorHow CUBRID recycles typed lock-free nodes between operations — a typed `freelist<T>` with a single available stack, a one-block back-buffer that swaps in lazily so concurrent claimers do not race to allocate, an `on_reclaim` hook the payload type implements, and a clearly-documented ABA window in the pop path bounded by the back-buffer time.
  • CUBRID Lock-Free Hash Map — Legacy C, Modern C++, the Bridge, and the ConsumersHow CUBRID implements its main concurrent associative table — a Harris–Michael chained hash with optional per-entry mutex, in two parallel implementations (legacy C `lf_hash_*` and modern C++ `lockfree::hashmap<K,T>`) bridged by `cubthread::lockfree_hashmap<K,T>` whose `m_type ∈ {OLD, NEW}` is decided at init by `PRM_ID_ENABLE_NEW_LFHASH`, with `lf_entry_descriptor` as the shared type that lets one entry layout drive both code paths.
  • CUBRID Lock-Free Primitives — Overview, Two Generations, and Reclamation SpineMap of CUBRID's lock-free primitives — the legacy C `lock_free.{h,c}` family and the modern C++ `lockfree::*` namespace — anchored on a single transactional reclamation spine that every other lock-free structure in the engine sits on.
  • CUBRID Lock-Free Transactional Reclamation — System, Table, Descriptor, and Address MarkerHow CUBRID reclaims retired nodes from lock-free data structures safely — a per-data-structure transaction id, per-thread descriptors that bracket reads, and a periodic minimum-active-id scan that tells the freelist when a retired node is no longer reachable from any live reader.
  • CUBRID Private Allocator — Per-Thread Lea Heap, C++ STL Allocator Wrapper, and Build-Mode RoutingPer-thread Lea-heap arena (Doug Lea's `dlmalloc` vendored under `customheaps`) instantiated once per `THREAD_ENTRY`, fronted by `db_private_alloc / _free / _realloc` macros that route SERVER_MODE allocations to the thread's heap, CS_MODE to the client workspace, and SA_MODE through a `PRIVATE_MALLOC_HEADER`-tagged dispatch that remembers whether the block came from the Lea heap or the workspace. C++ STL wrapper `cubmem::private_allocator<T>` lets STL containers participate; `private_unique_ptr<T>` and `PRIVATE_BLOCK_ALLOCATOR` are the convenience layers; `switch_to_global_allocator_and_call` is the escape hatch for cross-thread or process-global allocations.

Storage Engine (13)

  • CUBRID B+Tree — Layout, Latch-Coupling, and Unique-Key SuffixingHow CUBRID lays out a B+Tree index — slotted-page nodes with key||OID concatenation, separate non-leaf and leaf records, and overflow OID pages — and how insert / delete / scan walk it under latch-coupling discipline with unique-constraint enforcement at the OID-suffix level.
  • CUBRID B+Tree — Code-Level Deep DiveCode-level deep dive into CUBRID's B+Tree — node and record layout, latch-coupling traversal, insert with unique-key OID suffixing, split and merge, delete and vacuum interaction, range scan, unique / PK-FK constraint enforcement, recovery, and bulk online load, with annotated source excerpts.
  • CUBRID Disk Manager and File Manager — Volumes, Sectors, Files, Page Allocation, and ExtensionHow CUBRID layers a four-level hierarchy — OS file as a volume, 64-page sector as the disk-manager allocation unit, logical file as a sector bundle, page as the I/O unit — under everything else; how the disk cache splits permanent and temporary purposes to drive a two-step sector reservation and adaptive volume extension; and how the file manager turns reserved sectors into pages via three extensible-data tables (Partial / Full / User).
  • CUBRID File & Disk Manager — Code-Level Deep DiveCode-level deep dive into CUBRID's file and disk managers — volume and sector layout, sector reservation and volume extension, the file-header and extensible-data architecture, page allocation and deallocation, numerable files and the user-page table, recovery logging, and temporary-file handling, with annotated source excerpts.
  • CUBRID Double Write Buffer — Torn-Page Protection Between Page Buffer and Data FilesHow CUBRID protects against torn writes by staging every dirty data page first into a sequential, fixed-size DWB volume — fsync'd before the home write — so that a crash mid-flush always leaves either a clean home page or a clean DWB copy that recovery can use to replace it.
  • CUBRID Extendible Hash — Disk-Resident Hash File With Doubling Directory and Local DepthHow CUBRID realizes Fagin et al.'s extendible hashing on top of the page buffer — an EHID-rooted directory file whose pointer count doubles when a bucket overflows, slotted bucket pages with binary search and per-bucket local depth, system-op-bracketed splits/merges, RVEH_* WAL records for redo and logical undo, and a small set of internal callers (class-name → OID, catalog → repr-id, UPDATE/DELETE OID dedup).
  • CUBRID Heap Manager — Slotted Pages, Record Layout, Operations, MVCC, and CachesHow CUBRID stores variable-length records in slotted heap pages, how INSERT / UPDATE / DELETE / READ flow through the nine record types, how MVCC versioning lives inside the record header, and which caches keep the hot paths short.
  • CUBRID Heap Manager — Code-Level Deep DiveCode-level deep dive into CUBRID's heap manager — slotted-page mechanics, heap record layout and forwarding, insert / update / delete operation contexts, MVCC version chains in the heap, best-space and class-representation caches, heap scan, and vacuum interaction, with annotated source excerpts.
  • CUBRID LOB — External Storage, Locator Lifecycle, and Transactional CleanupHow CUBRID stores BLOB/CLOB data as files outside the data volume, names them with locator URIs, tracks per-transaction state in a red-black tree on the TDES, and reconciles file-system reality with transaction commit / rollback through a single dispatch point.
  • CUBRID Overflow File — Heap Big-Record and B+Tree Overflow-OID Page ChainsHow CUBRID spills oversized records out of slotted heap pages and oversized OID lists out of B+Tree leaves into chained overflow pages, with two distinct on-page formats sharing one underlying file abstraction (`FILE_MULTIPAGE_OBJECT_HEAP` / `FILE_BTREE_OVERFLOW_KEY` / per-tree OID overflow) and the WAL discipline that keeps the chains crash-safe.
  • CUBRID Page Buffer Manager — BCB, Three-Zone LRU, Private Quotas, Direct Victim Handoff, and Custom LatchesHow CUBRID maps disk pages to memory via BCBs (Buffer Control Blocks), evicts under a three-zone LRU split into per-thread private and shared lists with adjustable quotas, hands off victims directly to sleeping waiters via lock-free queues, and protects each BCB with a custom read/write/flush latch.
  • CUBRID Page Buffer Manager — Code-Level Deep DiveCode-level deep dive into CUBRID's page buffer manager — the BCB and its custom latch, the hash/fix path, the three-zone LRU lists with private quotas, victim selection and direct handoff, dirty flush and WAL coupling, and page promotion/demotion, with annotated source excerpts.
  • CUBRID TDE — Transparent Page-Level Encryption With Master-Key-Wrapped DEKHow CUBRID realizes transparent data encryption — a two-level key hierarchy (master key wraps three per-database data keys), AES-256-CTR or ARIA-256-CTR with per-page nonces (LSA for permanent pages, atomic counter for temp, logical pageid for log), encrypt-on-flush hooks in the page buffer and log page buffer, decrypt-on-read hooks at the same boundaries, a separate ``<db>_keys`` master-key file held outside the database, and a per-file tablespace-style TDE flag that propagates down to each page's ``pflag`` bits.

Transaction & Recovery (16)

Query Processing (20)

  • CUBRID Cursor — Client-Side Fetch Handle Over a Server List-File With Holdability and Scroll StateHow CUBRID realises an ANSI-style fetch cursor as a client-side `CURSOR_ID` that locks onto a server-side `QFILE_LIST_ID`, paging tuples one network-page at a time across a `qfile_get_list_file_page` round-trip, decoding length-prefixed packed rows into `DB_VALUE`s, prefetching dereferenced object identifiers in vector form, and surviving COMMIT through the session-scoped holdable-cursor list when the broker requests `RESULT_HOLDABLE`.
  • CUBRID External Sort — Run Generation, Multi-Way Merge, and the Sort SubstrateHow CUBRID performs disk-backed sorting through a two-phase replacement-selection-style run generator (`sort_inphase_sort`) and a balanced k-way merge (`sort_exphase_merge`) over `FILE_TEMP` runs, exposing a single callback-driven entry point (`sort_listfile`) used by ORDER BY / sort GROUP BY / DISTINCT, B+Tree bulk load, and parallel index build.
  • CUBRID Hash Join — Build/Probe Pattern, Hash-Scan Primitives, and Spill BehaviourHow CUBRID realises hash join as a Build/Probe driver in `query_hash_join.c` that reuses the `HASH_LIST_SCAN` primitive of `query_hash_scan.c`, picks one of three table layouts (in-memory `mht_hls`, hybrid memory-index-plus-file-tuples, or extendible `FHS` hash file) based on the `max_hash_list_scan_size` budget, escalates to grace-style equi-hash partitioning when the build side spills, and is admitted by the optimizer through a deliberately conservative `qo_examine_hash_join` gate keyed on the `USE_HASH` hint.
  • CUBRID JSON_TABLE — Table Function Turning JSON Documents Into Virtual RowsOne C++ scanner whose cursor stack walks a parser-built tree of `cubxasl::json_table::node` objects, expanding the input JSON document with `db_json_iterator_*` per NESTED PATH and emitting rows at the leaves — a SCAN_TYPE that lets CUBRID promote a JSONPath plus a column spec into a row source for the executor's `scan_next_scan` loop.
  • CUBRID List-File — Spillable Tuple-Stream Inter-Operator Pipe and Materialisation SubstrateHow CUBRID realises every materialised tuple stream — sub-query result, sort output, hash-build side, group-by accumulator, and final query result — as a single `QFILE_LIST_ID` linked-page abstraction backed by a per-query `QMGR_TEMP_FILE` membuf-then-`FILE_TEMP` substrate, and how the executor and scan layer pipe data through it via a uniform open / add / scan / close contract.
  • CUBRID Parallel Query — Intra-Query Parallelism Across Heap Scan, Hash Join, and Query ExecuteOne global parallel-query worker pool, a `compute_parallel_degree()` policy keyed on page count, a `worker_manager` reservation handle, and three operator-specific orchestrators — `parallel_heap_scan::manager` (block-range partitioning of heap sectors with mergeable-list / xasl-snapshot / buildvalue result handlers), `parallel_query::hash_join::{build_partitions,execute_partitions}` (shared partition fan-out then per-partition build+probe), and `parallel_query_execute::query_executor` (uncorrelated-aptr fan-out for `BUILDLIST_PROC` / `BUILDVALUE_PROC` / `UNION_PROC` / `HASHJOIN_PROC` / `MERGELIST_PROC`) — sitting on top of the `cubthread::worker_pool` named "parallel-query" and threaded through XASL via `px_executor`, `m_px_orig_thread_entry`, and the `S_PARALLEL_HEAP_SCAN` switch arm.
  • CUBRID Parser — Flex/Bison Pipeline, PT_NODE Tree, and the Parser Memory ModelHow CUBRID turns SQL text into a `PT_NODE` parse tree — a Flex lexer driven by a single-buffer `YY_INPUT`, a GLR Bison grammar that builds the tree through reduce-action calls to `parser_new_node`, a polymorphic-tagged-union `PT_NODE` whose per-type child layout is encoded in three function-pointer arrays (`pt_apply_f`, `pt_init_f`, `pt_print_f`), and a per-`PARSER_CONTEXT` block allocator that lets the whole tree be freed in one pass.
  • CUBRID Partitioning — Range/Hash/List Strategies, Partition Pruning, and Per-Partition ExecutionHow CUBRID partitions a logical table into a master class plus N child classes, encodes the per-partition rule (range bounds, hash modulus, list values) on the master class via SM_PARTITION, and uses a server-side PRUNING_CONTEXT to eliminate partitions at optimize time, route each insert/update record to its target partition heap, and dispatch a per-partition scan list at execute time.
  • CUBRID Post-Processing — Aggregation, Window/Analytic Functions, and Sort vs Hash GROUP BYHow CUBRID's query executor turns a sorted (or hash-accumulated) list file into grouped, aggregated, and window-framed output through `qexec_groupby` and `qexec_execute_analytic`, choosing between sort-based and hash-based GROUP BY at runtime, and falling back to external sort when the hash table outgrows `max_agg_hash_size`.
  • CUBRID Query Evaluator — PRED_EXPR Walking, regu_variable Fetch, and the Row-Level Filter EngineHow CUBRID turns each pulled tuple into a keep/skip verdict — `eval_pred` walks a `PRED_EXPR` tree of `T_PRED` boolean nodes and `T_EVAL_TERM` leaves under three-valued logic (`V_TRUE` / `V_FALSE` / `V_UNKNOWN` / `V_ERROR`), every leaf calls `fetch_peek_dbval` which dispatches on `REGU_VARIABLE::type` (constant, attribute fetch, list-file position, arithmetic expression, function call, host variable, OID, list-id) into a path-specific resolver, and `eval_fnc` pre-compiles a fast single-shape predicate to bypass the recursion when possible.
  • CUBRID Query Executor — XASL Interpretation, Iterator Model, and Heap/Index Scan OperatorsHow CUBRID interprets a serialized XASL plan tree as a Volcano-style operator tree — `qexec_execute_mainblock_internal` dispatches by `xasl->type`, drives a uniform open/next/close loop over `SCAN_ID` operators (heap, index, list, set, value, JSON, dblink, parallel-heap), and pushes results into per-XASL list files that downstream operators read back as plain list scans.
  • CUBRID Query Optimizer — Query Graph, Cost Model, Join Enumeration, and Compiled PlanHow CUBRID lowers a semantically-checked PT_NODE into a `QO_ENV` query graph of `QO_NODE`/`QO_SEGMENT`/`QO_TERM`, runs partial-then-total dynamic-programming join enumeration over a 2^N `join_info` vector with a System R-style fixed-cpu/io + variable-cpu/io cost model, and finalises the surviving `QO_PLAN` tree into the XASL access-spec tree shipped to the server.
  • CUBRID Query Rewrite — Pre-Optimization Tree Transformations and the LIMIT-Clause Case StudyHow CUBRID lowers the LIMIT clause into INST_NUM/ORDERBY_NUM/GROUPBY_NUM predicates during semantic checking, then re-rewrites surviving LIMITs in mq_rewrite, and how that lowering interacts with CNF conversion, predicate reduction, view inlining, subquery flattening, auto-parameterization, and the plan-generation-time multi-range LIMIT optimization.
  • CUBRID Runtime Memoization — Subquery Cache, Filter-Predicate Cache, and Per-Query Memoize HelpersHow CUBRID avoids redundant per-row work through three independent caches sharing one playbook — DB_VALUE-array hash key, fail-on-full memory budget, hit-ratio guard — but operating at three different lifecycle scopes: per-XASL `sq_cache` for uncorrelated scalar-subquery results, server-wide per-BTID `fpcache` for deserialised function-index predicates, and per-XASL `memoize::storage` for nested-loop-join inner-side tuple sets.
  • CUBRID Scalar Functions — Arithmetic, String, Numeric, JSON, Regex, and Cryptographic Operator PrimitivesHow CUBRID's scalar function library — `arithmetic.c`, `numeric_opfunc.c`, `string_opfunc.c`, `query_opfunc.c`, `crypt_opfunc.c`, and the `string_regex_*` family — implements the operator-primitive layer underneath the regu-variable evaluator, with `fetch_peek_arith` driving a per-`OPERATOR_TYPE` switch into `qdata_*_dbval` arithmetic dispatchers (which fan out by `DB_TYPE` into the per-pair `qdata_add_int_to_dbval`-style variants), `qdata_evaluate_function` driving a per-`FUNC_CODE` switch into JSON / regex / list / generic handlers, BCD arithmetic on `DB_NUMERIC` via `numeric_db_value_{add,sub,mul,div}` walking byte-wise binary digits in `numeric_{add,sub,mul,long_div}`, collation-aware string ops (`db_string_substring`, `db_string_lower`, `db_string_like`, `db_string_concatenate`) honoring `INTL_CODESET`, and a regex façade (`cubregex::compile / search / count / instr / replace / substr`) that routes to either RE2 or `std::regex` per the `regexp_engine` system parameter.
  • CUBRID Scan Manager — SCAN_ID Dispatch, Open/Next/Close Protocol, and the Access-Method CatalogueOne polymorphic SCAN_ID handle, a switch-driven open/start/next/end/close protocol, and a per-`SCAN_TYPE` dispatch into heap, B+Tree, list-file, set, value, JSON-table, dblink, show, parallel-heap, and method scans — the access-method catalogue that the executor's `scan_next_scan` loop sits directly on top of.
  • CUBRID Semantic Check — Name Resolution, Type Checking, Constant Folding, and Statement-Specific ValidationHow CUBRID's `pt_check_with_info` driver turns a freshly parsed `PT_NODE` tree into an analyzed, type-checked, constant-folded, CNF-normalized intermediate form by chaining four passes — name resolution, where-clause aggregate check, host-variable replacement, and a statement-aware semantic_check_local that internally calls `pt_semantic_type` for type evaluation and constant folding — and finally `pt_cnf` to push the predicate into conjunctive normal form before the optimizer runs.
  • CUBRID SERIAL — Sequence/Auto-Increment Subsystem With Catalogged State and Cached ValuesHow CUBRID stores every sequence as a row in the `_db_serial` system class, advances it under an exclusive object lock with optional client-side caching, and re-uses the same machinery to drive AUTO_INCREMENT columns through synthesized `<class>_ai_<attr>` serials.
  • CUBRID XASL Cache — Plan Cache Keyed by SHA-1 of SQL Hash Text with RT Recompile and Per-Class InvalidationHow CUBRID short-circuits parse → semantic-check → optimize → XASL-generate on the second execute by keying a server-wide latch-free hashmap on a SHA-1 hash of the rewritten SQL plus a per-entry `time_stored`, refcounting the entries with a single 32-bit `cache_flag`, watching cardinality drift through a recompile-threshold (RT) check, and invalidating dependent entries from a per-class OID list whenever DDL or schema-altering operations fire `xcache_remove_by_oid`.
  • CUBRID XASL Generator — Compiling the Optimized Plan Tree to a Server-Side Execution TreeHow CUBRID compiles a name-resolved, type-checked PT_NODE plus the optimizer's QO_PLAN tree into the procedural XASL_NODE tree the server actually executes — covering the recursive `gen_outer`/`gen_inner` walk, the `aptr/dptr/scan_ptr` slots that hide subqueries and joins inside one node, the REGU_VARIABLE / ACCESS_SPEC / OUTPTR_LIST sub-IRs, and the `xts_*` offset-table serialization that ships the whole tree to the server.

DDL & Schema (6)

Replication & HA (3)

  • CUBRID CDC — Streaming DML and DDL Through the WALHow CUBRID streams DML and DDL changes downstream — the modern `cdc_*` API that walks `LOG_SUPPLEMENTAL_INFO` records forward through `log_reader`, alongside the legacy HA `la_*` log applier that replays log archives onto a slave.
  • CUBRID HA Replication — Logical-Log Based Master/Slave Replication via copylogdb and applylogdbHow CUBRID's master engine emits auxiliary `LOG_REPLICATION_DATA` / `LOG_REPLICATION_STATEMENT` records alongside its physiological WAL during DML, and how a separate `copylogdb` daemon ships log volumes to a slave host where `applylogdb` (`la_apply_log_file`) walks them forward and dispatches per-record-type back into the storage layer for serialised, transactionally consistent replay.
  • CUBRID Heartbeat — Cluster Liveness, Failover and FailbackHow CUBRID's `cub_master` peers gossip liveness over UDP, score themselves into a single elected master per node's local view, and turn the resulting state transitions (slave→to-be-master→master, master→slave) into failover and failback through a job-queue FSM driving four worker threads.

Server Architecture (14)

  • CUBRID Boot — Server Startup, First-Time Creation, Restart-Recovery Dispatch, and Client ConnectHow `cub_server` brings every subsystem online — first-time `createdb` formats volumes and bootstraps the root-class catalog, restart hands off to `log_recovery`'s three-pass replay, and the client side wires `boot_restart_client` to `xboot_register_client` over the network.
  • CUBRID Broker — CAS Process Pool, Connection Routing, and the Client-Facing Front-EndHow CUBRID's `cub_broker` parent forks a fixed pool of `cub_cas` worker processes, exposes a single TCP listener, hands accepted client sockets to an idle CAS through a Unix-domain rendezvous channel using SCM_RIGHTS file-descriptor passing, and lets each CAS proxy the client's CSS-framed traffic upstream to `cub_server` — all coordinated through a single SysV shared-memory segment that also carries job queues, ACL state, monitoring counters, and the broker administration interface.
  • CUBRID DBI and CCI — Client API Surface, Statement Lifecycle, and Wire-Driver FaçadeHow CUBRID layers a single client-side `db_*` C API on top of `boot_cl` and `network_cl` — `db_open_buffer` / `db_compile_statement_local` / `db_execute_statement` / `db_query_first_tuple` / `db_query_get_tuple_value` / `db_close_session` walk every statement through a four-stage FSM (Initial → Compiled → Prepared → Executed) inside a `DB_SESSION`, and how the broker's CAS process wraps that same surface in a `T_SRV_HANDLE`-keyed wire driver (`ux_database_connect`, `ux_prepare`, `ux_execute`, `ux_fetch`, `ux_end_tran`) dispatched by a flat `server_fn_table` so JDBC, CCI, ODBC, Python, and PHP all reach the engine through the same `db_*` core.
  • CUBRID Error Management — Per-Thread Error Context, Stack, Message Catalog, and Wire PropagationHow CUBRID reports failures from any subsystem — the global `ER_*` error-code enum (`error_code.h`), the per-thread `cuberr::context` that owns a base `er_message` plus a `std::stack<er_message>` for nested errors, the `er_set` family that formats arguments through a printf-style spec compiled by `er_study_fmt`, the localised `cubrid.msg` / `csql.msg` / `utils.msg` catalogs loaded by `msgcat_init` in NetBSD/FreeBSD `nl_catd` format, the `cubrid_*.err` log file with size-based rotation and a `_latest` symlink, and the wire format (`er_get_area_error` / `er_set_area_error`) that flattens an error to three `OR_INT` fields plus the message string for client-server propagation.
  • CUBRID loaddb — Bulk Loader, Direct-Path Heap+B+Tree Insert, and Post-Load Statistics RebuildHow CUBRID's loaddb utility tokenises a CUBRID-format object file, splits it into batches, ships each batch to a server-side worker pool that holds a Bulk-Update lock and writes through `locator_multi_insert_force` directly into heap pages, then closes the load with a class-by-class statistics rebuild.
  • CUBRID Locator — OID Workspace, Bulk Fetch/Flush, and the Server-Side Insert/Update/Delete BridgeHow CUBRID translates between in-memory objects and on-disk OIDs — a client-side workspace that batches dirty objects into LC_COPYAREA buffers and a server-side `locator_*_force` family that fans out into heap, btree, lock, log, FK, and replication paths through one canonical entry point.
  • CUBRID cub_master Process — Daemon Lifecycle, Connection Registry, Request Dispatch, and the Auto-Restart Server MonitorEnd-to-end analysis of `cub_master` — the long-lived daemon that owns the per-host CUBRID service registry. Covers the boot sequence (`master_util_config_startup` → `css_does_master_exist` duplicate check → `css_daemon_start` fork → `css_master_init` socket binding + signal handlers → optional `hb_master_init` for HA → optional `server_monitor` instantiation when `auto_restart_server = on`); the `select()` loop that multiplexes the listening socket with every registered child connection in `css_Master_socket_anchor`; the `process_master_request` opcode taxonomy (~30 opcodes split across status / shutdown / HA-process registration / HA-info-query families) implementing what `commdb` and `cubrid commdb` send; the `server_monitor` C++ subsystem that runs a producer-consumer job queue (REGISTER / UNREGISTER / REVIVE / CONFIRM_REVIVE / SHUTDOWN) on a dedicated `std::thread` and uses `m_server_entry_map` to track every registered cub_server's PID + argv so it can re-fork on abnormal exit. Distinct from `cubrid-heartbeat.md` which covers the HA-replication subsystem layered on top of master.
  • CUBRID Monitoring — Perfmon Counters, Statistics Aggregation, and Per-Subsystem MonitorsHow CUBRID instruments hot paths with two layered counter systems — a C++ template-based `cubmonitor` library that registers groups of statistics and supports per-transaction sheets, and the older C `perf_monitor`/`pstat_Metadata` array used by SHOW STATS and statdump — plus per-subsystem monitors such as the per-vacuum-worker overflow-page threshold tracker that keep their own non-counter state.
  • CUBRID Network Protocol — Connection Accept, NRP Dispatch, and Server-Side Request HandlersHow CUBRID frames every server entry point as one NET_SERVER_* opcode dispatched through a static table of `(action_attribute, handler)` records — connections accepted by `cub_master`, handed to `cub_server` workers via `master::connector` over a Unix-domain socket, then driven by an epoll-based `cubconn::connection::worker` that reads CSS-framed packets and delegates to symmetric `or_pack_*` / `or_unpack_*` request marshalling on both sides of the wire.
  • CUBRID SA vs CS Runtime — Standalone (linked-in server) vs Client-Server (over the wire) ModesHow CUBRID compiles the same source tree three times — `cub_server` (SERVER_MODE), `libcubridsa` (SA_MODE), `libcubridcs` (CS_MODE) — so admin utilities can either embed the entire engine in-process and operate on the on-disk database directly, or talk over CSS to a separately running daemon, with the choice driven by per-utility classification (SA_ONLY / CS_ONLY / SA_CS) and a runtime `dlopen` of either `libcubridsa.so` or `libcubridcs.so`.
  • CUBRID Server Session — Per-Client State, Prepared-Statement Registry, and TDES BindingHow CUBRID maintains a per-client server-side state container — the `SESSION_STATE` — keyed by an integer session id in a lock-free hash, cached on the connection entry for O(1) request lookup, and bound to the per-thread TDES so that every server request lands on its rightful transaction descriptor, prepared-statement cache, and parameter set.
  • CUBRID System Parameters — Tunable Registry, Conf/Env/URL Parsing, and Per-Session ScopingHow CUBRID's `prm_Def[]` registry, the `cubrid.conf` INI parser with section selection, environment-variable overrides, the `db_set_system_parameters` SQL path, and the per-session `SESSION_PARAM` array combine into one ordered resolution flow that every other subsystem reads through `prm_get_*_value`.
  • CUBRID Thread Manager NG — Connection/Worker Pool Redesign for High-Concurrency (CBRD-26177)Guava-version redesign of CUBRID's connection/worker pool — bounded epoll-driven connection workers, a coordinator brokering rebalancing and auto-scaling, send/recv budgets, per-worker context freelists, and atomic-free statistics — replacing the legacy thread-per-connection plus max_clients-task-worker layout described in cubrid-thread-worker-pool.md.
  • CUBRID Thread and Worker Pool — Workers, Daemons, Lock-Free Primitives, and Critical SectionsHow CUBRID structures every server thread of execution — the per-thread `cubthread::entry` context, the `worker_pool` template (cores → workers → task queue) that runs queries / vacuum / loaddb / parallel-redo, the `daemon` + `looper` pattern that drives every periodic background flush and detect, the lock-free hashmap shared by lock manager and page buffer, and the heavyweight `csect` RW primitive with its per-thread tracker.

System Catalog (2)

  • CUBRID SHOW Commands — System-Introspection Virtual Scans Over Server Runtime StateHow CUBRID exposes server-internal state — volume headers, log headers, heap and B+Tree capacity, critical sections, threads, page buffer, transaction tables, timezones — as queryable virtual tables by rewriting `SHOW <name>` into `SELECT * FROM (PT_SHOWSTMT)`, dispatching the resulting access spec through `S_SHOWSTMT_SCAN`, and synthesising tuples on demand from per-`SHOWSTMT_TYPE` start/next/end function pointers.
  • CUBRID System Catalog Classes — Data-Driven Definition, Bootstrap Install, and System View Query SpecsHow CUBRID defines and bootstraps the SQL-visible system class family — `_db_class`, `_db_attribute`, `_db_index`, `_db_auth`, ... — through a data-driven framework. Each class is a `cubschema::system_catalog_definition` (attributes + constraints + grants + optional row initializer) registered into a global vector at `catcls_init`; a single `system_catalog_builder` walks the vector at `catcls_install`, issuing `db_create_class` -> `smt_add_attribute` -> `sm_update_class` -> constraints/grants for every class, then a second pass builds the system views from their query-spec strings. Separated from the disk-layer `catalog_manager` and from the in-memory `schema_manager`, this module owns the *meta-schema of the SQL-visible system catalog itself*.

Procedural Language (3)

  • CUBRID PL/JavaSP — Java Stored Procedures, JDBC Bridge, and the PL/CSQL-Sibling External PL EngineHow CUBRID runs Java and PL/CSQL stored procedures through a separate JVM process (cub_pl) that shares catalog rows and transport infrastructure with PL/CSQL while JavaSP alone owns the reflective dispatch on user JARs, the classloader hierarchy, and the security sandbox.
  • CUBRID PL/CSQL — Oracle-Compatible Procedural SQL Compiled to Java in the PL Family RuntimeHow CUBRID's PL/CSQL — the Oracle-dialect half of the PL family alongside JavaSP — is parsed by an ANTLR 4 grammar inside the shared `pl_server` JVM, lowered to a CUBRID-specific Java AST (DeclProgram / StmtBlock / ExprBinaryOp / loopOpt), translated to Java source by an emitter visitor, compiled by an in-process `javax.tools.JavaCompiler`, packaged as a JAR (Base64), and returned to the C-side `compile_handler` so `sp_add_stored_procedure_code` can persist it next to the same catalog rows JavaSP uses.
  • CUBRID PL Server Bridge — The Mid-Execution Callback Channel That Both PL Runtimes Ride OnThird sibling in the PL family — the shared mid-execution callback channel that both JavaSP and PL/CSQL ride on top of. Two physically distinct paths share the same opcode taxonomy and packed structures: Path A is the legacy `cub_server`→CAS channel for C-method scans (`SCAN_TYPE_METHOD` driven by `cubscan::method::scanner`, dispatched on the CAS side by `cubmethod::callback_handler::callback_dispatch`); Path B is the modern `cub_pl`→`cub_server` channel for JavaSP and PL/CSQL invocations (`SP_CODE_INTERNAL_JDBC` ferrying the same `METHOD_CALLBACK_*` opcodes, dispatched server-side by `cubpl::executor::response_callback_command`). Covers the request taxonomy (`METHOD_REQUEST_INVOKE` / `_CALLBACK` / `_END` / `_ARG_PREPARE` / `_COMPILE` / `_SQL_SEMANTICS` / `_GLOBAL_SEMANTICS`), the response taxonomy (~18 `METHOD_CALLBACK_*` opcodes), the compile-time semantic-check round-trips that PL/CSQL uses to validate embedded SQL, the recursion guard (`METHOD_MAX_RECURSION_DEPTH = 15`), and the `tran_begin/end_libcas_function` bracketing that scopes a callback nest under the parent transaction.

Internationalization & Specialty (2)

  • CUBRID Charset and Collation — Codeset Conversion, Locale-Aware Comparison, and Multi-Encoding SupportHow CUBRID encodes text in four codesets (binary, ISO-8859-1, EUC-KR, UTF-8), compiles LDML locale rules into shared libraries of UCA weights, and dispatches collation-aware comparison through a function-pointer LANG_COLLATION vtable consumed by B+Tree, sort, and string operators.
  • CUBRID Timezone — IANA Data Compilation, tz_id Resolution, and DATETIMETZ/TIMESTAMPTZ ConversionHow CUBRID compiles raw IANA tzdata files into a generated `timezones.c` and shared library `libcubrid_timezones.so`, loads the TZ_DATA blob at runtime via `dlsym`, packs a (zone, gmt-offset-rule, ds-rule) triple into a 32-bit `TZ_ID` with two reserved high bits distinguishing zone IDs from raw offsets, resolves wall-clock to UTC through `tz_datetime_utc_conv` walking the zone's offset-rule list and a daylight-saving ruleset while honouring the `LOCAL_STD` / `LOCAL_WALL` / `UTC` "AT" qualifier and overlap intervals, and exposes `DATETIMETZ`, `TIMESTAMPTZ`, `DATETIMELTZ`, and `TIMESTAMPLTZ` to the SQL layer through `tz_create_datetimetz`, `tz_conv_tz_datetime_w_region`, and `tz_explain_tz_id`.

Utilities (8)

  • CUBRID checksumdb — HA Replica vs Master Row-Checksum Verifier with Chunked, Replication-Replayed ComparisonEnd-to-end analysis of `checksumdb` — the HA-cluster integrity verifier that detects silent divergence between a master and its slaves by chunking each table along the primary key, computing per-chunk checksums on the master, replicating those checksums through the same WAL stream that ferries the data, and letting the slave recompute and compare. Covers the on-disk artefacts (`db_ha_apply_info_chksum_*` result table + `_schema` table) and the chunked walk (lower-bound iteration via the PK ordering, `--chunk-size` rows per chunk, `chksum_get_next_lower_bound` driving the cursor); the schema-checksum side that compares serialised class definitions; the include/exclude list filtering; the resume-from-prior-run mode; the report formatter (`chksum_report_summary` / `_diff` / `_schema_diff`); the SHARED lock acquired during chunk computation that allows reads but blocks writes long enough to make the chunk consistent.
  • CUBRID compactdb — Offline Database Compaction and Page Defragmentation UtilityHow CUBRID's compactdb utility complements the online MVCC vacuum by walking each class heap, NULL-ing dangling OID references, reclaiming empty heap pages, dropping obsolete catalog representations, and defragmenting heap files — driven from the client side, scoped per class, and run in three numbered passes against an unmounted-but-restartable database.
  • CUBRID csql — Interactive SQL Client, Two-Binary Launcher Split, Session-Command Prefix, and Single-Line Execution ModeEnd-to-end analysis of the `csql` interactive SQL client — the two-binary launcher pattern (`csql_launcher.c` parses argv, then `dlopen`s `cubridsa` or `cubridcs` and `dlsym`s into the `csql` entry depending on `--SA-mode` / `--CS-mode`); the CSQL_ARGUMENT option table and the validation matrix that rejects invalid combinations (`-p` / `-q` / `--loaddb-output`, `--write-on-standby` without `--sysadm`, `--skip-vacuum` only in SA); the `start_csql` read-execute-print loop with single-line-execution detection via `csql_walk_statement` / `csql_is_statement_complete` for in-block (string / comment / identifier) tracking; the session-command (`;`-prefixed) dispatch table with 47 entries split across file, edit, command, environment, help, and history families plus the `CMD_CHECK_CONNECT` flag that gates DDL-class commands when the session is offline; the readline / `stifle_history` integration and the `.hist` set of histogram commands; the DDL audit logging hookup (`logddl_*`) and the four output styles (column, line-output, plain, query, loaddb).
  • CUBRID cubrid Admin CLI — Verb Dispatcher, SA/CS-Routed Library Loading, and the Service · Server · Broker · Heartbeat FamilyEnd-to-end analysis of the unified `cubrid` admin CLI — the two-axis verb taxonomy (service-family verbs `service|server|broker|manager|heartbeat|pl|gateway` × command-family verbs `start|stop|restart|status|reload|on|off|acl|reset|info|deregister|list|getid|test|copylogdb|applylogdb|replication`, plus the database-admin verbs `createdb|backupdb|loaddb|unloaddb|...|memmon`) backed by `us_Service_map` + `us_Command_map` + `ua_Utility_Map` tables; the bitmask cross-check (`MASK_SERVICE`/`MASK_SERVER`/`MASK_BROKER`/`MASK_HEARTBEAT`/`MASK_PL`/`MASK_GATEWAY`) that rejects nonsensical (verb, command) pairs at parse time; the SA_ONLY/CS_ONLY/SA_CS routing that picks `libcubridsa` vs `libcubridcs` per verb; the `dlsym` of a verb-specific entry function with the standard `UTIL_FUNCTION_ARG` signature; the legacy compatibility shim in `util_front.c` that translates old short-arg invocations (`createdb -p 1000 ...`, `loaddb -u dba ...`) into modern `cubrid <verb> --long-arg ...` and `execvp`s into the unified entry.
  • CUBRID loadjava — JAR · .class Installer for the JavaSP Classloader TreeEnd-to-end analysis of `loadjava` — the small standalone utility that installs a `.class` or `.jar` into the database's per-database Java classloader root so JavaSP can find it. The whole utility is one C++17 file using `<filesystem>`. Covers the path resolution: `$CUBRID_DATABASES/<db>/java/<package>/<file>` for the dynamic (default) tree that the `cub_pl` JVM's `ContextClassLoader` watches, or `$CUBRID_DATABASES/<db>/java_static/<package>/<file>` (with `--jni`) for the static tree that is loaded once at JVM start; the `--package` flag with regex-validated dot notation that becomes the directory hierarchy under the install root; the `--overwrite` / `-y` flag for non-interactive overwrite; the deliberate `fs::remove` before copy that updates the directory's mtime so the JVM's classloader-manager picks up the change without restart (CBRD-24695); the lack of any database-server connection — the install is purely filesystem-side, with the JVM-side classloader manager doing the discovery.
  • CUBRID migrate — One-Shot 9.1→9.2 In-Place Format Upgrader for Volume Headers, Active Log Codeset, and Collation SyncEnd-to-end analysis of `migrate` — the version-locked one-shot in-place format upgrader from CUBRID 9.1 to 9.2 disk format. Covers the four-phase sequence: (1) per-volume header rewrite that converts the v9.1 disk-var-header layout to v9.2 (with an undo journal so a mid-migration crash can be rolled back via `undo_fix_volume_header`); (2) active-log codeset patch (`fix_codeset_in_active_log`) that retags the log header's codeset field consistently with the catalog; (3) `db_restart` boot, `synccoll_force` to refresh the catalog's collation rows from the new locale library, with strict per-collation compatibility checking (codeset match, checksum match unless contractions present); (4) `file_update_used_pages_of_vol_header` to refresh used-pages statistics. The lockstep undo discipline and the explicit `rel_disk_compatible() != V9_2_LEVEL` guard make migrate the canonical reference for how CUBRID handles disk-format upgrades.
  • CUBRID unloaddb — Schema and Data Export, Four-File Output Layout, and the Per-Class Multi-Thread · Multi-Process DriverEnd-to-end analysis of `unloaddb` — the schema-and-data exporter that is the natural inverse of `loaddb`. Covers the four-file output layout (`<prefix>_schema`, `<prefix>_trigger`, `<prefix>_indexes`, `<prefix>_objects`) emitted in a fixed order so a downstream `loaddb` can replay schema → triggers → data → indexes; the `extract_context` carrier struct that threads user / auth / storage-order / split-files / dba-bypass settings into every emitter; the `extract_classes_to_file` → `emit_schema(EXTRACT_CLASS)` → `emit_schema(EXTRACT_VCLASS)` ordering that resolves table-before-view dependencies; the `extract_objects` driver with `--thread-count` (per-class concurrent fetch, ≤127) and `--mt-process N/M` (mutually-exclusive multi-process partitioning across ≤36 processes by class); the `--datafile-per-class` mode that writes one data file per class instead of one big bag; the `--latest-image` MVCC snapshot toggle; the `--keep-storage-order` vs default `FOLLOW_ATTRIBUTE_ORDER` axis; the `--as-dba` group bypass for cross-owner extracts; the `--split-schema-files` and `--input-class-file` selective-extract path.
  • CUBRID Utilities Miscellany — commdb, gencat, generate_timezone, daemon, cubrid_version, pl Bootstrap HelperOmnibus coverage of the small `src/executables/` utilities that don't warrant a dedicated doc each. Six tools span three categories: (1) **runtime probes / helpers** — `commdb` (the standalone `cub_master` RPC client predating `cubrid commdb`; sends MASTER_REQ_* opcodes), `pl` (cub_pl JVM bootstrap helper used by `cubrid pl ping/start/stop/restart/status` reading `$CUBRID/var/pl_<db>.info`), `daemon` (tiny `fork`/`setsid` double-fork wrapper used by other binaries that don't already detach themselves); (2) **build-time generators** — `gencat` (NetBSD-derived POSIX message-catalog compiler that produces the binary `.cat` files for `MSGCAT_CATALOG_*` lookups), `generate_timezone` (62-line wrapper around `timezone_compile_data` that turns IANA `tzdata` into the C source the `cubrid_timezones` library is built from); (3) **trivia** — `cubrid_version` (37-line program that prints `PRODUCT_STRING` and exits).

A module-by-module read of the PostgreSQL backend (REL_18) — the shared-memory process model, the WAL durability spine, storage and pluggable access methods, the planner/executor pipeline, the catalog/cache layer, replication, the extensibility surface, plus version-evolution docs.

Start here: architecture overview

Jump to: Overview & Reading Paths · Base / Infrastructure · Storage Engine · Transaction & Recovery · Query Processing · DDL & Schema · Replication & HA · Server Architecture · System Catalog · Monitoring & Statistics · Client Protocol · Extensibility · Internationalization & Text · Utilities · Version Evolution

Overview & Reading Paths (16)

  • PostgreSQL Architecture Overview — One Binary, One Shared-Memory Machine, and the WAL SpineFront door for the PostgreSQL code-analysis tree. PostgreSQL is one binary: a single postmaster forks a family of role-typed children (the BackendType enum, ~18 roles) that all attach to one shared-memory segment whose size is fixed at startup — so the shared structures (PGPROC/ProcGlobal, the procarray, the buffer pool, the lock table, sinval) ARE the architecture, and the processes are interchangeable inhabitants of that machine. The WAL is the single durable event log that recovery, physical and logical replication, archiving, and backup all consume. This doc lays out seven architectural axes — process model, shared-memory/IPC substrate, the WAL durability spine, storage and pluggable access methods, the query pipeline, the catalog+cache layer, and the extensibility surface — one diagram per axis, each forward-referencing the per-module detail docs that own the mechanism.
  • PostgreSQL Code-Analysis Coverage MapTree-wide coverage tracker for the PostgreSQL code-analysis tree, written against REL_18_STABLE (commit 273fe94, PostgreSQL 18.x). Lists every module doc by subcategory with its source anchor, cross-reference priority, and status — the campaign is now COMPLETE: 117 module docs + 15 overview/coverage/status docs + 7 evolution docs, all written EN+KO+meta; 13 PG-relevant research papers ingested into knowledge/research/dbms-papers/. Records the doc-type legend (module / overview / detail / evolution), the in-tree-README + dbms-papers anchor strategy, and the explicit non-goals: contrib/ entirely, PG19-only features, non-libpq client drivers, and the test frameworks. Not a per-doc catalog (see README.md) and not an architecture map (see postgres-architecture-overview.md); this is the answer to 'what's planned, what's written, and what's out of scope?'
  • PostgreSQL Base Infrastructure — Section OverviewRouter for the base-infra subcategory: the substrate every other layer in PostgreSQL is built on. Seven foundational facilities, layered roughly bottom-up — the hierarchical memory-context allocators (AllocSet / slab / generation / bump under one MemoryContext virtual table), the elog/ereport error machinery that unwinds with setjmp/longjmp, dynahash (the in-process hash table behind syscache and the shared lock table), the resource-owner tree that guarantees buffer pins and locks are released on error, the function manager (fmgr) that turns a pg_proc row into a callable, the datatype library (utils/adt) that supplies every type's I/O and operators, and the GUC parameter system. Memory contexts and elog are the true bottom; everything above — executor, planner, storage, catalog — allocates in a context, throws through elog, and registers cleanup with a resource owner. This doc names the boundaries and the read order; the mechanism lives in the seven module docs it routes into.
  • PostgreSQL Client Protocol — Section OverviewRouter for the client-protocol subcategory: the backend half of the PostgreSQL frontend/backend (FE/BE) wire protocol. The layering is a single linear handshake that every connection climbs once — raw socket, then an optional encryption layer (TLS or GSSAPI) negotiated by a request packet before the startup packet, then the startup packet that selects an HBA rule, then authentication (the SASL/SCRAM challenge-response is the modern default), and only then the steady-state message loop in PostgresMain where Query and the Parse/Bind/Execute extended-query messages are read off the wire by pqcomm and decoded by pqformat. Three module docs own the three bands of that stack: postgres-wire-protocol (the byte framing and message loop), postgres-authentication (the ClientAuthentication dispatch and SCRAM exchange), and postgres-tls-gssapi (the transport-security layer). This section ends where a parsed message enters the query pipeline and where authentication hands the session to backend startup.
  • PostgreSQL DDL & Schema — Section OverviewRouter for the DDL & schema subcategory. The unifying fact: almost everything here is reached through ProcessUtility — the dispatch point for statements the planner does not optimize. standard_ProcessUtility handles the simple cases inline and routes the schema-mutating ones (CREATE/ALTER TABLE, indexes, constraints, triggers, partitions, policies) through ProcessUtilitySlow, which calls into the per-command modules under commands/, the catalog writers under catalog/ (heap.c, index.c, partition.c), and the partition-bound machinery under partitioning/. Two members break the ProcessUtility frame and the section says so explicitly: row-level security is a query-rewrite-time mechanism (get_row_security_policies, called from rewriteHandler.c, not from ProcessUtility), and partition tuple routing (ExecFindPartition) runs in the executor at DML time. COPY is utility-dispatched bulk I/O. Materialized views straddle DDL (the relation) and the query pipeline (the populating SELECT). This doc names those seams and hands the mechanism to the ten module docs.
  • PostgreSQL Extensibility — Section OverviewRouter for the extensibility subcategory. PostgreSQL's defining shape is that third-party code plugs into the engine at fixed indirection points without patching it: every plug point is a struct-of-callbacks or a function-pointer global the core consults at a known spot. This subcategory routes the user-facing plug points — foreign data wrappers (FdwRoutine), the CREATE EXTENSION packaging system, the hook globals (planner/executor/ProcessUtility/shmem), procedural languages dispatched through pg_language plus their SPI re-entry path, and custom scan providers (CustomScanMethods). The lower-level plug points — table/index AMs, custom WAL rmgrs, background workers — are the same pattern but are owned by storage-engine, txn-recovery, and server-architecture respectively; this overview names that boundary and hands off. Mechanism lives in core; most concrete implementations ship as contrib or third-party (out of scope here).
  • PostgreSQL Internationalization & Text Search — Section OverviewRouter for the i18n-text subcategory: the three text-handling layers that sit beside the byte representation of a string. (1) The collation provider abstraction — every locale-sensitive comparison, case-fold, and pattern match goes through a single pg_locale_t handle whose collate vtable dispatches to libc, ICU, or the PG17-new builtin provider, so str_tolower / varstr_cmp never branch on provider. (2) Server encodings — utils/mb owns the per-database server encoding, the conversion-function matrix between encodings, and multibyte-aware string ops, while the shared encoding-name and wchar tables live in src/common. (3) Full-text search — tsearch turns a document into a normalized tsvector via a configuration of parser plus dictionaries, matches it against a tsquery, and (via GIN's extractValue/extractQuery) is the canonical consumer of the GIN access method. This doc names the three module docs, their seams to base-infra (datatypes, fmgr) and storage-engine (GIN), and defers all mechanism.
  • PostgreSQL Monitoring & Statistics — Section OverviewRouter for the monitoring-stats subcategory: the introspection plane that lets a backend report what it is doing and the engine accumulate what has happened. Three pillars share one home (utils/activity/). (1) The PG15 cumulative statistics system — a shared-memory subsystem that replaced the old single stats-collector process and its UDP packets: each backend accumulates pending counts locally, then flushes them into a fixed-shmem block (for fixed-numbered kinds like checkpointer/bgwriter/WAL/IO) or a DSA-backed dshash keyed by PgStat_HashKey (for variable-numbered kinds like per-relation/per-function), with the checkpointer serializing the table to disk at shutdown and the startup process reloading it (or discarding after a crash). (2) Wait events + backend status — a near-free per-PGPROC report of what each backend is doing right now, with the wait-event taxonomy code-generated from wait_event_names.txt. (3) Progress reporting — a parallel-safe per-command progress array. This is the live view (status/wait/progress) versus the accumulated view (cumulative counters), and it defers all server-process plumbing to server-architecture and all SLRU/IO substrate to txn-recovery and storage-engine.
  • PostgreSQL Query Processing — Section OverviewRouter for the query-processing subcategory: the read-path pipeline that turns SQL text into answered rows. A backend runs five stages in order — parse (gram.y/scan.l → raw parse tree), analyze (parse_analyze → Query tree), rewrite (the rule system expands views and applies RLS), plan (the cost-based optimizer builds candidate Paths cheapest-first, picks a winner with dynamic-programming join ordering and a GEQO genetic fallback, then materializes it into an executable Plan tree), and execute (a demand-pull tree of plan nodes driven by ExecProcNode, with expression evaluation through a switch-threaded interpreter or optional LLVM JIT). Around that spine sit parallel query (a Gather node forks workers that run a copy of the sub-plan over shm_mq), prepared statements and portals (cached plans across executions), planner statistics (per-column plus extended multi-column statistics), and the Node infrastructure every stage's trees are built from. Hands off downward to storage-engine for the access methods scans actually call, and shares the optimizer's selectivity inputs with system-catalog.
  • PostgreSQL Replication & HA — Section OverviewRouter for the replication-ha subcategory. Every doc here consumes the same WAL stream the txn-recovery subcategory produces: walsender ships WAL to a walreceiver for physical streaming; replication slots pin the WAL (and a catalog snapshot) a consumer still needs; logical decoding turns that physical WAL into row changes via reorderbuffer + snapbuild and emits them through the pgoutput plugin to apply workers; synchronous replication makes COMMIT wait on standby acks; the archiver and WAL summarizer drain WAL to archives and block-modification summaries; and basebackup plus incremental backup take a physical snapshot of the cluster anchored to a WAL position. This section owns the WAL *consumers*; it hands the WAL *producer* and recovery/redo back to txn-recovery, and the FE/BE replication-command transport to client-protocol.
  • PostgreSQL Server Architecture — Section OverviewSection-overview router for the server-architecture subcategory: the fork-and-attach machine itself. A single postmaster sizes one shared-memory segment at boot and forks role-typed children (the BackendType family); each backend runs InitProcess → InitPostgres → the message loop. On top of that substrate sit the IPC primitives (the static shared segment plus dynamic dsm/dsa/shm_mq) and three distinct lock layers that lmgr/README itself enumerates — LWLocks/spinlocks for short in-memory critical sections, the heavyweight regular lock table with a waits-for deadlock detector, and SIReadLock predicate locks for SERIALIZABLE (SSI). Latches and signals (procsignal, SetLatch) carry wakeups, and a set of auxiliary processes plus the background-worker framework round out the inhabitants. This router names the boundaries and hands each mechanism off to its module doc; it covers WHO the processes are and HOW they coordinate, deferring WHAT they do (WAL, MVCC, query) to the adjacent sections.
  • PostgreSQL Storage Engine — Section OverviewRouter for the storage-engine subcategory — PostgreSQL's layered stack from OS files up: smgr/md maps a relation to 1 GB file segments, the shared buffer pool is the single choke point between memory and disk (with PG18 async I/O underneath), every block is a slotted page of ItemId line pointers, and above that sit the pluggable access methods. The structural highlight is the indirection: a table is reached through a TableAmRoutine (heap is the default, no-overwrite MVCC) and an index through an IndexAmRoutine (nbtree, gin, gist, spgist, brin, hash), so the heap's on-page records, HOT pruning, TOAST, the visibility map, and the free-space map are all heap-AM concerns layered on the shared page/buffer floor.
  • PostgreSQL System Catalog & Caches — Section OverviewRouter for the system-catalog subcategory. PostgreSQL is a catalog-driven engine: every type, operator, function, access method, and relation is a row in a pg_* system table, so the hot path is not querying those tables but caching them. Three caches stack in front of the catalog — the catcache (individual tuple lookups, the syscache being its typed dispatch layer), the syscache/lsyscache convenience accessors over it, and the relcache (whole-relation descriptors assembled from several catalogs) — and because each backend holds private copies, a fourth mechanism keeps them coherent: the sinval shared-invalidation loop, where a catalog mutation in one backend queues messages in shared memory that every other backend drains and acts on at AcceptInvalidationMessages points. Layered around this core are dependency tracking (pg_depend / pg_shdepend, the DROP ... CASCADE graph) and the namespace / search_path resolver that turns an unqualified name into a catalog OID. This doc names the layering and hands off mechanism to six module docs.
  • PostgreSQL Transactions & Recovery — Section OverviewRouter for the transactions-and-recovery subcategory. PostgreSQL's transaction story is two interlocking machines layered on the shared-memory substrate. The first is concurrency: a no-overwrite MVCC heap whose snapshots are computed from the procarray (the live-XID census in shared memory), with commit state recorded in the clog over the SLRU page cache and visibility resolved by heapam_visibility. The second is durability: the WAL is the single redo log; xact drives the commit protocol, rmgr-tagged records are replayed by the startup process on crash/archive recovery, the checkpointer bounds replay, and 2PC persists prepared state. Closing the loop, vacuum/freeze and autovacuum reclaim the dead tuple versions MVCC leaves behind and hold off XID wraparound. This overview names those layers, the order to read the fourteen module docs, and the seams it shares with storage-engine, server-architecture, and replication-ha.
  • PostgreSQL Utilities — Section OverviewRouter for the utilities subcategory: the standalone frontend/backend tools that ship in src/bin plus the build-time catalog codegen they depend on. The unifying fact is that none of these is part of the running server — they are separate executables that act on a cluster from the outside, and they cluster into four families by which on-disk contract they touch. (1) Cluster genesis: genbki.pl turns the pg_*.h/pg_*.dat catalog sources into postgres.bki at build time, and initdb drives a standalone backend in bootstrap mode (bootstrap.c) to replay that .bki and stamp the control file. (2) Logical export/migration over the FE/BE protocol: pg_dump/pg_restore/pg_dumpall reconstruct schema+data as SQL or an archive, and pg_upgrade orchestrates a schema-only dump plus a physical relfilenode swap. (3) Physical backup/recovery tools that consume the WAL and data files directly: pg_basebackup, pg_combinebackup (incremental reconstruction), and pg_rewind. (4) Inspection + control tools: pg_waldump decodes WAL by reusing the backend rmgr desc callbacks, psql is the interactive client, and pg_ctl/pg_controldata wrap postmaster lifecycle and read pg_control. Each tool gets one detail doc; this overview asserts the families and the contracts, and defers mechanism.
  • PostgreSQL KO Translation Status — Per-Document Phase BoardPhase board for the KO mirrors under `knowledge/ko/code-analysis/postgres/`. The general 5-phase framework (P0 literal mirror → P1 S1 machine cleanup → P2 structural rewrite → P3 natural Korean prose → P4 reviewed) and the marker convention live in `knowledge/code-analysis/cubrid/cubrid-translation-status.md` (the single source of truth for the model) and `knowledge/methodology/korean-translation.md`; this doc is PostgreSQL's applied instance. The PostgreSQL KO mirrors were authored as native parallel compositions (sonnet, single-pass) with `kb-ai-tell-check` run to clear S1 findings — so they enter the board at P2, not via the P0→P1 literal-then-clean path cubrid took. P3 (dedicated prose polish) and P4 (native review) remain pending; per-doc frontmatter phase tags are not yet stamped (a tracked follow-up).

Base / Infrastructure (7)

  • PostgreSQL Datatype Library — varlena, numeric, datetime, jsonb, and arraysEvery PostgreSQL SQL type is a row in pg_type whose behavior is a bundle of C functions registered in pg_proc; the abstract-datatype (ADT) library under src/backend/utils/adt implements those functions. The contract is the four-function I/O quartet — in (cstring to internal), out (internal to cstring), recv (binary wire to internal), send (internal to binary wire) — plus comparison/hashing/sortsupport for indexable types, all called through fmgr's V1 PG_FUNCTION_ARGS convention. Variable-length types are varlena: a value begins with a length-bearing header whose low bits flag a 4-byte aligned header (up to 1 GB), a 1-byte unaligned short header (<=126 B, saves padding on tiny strings), an in-line compressed datum, or a TOAST pointer to out-of-line storage; consumers detoast/unpack lazily via VARDATA_ANY/VARSIZE_ANY and pg_detoast_datum_packed. text/bytea are the canonical varlena types: in/out just copy bytes around a header, and ordering goes through varstr_cmp which fast-paths C-locale memcmp and otherwise calls the collation provider, with a SortSupport abbreviated-key path for sorts. numeric is an arbitrary-precision decimal stored as base-10000 NBASE digits with a packed short/long/special header (sign, weight, display scale); arithmetic deserializes to a NumericVar digit array, runs schoolbook add/mul with guard digits, and re-packs via make_result. date/timestamp are fixed-length integer types whose I/O routes through the shared datetime tokenizer (ParseDateTime/DecodeDateTime) and the Julian-day kernel date2j/j2date. jsonb is a tree of JEntry-tagged nodes: a container header counts children, a parallel JEntry array stores per-child type + length-or-offset (an offset every JB_OFFSET_STRIDE entries to keep random access O(1) while staying TOAST-compressible), and the variable-length payloads follow; values are built in an in-memory JsonbValue tree then serialized depth-first by convertJsonbValue, and compared structurally by compareJsonbContainers. arrays are varlena ArrayType headers carrying ndim, an optional null bitmap, element-type OID, dims/lbounds, then packed element data accessed through the ARR_* macros and array_get_element. This doc covers the I/O-quartet contract, the varlena header zoo, numeric digit arithmetic, datetime Julian kernel, jsonb binary layout, and array internals; it defers fmgr dispatch to postgres-fmgr.md and out-of-line storage/compression to postgres-toast.md.
  • PostgreSQL Dynamic Hash Tables (dynahash) — Larson Linear Hashing, Partitioned Shared Hashes, and the simplehash Templatedynahash is PostgreSQL's general-purpose chained hash table, the same code serving both backend-local tables (palloc'd, auto-expanding) and named shared-memory tables (fixed-directory, discoverable across processes). Its on-disk-free data structure is a Larson-style dynamically-expansible hash: a top-level directory of fixed-size segments of bucket-header pointers, with calc_bucket() splitting exactly one bucket per inserted element so the table grows smoothly without a stop-the-world rehash. Each entry is a HASHELEMENT header (link + cached hashvalue) immediately followed by the caller's MAXALIGN'd key-then-data payload, so pointers to entries are stable for the entry's lifetime — a property the chained design guarantees and that open-addressing cannot. The single-lock convention (shared lock to read, exclusive to write) becomes a bottleneck for the hottest shared tables, so HASH_PARTITION mode forbids on-the-fly bucket splits, shards nentries/freeList into NUM_FREELISTS spinlock-guarded slices, and lets callers use the low-order hash bits as an independently-lockable partition number — this is exactly how the buffer manager's SharedBufHash and the lock manager's LOCK/PROCLOCK tables get their concurrency. hash_seq_search walks the directory bucket-by-bucket and registers itself so a concurrent insert in the same backend won't split a bucket mid-scan. For performance- and cache-sensitive local hot paths, simplehash.h is a separate, macro-templated, open-addressing (robin-hood) hash that the compiler specializes to one element type, trading dynahash's stable pointers and partitioning for inlined comparisons and no pointer chasing. This doc covers hash_create/hash_search/hash_seq_search, the directory/segment/bucket geometry, partitioned locking, expand_table, and the simplehash robin-hood insert; it defers shared-memory allocation to postgres-shared-memory-ipc.md, the buffer table to postgres-buffer-manager.md, and the lock table to postgres-lock-manager.md.
  • PostgreSQL Error Handling — ereport/elog, PG_TRY/sigsetjmp, and the ErrorContext Escape HatchPostgreSQL routes every diagnostic message through a two-phase ereport/elog pipeline: errstart() allocates a slot on a depth-5 ErrorData stack and decides whether the message is worth processing; errfinish() formats, dispatches, and — for ERROR — invokes siglongjmp through PG_exception_stack to the nearest PG_TRY handler. FATAL exits via proc_exit(); PANIC calls abort(). A dedicated ErrorContext memory context is pre-reserved so out-of-memory errors can always be reported. ErrorContextCallback hooks allow call-stack narration. errsave/ereturn provide a non-longjmp soft-error path for input-validation code. Assert() is a compile-time-optional macro that calls ExceptionalCondition() — bypassing elog entirely — to avoid recursion during assertion failure.
  • PostgreSQL Function Manager (fmgr) — The V1 Call Convention, Lookup Pipeline, and Extension ABIHow PostgreSQL routes every SQL function call through a two-phase lookup (FmgrInfo) + invocation (FunctionCallInfo) pipeline, dispatching built-ins via a code-generated OID index, C extensions via a dlopen cache, and all other languages via their call handler; how the V1 ABI is declared with PG_FUNCTION_INFO_V1 and the Pg_magic_struct ABI check enforces module compatibility.
  • PostgreSQL GUC Parameters — config_generic, PGC Contexts, SET/SHOW, Check/Assign Hooks, and SIGHUP ReloadPostgreSQL's run-time configuration system (the "GUC" — Grand Unified Configuration) is a single table-driven mechanism that handles every tunable, from planner toggles to WAL sizing. Each parameter is a statically-initialized config_bool/int/real/string/enum record whose first member is a shared config_generic header carrying the parameter's name, its GucContext (PGC_INTERNAL / PGC_POSTMASTER / PGC_SIGHUP / PGC_SU_BACKEND / PGC_BACKEND / PGC_SUSET / PGC_USERSET — a privilege-and-timing ladder governing who may change it and when), a flags word, and the run-time state (current source, RESET value, and a stack of saved values). build_guc_variables() folds the five typed arrays into one case-insensitive dynahash keyed on name, and find_option() resolves every lookup through it, transparently minting placeholder records for custom (dotted) names. The write path is one funnel — set_config_with_handle(), reached from SQL SET via ExecSetVariableStmt and from the config file via ProcessConfigFile — which enforces the context ladder, runs parse_and_validate_value() (range checks plus a per-variable check_hook that can canonicalize the value and stash an opaque extra), pushes the prior value onto a transactional stack for SET LOCAL / rollback via push_old_value()/AtEOXact_GUC(), commits the new value into *variable, fires the assign_hook side effect, and flags GUC_REPORT variables for a ParameterStatus message. SHOW reads back through ShowGUCOption(), applying any show_hook and unit formatting. postgresql.conf changes take effect on SIGHUP: the signal sets ConfigReloadPending, and each process calls ProcessConfigFile(PGC_SIGHUP), which re-parses the file (plus postgresql.auto.conf written by ALTER SYSTEM), diffs it against current state, and re-applies — silently refusing PGC_POSTMASTER changes that need a restart. This doc covers the record layout, the context ladder, the lookup hash, the set/show funnel, the hook protocol, and the reload path; it defers backend GUC inheritance and parallel-worker state copy to postgres-backend-lifecycle.md and the SIGHUP plumbing per process type to postgres-postmaster.md.
  • PostgreSQL Memory Contexts — Hierarchical Region Allocation and longjmp-Safe CleanupPostgreSQL allocates almost all backend memory from a forest of MemoryContexts rooted at TopMemoryContext; palloc() draws from the CurrentMemoryContext, and the whole point is that resetting or deleting a context frees every chunk under it in one call — which is how error handling unwinds, since the elog/longjmp path just deletes the transaction's contexts. A context is an abstract MemoryContextData header plus a method vtable; four concrete allocators back it — AllocSet (general, power-of-2 freelists), Slab (fixed-size chunks), Generation (FIFO/lifespan groups, no chunk reuse), and Bump (header-less, no pfree) — each tuned to a different allocation pattern, all sharing a 1-byte MemoryContextMethodID encoded in the chunk header so pfree() can find the right free routine from the pointer alone.
  • PostgreSQL ResourceOwner — Hierarchical Resource Tracking and Error-Safe ReleaseA ResourceOwner is PostgreSQL's RAII-without-destructors mechanism: a per-transaction / per-subtransaction / per-portal object that remembers the buffer pins, lmgr locks, relcache/catcache references, open files, snapshots, JIT contexts, DSM segments, and AIO handles acquired while it was CurrentResourceOwner, so that all of them can be reliably freed at transaction end — including the abort path, where the executor's own data structures are presumed untrustworthy. ResourceOwners form a forest mirroring the transaction/portal nesting (subxact owner is a child of its parent's owner; a portal owner is a child of CurTransactionResourceOwner); releasing a parent recurses to children first. Each tracked resource kind registers a ResourceOwnerDesc giving it a release phase (BEFORE_LOCKS, LOCKS, AFTER_LOCKS) and a priority within the phase, and ResourceOwnerRelease is driven three times by xact.c — once per phase — so that backend-visible resources (pinned buffers) are dropped before the locks that other backends wait on, and backend-internal caches are dropped after. The storage layer is a 32-slot fixed array that spills into an open-addressing hash, plus a separate 15-entry lossy lock cache that lets subxact commit bulk-reassign locks to the parent. The whole machine plugs into PG_TRY/sigsetjmp error unwinding: longjmp out of any C frame lands in AbortTransaction, which calls ResourceOwnerRelease on the top owner and frees everything the failed frames forgot. This doc covers the ResourceOwnerData struct, Remember/Forget, the array→hash spill, the three-phase sorted release, the lock cache, and the xact.c/portalmem.c wiring; it defers MemoryContext (the sibling allocator) to postgres-memory-contexts.md, the elog/PG_TRY machinery to postgres-error-handling.md, and buffer-pin internals to postgres-buffer-manager.md.

Storage Engine (20)

  • PostgreSQL Asynchronous I/O — The PG18 AIO Subsystem, io_uring, and Read StreamsPG18 adds a first-class asynchronous I/O subsystem so the server can issue reads before it needs the data, overlapping storage latency with useful work. The core abstraction is the AIO handle (PgAioHandle): a backend acquires a handle with pgaio_io_acquire(), lower layers (bufmgr -> smgr -> md -> fd) register completion callbacks and define the operation via pgaio_io_start_readv()/writev(), and the handle is staged then submitted through a pluggable IoMethodOps vtable. Three methods exist: sync (no real AIO, used for debugging and as a fallback), worker (the default, dispatching IOs to B_IO_WORKER processes over a shared-memory submission queue), and io_uring (Linux 5.1+, one ring per backend in shared memory so any backend can drain completions). Handles live in shared memory and move through an eight-state machine (IDLE -> HANDED_OUT -> DEFINED -> STAGED -> SUBMITTED -> COMPLETED_IO -> COMPLETED_SHARED -> COMPLETED_LOCAL); a wait reference (PgAioWaitRef) carries a generation counter so any process can wait for an IO even after the handle is reused. Completion callbacks are identified by small integer IDs (not pointers, because EXEC_BACKEND ASLR forbids shared-memory function pointers) and run in critical sections; errors are encoded compactly into a PgAioResult and re-raised in the issuing backend via a backend-local PgAioReturn. The read_stream.c helper is the main consumer: a callback emits a stream of block numbers, the stream merges neighbours into vectored reads up to io_combine_limit, runs up to max_ios concurrent reads, and adapts its look-ahead distance to recent hit/miss history. This doc covers the handle lifecycle, the method vtable and the three implementations, batch submission and deadlock avoidance, the callback/result/target model, and read streams; it defers buffer-pool mechanics to postgres-buffer-manager.md and segment/fd translation to postgres-smgr-md.md.
  • PostgreSQL BRIN — Block Range Indexes and the Range MapBRIN (Block Range Index) is PostgreSQL's lossy, block-oriented index for very large naturally-clustered tables: instead of one index entry per heap tuple, it stores one tiny summary tuple per *page range* (BRIN_DEFAULT_PAGES_PER_RANGE = 128 heap pages), recording for the minmax opclass the per-column min/max, plus hasnulls/allnulls flags, of every tuple in that range. A fixed-arithmetic translation layer — the *range map* or revmap, living in the index blocks right after the metapage — maps any heap block to the TID of its summary tuple via HEAPBLK_TO_REVMAP_BLK / HEAPBLK_TO_REVMAP_INDEX, so there is no tree to descend. BRIN cannot support amgettuple; it only implements amgetbitmap (bringetbitmap), walking the revmap in ascending page-range order, calling the opclass consistent function per range, and emitting *every page* of a matching (or unsummarized, or placeholder) range into a lossy TIDBitmap that the BitmapHeapScan recheck node then filters. Inserts (brininsert) only narrow an *existing* summary in place via brin_doupdate (same-page overwrite when it fits, otherwise move-and-repoint-revmap); new ranges stay unsummarized until VACUUM, brin_summarize_new_values(), or autosummarize fires summarize_range, which inserts a placeholder tuple, scans the heap range with 'any visible' semantics, then unions the placeholder back so concurrent inserts are not lost. brin_desummarize_range reverses it. This doc covers the per-range summary tuple, the revmap arithmetic and physical extension (with page evacuation), the lossy bitmap scan, summarization/desummarization/autosummarize, and the minmax vs inclusion opclass contract (opcinfo/addValue/consistent/union); it defers the generic AM dispatch to postgres-index-am.md and the inclusion-style geometric opclass internals to postgres-gist.md.
  • PostgreSQL Buffer Manager — Shared Pool, Clock-Sweep Eviction, and the WAL-Before-Flush RuleHow PostgreSQL caches disk pages in a fixed-size shared pool: a BufferDesc array whose state packs refcount+usagecount+flags into one atomic word, a partitioned buffer-mapping hash keyed on (spc,db,rel,fork,block), pin/unpin reference counting separated from shared/exclusive content locks, a clock-sweep victim selection (StrategyGetBuffer/GetVictimBuffer) that decrements usage counts as the hand sweeps, BAS_* ring buffers that confine bulk scans to a few buffers, and FlushBuffer enforcing the WAL rule — XLogFlush up to the page LSN before the page is written — which makes steal/no-force recovery sound.
  • PostgreSQL Data Checksums — Page Integrity VerificationPostgreSQL data checksums are a cluster-wide, opt-in defense against silent on-disk corruption: when initdb -k (or pg_checksums) sets pg_control's data_checksum_version to 1, every heap/index/FSM/VM page carries a 16-bit checksum in its pd_checksum header field. The checksum is computed by pg_checksum_page() in checksum_impl.h, a deliberately SIMD-friendly algorithm built on the FNV-1a hash: the 8 KB page is treated as a 32-column array of uint32 values, 32 partial FNV sums (each seeded from a distinct random offset basis) are advanced in parallel with the mixing primitive hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value) >> 17), two extra zero rounds finish the avalanche, the 32 partials are xor-folded to a uint32, the block number is mixed in to catch transposed pages, and the result is reduced to a 1..65535 uint16 (never zero). On write, PageSetChecksumCopy() (shared buffers, under shared lock so a private copy is taken to dodge concurrent hint-bit writes) or PageSetChecksumInplace() (local buffers, bulk_write, hash overflow pages) stamps pd_checksum just before smgrwrite. On read, PageIsVerified() recomputes and compares; a mismatch raises a WARNING/ERROR (ERRCODE_DATA_CORRUPTED, 'page verification failed'), bumps pg_stat_database.checksum_failures, and — unless ignore_checksum_failure or zero_damaged_pages is set — refuses the page. The checksum is NOT in the WAL-protected image and is recomputed on every flush, so it detects corruption introduced *below* PostgreSQL (storage, kernel, RAM bit-flips on a clean page) rather than logic bugs. This doc covers the FNV algorithm, the write/read call sites, the cluster-wide enable in pg_control, and the failure-reporting path; it defers page-header layout to postgres-page-layout.md and buffer eviction/flush mechanics to postgres-buffer-manager.md.
  • PostgreSQL Free Space Map — Tracking Reusable Space per PageThe Free Space Map (FSM) is a separate relation fork that lets an inserter find a page with enough free space in O(log n) page reads instead of scanning the whole relation. It quantizes per-page free space into one byte (256 categories of BLCKSZ/256 each) so the whole map stays tiny, and arranges those bytes as a two-level tree: a binary max-heap *within* each FSM page (non-leaf node = max of its two children) plus a fanout tree *across* FSM pages (a leaf in an upper page mirrors the root of a lower page). fsm_search_avail walks the in-page tree by 'expanding a search triangle' to the right of a round-robin start slot (fp_next_slot) so concurrent backends spread their inserts; fsm_set_avail bubbles a new value up to the root. The map is deliberately NOT WAL-logged — writes use MarkBufferDirtyHint, reads use RBM_ZERO_ON_ERROR, and a battery of self-correcting checks (rebuild-on-corruption, fsm_does_block_exist guard, VACUUM's FreeSpaceMapVacuum upper-page refresh) repair the lossy approximation. The heap calls GetPageWithFreeSpace / RecordAndGetPageWithFreeSpace from hio.c to place inserts; indexfsm.c reuses the same machinery with a binary free/in-use encoding to recycle whole index pages. This doc covers the byte encoding, the in-page binary tree, the cross-page fanout addressing, the search/update/vacuum workhorses, and the heap insert path; it defers heap insertion mechanics to postgres-heap-am.md, fork I/O to postgres-smgr-md.md, and the VACUUM driver to postgres-vacuum.md.
  • PostgreSQL GIN — Generalized Inverted Index, Posting Trees, and the Fast-Update ListGIN is PostgreSQL's inverted index: it indexes the *keys* extracted from composite items (array elements, tsvector lexemes, jsonb paths) rather than the items themselves, and is optimized for the case where one key value occurs in many heap rows. The on-disk structure is a Lehman-Yao B-tree of key entries (the 'entry tree') where each leaf tuple holds either an inline compressed 'posting list' of heap TIDs or, once that list outgrows a page, a downlink to a 'posting tree' — a second B-tree keyed by ItemPointer. Because the same key appears in many rows, retail insertion is expensive, so GIN offers an opt-in 'fast update' path: gininsert appends new entries to an unordered singly-linked 'pending list' anchored on the metapage, and a later ginInsertCleanup merges them in bulk into the entry tree. A scan therefore reads the pending list first (linearly) and then the main index. Queries drive the opclass's extractQuery + consistent functions: keyGetItem merges per-entry TID streams in TID order, calling the consistent function to decide whether a heap row matches; partial-match scans collect a bitmap of all keys in a range. There is no delete in the entry tree (the key vocabulary is assumed stable), so VACUUM only strips dead TIDs from posting lists/trees and reclaims empty posting-tree pages under a cleanup lock. This doc covers the entry tree, posting lists vs posting trees, the fast-update pending list and ginInsertCleanup, multi-key extraction, partial-match scans, the consistent-function scan engine, and GIN vacuum; it defers the AM dispatch contract to postgres-index-am.md, the L&Y algorithm itself to postgres-nbtree.md, and text-search opclasses to postgres-full-text-search.md.
  • PostgreSQL GiST — Generalized Search Trees and Opclass Support FunctionsGiST (Generalized Search Tree) is PostgreSQL's framework index AM: the access method supplies a balanced, right-linked, height-balanced tree skeleton and crash-safe concurrency, while an opclass plugs in the *meaning* of a key through a small set of support functions — consistent (does this key/predicate overlap the query?), union (the bounding predicate of a set of children), penalty (cost of widening a key to cover a new entry), picksplit (how to partition an overfull page), plus compress/decompress, equal, distance (for k-NN ORDER BY), and fetch (for index-only scans). gisthandler wires these into an IndexAmRoutine; initGISTstate caches the FmgrInfo for each. Inserts descend via gistdoinsert along the path of least penalty (gistchoose/gistpenalty), adjusting downlinks on the way down so the tree is self-consistent after each step; a full page splits in gistplacetopage using the opclass picksplit (gistSplitByKey), and like nbtree the split is decomposed into WAL-atomic steps — the new right page is linked and flagged F_FOLLOW_RIGHT, then the downlink is posted to the parent and the page's NSN (a special-purpose LSN) is set. Searches (gistScanPage) keep only one page locked at a time and use the NSN-vs-parent-LSN comparison to detect a concurrent split and chase the right-link, exactly the nbtree recovery idiom adapted to a key whose ordering is opclass-defined. Builds use a sortsupport-driven bottom-up sorted build when available, else a cache-aware buffering build (Arge et al.) that turns random I/O into breadth-first batches. This doc covers the support-function contract, NSN + right-link concurrency, the penalty-driven insert and picksplit-driven split, ordered/k-NN search via a pairing heap, and the three build strategies; it defers the AM dispatch contract to postgres-index-am.md, the space-partitioned cousin to postgres-spgist.md, and block-range summaries to postgres-brin.md.
  • PostgreSQL Hash Index — Linear Hashing, Buckets, and Overflow PagesPostgreSQL's hash AM is a disk-resident linear hash table (Seltzer & Yigit 1991): a key's 32-bit hash code is masked into a bucket number that the table grows one bucket at a time, never doubling its file in a single step. Each bucket is a chain — one permanently-placed primary page plus zero or more recyclable overflow pages — doubly linked through the page special space. A metapage (block 0) holds the live bucket count (hashm_maxbucket), the highmask/lowmask pair that defines the current and next address space, and the hashm_spares[] splitpoint array that lets BUCKET_TO_BLKNO compute a bucket's physical block without moving any page. Growth happens by incremental split: when an insert pushes tuples-per-bucket past the fill factor, _hash_expandtable picks bucket (new_bucket & lowmask), takes cleanup locks on old and new primary pages, and _hash_splitbucket re-partitions the old bucket's tuples, marking moved tuples INDEX_MOVED_BY_SPLIT so concurrent scans don't double-count. Concurrency is page-content locks plus pins plus cleanup locks: a scan pins the primary bucket page for its whole lifetime (so a split or squeeze can't reorganize underneath it) and copies a whole page's matches into backend-local storage so it holds no index lock while touching the heap. Each backend caches the metapage in its relcache and validates the cache against hasho_prevblkno (which on a primary page stores maxbucket-as-of-last-split) to detect a stale mapping after a concurrent split. Every mutation — insert, overflow add, split allocate/move/complete, squeeze, vacuum — is a single atomic WAL action whose intermediate states leave a valid, searchable index. This doc covers the metapage and addressing, _hash_first/_hash_readpage scans, _hash_doinsert, _hash_expandtable/_hash_splitbucket, overflow add/free/squeeze in hashovfl.c, and bucket-level locking; it defers the AM dispatch contract to postgres-index-am.md, B-tree concurrency to postgres-nbtree.md, and buffer pins/cleanup locks to postgres-buffer-manager.md.
  • PostgreSQL Heap Access Method — No-Overwrite MVCC Tuples, Slotted Pages, and HOTPostgreSQL's default table AM stores every row version in place on a slotted heap page: an INSERT writes a tuple stamped with xmin, an UPDATE/DELETE never overwrites it but stamps xmax and (for UPDATE) writes a new version whose ItemId the old tuple's t_ctid links forward to. Visibility is decided per-tuple by reading xmin/xmax against a snapshot, so dead versions accumulate until HOT pruning (single-page) or vacuum reclaims them. HOT lets an UPDATE that touches no indexed column place the new version on the same page with no new index entry, chaining it off the root line pointer so the index still finds it.
  • PostgreSQL Index Access Method — IndexAmRoutine and the Generic Index LayerThe Index Access Method (Index AM) API defines the contract between PostgreSQL's generic index layer and individual index implementations (btree, hash, GiST, GIN, SP-GiST, BRIN). IndexAmRoutine is the function-pointer table stored on each index relation's rd_indam field; indexam.c provides the stable index_ wrappers that dispatcher code calls; genam.c supplies scan-descriptor allocation and system-catalog scan helpers. The interface predates TableAmRoutine and shapes it: index_beginscan opens both the AM scan and a table_index_fetch_begin session on the heap side, index_getnext_tid drives amgettuple, and index_fetch_heap calls table_index_fetch_tuple to resolve each TID to a visible heap tuple.
  • PostgreSQL Large Objects — pg_largeobject Chunked Storage and the lo APIPostgreSQL's large object (LO) facility stores a byte stream of up to 4 TB outside any one row by chopping it into fixed LOBLKSIZE = BLCKSZ/4 = 2 KB chunks, each chunk a tuple of the pg_largeobject system catalog keyed (loid, pageno); a separate pg_largeobject_metadata row records ownership and ACL so the object can exist with size zero and have permissions checked. The inversion API in inv_api.c (inv_create/open/read/write/lseek/tell/truncate/drop) presents a POSIX-file abstraction over that catalog: a read is a btree index scan on (loid, pageno >= start) that streams chunks in order and synthesizes zero bytes for sparse holes, while a write does a read-modify-write of each 2 KB chunk via heap_modify_tuple + CatalogTupleUpdate, inserting brand-new chunk tuples past the current end. Because every chunk is an ordinary MVCC heap tuple, large objects are fully transactional and crash-safe for free — a write is rolled back by transaction abort, vacuumed by autovacuum, and WAL-logged like any catalog change — at the cost of MVCC bloat (each rewrite leaves a dead tuple) and no streaming over the wire. The server-side be_lo_* fmgr wrappers in be-fsstubs.c add a per-transaction file-descriptor table (the cookies array in a private fscxt memory context), snapshot registration so a read-mode LO sees a stable view, lo_import/lo_export to/from server files, and the lo_get/lo_put/lo_from_bytea whole-object helpers. This doc covers LOBLKSIZE chunking, the two catalogs, inv_* read-modify-write and sparse-hole semantics, the FD/snapshot/subxact lifecycle, and transactional guarantees; it defers heap tuple mechanics to postgres-heap-am.md and in-row varlena/TOAST compression to postgres-toast.md.
  • PostgreSQL B-Tree (nbtree) — Lehman-Yao Concurrency, Splits, and Index-Tuple DeletionPostgreSQL's default index AM is a Lehman-Yao B-tree: every page carries a right-link to its sibling and a high key that upper-bounds its keyspace, so a reader holding only one page lock at a time can detect a concurrent split (search key > high key) and recover by following the right-link instead of locking the whole root-to-leaf path. Inserts descend with latch-coupling (lock child, release parent), and a full page splits in a single atomic WAL action that writes the new right page, fixes the right sibling's left-link, and flags the left page INCOMPLETE_SPLIT until a second action posts the downlink to the parent — so a crash mid-split leaves a tree that searches still traverse via the right-link. Index tuples are removed lazily (LP_DEAD hint-marking by scans, simple/bottom-up deletion, and VACUUM's btbulkdelete), and empty pages are deleted in a two-stage half-dead protocol, then deferred into the FSM only once no snapshot can still hold a reference. Deduplication folds equal non-pivot tuples into posting-list tuples as a last line of defense before a split. This doc covers the L&Y invariants, _bt_search/_bt_moveright descent, _bt_insertonpg/_bt_split, page deletion, and the bthandler/IndexAmRoutine surface; it defers the AM dispatch contract to postgres-index-am.md, page latching to postgres-buffer-manager.md, and SQL-level locks to postgres-lock-manager.md.
  • PostgreSQL Page Layout — Slotted Page Format, ItemId Indirection, and Checksum MechanicsHow PostgreSQL lays out a fixed-size disk page: PageHeaderData with pd_lsn/pd_lower/pd_upper/pd_special offsets, a growing ItemId (line-pointer) array and a shrinking tuple area meeting in a free-space gap, ItemIdData's three-field bit-packed slot encoding four LP_* states (UNUSED/NORMAL/REDIRECT/DEAD), the 6-byte ItemPointerData TID that crosses the page boundary into the heap, and the 16-bit fold-and-XOR checksum computed per block number to catch storage corruption.
  • PostgreSQL Sequences — Relation-Backed Generators, Caching, and WALA PostgreSQL sequence is not a special in-memory counter but an ordinary relation with relkind RELKIND_SEQUENCE that holds exactly one heap tuple of FormData_pg_sequence_data {last_value, log_cnt, is_called} on block 0, while its immutable parameters (start, increment, min, max, cache, cycle, typid) live in the pg_sequence catalog row reached through the SEQRELID syscache. nextval_internal is the hot path: it consults a per-backend SeqTable hash entry (keyed by relid) that caches a [last, cached] window of already-allocated values so most calls return without touching the buffer at all, and only when the window is exhausted does it ex-lock block 0, walk the increment/min/max/cycle arithmetic for up to seqcache values, and write the tuple back. To avoid one WAL record per value it pre-logs SEQ_LOG_VALS (32) extra increments: a single XLOG_SEQ_LOG record (xl_seq_rec + the full tuple, REGBUF_WILL_INIT) publishes the *future* last_value and resets log_cnt, so the next 32-ish fetches advance only the in-buffer log_cnt counter with no further WAL — at the documented cost that a crash silently skips the unissued pre-logged values, which is why sequences guarantee uniqueness and monotonicity but never gap-freedom. The first nextval after a checkpoint is force-logged (PageGetLSN(page) <= GetRedoRecPtr()) so redo from the checkpoint cannot lose the advance. nextval/setval are deliberately non-transactional (an aborted xact does not give the number back) and forbidden in parallel mode and on read-only transactions for non-temp sequences; ALTER SEQUENCE and ResetSequence, by contrast, ARE transactional because they rewrite the whole relation via RelationSetNewRelfilenumber under AccessExclusiveLock. The sequence 'access method' (sequence_open/sequence_close in access/sequence/sequence.c) is a thin relation_open wrapper that only asserts relkind. Redo is reinit-from-scratch: seq_redo rebuilds block 0 in palloc'd workspace and memcpy's it in so a hot-standby reader never sees a torn page. This doc covers the one-row relation layout, the SeqTable cache, nextval's WAL/log_cnt batching, SET/RESET semantics, and the RM_SEQ_ID resource manager; it defers heap tuple mechanics to postgres-heap-am.md, WAL insertion/redo to postgres-xlog-wal.md, and the SEQRELID catalog cache to postgres-catcache-syscache.md.
  • PostgreSQL smgr & md — The Storage Manager Switch and the Magnetic-Disk DriverHow PostgreSQL routes every page-level I/O through a per-backend SMgrRelation handle table (smgr.c) and the single concrete driver beneath it (md.c), which maps a logical relation+fork+block triple to POSIX file segments, manages the segment vector per fork, and feeds fsync obligations to the checkpointer via the sync subsystem.
  • PostgreSQL SP-GiST — Space-Partitioned Trees on DiskSP-GiST (space-partitioned GiST) is PostgreSQL's framework for mapping pointer-linked, unbalanced, space-partitioning trees — quadtrees, k-d trees, and radix tries — onto fixed-size disk pages with high fanout. Logically the tree is a set of inner tuples (each an optional prefix plus an array of label,pointer nodes) and leaf tuples (a leaf datum, a heap TID, and a nextOffset chain link); the core requires that any one inner tuple and any one leaf-tuple chain fit on a single page, so a search touches only a few pages even when it crosses many logical nodes. Five opclass methods drive everything: config declares the prefix/label/leaf types and whether long values can be suffixed; choose decides per-insert whether to descend (MatchNode), grow the node array (AddNode), or break a too-specific prefix into prefix+postfix (SplitTuple); picksplit turns an overflowing leaf chain into one inner tuple fanning out to several leaf lists; and inner_consistent / leaf_consistent prune the search at inner tuples and test final leaf datums. Concurrency uses conditional latch-coupling plus triple-parity page placement to bound deadlocks, and redirect tuples keep in-flight scans correct after a chain moves. VACUUM does a single sequential pass, converting expired redirects to placeholders and chasing a pending list to catch tuples that PickSplit moved onto already-scanned pages. This doc walks spgdoinsert, spgWalk/spgInnerTest, the opclass surface, and spgvacuum; it defers GiST's lossy-key model to postgres-gist.md and the AM dispatch contract to postgres-index-am.md.
  • PostgreSQL Table Access Method — Pluggable Storage Dispatch via TableAmRoutinePostgreSQL 12 introduced the Table Access Method API — a function-pointer table (TableAmRoutine) stored on each relation's relcache entry — so that executor code calling table_tuple_insert, table_beginscan, and table_scan_getnextslot dispatches to whichever storage implementation the relation was created with. The only in-tree AM at REL_18 is heap (heapam_methods). The contract covers six functional areas: slot callbacks, sequential scan lifecycle, index-fetch, DML (insert/update/delete/lock), DDL/vacuum/analyze, and miscellaneous sizing. TM_Result is the shared DML outcome vocabulary; ScanOptions flags carry scan-type and behavior hints to scan_begin.
  • PostgreSQL Table Sampling — The TSM API, SYSTEM and BERNOULLIPostgreSQL implements the SQL:2003 TABLESAMPLE clause through a small pluggable interface, the Tablesample Method (TSM) API. A method is a handler function returning a TsmRoutine struct of callbacks: SampleScanGetSampleSize feeds the planner page/tuple estimates, BeginSampleScan validates the percentage argument and turns it into an integer cutoff, NextSampleBlock chooses which heap blocks to read, and NextSampleTuple chooses which offsets within a block to return. The executor node nodeSamplescan.c drives these callbacks via the table AM's scan_sample_next_block / scan_sample_next_tuple, so the same TSM works over any table AM. Two methods ship in core: SYSTEM samples at block granularity (it accepts or rejects whole pages, so it is cheap — it reads only a fraction of pages — but clustered, since all rows on an accepted page are returned), while BERNOULLI samples at row granularity (it must read every page but flips an independent coin per tuple, giving a true simple-random row sample). Both achieve history-independent, repeatable sampling by hashing the candidate identifier (block number for SYSTEM; block+offset for BERNOULLI) together with a seed and comparing the 32-bit hash against cutoff = round((2^32) * percent/100); a REPEATABLE(seed) clause makes the seed deterministic across queries, otherwise a per-scan random seed is drawn. This doc covers the TsmRoutine contract, the SYSTEM vs BERNOULLI cutoff-hash mechanism, repeatability, the planner size/cost glue and the executor/heap-AM driver loop; it defers the general SeqScan/IndexScan node machinery to postgres-scan-nodes.md and the index AM surface to postgres-index-am.md.
  • PostgreSQL TOAST — Oversized Attribute Storage, Compression, and DetoastingTOAST (The Oversized-Attribute Storage Technique) is PostgreSQL's mechanism for storing variable-length attributes that exceed a page boundary. Attributes larger than ~2 KB trigger a four-round toaster that first tries inline PGLZ or LZ4 compression and, if still too large, slices the datum into TOAST_MAX_CHUNK_SIZE chunks stored in a per-relation pg_toast_<OID> heap. The inline datum is replaced by an 18-byte varatt_external pointer. Detoasting is on-demand and transparent: detoast_attr fetches and reassembles chunks via an index scan on (valueid, chunkidx). Per-column storage strategy (PLAIN/MAIN/EXTENDED/EXTERNAL) governs whether compression and out-of-line storage are attempted.
  • PostgreSQL Visibility Map — The Two-Bit Per-Page Bitmap That Drives Vacuum Skipping and Index-Only ScansHow PostgreSQL's visibility map stores two bits per heap page (all-visible and all-frozen) in a dedicated fork, lets VACUUM skip clean pages and index-only scans skip heap fetches, and keeps both bits crash-safe through coordinated WAL logging with the heap page's PD_ALL_VISIBLE flag.

Transaction & Recovery (14)

  • PostgreSQL Autovacuum — The Launcher, Workers, and Anti-Wraparound SchedulingHow PostgreSQL schedules maintenance without a global timetable: a single launcher process keeps a per-database round-robin list in shared memory and signals the postmaster to fork short-lived per-database workers; each worker scans pg_class, decides per table whether dead tuples or inserts crossed a threshold + scale-factor equation (or whether relfrozenxid/relminmxid crossed a freeze age, forcing an anti-wraparound vacuum that even a disabled autovacuum cannot skip), claims tables through a schedule lock, and divides a shared I/O cost budget across the live workers by writing a balanced cost limit into shared memory.
  • PostgreSQL Checkpoint — The Durability Anchor: redo-pointer fixation, buffer flush, and WAL truncationPostgreSQL checkpoints are performed exclusively by the checkpointer auxiliary process (B_CHECKPOINTER), which runs CreateCheckPoint for normal operation and CreateRestartPoint during recovery. An online checkpoint first writes XLOG_CHECKPOINT_REDO to pin the redo pointer, then flushes all dirty shared buffers and SLRUs via CheckPointGuts, writes the XLOG_CHECKPOINT_ONLINE record, updates pg_control (ControlFileData), and recycles old WAL segments. The CheckpointerShmemStruct request queue lets backends forward fsync requests and wait for checkpoint completion via ckpt_started/ckpt_done condition variables. Checkpoint pacing is governed by checkpoint_completion_target, CheckpointWriteDelay, and IsCheckpointOnSchedule.
  • PostgreSQL CLOG & Commit Timestamp — Transaction Status Bitmaps, Subtransaction Parentage, and the commit_ts SLRUpg_xact (CLOG) stores a 2-bit commit/abort/sub-committed status for every XID in SLRU pages, using a group-update optimisation to batch concurrent commits under a single bank lock. pg_subtrans records each subtransaction's immediate parent XID — volatile across crashes, sufficient only for the span of open transactions. pg_commit_ts, gated by track_commit_timestamp, stores an 8+2-byte (TimestampTz, RepOriginId) entry per XID in a third SLRU. All three are simple clients built on top of the postgres-slru substrate.
  • PostgreSQL MultiXact — Multiple Lockers and Updaters on One TupleA heap tuple has exactly one xmax field, but row-level locking lets many transactions hold compatible locks on the same tuple at once (FOR KEY SHARE, FOR SHARE) and a locker can coexist with a committed updater. PostgreSQL squeezes this many-to-one relationship into the single xmax by replacing the bare XID with a MultiXactId (MXID) — a small integer that names a stored array of (xid, status) members — and flagging the tuple HEAP_XMAX_IS_MULTI. MXIDs are backed by two parallel SLRUs: pg_multixact/offsets maps an MXID to the starting position of its member array, and pg_multixact/members holds the variable-length (xid, status) arrays packed in 5-word groups (4 flag bytes + 4 xids). MultiXactIdCreate builds a two-member multi; MultiXactIdExpand reads an existing multi, drops dead lockers, and re-creates a larger one (multis are immutable — you never mutate the member set, you allocate a new MXID). Six member statuses (ForKeyShare, ForShare, ForNoKeyUpdate, ForUpdate, NoKeyUpdate, Update) encode lock strength plus whether the member updated. Like XIDs, MXIDs are a 32-bit space that wraps, so a tuple carrying an old MXID must be frozen by VACUUM before relminmxid falls within the wraparound horizon — FreezeMultiXactId rewrites or drops the xmax — and a second, often tighter constraint guards the members address space (its own 32-bit offset counter). This doc covers the MXID lifecycle, the offsets/members SLRU layout, create/expand/read, member-space wraparound, and multixact freezing; it defers the SLRU buffer machinery to postgres-slru.md, the heavyweight tuple lock table to postgres-lock-manager.md, and the XID side of freezing to postgres-xid-wraparound-freeze.md.
  • PostgreSQL MVCC Snapshots — Visibility, the procarray Census, and Isolation LevelsA PostgreSQL MVCC snapshot is a SnapshotData struct — xmin, xmax, and an xip[] array of in-progress XIDs — built by GetSnapshotData taking a one-pass census of the procarray under ProcArrayLock; HeapTupleSatisfiesMVCC then decides per-tuple visibility by comparing the tuple's xmin/xmax against that snapshot via XidInMVCCSnapshot, caching its verdict in the tuple's infomask hint bits. READ COMMITTED takes a fresh snapshot per statement; REPEATABLE READ reuses one snapshot for the whole transaction. The oldest live xmin across all backends is the vacuum horizon that gates dead-version reclamation.
  • PostgreSQL procarray — The Live-XID Census Behind Every SnapshotHow PostgreSQL keeps a shared-memory census of every backend's running XID — a sorted PGPROC index array plus dense ProcGlobal mirror arrays (xids/subxidStates/statusFlags) — and uses it to build each snapshot's xmin/xmax/xip in one ProcArrayLock-shared pass, to compute the four vacuum xmin horizons via ComputeXidHorizons, to clear committed XIDs in groups under contention (ProcArrayGroupClearXid), and to emulate primary transactions on a hot standby through the sorted KnownAssignedXids array.
  • PostgreSQL Recovery & Redo — Crash Recovery, PITR, and Hot StandbyPostgreSQL recovery has three modes (crash, archive/PITR, standby) unified in one startup-process loop: InitWalRecovery reads pg_control and signal files to select the mode; PerformWalRecovery iterates ReadRecord/ApplyWalRecord until the recovery target or end-of-WAL; FinishWalRecovery seals the WAL end-point for normal operation. ApplyWalRecord dispatches each decoded record to its rmgr's rm_redo callback, enforcing the idempotency invariant (skip if PageLSN >= record-LSN), tracking the lastReplayedEndRecPtr watermark in XLogRecoveryCtlData, and checking for timeline switches. The WAL prefetcher (XLogPrefetcher / LsnReadQueue) lookaheads decoded records and issues async kernel read hints for referenced buffer pages, hiding I/O latency behind replay CPU work. Timeline management (readTimeLineHistory / writeTimeLineHistory / tliSwitchPoint) gives each promotion its own integer ID, and a .history file records the fork point; PITR and standby both scan the history chain to find the correct WAL branch. Hot standby opens query connections once CheckRecoveryConsistency signals PMSIGNAL_RECOVERY_CONSISTENT, and tracks in-flight primary XIDs via RecordKnownAssignedTransactionIds.
  • PostgreSQL SLRU — Simple LRU Buffering for Wrap-Around-Able MetadataHow PostgreSQL's SLRU substrate allocates a fixed shared-memory pool of BLCKSZ page buffers organized into fixed-size banks, drives LRU eviction within each bank, multiplexes bank-level and per-buffer LWLocks for concurrent access, and provides a WAL-before-data write path, truncation, and a scannable on-disk segment layout — serving clog, subtrans, commit_ts, multixact, notify, and serializable isolation as plug-in clients.
  • PostgreSQL Two-Phase Commit — Global Transactions, Dummy PGPROCs, and WAL-First State PersistencePREPARE TRANSACTION reserves a GID and a dummy PGPROC in the shared TwoPhaseState array, assembles a serialized state blob (subxact XIDs, pending relation drops, cache-invalidation messages, 2PC rmgr records) into a WAL record via StartPrepare/EndPrepare, and leaves the XID visible in the ProcArray under the dummy proc until a separate backend calls COMMIT PREPARED or ROLLBACK PREPARED via FinishPreparedTransaction; state migrates lazily from WAL to pg_twophase files at checkpoint time, and RecoverPreparedTransactions restores all prepared xacts from that combined WAL+file store at end-of-recovery.
  • PostgreSQL Vacuum — Dead-Tuple Reclamation, Freezing, and XID Wraparound PreventionHow PostgreSQL reclaims dead heap tuple space and prevents transaction-ID wraparound: a three-phase loop (heap scan → index bulk-delete → heap LP_DEAD reap) driven by lazy_scan_heap, LVRelState tracking per-relation cutoffs (OldestXmin, FreezeLimit, MultiXactCutoff), VacuumCutoffs established once per VACUUM, cost-based pacing via vacuum_delay_point, visibility-map-guided page skipping, aggressive vs. normal mode distinguishing freeze-required scans from ordinary dead-tuple collection, a wraparound failsafe that disables index vacuuming when relfrozenxid is dangerously old, and parallel index vacuuming via ParallelVacuumState in DSM.
  • PostgreSQL WAL Records & Resource Managers — The rmgr Table and Record AnatomyPostgreSQL's WAL is a single physical byte-stream that multiplexes the redo logs of every subsystem, and the resource-manager (rmgr) table is the dispatch layer that demultiplexes it. Each XLogRecord carries an 8-bit xl_rmid that indexes RmgrTable[], a static array built by #include-ing rmgrlist.h through the PG_RMGR X-macro; each entry holds up to eight callbacks (redo, desc, identify, startup, cleanup, mask, decode) and the 24-entry REL_18 set runs XLOG, Transaction, Storage, CLOG, Database, Tablespace, MultiXact, RelMap, Standby, Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist, BRIN, CommitTs, ReplicationOrigin, Generic, and LogicalMessage. A record is a fixed 24-byte header (xl_tot_len, xl_xid, xl_prev, xl_info, xl_rmid, xl_crc) followed by self-describing chunks: per-block headers that name a (relfilelocator, fork, block) and may carry a full-page image with the all-zero 'hole' stripped, plus a main-data blob. The high 4 bits of xl_info are the rmgr's private opcode; redo routines switch on it and use XLogReadBufferForRedo to re-pin each registered buffer, replaying only when the page LSN is older than the record. generic_xlog lets extensions WAL-log arbitrary byte deltas of standard pages without a custom rmgr by diffing before/after images into fragments. rmgrdesc routines (desc/identify) drive pg_waldump's human-readable output. This doc covers the rmgr dispatch table, XLogRecord/block-reference anatomy, FPIs, the generic_xlog delta machinery, redo-time buffer access, and custom rmgr registration; it defers WAL insertion/flush internals to postgres-xlog-wal.md and the recovery driver loop to postgres-recovery-redo.md.
  • PostgreSQL Transaction Management — The Commit State Machine, Subtransactions, and 2PC HooksA PostgreSQL backend runs every statement through a two-level state machine: a low-level TransState (DEFAULT/START/INPROGRESS/COMMIT/ABORT/PREPARE) that tracks what the engine is doing, and a high-level TBlockState (TBLOCK_*) that tracks what the client's BEGIN/COMMIT/SAVEPOINT block wants. XIDs are lazy — GetCurrentTransactionId calls AssignTransactionId only when a write first needs one, so read-only transactions never consume an XID; GetNewTransactionId hands out the next FullTransactionId under XidGenLock and stores it into the ProcArray before releasing the lock. CommitTransaction is a fixed order of operations whose durable instant is RecordTransactionCommit: it writes one XLOG_XACT_COMMIT record, flushes (unless synchronous_commit=off), then marks pg_xact via TransactionIdCommitTree, then clears the ProcArray; abort is the mirror image but never has to flush because a missing record is read as 'aborted'. Subtransactions (savepoints) are a stack of TransactionStateData nodes; a subxact's committed children are merged up into the parent's childXids array and only the top-level commit record durably commits the whole tree atomically. Two-phase prepare reuses the same record-writing helpers but is its own document.
  • PostgreSQL XID Wraparound and Freeze — Transaction ID Allocation, Limit Enforcement, and Tuple FreezingHow PostgreSQL prevents XID wraparound data loss: the 32-bit XID modular arithmetic and the 2^31-window comparator, TransamVariablesData four-limit ladder (xidVacLimit / xidWarnLimit / xidStopLimit / xidWrapLimit) updated by SetTransactionIdLimit, GetNewTransactionId allocating under XidGenLock with CLOG extension before counter advance, vacuum_get_cutoffs deriving FreezeLimit from vacuum_freeze_min_age, and heap_prepare_freeze_tuple building per-tuple HeapTupleFreeze plans that heap_freeze_execute_prepared applies to stamp HEAP_XMIN_FROZEN in t_infomask — making tuples permanently visible and allowing relfrozenxid to advance so CLOG can be truncated.
  • PostgreSQL Write-Ahead Log — Record Insertion, LSNs, and the Durability SpinePostgreSQL's WAL is one append-only byte stream, and an LSN is literally a byte offset into it. A backend builds a record with XLogBeginInsert / XLogRegister* / XLogInsert: the assembler (XLogRecordAssemble) packs an rmgr-tagged header, registered data chunks, and optionally a full-page image into an XLogRecData chain; XLogInsertRecord then runs a two-step insert — a spinlock-guarded byte reservation (ReserveXLogInsertLocation bumps CurrBytePos) that hands back the LSN, followed by a lock-free copy into the WAL buffers under one of NUM_XLOGINSERT_LOCKS=8 insertion LWLocks. Durability is separated from insertion: XLogFlush/XLogWrite push bytes to the kernel and issue_xlog_fsync forces them down, gated by the three running watermarks Insert<=Write<=Flush. Two rules make this the engine's spine — (1) the page's LSN gates the buffer manager (a dirty page cannot be written until WAL is flushed to that LSN), and (2) COMMIT durability is exactly XLogFlush(commit-LSN). Full-page writes defeat torn-page hazards by stashing a whole page image on its first change after a checkpoint. Defers redo to recovery-redo, the rmgr catalog to wal-records-rmgr.

Query Processing (20)

  • PostgreSQL Aggregation, Sort, and Window ExecutorsPostgreSQL's blocking/streaming pipeline operators all live inside the same demand-pull PlanState tree but differ in how much input they must buffer before they can return a tuple. Agg (nodeAgg.c) runs four strategies — AGG_PLAIN, AGG_SORTED, AGG_HASHED, AGG_MIXED — chosen by the planner: sorted grouping streams one group at a time off pre-ordered input via group-boundary comparisons, hashed grouping builds per-grouping-set TupleHashTables and, when the table exceeds hash_mem, enters spill mode (hash_agg_enter_spill_mode) writing overflowing tuples to partitioned logical tapes for recursive reprocessing; grouping sets are handled by a chain of phases that share one ordered pass, with AGG_MIXED interleaving a hash phase and sort phases. Sort (nodeSort.c) is a pure blocking operator that drains its child into tuplesort.c on the first ExecSort call (datum sort for a single column, heap sort otherwise) then streams sorted tuples back. IncrementalSort (nodeIncrementalSort.c) exploits an already-sorted prefix: it sorts one presorted-prefix group at a time in a four-state machine (INCSORT_LOADFULLSORT / READFULLSORT / LOADPREFIXSORT / READPREFIXSORT), letting it return rows early and bound memory — a major LIMIT win. WindowAgg (nodeWindowAgg.c) buffers each partition in a tuplestore and evaluates window functions over per-row frames via the WindowObject API, using the aggregate inverse-transition function to slide moving-frame aggregates incrementally instead of recomputing them. Sort ordering and the tuplesort engine itself are owned by postgres-tuplesort.md; the transition-function expression machinery by postgres-expression-eval.md; the PlanState/slot framework by postgres-executor.md.
  • PostgreSQL Parse Analysis — Transforming the Raw Tree into a QueryParse analysis is the second half of the parser: the grammar (gram.y) produces a purely syntactic RawStmt that names no catalog objects, and transformStmt walks that tree to produce a semantically-resolved Query — every table looked up in the catalog and assigned a range-table index, every column reference resolved to a Var carrying (varno, varattno, type, collation), every clause type-checked. The whole pass threads a single mutable ParseState (pstate) whose p_rtable accumulates RangeTblEntry records, whose p_namespace tracks which RTEs are name-visible at the current point, and whose p_next_resno hands out target-list column positions. transformSelectStmt is the spine for SELECT: it runs the FROM clause first (transformFromClause builds RTEs and the joinlist), then the target list (transformTargetList, with star-expansion and FigureColname auto-naming), then WHERE/HAVING (coerced to boolean), then ORDER BY before GROUP BY and DISTINCT because the latter two reuse the sort tlist, then LIMIT, windows, and a final collation-assignment and aggregate-legality pass. The output Query is read-only input to the rewriter and planner; pstate is freed immediately after.
  • PostgreSQL Cost Model — Page/CPU Costs and Path ComparisonPostgreSQL's cost model is the scalar scoring function the planner uses to compare physical access paths: every Path carries a (startup_cost, total_cost) pair in abstract units derived from five tunable GUCs — seq_page_cost (1.0), random_page_cost (4.0), cpu_tuple_cost (0.01), cpu_index_tuple_cost (0.005), cpu_operator_cost (0.0025). The cost_* family in costsize.c computes those pairs: cost_seqscan charges spc_seq_page_cost per page plus (cpu_tuple_cost + qual eval) per tuple; cost_index defers index-internal cost to the AM's amcostestimate then charges heap fetches via the Mackert-Lohman index_pages_fetched formula, interpolating between a perfectly-correlated (mostly sequential) and uncorrelated (all-random) bound by correlation-squared; cost_tuplesort models in-memory quicksort as comparison_cost * N * log2(N) and external merge sort with extra page I/O; the joins use a two-phase initial_cost_* / final_cost_* split so cheap lower bounds prune the search before the expensive qual-eval pass runs. startup_cost is the price to return the FIRST tuple (a sort must read all input before emitting; a seqscan can emit immediately), total_cost the price to drain the path — the two together let add_path keep a path that wins on either axis. Row counts come from the selectivity machinery (clauselist_selectivity in selfuncs) via set_baserel_size_estimates; cost_qual_eval turns pg_proc.procost into per-tuple CPU charges. The model is a direct descendant of Selinger's System R access-path selection.
  • PostgreSQL Executor — The Demand-Pull Plan-Node Tree and Tuple FlowThe PostgreSQL executor is a demand-pull (Volcano-style) iterator over a PlanState tree that mirrors the planner's read-only Plan tree one node for one node. ExecutorStart builds the state tree and a per-query EState; ExecutorRun pulls tuples by calling ExecProcNode on the root, which recurses down child nodes until a leaf scan returns the next TupleTableSlot, or an empty slot at end-of-data; ExecutorFinish drains ModifyTable nodes and fires AFTER triggers; ExecutorEnd tears the state tree and the per-query memory context down in one shot. The unit of dataflow is the TupleTableSlot — a uniform tuple handle with virtual / heaptuple / minimal / buffer-heap backings selected by a TupleTableSlotOps vtable, so any node can consume any other node's output without caring how the tuple is physically stored. Memory hygiene rests on two contexts: the per-query context that holds the whole state tree, and per-tuple ExprContexts reset once per tuple to bound intra-query allocation.
  • PostgreSQL Expression Evaluation — Flat Step Array, Dispatch Loop, and Fast PathsPostgreSQL compiles every expression tree into a flat ExprEvalStep array during executor startup, then dispatches steps at runtime via a switch loop or GCC computed-goto table inside ExecInterpExpr. ExecInitExpr walks the Expr tree once, emitting typed opcodes (EEOP_*) into the steps array while wiring argument slots directly into FunctionCallInfo structs to avoid runtime copies. ExecReadyExpr tries JIT first and falls back to ExecReadyInterpretedExpr, which installs fast-path evalfuncs for the most common single-step patterns — constant, simple Var fetch, and assign-Var — before committing the full interpreter or direct-threaded dispatch. The ExprContext provides per-tuple memory lifetime: callers switch into ecxt_per_tuple_memory before calling ExecEvalExpr and reset it once the tuple is done, bounding all intra-expression allocations.
  • PostgreSQL Statistics — ANALYZE, pg_statistic, and Extended StatisticsPostgreSQL's cardinality estimates rest on a sampled statistics pipeline. ANALYZE draws a fixed-size reservoir sample (300 * statistics_target rows via a two-stage block-sample + Vitter algorithm in acquire_sample_rows), and per column a type-specific compute_stats routine condenses the sample into a pg_statistic row: stanullfrac, stawidth, stadistinct (the Haas-Stokes Duj1 estimator n*d / (n - f1 + f1*n/N)), an MCV list (STATISTIC_KIND_MCV), an equi-depth histogram of the non-MCV residue (STATISTIC_KIND_HISTOGRAM), and a physical/logical correlation (STATISTIC_KIND_CORRELATION). The planner reads these back in selfuncs.c: var_eq_const probes the MCV list for an exact frequency or falls back to (1 - sumcommon - nullfrac) / otherdistinct; scalarineqsel binary-searches the histogram and interpolates within the bracketing bin. Because per-column stats assume column independence, multi-clause estimates compound multiplicatively and undershoot on correlated columns. Extended statistics (CREATE STATISTICS, stored in pg_statistic_ext_data) fix this for declared column groups via three kinds: functional dependencies (degree of validity measured by sort-and-group in dependency_degree), multivariate n-distinct (Duj1 over column combinations), and multivariate MCV lists. The estimator combines correlated clauses with P(a,b) = f * Min(P(a),P(b)) + (1-f) * P(a) * P(b) instead of the independence product.
  • PostgreSQL JIT — LLVM Compilation of Expressions and Tuple DeformingHow PostgreSQL turns the interpreted expression-evaluation and tuple-deforming hot paths into native code at query time: the provider-independent jit.c wrapper that loads the llvmjit shared library on demand, the per-opcode LLVM IR generation in llvmjit_expr.c that mirrors execExprInterp.c step-for-step, the tupledesc-specialized deform function built in llvmjit_deform.c, bitcode-based inlining of operator bodies, the cost thresholds (jit_above_cost / jit_inline_above_cost / jit_optimize_above_cost) the planner uses to set per-query JIT flags, and the lazy compile-on-first-call discipline that batches emission for a whole query.
  • PostgreSQL Join Executors — Nested Loop, Merge Join, and Hash JoinPostgreSQL implements the three classical binary join algorithms as demand-pull PlanState nodes that each return one joined TupleTableSlot per ExecProcNode call. Nested loop (ExecNestLoop) scans the inner subtree once per outer tuple; when the inner is a parameterized index scan it pushes outer column values into PARAM_EXEC slots and ExecReScans the inner so the rescan is an index probe, not a rescan-from-disk. Merge join (ExecMergeJoin) is an 11-state machine over two pre-sorted inputs that advances whichever side is behind, marks the first inner tuple of an equal group with ExecMarkPos, and ExecRestrPos back to that mark to re-join duplicate outer keys. Hash join (ExecHashJoinImpl) is a 6-state hybrid-hash machine: the inner Hash child is built into an in-memory chained hash table by MultiExecHash / ExecHashTableInsert, the outer is probed bucket-by-bucket via ExecScanHashBucket, and when the build exceeds the hash_mem budget ExecHashIncreaseNumBatches doubles the batch count and spills non-current-batch tuples to per-batch BufFiles, processed one batch at a time. All three share the JoinState base (joinqual vs. otherqual, single_match, null-fill slots for outer joins) and project results through ExecProject into a virtual result slot.
  • PostgreSQL Join Ordering — Dynamic-Programming Search and the GEQO FallbackPostgreSQL chooses a join order by exhaustive bottom-up dynamic programming, the same System R idea Selinger described in 1979. make_rel_from_joinlist counts the jointree items (levels_needed) and, below geqo_threshold (default 12), hands off to standard_join_search, which fills root->join_rel_level[k] level by level: level 1 holds the base/initial rels, and join_search_one_level(level) builds every feasible k-item joinrel from lower levels — left/right-sided joins of a (level-1)-rel against an initial rel via make_rels_by_clause_joins, plus bushy joins of a k-rel against a (level-k)-rel, with a clauseless-Cartesian last-ditch pass. make_join_rel checks join_is_legal, finds-or-builds the RelOptInfo (so the same set of relids is one node no matter which pair produced it — the DP memo), and calls populate_joinrel_with_paths -> add_paths_to_joinrel, which enumerates sort-merge, nestloop, and hash physical paths and keeps only the non-dominated ones via add_path. At or above geqo_threshold the search space (which grows super-exponentially) is too large to enumerate, so geqo() runs a genetic algorithm over a pool of join-order tours, scoring each with geqo_eval -> gimme_tree (a clump-merging heuristic that reuses make_join_rel), and returns the cheapest tour found. The principle: the join order is the dominant cost decision, path enumeration is delegated, and only the search strategy switches at the threshold.
  • PostgreSQL Node System — NodeTag, copyObject, and the gen_node_support CodegenEvery parse tree, plan tree, and executor state tree in PostgreSQL is built from tagged C structs called Nodes. The contract is one field: the first member of every node struct is a NodeTag enum value, so any node pointer can be cast to Node* and dispatched on its tag. makeNode(T) palloc0's a struct of the right size and stamps T_T into that first field; nodeTag/IsA/castNode read it back. Single inheritance is by convention — a struct 'derives' from Plan by making Plan its first field, which recursively keeps NodeTag first. List is a tagged, expansible array (not a cons-cell list) with four flavours (pointer/int/Oid/Xid) sharing one ListCell union. The bulk of node housekeeping — copyObject() deep copy, equal() structural compare, outNode() serialize, nodeRead() deserialize — is not hand-written: gen_node_support.pl parses the node header files at build time, assigns the NodeTag numbers (a frozen ABI in stable branches), and emits copyfuncs/equalfuncs/outfuncs/readfuncs from the struct field types, steered by pg_node_attr() annotations. The hand-written copyfuncs.c/equalfuncs.c hold only the macros, the dispatch entry points (copyObjectImpl, equal), the List special-case, and a handful of custom_copy_equal nodes like Const. This is the plumbing that lets the planner copy a Query, the executor ship a Plan to a parallel worker, and the rewriter compare expression subtrees — all generically, by tag.
  • PostgreSQL Parallel Query — Gather, Workers, and the DSM PlanPostgreSQL parallelizes a query by inserting a Gather (or order-preserving GatherMerge) node above a partial plan: the partial subtree below it is parallel-aware so that N copies running concurrently together produce the full result without duplication. On first execution the Gather node calls ExecInitParallelPlan, which serializes the plan with nodeToString into a dynamic-shared-memory (DSM) segment keyed by a shm_toc table of contents, sizes the segment by walking the PlanState tree twice (estimate then init), and lays out per-worker tuple queues, a DSA area, parameter and instrumentation slots. LaunchParallelWorkers asks the postmaster to fork background workers that run ParallelWorkerMain: each worker attaches the DSM, restores leader state (GUCs, snapshots, the transaction XID set, combo CIDs, libraries, user IDs) so visibility checks match the leader, then runs its own executor copy via ParallelQueryMain, writing MinimalTuples into its shm_mq tuple queue. The leader's gather_readnext drains those queues round-robin while optionally also running the plan locally (parallel_leader_participation); GatherMerge instead binary-heap-merges the per-worker streams to preserve sort order. Parallel mode is strictly read-only, enforced by EnterParallelMode, and errors propagate back over a dedicated error shm_mq via PROCSIG_PARALLEL_MESSAGE caught at the leader's next CHECK_FOR_INTERRUPTS.
  • PostgreSQL Parser — Lexing, LALR(1) Grammar, and Semantic AnalysisHow PostgreSQL turns a SQL string into a Query tree: a no-backtrack Flex lexer (scan.l) emits tokens through a one-token-lookahead filter (base_yylex) into a ~19 700-line Bison LALR(1) grammar (gram.y) that builds a catalog-free raw parse tree (RawStmt/List of nodes), then a second phase (analyze.c + parse_*.c) drives a ParseState namespace walker over each statement to resolve names, assign OIDs, coerce types, and produce the Query node the planner consumes.
  • PostgreSQL Path Generation — Scan and Join Path Builders, add_path Pruning, and PathkeysPath generation is the part of the PostgreSQL optimizer that turns a RelOptInfo (one per base or join relation) into a populated pathlist: for each base rel, make_one_rel drives set_base_rel_pathlists, which dispatches to per-RTE-type builders (set_plain_rel_pathlist, set_append_rel_pathlist, set_foreign_pathlist, set_subquery_pathlist, …); set_plain_rel_pathlist calls create_tidscan_paths, create_seqscan_path, and create_index_paths; create_index_paths iterates rel->indexlist, matching restriction clauses, join clauses, and EquivalenceClass-derived clauses to each index, then emits IndexPath, BitmapHeapPath, and parameterized variants; add_path enforces the dominance pruning rule — a new path is kept only if it is not dominated by any existing path on cost, pathkeys (sort order), parameterization, row count, and parallel-safety; set_cheapest then pins the cheapest_startup_path, cheapest_total_path, and cheapest_parameterized_paths; pathkeys.c implements canonical sort-order tracking through EquivalenceClass-backed PathKey lists so that index orderings, merge-join inputs, and ORDER BY requirements are all expressed in a unified vocabulary.
  • PostgreSQL Plan Creation — From Winning Path to Executable Plan TreePlan creation is the optimizer's final hand-off: it turns the single cheapest Path chosen by the join-order search into a fully-formed, executable Plan tree, then rewrites every variable reference so the executor can dereference columns by slot offset instead of by catalog identity. create_plan_recurse walks the Path tree top-down, dispatching on pathtype to a create_<node>_plan routine that allocates the matching Plan node, builds its target list (build_path_tlist, or the wider physical tlist when the parent will not project), orders and strips its quals (order_qual_clauses, extract_actual_clauses), and copies cost/width estimates (copy_generic_path_info); join routines recurse into both children and convert lateral/outer Vars into NestLoopParams via replace_nestloop_params. The tlists and quals at this stage are still in parser numbering — Vars carry (varno, varattno) keyed to the range table. set_plan_references then makes a second top-down pass: it flattens every subquery's range table into one global finalrtable, offsets each scan node's Vars by rtoffset, and for every join/upper node rewrites child-output Vars to (INNER_VAR | OUTER_VAR, resno) so the executor reads columns positionally from the inner/outer TupleTableSlot via an indexed_tlist hash. Correlated and uncorrelated sub-SELECTs are handled separately in subselect.c: build_subplan decides between an initPlan (run once, result stashed in a PARAM_EXEC slot) and a per-tuple SubPlan, and SS_attach_initplans hangs the query level's initPlans on the topmost node. The output is a Plan tree whose every expression is resolved to a slot offset, ready for ExecInitNode.
  • PostgreSQL Planner — From Query Tree to Plan: Paths, Cost, and the System-R LineageThe PostgreSQL planner is a cost-based, System-R-lineage optimizer that turns a rewritten Query tree into an executable Plan tree in three nested phases: subquery_planner preprocesses one Query level (sublink/subquery pull-up, qual canonicalization, const-folding), grouping_planner layers the upper operations (set ops, grouping/aggregation, window functions, DISTINCT, ORDER BY, LIMIT), and query_planner/make_one_rel does the scan-and-join core. The defining data-structure split is RelOptInfo (one per relation, base or join) holding a pathlist of Path nodes (every plausible way to produce that relation), versus the Plan (the single executable tree). Paths are generated bottom-up — set_base_rel_pathlists builds scan paths for each base rel, then dynamic-programming join search builds joinrel paths level by level — and set_cheapest keeps only the cheapest path per cost/sort-order/parameterization niche. create_plan then walks the one winning Path tree top-down and emits the matching Plan. Join-order enumeration, the cost formulas, the per-AM path builders, and preprocessing each have their own docs; this is the spine and the map.
  • PostgreSQL Portals, Prepared Statements, and the Plan CacheA portal is PostgreSQL's execution container — the live state of one runnable SQL command, wrapping a PlannedStmt list, a CachedPlan reference, a result tupdesc, and a cursor position. PortalStart classifies the portal into one of five PortalStrategy modes (ONE_SELECT, ONE_RETURNING, ONE_MOD_WITH, UTIL_SELECT, MULTI_QUERY) via ChoosePortalStrategy, sets up the snapshot and (for ONE_SELECT) calls ExecutorStart; PortalRun then either drives the executor incrementally through PortalRunSelect (so FETCH n can suspend mid-query) or runs the command to completion into a tuplestore; PortalDrop tears the container down, shutting the executor and dropping the cached-plan refcount. Prepared statements (PREPARE / the protocol Parse message) are parse-analyzed once into a CachedPlanSource stored in a per-backend hash table; each EXECUTE binds parameters and calls GetCachedPlan, which is where the plancache chooses between a parameter-independent generic plan (planned once, reused) and a per-parameter custom plan (replanned every time). The heuristic in choose_custom_plan: always custom for the first 5 executions, then compare the cached generic_cost against the running average custom cost (custom cost includes a 1000 * cpu_operator_cost-per-relation planning charge) and keep generic only if it is cheaper. Cache invalidation via sinval messages flips is_valid, forcing RevalidateCachedQuery to re-analyze/replan on next use.
  • PostgreSQL Planner Prep — Subquery Pull-Up, Qual Canonicalization, and Set OpsHow PostgreSQL's planner prep phase rewrites a parse tree into a flatter, more optimizable shape before path generation: pulling ANY/EXISTS SubLinks up into semi/anti-joins and flattening simple subqueries and UNION ALL members into the parent jointree (prepjointree.c), canonicalizing WHERE/CHECK quals by pushing NOTs down via DeMorgan and applying the inverse OR distributive law without forcing CNF (prepqual.c), expanding and junk-augmenting the target list for INSERT/UPDATE/DELETE/MERGE and row marks (preptlist.c), and planning UNION/INTERSECT/EXCEPT as an Append/SetOp tree over independently planned leaf subqueries (prepunion.c).
  • PostgreSQL Rule System & Query Rewriter — Views, DO INSTEAD, and RLSThe PostgreSQL query rewriter is the stage between parse analysis and planning that applies the rule system: it turns one parse-analyzed Query into zero, one, or many Query trees by substituting stored rule actions. QueryRewrite is the single entry point and runs in three steps — RewriteQuery fires the non-SELECT (INSERT/UPDATE/DELETE) rules and auto-updatable-view rewriting, producing a flat list of product queries; fireRIRrules then walks every range-table entry of each result and expands views in place via ApplyRetrieveRule, which replaces a view RTE with the view's stored ON SELECT _RETURN Query as a sub-select; and step 3 decides which query carries the command tag. Views are not a storage object but a relation with an ON SELECT DO INSTEAD rule, defined through DefineQueryRewrite. INSERT/UPDATE/DELETE rules are matched by matchLocks and fired by fireRules, which honors INSTEAD vs. ALSO and conditional quals by emitting a negated-qual 'default' copy of the original. Row-level security is the last seam in fireRIRrules: get_row_security_policies returns securityQuals and withCheckOptions that are prepended to each relation RTE as security-barrier quals. SEARCH/CYCLE clause expansion on recursive CTEs is also folded into fireRIRrules. The rewriter outputs a list of plannable Query trees; recursion is bounded by an explicit rewrite_events stack and activeRIRs OID list.
  • PostgreSQL Scan Nodes — Sequential, Index, Bitmap, and TID ScansThe PostgreSQL executor provides six leaf scan node types — SeqScan, IndexScan, IndexOnlyScan, BitmapHeapScan, TidScan, and TidRangeScan — each implementing the Volcano iterator interface via a shared ExecScan/ExecScanExtended loop. SeqScan delegates entirely to the table AM; IndexScan fetches one heap tuple per index entry; IndexOnlyScan avoids the heap by reading data directly from the index using the visibility map as a fallback gate; BitmapHeapScan decouples the index pass from the heap pass via a TIDBitmap; TidScan and TidRangeScan target exact or range ctid predicates. Every node type carries parallel DSM support via the standard Estimate/InitializeDSM/ReInitializeDSM/InitializeWorker protocol.
  • PostgreSQL tuplesort — In-Memory Quicksort, External Merge Sort, Logical Tapes, and TuplestorePostgreSQL's tuplesort.c is a single module that does two opposite things depending on how much data fits in work_mem. If everything fits, it accumulates SortTuples in one array and runs a specialized quicksort (state TSS_SORTEDINMEM); if it overflows, it switches once to external mode (TSS_BUILDRUNS), quicksorting each work_mem-sized batch into a sorted run written to a logical tape, then merging the runs with a binary min-heap. The runs are not stored in separate OS files: logtape.c packs every logical tape into one temp file as a chain of BLCKSZ blocks and recycles each block back onto a min-heap freelist the instant it is read during the merge, so peak disk usage stays near 1x the data instead of the 2x a naive tape-per-file scheme would need. The merge order M (number of input tapes per pass) is chosen by tuplesort_merge_order from work_mem; if the run count exceeds M, the sort does multiple balanced merge passes. Three optimizations cut comparison cost: abbreviated keys (a fixed-size sortable proxy for the leading column, with a runtime abort if it is not discriminating), bounded sort (LIMIT n turns the sort into a fixed-size max-heap of n elements that discards everything smaller, never spilling), and datum-specialized qsort variants. Parallel sort has each worker produce exactly one sorted run on its own tape; the leader imports the worker tapes and runs a single final merge. The companion tuplestore.c is the no-sort sibling: it stores a tuple sequence with multiple independent read pointers (for cursors, Materialize, recursive CTEs), spilling to a BufFile past work_mem but never reordering. This doc covers the in-memory vs external decision, run generation, the logical-tape set and block recycling, the merge heap, abbreviated keys, bounded sort, parallel sort, and tuplestore; it defers the Sort/Agg plan nodes to postgres-agg-sort-nodes.md and executor plumbing to postgres-executor.md.

DDL & Schema (10)

  • PostgreSQL ALTER TABLE — The Multi-Pass Machinery and Table RewritesALTER TABLE in PostgreSQL is a three-phase, multi-pass machine implemented almost entirely in tablecmds.c. AlterTable() takes a single lock (the strongest required by any subcommand, computed up front by AlterTableGetLockLevel without touching the table) and hands control to ATController, which runs Phase 1 (ATPrepCmd: permission/relkind checks, inheritance recursion, build the per-table work queue of AlteredTableInfo), Phase 2 (ATRewriteCatalogs: replay each table's subcommands ordered into 13 fixed passes from AT_PASS_DROP through AT_PASS_MISC so DROP precedes ADD and ALTER TYPE precedes index re-add), and Phase 3 (ATRewriteTables: a single scan per table that verifies new constraints and, if tab->rewrite is nonzero, recopies every tuple into a new relfilenode via make_new_heap + ATRewriteTable + finish_heap_swap). The central design tension is in-place catalog-only changes versus full heap rewrites: ADD COLUMN with a constant default uses the pg_attribute fast-default (attmissingval) to avoid a rewrite; ALTER COLUMN TYPE consults ATColumnChangeRequiresRewrite to skip the rewrite when the cast is binary-coercible; DROP COLUMN is a pure catalog logical-delete (attisdropped) that never touches heap pages. The rewrite reason is recorded as a bitmask (AT_REWRITE_DEFAULT_VAL, AT_REWRITE_COLUMN_REWRITE, AT_REWRITE_ALTER_PERSISTENCE, AT_REWRITE_ACCESS_METHOD).
  • PostgreSQL Constraints — CHECK, NOT NULL, Unique, Primary, and Foreign KeysEvery PostgreSQL integrity constraint is one row in the pg_constraint catalog, written by CreateConstraintEntry, whose contype column ('c' CHECK, 'n' NOT NULL, 'p' PRIMARY KEY, 'u' UNIQUE, 'f' FOREIGN KEY, 'x' EXCLUSION) selects the enforcement machinery. CHECK and NOT NULL are expression predicates evaluated in-line by the executor on every modified row (StoreRelCheck / StoreRelNotNull store the conbin tree; ExecConstraints fires it). UNIQUE and PRIMARY KEY are not evaluated at all — they delegate to a backing unique B-tree index (conindid), so a duplicate is rejected by the index insert. FOREIGN KEY is the outlier: it is implemented as a set of system AFTER-ROW triggers (createForeignKeyCheckTriggers on the referencing side, createForeignKeyActionTriggers on the referenced side) that run SPI queries — RI_FKey_check looks up the parent row with SELECT ... FOR KEY SHARE, ri_restrict / RI_FKey_cascade_* enforce ON DELETE/UPDATE actions. ri_triggers.c caches a RI_ConstraintInfo per FK and a prepared SPI plan per query shape. NOT VALID skips the back-scan at creation (convalidated=false); a later VALIDATE CONSTRAINT runs RI_Initial_Check (a single anti-join LEFT JOIN ... IS NULL query) or re-scans the heap. Deferrable constraints set condeferrable/condeferred and ride the AFTER-trigger deferred-event queue, fired at SET CONSTRAINTS or commit. Domains carry CHECK and NOT NULL constraints too (typecmds.c, domainAddCheckConstraint).
  • PostgreSQL COPY — Bulk Load/Unload, Formats, and Multi-InsertCOPY is PostgreSQL's bulk load/unload command. DoCopy() in copy.c parses the option list (ProcessCopyOptions), checks role privileges, and dispatches to BeginCopyFrom/CopyFrom (load) or BeginCopyTo/DoCopyTo (unload). Three built-in formats — text, CSV, binary — are each a CopyFromRoutine / CopyToRoutine vtable of start/infunc/onerow/end callbacks selected by CopyFromGetRoutine. On load, copyfromparse.c runs a staged pipeline (raw_buf -> input_buf -> line_buf -> attribute_buf): CopyLoadRawBuf reads bytes, CopyConvertBuf transcodes, CopyReadLineText splits lines honoring quote/escape, and CopyReadAttributesText/CSV de-escape fields, which NextCopyFrom feeds to per-type input functions via InputFunctionCallSafe. CopyFrom batches tuples through a CopyMultiInsertInfo of CopyMultiInsertBuffers (MAX_BUFFERED_TUPLES=1000, MAX_BUFFERED_BYTES=65535) flushed via table_multi_insert, falling back to CIM_SINGLE when triggers, volatile defaults, or non-batching FDWs forbid batching. COPY FREEZE sets TABLE_INSERT_FROZEN to write pre-frozen tuples when the relation was created/truncated in the current subxact. ON_ERROR IGNORE traps soft errors through an ErrorSaveContext, counting num_errors, bounded by REJECT_LIMIT and reported per LOG_VERBOSITY. COPY TO scans the table (or runs a query through a DestCopyOut receiver) and emits rows via CopyAttributeOutText/CSV or binary send functions.
  • PostgreSQL DDL Execution — ProcessUtility Dispatch, the Simple/Slow Split, and the CREATE TABLE PathHow PostgreSQL dispatches utility (non-optimizable) statements: ProcessUtility checks the hook, standard_ProcessUtility handles simple statements inline and forwards the catalog-heavy set to ProcessUtilitySlow, which wraps every command in event-trigger fences; CREATE TABLE is the canonical path — transformCreateStmt expands LIKE/inheritance, DefineRelation resolves namespace and access method, calls heap_create_with_catalog to write pg_class/pg_attribute rows and allocate the physical file, then CommandCounterIncrement makes the new tuple visible within the same transaction so AddRelationNewConstraints can transform raw DEFAULT expressions against it.
  • PostgreSQL Event Triggers — DDL Hooks and Command CollectionEvent triggers are user-defined functions that fire on DDL and session events rather than on table DML. PostgreSQL defines five event types — ddl_command_start, ddl_command_end, sql_drop, table_rewrite, and login — each fired from a hand-placed call site rather than from a generic dispatcher. The DDL events fire from inside ProcessUtilitySlow: EventTriggerBeginCompleteQuery sets up a per-command EventTriggerQueryState, EventTriggerDDLCommandStart fires before execution, and EventTriggerSQLDrop + EventTriggerDDLCommandEnd fire after. table_rewrite fires from ATRewriteTables in tablecmds.c; login fires from PostgresMain after authentication. Each firing point calls EventTriggerCommonSetup, which consults the syscache-backed event-trigger cache (EventCacheLookup), filters candidate triggers by session replication role and command tag, and hands an EventTriggerData node to EventTriggerInvoke. Two side channels accumulate data for the trigger function to read: a SQLDropList of dropped objects (consumed by pg_event_trigger_dropped_objects) and a commandList of CollectedCommand records (consumed by pg_event_trigger_ddl_commands), the latter populated by EventTriggerCollect* calls sprinkled through the DDL execution path.
  • PostgreSQL Index Creation — CREATE INDEX, the Build, and CONCURRENTLYCREATE INDEX in PostgreSQL is two cooperating layers: indexcmds.c::DefineIndex does the SQL-level work — lock the table, resolve the access method and operator classes, compute the IndexInfo, choose names — and catalog/index.c::index_create makes the catalog rows (pg_class, pg_attribute, pg_index, dependencies) and, unless told to skip, calls index_build. index_build is AM-agnostic: it invokes the access method's ambuild callback (e.g. btbuild), which scans the heap via table_index_build_scan and bulk-loads a fresh, dense index, then index_update_stats writes reltuples back. A plain build holds ShareLock — readers proceed, writers block. CREATE INDEX CONCURRENTLY trades that one blocking build for a multi-transaction, ShareUpdateExclusiveLock protocol: insert a not-ready/not-valid catalog entry and commit so writers start maintaining HOT-safety; WaitForLockers; build against a snapshot; mark indisready and commit; WaitForLockers again; take a reference snapshot, validate_index to merge-join the heap against the index's existing TIDs and insert what is missing; drop the snapshot, WaitForOlderSnapshots; finally set indisvalid. REINDEX reuses index_build with isreindex=true under AccessExclusiveLock (or REINDEX CONCURRENTLY builds a shadow index and swaps it). The whole mechanism rests on the three pg_index liveness bits — indislive, indisready, indisvalid — flipped in order by index_set_state_flags.
  • PostgreSQL Materialized Views — Storage and REFRESH (incl. CONCURRENTLY)A PostgreSQL materialized view is a RELKIND_MATVIEW relation that owns a real heap (relfilenode) plus a SELECT-INSTEAD-OF rewrite rule carrying its defining query. CREATE MATERIALIZED VIEW always creates the relation WITH NO DATA first (createas.c), then reuses the REFRESH code path to populate it. REFRESH builds a fresh transient heap by running the stored query through a DR_transientrel DestReceiver (frozen bulk insert), then either swaps relfilenodes wholesale via refresh_by_heap_swap -> finish_heap_swap (the default, under AccessExclusiveLock) or, for REFRESH ... CONCURRENTLY, computes a row-level diff via refresh_by_match_merge: ANALYZE the temp table, FULL JOIN old-vs-new keyed on a usable UNIQUE index to build a 'diff' temp table of (tid, newdata), then set-based DELETE + INSERT against the live matview under only ExclusiveLock. The pg_class.relispopulated flag (exposed by RelationIsPopulated / RelationIsScannable) gates whether a scan is allowed; an unpopulated matview raises 'has not been populated'. CONCURRENTLY requires a populated matview and at least one usable unique index.
  • PostgreSQL Declarative Partitioning — Bounds, Tuple Routing, and PruningPostgreSQL declarative partitioning turns one logical table into a hierarchy of physical leaf tables, splitting the work across three layers. Catalog/relcache builds a PartitionDesc per partitioned relation — an ordered oids[] array plus a PartitionBoundInfo that canonicalizes the LIST/RANGE/HASH bounds into a sorted datums[] array and an indexes[] map, so any point lookup is a binary search (partition_list_bsearch / partition_range_datum_bsearch) or a modulo (compute_partition_hash_value). On INSERT, execPartition.c routes each tuple top-down: ExecFindPartition extracts the partition key with FormPartitionKeyDatum and calls get_partition_for_tuple, which binary-searches the boundinfo and short-circuits via a last-found cache once it hits the same partition PARTITION_CACHED_FIND_THRESHOLD (16) times in a row. On SELECT, partprune.c eliminates partitions twice: plan-time pruning (prune_append_rel_partitions) compiles WHERE clauses into PartitionPruneStepOp / PartitionPruneStepCombine steps and runs get_matching_partitions before the plan is finalized; runtime pruning (ExecDoInitialPruning / ExecFindMatchingSubPlans) re-runs the same steps once PARAM_EXEC values from nested loops or generic plans are known, dropping Append/MergeAppend subplans that cannot match. Partitionwise join and aggregate (enable_partitionwise_join / _aggregate) push a join or GROUP BY below the Append by matching the two inputs' bounds with partition_bounds_merge, turning one big operator into N small ones.
  • PostgreSQL Row-Level Security — Policies and Rewrite-Time QualsRow-Level Security (RLS) is PostgreSQL's per-row access-control layer. A policy is a row in pg_policy that names a relation, a command class (polcmd: SELECT/INSERT/UPDATE/DELETE or '*'), a set of roles, a permissive/restrictive flag, and up to two expressions: a USING qual (polqual, which rows are visible/lockable) and a WITH CHECK qual (polwithcheck, which new rows may be written). The flag pg_class.relrowsecurity arms RLS on the table; relforcerowsecurity makes it apply even to the table owner. At query-rewrite time, fireRIRrules() calls get_row_security_policies() for each relation RTE; check_enable_rls() decides RLS_NONE / RLS_NONE_ENV / RLS_ENABLED based on the relation flags, BYPASSRLS, ownership, the row_security GUC, and the RI-check context. add_security_quals() OR-combines permissive USING quals and AND-combines restrictive ones, prepending them to rte->securityQuals (security-barrier quals); add_with_check_options() builds WithCheckOption nodes the executor enforces on written rows. With RLS armed but no matching permissive policy, a single always-false Const yields the implicit default-deny. Because the decision depends on role and GUC, get_row_security_policies sets hasRowSecurity so the plan cache replans when the environment changes.
  • PostgreSQL Triggers — Definition, Firing Points, and the After-Trigger QueueA PostgreSQL trigger is a pg_trigger catalog row binding a function to a (timing, level, event) tuple; RelationBuildTriggers materializes those rows into a per-relation TriggerDesc cached in the relcache, and SetTriggerFlags precomputes boolean summary flags so the executor can skip the trigger array entirely when nothing matches. The executor calls one of a fixed family of firing-point hooks — ExecBR/ExecAR/ExecIR for row-level and ExecBS/ExecAS for statement-level, one per BEFORE/AFTER/INSTEAD-OF timing and per INSERT/UPDATE/DELETE event — at the exact point in nodeModifyTable where the corresponding tuple operation happens. BEFORE-row triggers run synchronously and may rewrite or veto the tuple; INSTEAD-OF triggers (views only) replace the operation; AFTER triggers do almost no work at fire time — ExecAR* merely calls AfterTriggerSaveEvent, which appends a compact AfterTriggerEventData record (just a flags word plus one or two CTIDs) into a chunked AfterTriggerEventList held in the per-query level of the global afterTriggers state. At AfterTriggerEndQuery the immediate-mode events are marked and fired in queue order via afterTriggerInvokeEvents (which re-fetches each tuple by CTID under SnapshotAny); deferrable events are migrated to the transaction-global list and fired at AfterTriggerFireDeferred just before commit. Transition tables (REFERENCING OLD/NEW TABLE) are captured into tuplestores in MakeTransitionCaptureState as rows flow, decoupled from the CTID-based event records and never deferrable.

Replication & HA (10)

  • PostgreSQL WAL Archiving & Summarization — PITR and the Incremental-Backup SubstrateContinuous WAL archiving and WAL summarization are two background processes that turn PostgreSQL's append-only WAL stream into the durable substrate for point-in-time recovery (PITR) and incremental backup. The archiver process (PgArchiverMain) waits on a latch, scans pg_wal/archive_status for .ready notification files left by the WAL machinery, prioritizes them in a max-heap (timeline-history files first, then oldest segment), and ships each segment by calling an archive module's archive_file_cb — either the built-in shell_archive (archive_command) or a loaded archive_library. A successful copy renames NNN.ready to NNN.done; a checkpoint later recycles the segment. On the recovery side, xlogarchive.c runs restore_command via RestoreArchivedFile to pull segments back out of the archive, and the .ready/.done status protocol (XLogArchiveNotify/CheckDone/IsBusy) decides when a segment may be recycled. The PG17 WAL summarizer (WalSummarizerMain) is a separate process that, when summarize_wal is on, continuously reads WAL with an xlogreader, records which relation-fork blocks were modified into a BlockRefTable, and emits TLI-LSN-LSN .summary files into pg_wal/summaries. Incremental backup (pg_basebackup --incremental) consumes those summaries to copy only changed blocks since a reference backup. The summarizer handles checkpoint-aligned file boundaries, timeline switches, wal_level=minimal fast-forward skipping, and time-based summary cleanup.
  • PostgreSQL Base Backup — BASE_BACKUP, Backup Label, and the Streaming ProtocolA PostgreSQL base backup is a physical, byte-for-byte snapshot of the data directory taken while the cluster keeps running — the relational equivalent of a fuzzy archival dump. The replication-protocol command BASE_BACKUP (dispatched to SendBaseBackup) brackets the copy between do_pg_backup_start and do_pg_backup_stop: start increments runningBackups under the WAL insert locks, forces full-page writes, and runs a checkpoint so the redo point is pinned; stop records the end LSN and waits for the WAL covering the backup interval to be archived. Between the brackets, perform_base_backup walks $PGDATA with sendDir/sendFile, emitting each file as a tar member through a bbsink — a stacked vtable of sinks (copystream -> progress -> throttle -> compression -> target) that ultimately frames every chunk as a CopyData message over the wire. The first member of base.tar is backup_label (the recovery breadcrumb naming START WAL LOCATION and CHECKPOINT LOCATION), followed by tablespace_map; pg_control is sent last so it reflects a consistent post-walk state. The restored copy is NOT directly usable: it is a torn, inconsistent image that only becomes consistent after crash-style WAL replay from the checkpoint redo point up to the backup end LSN, which is why backup_label and the WAL between start and stop are mandatory. Incremental backup (PG17+) reuses the same machinery: the client UPLOAD_MANIFESTs a prior backup, GetFileBackupMethod consults the WAL summaries' block-reference table to decide per-file whether to send it fully or as a sparse INCREMENTAL.* file carrying only changed blocks.
  • PostgreSQL Incremental Backup — WAL Summaries, the Manifest, and pg_combinebackupPostgreSQL 17 added incremental physical backup. The client uploads the prior backup's manifest with UPLOAD_MANIFEST, then issues BASE_BACKUP ... INCREMENTAL. On the server, CreateIncrementalBackupInfo parses the manifest incrementally, PrepareForIncrementalBackup matches the manifest's WAL ranges against this server's timeline history, waits for the WAL summarizer to catch up to the backup start LSN, then loads the relevant WAL summary files into one in-memory BlockRefTable. For every relation file, GetFileBackupMethod consults that table: a relation with no recorded changes is sent as a tiny INCREMENTAL.* stub (header only), a heavily-changed relation is sent fully, and a partially-changed relation is sent as an incremental file carrying just the modified blocks plus a header (INCREMENTAL_MAGIC, block count, truncation_block_length, the relative block-number array). sendFile in basebackup.c writes that header and streams only the listed blocks. The backup is unusable until pg_combinebackup — a frontend tool — walks the backup chain newest-to-oldest, builds a per-block sourcemap in reconstruct_from_incremental_file, and writes a full synthetic data directory. The free-space map is always sent fully because it is not WAL-logged.
  • PostgreSQL Logical Decoding — Reorder Buffer, Snapshot Builder, and Output PluginsLogical decoding turns the physical WAL — a byte stream of block-level redo records — back into a per-transaction, commit-ordered stream of logical row changes. decode.c is the rmgr dispatch: LogicalDecodingProcessRecord routes each XLogReadRecord()ed record to a per-resource-manager rm_decode handler (xact_decode, heap_decode, heap2_decode, …) that parses the physical record into a ReorderBufferChange and hands it to the reorder buffer keyed by xid. reorderbuffer.c reassembles the interleaved, subtransaction-split change streams of all concurrent transactions into one ordered stream per top-level transaction, holding changes in slab/generation memory contexts until the commit record arrives, spilling the largest transaction to disk (or streaming it) when logical_decoding_work_mem is exceeded, and replaying via a k-way binary-heap merge over the per-(sub)transaction LSN-ordered change lists. snapbuild.c builds a SNAPSHOT_HISTORIC_MVCC catalog snapshot purely from the WAL — tracking only catalog-modifying committed xids between [xmin, xmax) and resolving cmin/cmax from XLOG_HEAP2_NEW_CID records — advancing through a four-state machine (START → BUILDING_SNAPSHOT → FULL_SNAPSHOT → CONSISTENT) so that decoded heap tuples can be interpreted against the catalog as it looked when the WAL was written. logical.c wires it all together: it loads the output plugin's _PG_output_plugin_init, validates the begin/change/commit callbacks, and invokes them through error-context wrappers (begin_cb_wrapper, change_cb_wrapper, commit_cb_wrapper) that the reorder buffer calls during replay.
  • PostgreSQL Logical Replication — Subscriber Apply, Launcher, and Table SyncOn the subscriber side, logical replication is driven by three cooperating process types. The logical replication launcher (ApplyLauncherMain) is a postmaster-registered background worker that polls pg_subscription once per cycle and launches one leader apply worker per enabled subscription via logicalrep_worker_launch, recording a per-subscription last-start time in a DSA-backed hash so a crashing worker cannot hot-loop faster than wal_retrieve_retry_interval. Each leader apply worker (ApplyWorkerMain -> SetupApplyOrSyncWorker -> run_apply_worker) connects to the publisher with walrcv_connect, sets up a replication origin to remember its confirmed stream position, then runs LogicalRepApplyLoop: receive a CopyData buffer over the walreceiver libpq shim, peek the first byte, and on 'w' dispatch the decoded logical-replication message through apply_dispatch to one of the apply_handle_* routines (begin/commit, insert/update/delete, truncate, relation/type, origin, and the stream_* / prepare variants for in-progress and two-phase transactions). apply_handle_insert/update/delete translate the row into a local TupleTableSlot via the LogicalRepRelMapEntry relation cache (logicalrep_rel_open), run the change through the executor with session_replication_role = replica, and report INSERT/UPDATE/DELETE conflicts (CT_*) via ReportApplyConflict when the local tuple's origin differs or the target row is missing/duplicate. Feedback (write/flush/apply LSNs) flows back upstream through send_feedback, gated on get_flush_position so the subscriber never acknowledges a transaction it has not durably committed. Initial data copy is handled by per-table tablesync workers (TablesyncWorkerMain -> LogicalRepSyncTableStart -> copy_table) that march each pg_subscription_rel row through the state machine INIT -> DATASYNC -> FINISHEDCOPY -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY, handing the table back to the leader apply worker once it has caught up to the sync point.
  • PostgreSQL pg_rewind — Timeline Divergence Detection and Data Directory Resyncpg_rewind resynchronises a diverged PostgreSQL data directory by finding the last common WAL checkpoint between a stale target and an authoritative source, reading the target's post-divergence WAL to discover every data page the target modified, then fetching only those pages plus any non-relation files that differ — leaving the rest untouched. The tool exposes two source backends through the rewind_source vtable: init_libpq_source (live server via libpq) and init_local_source (local directory). File-level decisions are encoded in a per-path hash table (filehash, file_entry_t) that accumulates source/target metadata and a dirty-page bitmap (datapagemap_t); decide_file_actions() collapses the hash into a sorted filemap_t whose action ordering guarantees CREATE before COPY, COPY before REMOVE. A backup_label is written at the last common checkpoint so WAL replay from the source closes the remaining gap on next startup.
  • PostgreSQL pgoutput — The Built-in Logical Replication Output Pluginpgoutput is PostgreSQL's in-core logical decoding output plugin — the component that serializes the transaction-ordered change stream produced by logical decoding into the wire-level logical replication protocol that a subscriber's apply worker consumes. It registers a table of OutputPluginCallbacks in _PG_output_plugin_init (begin/change/truncate/message/commit, plus the two-phase and streaming variants) and translates each callback into a typed protocol message via the logicalrep_write_* writers in proto.c: 'B' BEGIN, 'R' RELATION, 'Y' TYPE, 'I'/'U'/'D' INSERT/UPDATE/DELETE, 'T' TRUNCATE, 'C' COMMIT, and the 'S'/'E'/'A'/'c' STREAM_START/STOP/ABORT/COMMIT frames for in-progress transactions. A per-relation RelationSyncCache (RelationSyncEntry) memoizes, for each table, which publications publish it, which pubactions are enabled, the resolved publish_as_relid for partition-root publishing, the column-list bitmap, and one ExprState per action for row filters. pgoutput defers BEGIN until the first surviving change (skipping empty transactions), sends each relation's schema lazily exactly once (schema_sent, tracked per top-xid for streamed transactions), evaluates row filters and column lists per change, and tags every streamed-mode message with its xid so the subscriber can discard aborted subtransactions.
  • PostgreSQL Replication Slots — Retaining WAL and catalog_xminA replication slot is a durable, named bookmark in shared memory that pins the two resources a downstream consumer still needs: the oldest WAL segment (via restart_lsn) and, for logical slots, the oldest system-catalog row versions (via catalog_xmin). The fixed-size ReplicationSlotCtl->replication_slots[] array holds max_replication_slots ReplicationSlot structs; each is guarded by a per-slot spinlock for field updates and by the cluster-wide ReplicationSlotControlLock for the in_use bit. Persistent slots are crash-safe — each lives as a pg_replslot/<name>/state file written atomically (tmp + fsync + rename) at every checkpoint and reloaded by StartupReplicationSlots before redo. Retention is enforced indirectly: ReplicationSlotsComputeRequiredLSN aggregates the minimum restart_lsn and hands it to the xlog module so checkpoints will not recycle still-needed segments, while ReplicationSlotsComputeRequiredXmin publishes the minimum catalog_xmin into the ProcArray so vacuum will not prune catalog tuples a logical decoder still needs. Slots advance forward only (pg_replication_slot_advance), can be invalidated when they fall too far behind (max_slot_wal_keep_size, horizon conflicts, idle timeout), and — for logical failover slots — are copied to standbys by slot sync so a consumer survives a failover.
  • PostgreSQL Synchronous Replication — synchronous_commit and the Wait QueueSynchronous replication makes a committing backend block after its WAL is locally flushed until a configured set of standbys acknowledge the same LSN. The whole mechanism lives on the primary: a committing backend calls SyncRepWaitForLSN(XactLastRecEnd, true) from RecordTransactionCommit, after which it registers MyProc on one of three LSN-ordered wait queues in shared memory (WalSndCtl->SyncRepQueue[mode], indexed by SYNC_REP_WAIT_WRITE/FLUSH/APPLY chosen from synchronous_commit) and sleeps on its own latch. The synchronous_commit GUC selects the durability level (off/local/remote_write/on=remote_flush/remote_apply) via assign_synchronous_commit setting SyncRepWaitMode; synchronous_standby_names selects which and how many standbys count, parsed into a SyncRepConfigData with method FIRST (priority) or ANY (quorum) and num_sync. When a walsender receives a standby reply it calls SyncRepReleaseWaiters, which computes the synced write/flush/apply positions across the candidate standbys (oldest for priority, Nth-latest for quorum via SyncRepGetSyncRecPtr), advances WalSndCtl->lsn[mode], and wakes every queued backend whose waitLSN has been reached through SyncRepWakeQueue. Standbys are unaware they are synchronous; all durability policy is primary-side.
  • PostgreSQL WAL Sender/Receiver — Streaming Replication TransportThe walsender process (primary) and walreceiver process (standby) are the two endpoints of PostgreSQL's streaming replication transport. A walsender is a backend variant that speaks a small replication-command dialect: IDENTIFY_SYSTEM/TIMELINE_HISTORY for handshake, then START_REPLICATION to enter COPY-Both mode and stream WAL segments from sentPtr up to the current flush pointer. The send loop (WalSndLoop) drains WAL via XLogSendPhysical or XLogSendLogical, interleaves keepalives, and processes standby reply messages (write/flush/apply LSNs plus hot-standby feedback xmin). The walreceiver on the standby side runs WalReceiverMain: it loads libpqwalreceiver dynamically, calls IDENTIFY_SYSTEM, streams WAL into pg_wal via XLogWalRcvWrite and XLogWalRcvFlush, updates WalRcvData->flushedUpto, and wakes the startup process. State on both sides is tracked in shared memory: WalSnd[] array (walsender) and WalRcvData (walreceiver).

Server Architecture (9)

  • PostgreSQL Auxiliary Processes — bgwriter, walwriter, checkpointer, startup, and sysloggerPostgreSQL's five infrastructure processes — bgwriter, walwriter, checkpointer, startup, and syslogger — share a common initialization path via AuxiliaryProcessMainCommon and run independent event loops. bgwriter proactively flushes dirty shared buffers; walwriter batches WAL fsync for async-commit guarantees; checkpointer owns all checkpoint logic and coordinates with backends via CheckpointerShmem; startup drives WAL recovery then exits; syslogger captures stderr via a pipe and rotates log files. All five except syslogger are relaunched automatically by LaunchMissingBackgroundProcesses on every ServerLoop iteration.
  • PostgreSQL Backend Lifecycle — Fork, Initialize, Command Loop, and ExitHow a PostgreSQL backend process comes to life, initializes shared-memory membership and catalog access, runs the per-session command loop (ReadCommand → exec_simple_query / extended-query messages → start_xact_command / finish_xact_command), and exits: the four-phase arc from postmaster fork through PostgresMain's sigsetjmp error-recovery loop to proc_exit callbacks.
  • PostgreSQL Background Workers — The Dynamic Worker FrameworkHow PostgreSQL turns the multi-process postmaster into an extensible task framework: the BackgroundWorker registration struct, the split between static (RegisterBackgroundWorker, shared_preload_libraries only) and dynamic (RegisterDynamicBackgroundWorker, from any backend) registration, the lockless BackgroundWorkerSlot array in shared memory that lets the postmaster fork workers without ever taking a lock, the generation-numbered handle protocol for start/stop/terminate, the crash-and-restart lifecycle keyed on exit code, and how parallel query, logical replication apply, and third-party extensions all build on the same five-symbol API plus DSM/DSA for shared state.
  • PostgreSQL Latches, Wait Event Sets, and Inter-Process SignalsHow a PostgreSQL process sleeps and is woken: the Latch (a boolean + a wakeup) that replaces sleep/poll-the-flag loops, the WaitEventSet abstraction over epoll/kqueue/poll that multiplexes latch, socket, postmaster-death and timeout readiness in one race-free wait, the self-pipe / signalfd / SIGURG trick that lets a signal handler interrupt a blocking poll, the SIGUSR1-multiplexed ProcSignal mechanism (per-ProcNumber shared-memory slots, signal reasons, and global barriers) for inter-backend messaging, and the InterruptPending / CHECK_FOR_INTERRUPTS / ProcessInterrupts machinery that turns an asynchronous signal into a synchronous, safe cancellation or shutdown.
  • PostgreSQL Lock Manager — The Heavyweight Lock Table, Lock Modes, and Deadlock DetectionHow PostgreSQL records SQL-visible locks in a partitioned shared-memory hash of LOCK/PROCLOCK structures fronted by a per-backend LOCALLOCK cache and a fast path for weak relation locks, arbitrates the 8 lock modes via a static conflict table, and breaks deadlocks by searching a waits-for graph with soft-edge queue rearrangement.
  • PostgreSQL LWLock and Spinlock — Atomic State Word, Wait-Queue Protocol, and Platform TASHow PostgreSQL protects shared-memory data structures with two tiers of in-process locks: spinlocks (slock_t / TAS) for microsecond-scale mutual exclusion and LWLocks for read/write shared-memory access, where the LWLock state is a single atomic uint32 word encoding exclusive sentinel, shared-holder count, and three flag bits, acquisition uses a two-attempt queue protocol to close the enqueue-then-miss race, and release atomically decrements the word and conditionally wakes the proclist waiter queue via PGSemaphoreUnlock.
  • PostgreSQL Postmaster — Cluster Supervisor, Process Lifecycle, and Crash RecoveryThe postmaster is PostgreSQL's cluster supervisor: it creates shared memory once at startup, then forks a family of role-typed children (BackendType) to service connections and run background tasks. Its ServerLoop blocks on a WaitEventSet; incoming connections wake BackendStartup, SIGCHLD wakes process_pm_child_exit and PostmasterStateMachine. A child crash sets FatalError, sends SIGTERM/SIGQUIT to siblings, and drives a deterministic 11-state PMState machine through shutdown-and-restart. The PMChild pool in pmchild.c tracks every live child; CleanupBackend releases slots and triggers the state machine after each exit.
  • PostgreSQL Shared Memory & IPC — Static Segment, Dynamic Shared Memory, and the shm_mq Message LayerHow PostgreSQL processes share state: a single fixed-size SysV/mmap segment sized at postmaster startup (CalculateShmemSize / CreateSharedMemoryAndSemaphores), bump-allocated via ShmemAlloc into a shmem-index hash, extended at runtime by dynamic shared memory segments (dsm_create / dsm_attach / dsm_detach) whose platform dispatch lives in dsm_impl.c, with a lock-free ring-buffer message queue (shm_mq) layered on DSM for parallel-query tuple transport, and a dynamic slab allocator (dsa) built on top for variable-size shared data structures.
  • PostgreSQL SSI and Predicate Locking — Serializable Snapshot Isolation, SIREAD Locks, and the rw-conflict GraphPostgreSQL implements true SERIALIZABLE isolation via Serializable Snapshot Isolation (SSI): snapshot isolation runs unchanged but SIREAD predicate locks track every read, and rw-conflict edges are recorded in per-transaction SERIALIZABLEXACT structs; when a dangerous structure (Tin->Tpivot->Tout, both edges rw-conflicts with Tout committing first) is detected in OnConflict_CheckForSerializationFailure, one transaction is aborted with SQLSTATE 40001; lock granularity escalates from tuple to page to relation to cap shared-memory use, and old committed transaction state spills to an SLRU-backed serial log.

System Catalog (6)

  • PostgreSQL Cache Invalidation — sinval Message Queue, inval.c Dispatcher, and Transactional DeferralHow PostgreSQL keeps per-backend catcache and relcache entries coherent across concurrent DDL: heap_update and heap_delete call CacheInvalidateHeapTuple to register messages that are deferred until command-end (via CommandEndInvalidationMessages) and broadcast to the shared sinval ring buffer only at transaction commit (via AtEOXact_Inval); SIInsertDataEntries writes into a 4096-slot circular buffer in shared memory; each backend drains that buffer on transaction start via AcceptInvalidationMessages, which dispatches each message to LocalExecuteInvalidationMessage, and falls back to a full cache reset if the buffer overflowed; CacheRegisterSyscacheCallback and CacheRegisterRelcacheCallback let subsystems attach invalidation hooks.
  • PostgreSQL CatCache & SysCache — Per-Backend Catalog Tuple Cache, Negative Entries, and the sinval Invalidation LoopHow PostgreSQL caches system-catalog tuples in each backend: a per-process hash table of CatCTup entries (positive and negative) bucketed by key hash, a CatCList layer for partial-key multi-row searches, a two-level API where syscache.c wraps catcache.c with named integer IDs derived from MAKE_SYSCACHE declarations in pg_*.h headers, lsyscache.c as the convenience facade used everywhere in the planner and executor, and the CatCacheInvalidate/sinval loop that marks entries dead and triggers registered callbacks when any backend commits a catalog change.
  • PostgreSQL Dependency Tracking — pg_depend, CASCADE, and Deletion OrderingHow PostgreSQL records and enforces inter-object dependencies: the pg_depend (local) and pg_shdepend (cross-database) catalogs, the four user-visible dependency kinds (normal/auto/internal/extension) plus partition and auto-extension variants, recordDependencyOn and its expression-walking wrappers, and the DROP ... CASCADE engine — findDependentObjects' two-phase reverse-then-forward traversal that turns the dependency graph into a safe, deterministic deletion order, reportDependentObjects' RESTRICT gate, and the per-object deleteOneObject/doDeletion dispatch.
  • PostgreSQL Schemas & search_path — Name Resolution and the Namespace CacheHow PostgreSQL turns an unqualified SQL name into an object OID: the pg_namespace catalog that gives schemas their identity, the search_path GUC and its derivation into an ordered list of namespace OIDs (activeSearchPath), the implicit prepending of pg_temp and pg_catalog, RangeVarGetRelidExtended's lookup-then-lock retry loop, the per-object-class visibility scans (relations, types, functions, operators), the per-session temp namespace lifecycle, and the two-layer caching — the simplehash search-path cache plus the activePathGeneration counter — kept coherent by syscache invalidation callbacks on pg_namespace and pg_authid.
  • PostgreSQL Relation Cache — RelationData, Bootstrap Nailing, and Sinval-Driven RebuildHow PostgreSQL caches per-relation metadata in a backend-private OID hash: RelationData holds the pg_class row, tuple descriptor, rule/trigger/RLS sub-structures, index access info, and physical address; RelationBuildDesc assembles an entry from system catalogs; seven catalogs are nailed at startup via formrdesc or pg_internal.init to break the bootstrap recursion; sinval messages drive RelationFlushRelation, which either clears a zero-refcount entry or rebuilds an open one via a swap-in-place dance that preserves pointer stability for callers.
  • PostgreSQL System Catalogs — Schema Definition, Bootstrap Mechanics, and the Catalog Write PathHow PostgreSQL defines and bootstraps its system catalogs: CATALOG()/BKI_* macros in pg_*.h headers that feed genbki.pl to produce the postgres.bki bootstrap script; the OID-range classification that separates pinned bootstrap objects (OID < 12000) from normal objects; the distinction between per-database catalogs (pg_class, pg_attribute, …) and cluster-wide shared catalogs (pg_authid, pg_database, …); the mapped-relation mechanism (relmapper.c / pg_filenode.map) that lets bootstrap catalogs have a stable relfilenode without needing to write to pg_class; and the catalog write path — heap_create_with_catalog, AddNewRelationTuple, AddNewAttributeTuples, CatalogTupleInsert — that every DDL statement threads through.

Monitoring & Statistics (2)

  • PostgreSQL Cumulative Statistics — The PG15 Shared-Memory Stats SubsystemPostgreSQL's cumulative statistics system accumulates per-object counters (relations, functions, IO, WAL, SLRU, database, replication slots, subscriptions, backends) that drive the pg_stat_* views and autovacuum. PG15 replaced the legacy single stats-collector process — which received counts as UDP datagrams and owned the only copy of the numbers — with a shared-memory subsystem. Variable-numbered stats (per-relation, per-function) live in a DSA-backed dshash keyed by PgStat_HashKey {kind, dboid, objid}; the dshash entry (PgStatShared_HashEntry) holds only a refcount, a generation counter, a dropped flag, and a dsa_pointer to the separately allocated, variable-size stats body. Fixed-numbered kinds (checkpointer, bgwriter, archiver, IO, WAL, SLRU) live in a plain shared-memory control block. Each backend buffers updates in process-local pending entries reached through a local hashtable (pgStatEntryRefHash) of reference-counted PgStat_EntryRef handles, so the hot path never touches shared memory; pgstat_report_stat() flushes the pending list under each entry's dedicated LWLock at most once per second (forced at commit). Lifetime is reference-counted: a DROP marks the entry dropped and decrements the sentinel refcount, but the body is freed only when the last backend releases its reference, with a generation counter detecting reuse-during-release races and a gc_request_count epoch prompting backends to garbage-collect stale local refs. Drops are transactional: pgstat_drop_transactional() defers the real drop to commit and writes it into the commit/abort WAL record so replicas and crash recovery stay consistent. The checkpointer serializes the table to pg_stat/pgstat.stat at shutdown; the startup process reloads it, or discards everything after a crash.
  • PostgreSQL Wait Events and Progress ReportingHow a PostgreSQL backend advertises what it is doing right now: the single 4-byte wait_event_info word (class byte + event id) written lock-free by pgstat_report_wait_start, the table-driven wait_event_names.txt codegen that builds the per-class enums and name lookups, the per-backend PgBackendStatus slot behind pg_stat_activity (st_state, st_activity_raw, query/plan id) guarded by the st_changecount seqlock, and the command-progress channel (st_progress_command + a 20-slot st_progress_param array) that backs pg_stat_progress_vacuum / create_index and is fed across parallel workers via a PqMsg_Progress message.

Client Protocol (3)

  • PostgreSQL Authentication — pg_hba, SCRAM-SHA-256, and SASLHow a PostgreSQL backend authenticates a client after the startup packet: pg_hba.conf-driven method selection in ClientAuthentication, the AUTH_REQ_* request/response codes on top of the wire protocol, the SCRAM-SHA-256 SASL exchange (auth-sasl.c driver + auth-scram.c mechanism) with channel binding, the md5 and plaintext password verifiers in crypt.c, the mock-authentication anti-enumeration trick, and the ClientAuthentication_hook extension point.
  • PostgreSQL Transport Security — TLS (OpenSSL) and GSSAPI EncryptionHow PostgreSQL's backend secures the byte stream beneath the FE/BE protocol: the secure_read/secure_write indirection that dispatches between raw sockets, OpenSSL TLS, and GSSAPI; the sslmode/SSLRequest and GSSENCRequest single-byte negotiation that precedes authentication; the OpenSSL handshake via SSL_accept over a custom BIO that wraps secure_raw_read/secure_raw_write; client-certificate extraction (CN/DN, embedded-NULL rejection) and channel-binding hashes; and the length-prefixed gss_wrap/gss_unwrap packet framing of GSSAPI transport encryption.
  • PostgreSQL Wire Protocol — FE/BE Framing, Startup Handshake, and the Simple/Extended Query LoopsHow PostgreSQL's backend speaks to clients: the length-prefixed binary framing of protocol version 3, the SSL/GSSAPI negotiation and startup-packet handshake that precedes authentication, the simple-Query and extended-Query (Parse/Bind/Execute) message loops handled by PostgresMain, and the pqcomm/pqformat send-buffer pair that decouples message construction from socket writes.

Extensibility (6)

  • PostgreSQL CustomScan — The Provider API for Pluggable Plan NodesPostgreSQL lets an extension inject an entirely new physical operator into a plan tree without forking the backend, via the CustomScan provider API. The mechanism is three small method-vtable structs threaded through the three phases of query processing: CustomPathMethods carries one callback (PlanCustomPath) so a path injected during planning can later lower itself to a plan node; CustomScanMethods carries one callback (CreateCustomScanState) so a CustomScan plan node can instantiate its executor state; and CustomExecMethods carries the runtime vtable (BeginCustomScan / ExecCustomScan / EndCustomScan / ReScanCustomScan required, plus optional mark-restore, parallel-DSM, shutdown, and EXPLAIN hooks). A provider gets control by setting the global set_rel_pathlist_hook or set_join_pathlist_hook and calling add_path() with a CustomPath; the core planner is otherwise oblivious. Because plan trees are copied and serialized, the methods field is a pointer to a static vtable that is never copied — instead the provider registers the vtable by a string name (RegisterCustomScanMethods) into a process-local hash, and copyfuncs/outfuncs/readfuncs re-resolve it by name (GetCustomScanMethods). The same name-keyed registry generalizes to the extensible-node framework (RegisterExtensibleNodeMethods / T_ExtensibleNode), letting a provider stash arbitrary private node types in custom_private and have copyObject/nodeToString/stringToNode round-trip them through provider-supplied nodeCopy/nodeEqual/nodeOut/nodeRead callbacks. nodeCustom.c is the thin executor shim that dispatches each ScanState lifecycle event to the provider vtable, defaulting a virtual slot, opening the scan relation when scanrelid>0, and erroring out cleanly when an optional capability (MarkPos) is invoked but unimplemented. This doc covers the three method structs, the planner-hook injection point, create_customscan_plan lowering, the nodeCustom executor shim, and the extensible-node registry; it defers the surrounding executor framework to postgres-executor.md, path generation to postgres-planner-overview.md, and the FDW sibling mechanism to postgres-fdw.md.
  • PostgreSQL Extensions — CREATE EXTENSION, Control Files, and Versioned ScriptsAn extension is a named, versioned bundle of SQL objects that PostgreSQL tracks as a single dependable unit. On disk it is a control file (extname.control, parsed by the postgresql.conf parser) plus a directory of versioned SQL scripts named extname--version.sql and extname--from--to.sql. CREATE EXTENSION (CreateExtension -> CreateExtensionInternal) reads the primary control file, picks a version, computes an install/upgrade script sequence by running Dijkstra's shortest-path over the version graph (find_install_path / find_update_path), inserts one pg_extension catalog row (InsertExtensionTuple), then runs each script under execute_extension_script with creating_extension = true so that every object the script creates auto-records a DEPENDENCY_EXTENSION pg_depend edge back to the extension. ALTER EXTENSION UPDATE (ExecAlterExtensionStmt -> ApplyExtensionUpdates) walks the same graph from the installed version. Relocatable extensions support ALTER EXTENSION SET SCHEMA (AlterExtensionNamespace) and @extschema@ substitution; trusted extensions (extension_is_trusted) let a non-superuser with database CREATE privilege install them by transiently switching to the bootstrap superuser. DROP EXTENSION is almost free: RemoveExtensionById deletes the catalog row and the dependency machinery cascades to every recorded member object.
  • PostgreSQL Foreign Data Wrappers — The In-Core FDW MechanismA foreign data wrapper (FDW) is an extension that makes an external data source queryable as a regular table. PostgreSQL implements SQL/MED entirely in the core: the catalog tier (pg_foreign_data_wrapper, pg_foreign_server, pg_user_mapping, pg_foreign_table) records the wrapper, its connection endpoints, and per-user credentials; the DDL tier in foreigncmds.c (CreateForeignDataWrapper, CreateForeignServer, CreateUserMapping, CreateForeignTable, ImportForeignSchema) maintains those catalogs; and the dispatch tier in foreign.c (GetFdwRoutine, GetFdwRoutineForRelation) calls each wrapper's C handler function, which returns an FdwRoutine struct of callback pointers. The planner reaches the FDW through three mandatory callbacks — GetForeignRelSize for size estimates, GetForeignPaths to enumerate access paths, GetForeignPlan to build the ForeignScan plan node — driven from set_foreign_size / set_foreign_pathlist / create_foreignscan_plan. The executor drives the scan through BeginForeignScan / IterateForeignScan / EndForeignScan from nodeForeignscan.c, and writes through the optional modify callbacks (PlanForeignModify, BeginForeignModify, ExecForeignInsert/Update/Delete, EndForeignModify) from nodeModifyTable.c. Optional callbacks add remote join/aggregate push-down, direct modify, ANALYZE, IMPORT FOREIGN SCHEMA, TRUNCATE, parallelism, and asynchronous execution. postgres_fdw is a contrib module and out of scope here; this doc covers only the in-core machinery every wrapper plugs into.
  • PostgreSQL Hooks — The Function-Pointer Extension PointsA PostgreSQL hook is a global function-pointer variable, NULL by default, that the core wraps around a stable entry point so a loadable module can intercept it without forking the server. The pattern is uniform: each hookable operation Foo() splits into a thin dispatcher Foo() that tests `if (Foo_hook) (*Foo_hook)(...) else standard_Foo(...)`, plus a standard_Foo() carrying the real logic. A shared library's _PG_init() (run by load_external_function/internal_load_library when the .so is loaded, after the PG_MODULE_MAGIC ABI check) installs a hook by the save-and-chain convention: `prev = the_hook; the_hook = my_fn;` and the module's function calls `prev ? prev(...) : standard_Foo(...)` so multiple modules stack. Query-path hooks (planner_hook, ExecutorStart/Run/Finish/End_hook, ExecutorCheckPerms_hook, ProcessUtility_hook) are installable from any preload timing. The two shared-memory hooks are phase-gated: shmem_request_hook fires inside process_shmem_requests() (only valid caller of RequestAddinShmemSpace), and shmem_startup_hook fires at the tail of CreateSharedMemoryAndSemaphores() once the segment exists. ClientAuthentication_hook fires after the auth verdict in ClientAuthentication(). All hook variables are PGDLLIMPORT-exported so out-of-tree modules can bind them.
  • PostgreSQL PL/pgSQL — Handler, Compilation to a PLpgSQL_function AST, and the Tree-Walking Statement ExecutorPL/pgSQL is PostgreSQL's default procedural language, implemented as a loadable handler that plugs into the function-manager / validator contract: CREATE FUNCTION ... LANGUAGE plpgsql stores the body text in pg_proc.prosrc, and at first call plpgsql_call_handler compiles it. Compilation (pl_comp.c) is a one-shot Bison parse that walks the source into a PLpgSQL_function — a tree of PLpgSQL_stmt nodes plus a flat datums[] array of variables (PLpgSQL_var/row/rec/recfield); the compiled function is cached on FmgrInfo.fn_extra via funccache.c so subsequent calls skip parsing. Crucially PL/pgSQL never plans SQL itself: every embedded query becomes a PLpgSQL_expr that is handed to SPI (SPI_prepare_extended / SPI_execute_plan_with_paramlist) with a parser-setup hook that resolves identifiers to PL/pgSQL variables via PARAM_EXTERN params; local variable values are pushed to the executor lazily through a ParamListInfo callback (plpgsql_param_eval_var). Execution (pl_exec.c) is a recursive tree-walk: plpgsql_exec_function copies the datums, sets up an error-context callback and an eval ExprContext, then exec_stmt_block / exec_stmts dispatch on cmd_type to exec_stmt_* handlers that return PLPGSQL_RC_OK/EXIT/RETURN/CONTINUE. Scalar expressions that are a bare table-less SELECT of one column take a 'simple expression' fast path (exec_eval_simple_expr) that bypasses the SPI executor and runs the ExprState directly against the eval econtext, with cached-plan revalidation per transaction. BEGIN ... EXCEPTION blocks are implemented as internal subtransactions: exec_stmt_block wraps the body in BeginInternalSubTransaction + PG_TRY, and on error rolls the subxact back, copies the ErrorData, and scans the handler list for a SQLSTATE-condition match. This doc covers the handler/validator/inline trio, pl_comp's AST build and namespace model, the datum/variable model, the exec_stmt_* dispatch loop, the SPI-backed and simple-expression SQL paths, and exception subtransactions; it defers the SPI API surface to postgres-spi.md and the V1 function-call ABI to postgres-fmgr.md.
  • PostgreSQL SPI — The Server Programming Interface: How PLs and Extensions Run SQL From Inside the BackendSPI (Server Programming Interface) is PostgreSQL's in-backend C API for running SQL from code that is *already* executing inside a backend — every procedural-language handler (PL/pgSQL, PL/Python, PL/Perl), most data-modifying extensions, and many utility commands reach the parser/planner/executor through it rather than through the wire protocol. SPI's core abstraction is a *connection stack*: SPI_connect pushes an `_SPI_connection` frame that owns two memory contexts (procCxt for results that outlive one statement, execCxt that is reset after every call) and snapshots three API globals (SPI_processed, SPI_tuptable, SPI_result) so a nested call cannot clobber its caller's results. SPI_execute runs a one-shot plan (parse→rewrite→plan→execute every time); SPI_prepare builds a reusable CachedPlanSource that SPI_keepplan promotes into CacheMemoryContext for cross-call reuse, and SPI_execute_plan re-validates and runs it. Result rows land in an SPITupleTable — a palloc'd array of HeapTuples in its own child context — filled by the DestSPI receiver (spi_dest_startup + spi_printtup). Cursors (SPI_cursor_open) hand the plan to the Portal machinery so a PL can fetch incrementally. The subtle core is snapshot + transaction management: _SPI_execute_plan implements four distinct snapshot regimes (read-only vs read-write × caller-supplied vs SPI-managed) and, in non-atomic contexts (procedures, DO blocks), SPI_commit/SPI_rollback can terminate and restart the surrounding transaction while holding pinned portals. This doc covers the connection stack, the prepare/execute/cursor paths, SPITupleTable, the memory-context + snapshot model, and AtEOXact cleanup; it defers PL/pgSQL's own state machine to postgres-plpgsql.md, the executor proper to postgres-executor.md, and portal/prepared-statement lifecycle to postgres-portals-prepared.md.

Internationalization & Text (3)

  • PostgreSQL Collation Providers — libc, ICU, and the Builtin ProviderHow PostgreSQL abstracts collation behavior behind a single pg_locale_t handle dispatched by a one-character collprovider code: the libc provider (newlocale/strcoll_l/strxfrm_l, lc_collate/lc_ctype fixed at CREATE DATABASE), the ICU provider (UCollator, BCP-47 language tags, the only provider that supports nondeterministic collations), and the builtin provider (C / C.UTF-8 / PG_UNICODE_FAST, version-stable Unicode tables compiled into the server). Covers deterministic vs. nondeterministic equality, the collversion drift-detection mechanism that warns when the OS collation library changes underneath an index, and the per-backend collation cache.
  • PostgreSQL Character Set Encoding — Server/Client Encodings and ConversionHow PostgreSQL represents and converts text between character set encodings: the per-database fixed server encoding versus the per-session client encoding, the conversion framework that wires client↔server bytes through pg_do_encoding_conversion and the pg_conversion-registered procs (with UTF-8 as the universal pivot via UtfToLocal/LocalToUtf radix trees), the per-encoding pg_wchar_table of mblen/verify/char↔wchar function pointers shared with frontend tools in src/common, UTF-8 length decoding and the shift-based DFA validator, and the verify-on-ingress / trust-on-egress discipline that governs how identifiers and string literals enter the server encoding.
  • PostgreSQL Full-Text Search — tsvector, tsquery, Dictionaries, and GINHow PostgreSQL turns documents into searchable, ranked text: the text parser that splits raw input into typed tokens, the per-token-type chain of configurable dictionaries (simple/stopword, synonym, ispell, thesaurus) that normalizes tokens into lexemes, the tsvector (sorted lexeme + position store) and tsquery (operator tree with AND/OR/NOT/phrase) datatypes built by to_tsvector / to_tsquery, the ternary TS_execute matcher behind the @@ operator with its phrase-distance logic, the ts_rank / ts_rank_cd cover-density ranking functions, and the GIN inverted-index integration (gin_extract_tsvector / gin_extract_tsquery / gin_tsquery_consistent) that makes @@ index-accelerated with lossy recheck.

Utilities (6)

  • PostgreSQL Cluster Bootstrap — initdb, the Bootstrap Backend, and genbkiHow a PostgreSQL data directory is born: genbki.pl turns the pg_*.h catalog headers and pg_*.dat data files into a flat postgres.bki script at build time; initdb lays out the on-disk directory tree, writes PG_VERSION and config files, and runs a single-process backend in --boot mode whose tiny BKI interpreter (create/open/insert/declare index/build indices) materialises the bootstrap catalogs heap-by-heap; then a standalone non-bootstrap backend replays system_constraints.sql, system_views.sql, pg_*.sql and CREATE DATABASE to clone template1 into template0 and postgres.
  • pg_basebackup — Driving BASE_BACKUP, Parallel WAL Streaming, Output Formats, and the pg_receivewal / pg_recvlogical FamilyHow pg_basebackup builds and issues a BASE_BACKUP replication command over a libpq replication connection, forks a second connection to stream WAL in parallel through ReceiveXlogStream, and shapes the COPY-streamed archive into plain or tar output with optional client- or server-side gzip/lz4/zstd compression via the astreamer pipeline; plus how its siblings pg_receivewal (physical WAL archiving with FindStreamingStart resume logic) and pg_recvlogical (logical decoding consumer) reuse the shared streamutil connection and slot helpers.
  • PostgreSQL pg_ctl / pg_controldata — Server Lifecycle Control and Cluster State Inspectionpg_ctl is the command-line tool that starts, stops, restarts, reloads, promotes, and monitors a PostgreSQL server by forking the postgres binary and sending signals via the postmaster.pid file. pg_controldata is a read-only inspector of the binary control file ($PGDATA/global/pg_control), which holds ControlFileData — a 512-byte atomic-write structure carrying the cluster's DBState, last checkpoint LSN, timeline IDs, WAL-level settings, and compile-time constants. Together the two tools expose the entire server lifecycle and cluster health without requiring a live SQL connection.
  • PostgreSQL pg_dump / pg_restore — Catalog-Driven Logical BackupHow PostgreSQL produces a logical (SQL-reconstructable) backup: pg_dump reads the live catalog over a regular SQL connection inside a serializable snapshot, materializes every object into an in-memory DumpableObject graph keyed by CatalogId and DumpId, topologically sorts that graph so no statement forward-references an undefined object, and emits the result through a pluggable archive format (plain / custom / directory / tar). The archive's table of contents (TOC) carries each object's DDL, COPY data hook, and dependency list, which pg_restore replays — optionally in parallel across worker processes that honor the dependency edges and the pre-data / data / post-data section split. pg_dumpall wraps pg_dump per database and adds cluster-global objects (roles, role memberships, tablespaces) that pg_dump cannot see.
  • PostgreSQL pg_upgrade — In-Place Major-Version UpgradeHow pg_upgrade migrates a PostgreSQL cluster across major versions without a full dump-reload: it starts both old and new postmasters in binary-upgrade mode, dumps only the schema via pg_dump, restores that schema into the new cluster, then transfers user relation files by copy, hard-link, clone, copy_file_range, or directory-swap — preserving OIDs and relfilenumbers so toast pointers and enum values remain valid — after a battery of pre-flight compatibility checks ensure the clusters can share on-disk data.
  • PostgreSQL pg_waldump — WAL Decoding and Inspection Utilitypg_waldump is a read-only, frontend-mode binary that decodes WAL segments from disk into human-readable text. It wires three callbacks — WALDumpReadPage, WALDumpOpenSegment, WALDumpCloseSegment — into an XLogReaderState and drives XLogReadRecord in a tight loop, optionally filtering by rmgr, xid, relation, block, fork, or full-page-write presence. For each matching record it calls the rmgr's rm_desc / rm_identify callbacks to produce the display line; with --stats it accumulates XLogStats instead, printing a per-rmgr (and optionally per-record-type) count/size table at the end. --follow makes the loop poll after end-of-WAL; --save-fullpage extracts and decompresses full-page images to disk. The tool shares its XLogReader infrastructure with the server but runs entirely outside it, with no shared memory and no buffer manager. Defers WAL record anatomy and rmgr dispatch to postgres-wal-records-rmgr.md; defers WAL insertion and LSN mechanics to postgres-xlog-wal.md.

Version Evolution (7)

  • PostgreSQL I/O — From Synchronous Buffered Reads to Asynchronous I/OPostgreSQL's read path evolved from one-block-at-a-time synchronous buffered reads (a backend that missed shared_buffers blocked in pread until the kernel returned the page) toward overlapping storage latency with useful work. The arc: 8.3-era ReadBuffer/pread (sync) -> 9.0 PrefetchBuffer via posix_fadvise(WILLNEED), first wired into bitmap heap scans behind effective_io_concurrency -> 15 WAL prefetch during recovery (xlogprefetcher.c issues fadvise hints for blocks an upcoming redo record will touch, gated by recovery_prefetch + maintenance_io_concurrency) -> 17 the read_stream abstraction (read_stream.c), a producer/consumer helper that turns a block-number callback into coalesced vectored reads with adaptive look-ahead, retrofitted onto seq scan, ANALYZE, and VACUUM -> 18 the first-class AIO subsystem (aio.c) with a pluggable IoMethodOps vtable (sync / worker / io_uring), PgAioHandle state machines in shared memory, B_IO_WORKER processes, and read_stream re-platformed onto true async completion instead of fadvise hints. The throughline: PostgreSQL never owned the I/O concurrency primitive until PG18 — before that it borrowed the kernel's readahead via advisory hints; PG18 finally issues, tracks, and completes real asynchronous reads itself.
  • PostgreSQL Parallel Query — From No Parallelism to Parallel Joins, Aggregates, and MaintenancePostgreSQL had no intra-query parallelism until 9.6, when it learned to fork background workers under a Gather node that drains per-worker tuple queues over dynamic shared memory; PG10 made the partial subtree below Gather rich — parallel hash/merge/nestloop joins, partial+finalize aggregation, and an order-preserving GatherMerge; PG11 pushed parallelism into DDL with parallel CREATE INDEX (parallel btree sort) and into the planner with a parallel-aware Append that hands partitions out to workers; PG13 extended it to maintenance with parallel VACUUM that vacuums indexes in parallel; PG14–18 are incremental — finer parallel-safe markings, leader-participation tuning, and parallelizing more node and utility paths — converging on the current Gather/GatherMerge + ParallelContext design documented in postgres-parallel-query.md.
  • PostgreSQL Partitioning — From Table Inheritance to Declarative PartitioningPostgreSQL partitioning evolved from a do-it-yourself pattern — child tables bolted onto a parent with table inheritance, CHECK constraints, INSERT triggers, and constraint_exclusion pruning the parent's children one CHECK-implication proof at a time — into first-class declarative partitioning: PG10 added PARTITION BY syntax, a catalog-backed PartitionBoundInfo, and C-level tuple routing; PG11 added hash partitioning, a DEFAULT partition, partitionwise join/aggregate, UPDATE-driven row movement, and runtime pruning of Append subplans via PARAM_EXEC values; PG12 rewrote plan-time pruning to be far faster and added ATTACH/DETACH ... CONCURRENTLY plus foreign-key and index inheritance; PG13–16 polished it with BEFORE-row triggers, logical-replication publication of partition roots, and a long list of planner refinements. The arc is a migration of the partition-resolution logic out of user-written SQL (triggers + CHECK proofs) and into binary-searchable catalog metadata read by dedicated partitioning/ C modules.
  • PostgreSQL Storage — From a Hardwired Heap to the Pluggable Table Access Method APIFor most of its history PostgreSQL had exactly one table storage model — the no-overwrite MVCC heap — wired directly into the executor, the planner, and the DDL layer. PostgreSQL 12 broke that coupling by introducing the Table Access Method API: TableAmRoutine, a ~40-slot function-pointer vtable stored on each relation's relcache entry, with heap recast as just one access method (heapam_methods) registered through pg_am. The same release generalized the tuple slot into the TupleTableSlotOps abstraction so the executor could carry tuples in any AM-specific in-memory form. PG15 added ALTER TABLE ... SET ACCESS METHOD to rewrite a table into a different AM, and PG17 extended SET ACCESS METHOD to materialized views and made default_table_access_method govern them. At REL_18 heap remains the only in-tree AM, but the API has become the extension point for an ecosystem of columnar and analytic AMs (Citus columnar, Hydra, ParadeDB) and for the stalled zheap in-place engine.
  • PostgreSQL Replication — From WAL File Shipping to Logical Pub/SubPostgreSQL replication evolved from coarse file-based log shipping (8.x warm standby) to byte-level streaming with read-only standbys (9.0 hot standby), then to cascading and synchronous topologies (9.1-9.2), to logical decoding and replication slots that turn the physical WAL into a logical change stream (9.4), to a fully built-in publish/subscribe logical replication system (10), and finally to selective and resilient replication — row filters and column lists (15), two-phase decoding (14) and two-phase subscriptions (15), and failover slots with standby slot synchronization (17).
  • PostgreSQL Statistics — From a Collector Process to Extended Stats and Shared-Memory Cumulative StatsPostgreSQL's two statistics subsystems — the planner statistics that feed cardinality estimation and the cumulative activity counters that feed the pg_stat_* views and autovacuum — evolved on separate tracks toward the same destination: richer data, cheaper access, fewer single points of failure. Planner stats began as per-column rows in pg_statistic that assume column independence; PG10 added CREATE STATISTICS with functional dependencies and multivariate n-distinct to capture cross-column correlation, PG12 added multivariate MCV lists, and PG14 extended the machinery to arbitrary expressions. Cumulative stats began as a single UDP-fed collector process that owned the only in-memory copy of every counter — lossy under load, a throughput ceiling, and forcing each pg_stat read to deserialize a whole file; PG15 replaced it with a DSA/dshash shared-memory subsystem where backends buffer pending deltas locally and flush under per-entry LWLocks. PG16 added pg_stat_io as a new fixed stat kind for byte/op-level I/O accounting, and PG18 added per-backend I/O stats and folded WAL and byte-level columns into pg_stat_io. This doc traces the arc release by release; the mechanism for the current design lives in the cumulative-stats and extended-statistics module docs.
  • PostgreSQL Vacuum & Visibility — From Plain Lazy Vacuum to HOT, Freeze Maps, and ParallelismThe arc of PostgreSQL's dead-tuple reclamation: a single-threaded full-table lazy vacuum learned to avoid creating garbage in the first place (HOT, 8.3), then to skip clean pages (visibility map, 8.4), then to skip already-frozen pages so anti-wraparound scans stop re-reading static data (all-frozen bit / freeze map, 9.6), then to fan index cleanup out across workers (parallel index vacuum, 13), then to refuse to let wraparound win by bypassing index vacuuming under pressure (failsafe, 14), and finally to stop wasting memory on a flat dead-TID array by switching to a compressed radix store (TidStore, 17).

Other (1)