Skip to content

PostgreSQL Storage Engine — Section Overview

Contents:

The storage-engine section documents how PostgreSQL turns a stream of bytes on disk into the rows and index entries the executor reads, and back. It is the band that sits below the query pipeline (the executor calls into it through access-method routines) and above the operating system (it ends at read()/write() on segment files). Concretely it spans src/backend/storage/{smgr,buffer,page,freespace,large_object,aio} and src/backend/access/{table,heap,common,index,nbtree,gin,gist,spgist,brin,hash,tablesample}.

Read bottom-up, the stack is five strata:

  1. smgr / md — the storage-manager indirection that maps a relation’s logical blocks onto physical file segments (the magic-number layer that splits a relation into 1 GB files).
  2. buffer pool — the shared, fixed-size page cache that every read and write funnels through. It is the single choke point between memory and disk and the place the WAL-before-flush rule is enforced. Underneath it, the PG18 async-I/O subsystem (storage/aio/) turns block reads into issued-and-awaited operations via a read-stream API.
  3. page layout — the slotted-page format (PageHeader, ItemId line pointers, ItemPointer/TID addressing) shared by every access method, plus the data-checksum that guards each page.
  4. pluggable access methods — the structural highlight. A table is reached through a TableAmRoutine and an index through an IndexAmRoutine; the executor never hard-codes “heap”. heap is merely the default table AM; nbtree/gin/gist/spgist/brin/hash are the in-core index AMs.
  5. on-page records and their satellites — the heap AM’s no-overwrite tuple format and HOT pruning, TOAST for over-sized attributes, the visibility map and free-space map that accelerate scans and inserts, and the sequence and large-object stores that ride on the heap.

The sharp boundaries. This section deliberately stops at two seams and hands off:

  • It does not own MVCC semantics, WAL, vacuum, or recovery. The heap is no-overwrite — an UPDATE writes a new tuple version and leaves the old one — but why a version is visible, how its commit state is recorded, what WAL record describes the change, and who reclaims the dead versions all belong to the txn-recovery subcategory (postgres-overview-txn-recovery.md). This section describes the on-page realisation (where the xmin/xmax/infomask bits live, how HOT chains are pruned, how the visibility map’s two bits per page are set); it defers the protocol to txn-recovery. The visibility-map and free-space-map docs live here because they are physical heap satellites; the vacuum that drives them lives in txn-recovery.
  • It does not own concurrency primitives. Pages are protected by buffer content-locks and pins, but the LWLock/pin machinery itself, and the heavyweight relation locks an AM acquires, belong to server-architecture (postgres-overview-server-architecture.md). This section names the latch discipline; it does not re-describe the lock manager.

Figure 1 — The storage-engine stack, from the executor down to data files

Figure 1 — The storage-engine layered stack. The executor enters through the table/index AM routines (never a hard-coded heap); table sampling is a thin AM-aware wrapper over the table scan. The heap AM owns its on-page tuple format and a constellation of satellites (TOAST, visibility map, free-space map, sequences, large objects). Every AM — heap and all six index AMs — addresses pages through one shared slotted-page format guarded by the optional data-checksum, and every page read or written passes through the shared buffer pool, which sits on the PG18 async-I/O layer and the smgr/md relation-to-segment mapping before reaching OS files.

Three structural facts the diagram makes load-bearing:

  • The AM indirection is the spine of this section. tableam.h defines a TableAmRoutine struct-of-callbacks (scan_begin, tuple_insert, tuple_update, index_fetch_tuple, …) and amapi.h defines an IndexAmRoutine; the executor and genam/indexam dispatch through these pointers. heap fills in a TableAmRoutine and is wired as the default, but it is not privileged in the call path — this is what makes pluggable storage and the in-core index AMs all instances of one pattern. The postgres-table-am.md and postgres-index-am.md docs own the two routers; every concrete AM doc is “what this AM puts behind the callbacks.”

  • One page format underlies every AM. Whether the bytes hold a heap tuple or a B-tree internal node, the block is a slotted page: a fixed PageHeader, a forward-growing ItemId (line-pointer) array, and a backward-growing tuple area, addressed by ItemPointer (block number + offset = TID). postgres-page-layout.md owns this once; the per-AM docs describe only what they store in the slots. Data checksums (postgres-data-checksums.md) are computed over that page image at the buffer-flush boundary.

  • The buffer pool is the universal choke point. No AM touches a file directly; it pins a buffer, takes a content lock, mutates the page, marks it dirty, and unpins. The buffer manager (postgres-buffer-manager.md) is where the WAL-before-flush rule is enforced — a dirty page may not be written until the WAL record describing the change is durable (the seam to the txn-recovery subcategory). Below it, PG18’s async I/O (postgres-aio.md) and the read-stream API turn synchronous block reads into batched, issued-and-awaited operations; smgr/md (postgres-smgr-md.md) maps logical block numbers onto 1 GB file segments.

The 20 docs in this section are not equally central, and alphabetical order buries the spine. Read cross-referenced-first: establish the shared floor and the AM indirection before any individual AM.

  1. postgres-page-layout.md — Read first. It defines the vocabulary (PageHeader, ItemId, ItemPointer/TID, slotted layout) that every other doc in the section reuses, and it has no upward dependency. The line pointer and the page-checksum idea reappear in every AM.
  2. postgres-buffer-manager.md — Read second. This is where most of the storage-engine engineering effort lives: BufferDesc array, the clock-sweep victim selection, pinning and content-locks, and the WAL-before-flush enforcement that ties the section to txn-recovery. Every layer above touches the buffer pool.
  3. postgres-smgr-md.md — Read third; the disk-facing complement of the buffer manager. Short and self-contained: how a relation’s blocks map to base/<db>/<relfilenode> files segmented at 1 GB, and how the smgr indirection allows non-md storage managers.
  4. postgres-table-am.md — Read fourth. The TableAmRoutine router. Understanding this struct is what lets every later “heap does X” claim be read as “the default AM does X.”
  5. postgres-heap-am.md — Read fifth. The default table AM: the no-overwrite tuple format (HeapTupleHeader, xmin/xmax/infomask), the insert/update/delete paths, and HOT pruning (README.HOT). This is the densest doc; budget time.
  6. postgres-index-am.md — Read sixth. The IndexAmRoutine router and the genam/indexam generic scan loop, so the six concrete index-AM docs below can each be read as “what this AM puts behind the callbacks.”
  7. postgres-nbtree.md — Read seventh. The B-tree is the default and by far the most-used index AM; it is also the cleanest illustration of the index-AM contract (Lehman-Yao high-key, page splits, the README is first-class). Read it before the specialised AMs.
  8. postgres-visibility-map.md + postgres-free-space-map.md — Read together after the heap AM; they are the heap’s two physical satellites (the VM gates index-only scans and freeze; the FSM steers inserts).
  9. postgres-toast.md — Read when over-sized attributes matter; how a wide value is compressed and/or pushed to an out-of-line TOAST relation and detoasted on read.
  10. postgres-aio.md — Read when the I/O path matters; the PG18 async-I/O subsystem and read-stream API beneath the buffer manager. Greenfield code with its own README, largely orthogonal to the AM layer above.
  11. postgres-gin.md, postgres-gist.md, postgres-spgist.md, postgres-brin.md, postgres-hash-index.md — Skim per need. Each is a specialised index AM behind the same IndexAmRoutine; read the one your workload uses.
  12. postgres-table-sampling.md, postgres-sequences.md, postgres-large-objects.md, postgres-data-checksums.md — Skim per need. Each is a focused side-channel that composes on the floor without changing it.

A reader who works through 1–7 in order has the full “how a row or index entry gets read and written” mental model. The remaining docs each fill in a satellite or a specialised AM that the core path either piggybacks on or dispatches to.

These are forward references; some module docs are not yet written. The summaries are predictive — what each doc will cover.

DocOne-line summary
postgres-page-layout.mdSlotted-page format shared by every AM: PageHeader, forward-growing ItemId line-pointer array, backward-growing tuple area, ItemPointer/TID addressing, free-space and special-space regions.
postgres-buffer-manager.mdShared fixed-size buffer pool: BufferDesc array + buffer hash table, clock-sweep victim selection, pin/content-lock discipline, the WAL-before-flush rule, and ring buffers for bulk scans.
postgres-smgr-md.mdStorage-manager indirection mapping a relation’s logical blocks to physical files; the md manager segments a relation into 1 GB files and tracks the fork files (main/fsm/vm/init).
postgres-aio.mdPG18 async-I/O subsystem: pluggable methods (io_uring, worker, sync), the read_stream look-ahead API, and how block reads become issued-and-awaited operations under the buffer manager.
postgres-table-am.mdThe TableAmRoutine struct-of-callbacks that decouples the executor from heap; scan/insert/update/delete/index-fetch slots, slot-based tuple access, and how heap registers as the default.
postgres-heap-am.mdThe default table AM: no-overwrite HeapTupleHeader (xmin/xmax/infomask), insert/update/delete paths via hio, HOT chains and on-page pruning (pruneheap.c, README.HOT).
postgres-toast.mdThe Oversized-Attribute Storage Technique: per-attribute compression (pglz / LZ4), out-of-line storage in a companion TOAST relation chunked into rows, and the detoast read path.
postgres-visibility-map.mdTwo bits per heap page (all-visible, all-frozen) in the _vm fork; gates index-only scans and lets vacuum skip pages; how the bits are set and cleared.
postgres-free-space-map.mdPer-relation FSM fork: a binary tree of free-space fractions over pages so inserts find a page with room; maintained by vacuum and by failed insert attempts.
postgres-nbtree.mdThe default index AM: Lehman-Yao B+-tree with high keys and right-links, page-split and deletion protocols, multi-column and INCLUDE indexes, deduplication, and the first-class README.
postgres-gin.mdGeneralized Inverted Index: posting lists/trees over many keys per heap tuple (arrays, jsonb, full-text), the pending-list fast-insert path, and bulk build.
postgres-gist.mdGeneralized Search Tree: a balanced tree parameterised by an opclass (consistent/union/penalty/picksplit) for geometric, range, and nearest-neighbour search.
postgres-spgist.mdSpace-Partitioned GiST: unbalanced, opclass-defined partitioning (quadtree, radix tree) for non-rectangular and text-prefix data, with inner-tuple node descent.
postgres-brin.mdBlock Range Index: per-block-range summaries (min/max, bloom) instead of per-tuple entries; tiny indexes for naturally-clustered large tables, with a summarisation/desummarisation lifecycle.
postgres-hash-index.mdHash index AM: bucket pages addressed by a hash of the key, overflow-page chains, the split/expand protocol, and WAL-logging (crash-safe since PG10).
postgres-index-am.mdThe IndexAmRoutine router and the genam/indexam generic scan machinery (index_beginscan, index_getnext_tid, amcheck); how AMs advertise capabilities and the index-only-scan path.
postgres-table-sampling.mdThe TsmRoutine tablesample interface and the two built-in methods (SYSTEM block-level, BERNOULLI row-level) layered as an AM-aware wrapper over the table scan.
postgres-data-checksums.mdPer-page checksum (checksum.c / checksum_helper.c) verified on read and stamped on flush, the cluster-level enable flag, and the pg_checksums offline tool.
postgres-large-objects.mdThe pg_largeobject store: a value chunked into 2 KB rows in a system table, addressed by OID, with the server-side inv_api and lo_* libpq/SQL interface.
postgres-sequences.mdSequences as single-row heap relations with a per-backend cache; nextval/setval semantics, WAL logging of the increment, and why they are non-transactional.

The storage engine is the floor of the data path, but not the floor of the system — three neighbouring sections own the boundaries it touches.

  • above (Query Processing). The executor is the storage engine’s only upward caller: scan, index-scan, index-only-scan, bitmap-heap-scan, and sample-scan nodes all drive the table/index AM routines described here. The AM indirection is precisely the seam — the executor speaks TableScanDesc/IndexScanDesc, never heap internals. See postgres-overview-query-processing.md.
  • below / beside (Transaction & Recovery). This is the section’s most important boundary and the one it most deliberately defers. The heap is no-overwrite, so it produces the dead tuples and the visibility bits, but the WAL that protects every page (the WAL-before-flush rule enforced in the buffer manager), the MVCC snapshot logic that decides which version is visible, the commit-state records, and the vacuum that reclaims dead versions and maintains the visibility and free-space maps all live in txn-recovery. The storage engine is the substrate WAL protects and vacuum reclaims; txn-recovery is the protocol. See postgres-overview-txn-recovery.md.
  • beside (Server Architecture). Buffer pins and content-locks, the LWLocks that protect buffer headers and AM internal pages, and the heavyweight relation locks an AM acquires before scanning all belong to the concurrency machinery owned by server-architecture. This section assumes those primitives and names the latch discipline; it does not re-describe the lock manager. See postgres-overview-server-architecture.md.

For the whole-system frame — why PostgreSQL is one binary on one shared-memory segment, and where this section’s buffer pool and AMs sit among the seven architectural axes — see postgres-architecture-overview.md (Axis 4 is this section).