PostgreSQL Storage Engine — Section Overview
Contents:
What this section covers
Section titled “What this section covers”The storage-engine section documents how PostgreSQL turns a stream of
bytes on disk into the rows and index entries the executor reads, and back.
It is the band that sits below the query pipeline (the executor calls into
it through access-method routines) and above the operating system (it ends
at read()/write() on segment files). Concretely it spans
src/backend/storage/{smgr,buffer,page,freespace,large_object,aio} and
src/backend/access/{table,heap,common,index,nbtree,gin,gist,spgist,brin,hash,tablesample}.
Read bottom-up, the stack is five strata:
- smgr / md — the storage-manager indirection that maps a relation’s logical blocks onto physical file segments (the magic-number layer that splits a relation into 1 GB files).
- buffer pool — the shared, fixed-size page cache that every read and
write funnels through. It is the single choke point between memory and
disk and the place the WAL-before-flush rule is enforced. Underneath it,
the PG18 async-I/O subsystem (
storage/aio/) turns block reads into issued-and-awaited operations via a read-stream API. - page layout — the slotted-page format (
PageHeader,ItemIdline pointers,ItemPointer/TID addressing) shared by every access method, plus the data-checksum that guards each page. - pluggable access methods — the structural highlight. A table is
reached through a
TableAmRoutineand an index through anIndexAmRoutine; the executor never hard-codes “heap”.heapis merely the default table AM;nbtree/gin/gist/spgist/brin/hashare the in-core index AMs. - on-page records and their satellites — the heap AM’s no-overwrite tuple format and HOT pruning, TOAST for over-sized attributes, the visibility map and free-space map that accelerate scans and inserts, and the sequence and large-object stores that ride on the heap.
The sharp boundaries. This section deliberately stops at two seams and hands off:
- It does not own MVCC semantics, WAL, vacuum, or recovery. The heap is
no-overwrite — an
UPDATEwrites a new tuple version and leaves the old one — but why a version is visible, how its commit state is recorded, what WAL record describes the change, and who reclaims the dead versions all belong to the txn-recovery subcategory (postgres-overview-txn-recovery.md). This section describes the on-page realisation (where thexmin/xmax/infomaskbits live, how HOT chains are pruned, how the visibility map’s two bits per page are set); it defers the protocol to txn-recovery. The visibility-map and free-space-map docs live here because they are physical heap satellites; the vacuum that drives them lives in txn-recovery. - It does not own concurrency primitives. Pages are protected by buffer
content-locks and pins, but the LWLock/pin machinery itself, and the
heavyweight relation locks an AM acquires, belong to server-architecture
(
postgres-overview-server-architecture.md). This section names the latch discipline; it does not re-describe the lock manager.
The layered stack
Section titled “The layered stack”Figure 1 — The storage-engine layered stack. The executor enters through the table/index AM routines (never a hard-coded heap); table sampling is a thin AM-aware wrapper over the table scan. The heap AM owns its on-page tuple format and a constellation of satellites (TOAST, visibility map, free-space map, sequences, large objects). Every AM — heap and all six index AMs — addresses pages through one shared slotted-page format guarded by the optional data-checksum, and every page read or written passes through the shared buffer pool, which sits on the PG18 async-I/O layer and the smgr/md relation-to-segment mapping before reaching OS files.
Three structural facts the diagram makes load-bearing:
-
The AM indirection is the spine of this section.
tableam.hdefines aTableAmRoutinestruct-of-callbacks (scan_begin,tuple_insert,tuple_update,index_fetch_tuple, …) andamapi.hdefines anIndexAmRoutine; the executor andgenam/indexamdispatch through these pointers.heapfills in aTableAmRoutineand is wired as the default, but it is not privileged in the call path — this is what makes pluggable storage and the in-core index AMs all instances of one pattern. Thepostgres-table-am.mdandpostgres-index-am.mddocs own the two routers; every concrete AM doc is “what this AM puts behind the callbacks.” -
One page format underlies every AM. Whether the bytes hold a heap tuple or a B-tree internal node, the block is a slotted page: a fixed
PageHeader, a forward-growingItemId(line-pointer) array, and a backward-growing tuple area, addressed byItemPointer(block number + offset = TID).postgres-page-layout.mdowns this once; the per-AM docs describe only what they store in the slots. Data checksums (postgres-data-checksums.md) are computed over that page image at the buffer-flush boundary. -
The buffer pool is the universal choke point. No AM touches a file directly; it pins a buffer, takes a content lock, mutates the page, marks it dirty, and unpins. The buffer manager (
postgres-buffer-manager.md) is where the WAL-before-flush rule is enforced — a dirty page may not be written until the WAL record describing the change is durable (the seam to the txn-recovery subcategory). Below it, PG18’s async I/O (postgres-aio.md) and the read-stream API turn synchronous block reads into batched, issued-and-awaited operations;smgr/md(postgres-smgr-md.md) maps logical block numbers onto 1 GB file segments.
Reading order
Section titled “Reading order”The 20 docs in this section are not equally central, and alphabetical order buries the spine. Read cross-referenced-first: establish the shared floor and the AM indirection before any individual AM.
postgres-page-layout.md— Read first. It defines the vocabulary (PageHeader,ItemId,ItemPointer/TID, slotted layout) that every other doc in the section reuses, and it has no upward dependency. The line pointer and the page-checksum idea reappear in every AM.postgres-buffer-manager.md— Read second. This is where most of the storage-engine engineering effort lives:BufferDescarray, the clock-sweep victim selection, pinning and content-locks, and the WAL-before-flush enforcement that ties the section to txn-recovery. Every layer above touches the buffer pool.postgres-smgr-md.md— Read third; the disk-facing complement of the buffer manager. Short and self-contained: how a relation’s blocks map tobase/<db>/<relfilenode>files segmented at 1 GB, and how the smgr indirection allows non-md storage managers.postgres-table-am.md— Read fourth. TheTableAmRoutinerouter. Understanding this struct is what lets every later “heap does X” claim be read as “the default AM does X.”postgres-heap-am.md— Read fifth. The default table AM: the no-overwrite tuple format (HeapTupleHeader,xmin/xmax/infomask), the insert/update/delete paths, and HOT pruning (README.HOT). This is the densest doc; budget time.postgres-index-am.md— Read sixth. TheIndexAmRoutinerouter and thegenam/indexamgeneric scan loop, so the six concrete index-AM docs below can each be read as “what this AM puts behind the callbacks.”postgres-nbtree.md— Read seventh. The B-tree is the default and by far the most-used index AM; it is also the cleanest illustration of the index-AM contract (Lehman-Yao high-key, page splits, the README is first-class). Read it before the specialised AMs.postgres-visibility-map.md+postgres-free-space-map.md— Read together after the heap AM; they are the heap’s two physical satellites (the VM gates index-only scans and freeze; the FSM steers inserts).postgres-toast.md— Read when over-sized attributes matter; how a wide value is compressed and/or pushed to an out-of-line TOAST relation and detoasted on read.postgres-aio.md— Read when the I/O path matters; the PG18 async-I/O subsystem and read-stream API beneath the buffer manager. Greenfield code with its own README, largely orthogonal to the AM layer above.postgres-gin.md,postgres-gist.md,postgres-spgist.md,postgres-brin.md,postgres-hash-index.md— Skim per need. Each is a specialised index AM behind the sameIndexAmRoutine; read the one your workload uses.postgres-table-sampling.md,postgres-sequences.md,postgres-large-objects.md,postgres-data-checksums.md— Skim per need. Each is a focused side-channel that composes on the floor without changing it.
A reader who works through 1–7 in order has the full “how a row or index entry gets read and written” mental model. The remaining docs each fill in a satellite or a specialised AM that the core path either piggybacks on or dispatches to.
Detail-doc summaries
Section titled “Detail-doc summaries”These are forward references; some module docs are not yet written. The summaries are predictive — what each doc will cover.
| Doc | One-line summary |
|---|---|
postgres-page-layout.md | Slotted-page format shared by every AM: PageHeader, forward-growing ItemId line-pointer array, backward-growing tuple area, ItemPointer/TID addressing, free-space and special-space regions. |
postgres-buffer-manager.md | Shared fixed-size buffer pool: BufferDesc array + buffer hash table, clock-sweep victim selection, pin/content-lock discipline, the WAL-before-flush rule, and ring buffers for bulk scans. |
postgres-smgr-md.md | Storage-manager indirection mapping a relation’s logical blocks to physical files; the md manager segments a relation into 1 GB files and tracks the fork files (main/fsm/vm/init). |
postgres-aio.md | PG18 async-I/O subsystem: pluggable methods (io_uring, worker, sync), the read_stream look-ahead API, and how block reads become issued-and-awaited operations under the buffer manager. |
postgres-table-am.md | The TableAmRoutine struct-of-callbacks that decouples the executor from heap; scan/insert/update/delete/index-fetch slots, slot-based tuple access, and how heap registers as the default. |
postgres-heap-am.md | The default table AM: no-overwrite HeapTupleHeader (xmin/xmax/infomask), insert/update/delete paths via hio, HOT chains and on-page pruning (pruneheap.c, README.HOT). |
postgres-toast.md | The Oversized-Attribute Storage Technique: per-attribute compression (pglz / LZ4), out-of-line storage in a companion TOAST relation chunked into rows, and the detoast read path. |
postgres-visibility-map.md | Two bits per heap page (all-visible, all-frozen) in the _vm fork; gates index-only scans and lets vacuum skip pages; how the bits are set and cleared. |
postgres-free-space-map.md | Per-relation FSM fork: a binary tree of free-space fractions over pages so inserts find a page with room; maintained by vacuum and by failed insert attempts. |
postgres-nbtree.md | The default index AM: Lehman-Yao B+-tree with high keys and right-links, page-split and deletion protocols, multi-column and INCLUDE indexes, deduplication, and the first-class README. |
postgres-gin.md | Generalized Inverted Index: posting lists/trees over many keys per heap tuple (arrays, jsonb, full-text), the pending-list fast-insert path, and bulk build. |
postgres-gist.md | Generalized Search Tree: a balanced tree parameterised by an opclass (consistent/union/penalty/picksplit) for geometric, range, and nearest-neighbour search. |
postgres-spgist.md | Space-Partitioned GiST: unbalanced, opclass-defined partitioning (quadtree, radix tree) for non-rectangular and text-prefix data, with inner-tuple node descent. |
postgres-brin.md | Block Range Index: per-block-range summaries (min/max, bloom) instead of per-tuple entries; tiny indexes for naturally-clustered large tables, with a summarisation/desummarisation lifecycle. |
postgres-hash-index.md | Hash index AM: bucket pages addressed by a hash of the key, overflow-page chains, the split/expand protocol, and WAL-logging (crash-safe since PG10). |
postgres-index-am.md | The IndexAmRoutine router and the genam/indexam generic scan machinery (index_beginscan, index_getnext_tid, amcheck); how AMs advertise capabilities and the index-only-scan path. |
postgres-table-sampling.md | The TsmRoutine tablesample interface and the two built-in methods (SYSTEM block-level, BERNOULLI row-level) layered as an AM-aware wrapper over the table scan. |
postgres-data-checksums.md | Per-page checksum (checksum.c / checksum_helper.c) verified on read and stamped on flush, the cluster-level enable flag, and the pg_checksums offline tool. |
postgres-large-objects.md | The pg_largeobject store: a value chunked into 2 KB rows in a system table, addressed by OID, with the server-side inv_api and lo_* libpq/SQL interface. |
postgres-sequences.md | Sequences as single-row heap relations with a per-backend cache; nextval/setval semantics, WAL logging of the increment, and why they are non-transactional. |
Adjacent sections
Section titled “Adjacent sections”The storage engine is the floor of the data path, but not the floor of the system — three neighbouring sections own the boundaries it touches.
- above (Query Processing). The executor is the storage engine’s only
upward caller: scan, index-scan, index-only-scan, bitmap-heap-scan, and
sample-scan nodes all drive the table/index AM routines described here. The
AM indirection is precisely the seam — the executor speaks
TableScanDesc/IndexScanDesc, never heap internals. Seepostgres-overview-query-processing.md. - below / beside (Transaction & Recovery). This is the section’s most
important boundary and the one it most deliberately defers. The heap is
no-overwrite, so it produces the dead tuples and the visibility bits, but
the WAL that protects every page (the WAL-before-flush rule enforced in
the buffer manager), the MVCC snapshot logic that decides which version
is visible, the commit-state records, and the vacuum that reclaims
dead versions and maintains the visibility and free-space maps all live in
txn-recovery. The storage engine is the substrate WAL protects and vacuum
reclaims; txn-recovery is the protocol. See
postgres-overview-txn-recovery.md. - beside (Server Architecture). Buffer pins and content-locks, the
LWLocks that protect buffer headers and AM internal pages, and the
heavyweight relation locks an AM acquires before scanning all belong to the
concurrency machinery owned by server-architecture. This section assumes
those primitives and names the latch discipline; it does not re-describe
the lock manager. See
postgres-overview-server-architecture.md.
For the whole-system frame — why PostgreSQL is one binary on one
shared-memory segment, and where this section’s buffer pool and AMs sit among
the seven architectural axes — see postgres-architecture-overview.md
(Axis 4 is this section).