Skip to content

PostgreSQL Heap Access Method — No-Overwrite MVCC Tuples, Slotted Pages, and HOT

Contents:

A heap file in the textbook sense is an unordered collection of records: a relation’s rows live in pages, addressed by a page number plus a slot, with no ordering guarantee (Database System Concepts, 7e, ch. 13 “Data Storage Structures”, §“File Organization”; Database Internals, Petrov, ch. 3 “File Formats”). The two questions every heap implementation must answer are: how is a variable-length record laid out within a fixed-size page, and what happens to the old bytes when a record is updated or deleted.

The layout answer is almost universal: the slotted page (also “slot directory” page). A page header sits at the front; an array of small line pointers (slots) grows downward from just after the header; the actual records grow upward from the end of the page; the free space is the hole in the middle. A record’s address is (page, slot index), and the slot — not the record’s byte offset — is the stable handle, so the record can be moved within the page during compaction without invalidating references that go through the slot (Database Internals, ch. 3, §“Slotted Pages”; Database System Concepts, §“Organization of Records in Files”). This indirection is the structural foundation everything else in this document builds on.

The update/delete answer is the design fork that defines the engine. There are two families:

  1. Update-in-place. The new value overwrites the old bytes; the prior image, if kept at all, goes to a separate undo area so that concurrent readers and rollback can reconstruct it. Oracle and MySQL/InnoDB are the canonical in-place engines.
  2. No-overwrite (append/versioned). The old version is left untouched; a new version is written elsewhere and the two are linked. Readers choose a version by timestamp/transaction visibility. This is the lineage of the Berkeley POSTGRES storage system.

PostgreSQL is the no-overwrite engine made famous. Its design traces directly to Stonebraker’s The Design of the POSTGRES Storage System (VLDB 1987), whose thesis was that storage should never overwrite committed data: an update writes a new tuple and leaves the prior one as historical state, originally so that “time-travel” queries could read the database as of any past instant, and the log could be eliminated in favor of forcing data at commit. Modern PostgreSQL kept the no-overwrite heap but pairs it with a conventional WAL (see postgres-xlog-wal.md) rather than the paper’s force-at-commit scheme; the time-travel feature was dropped, but the no-overwrite tuple is exactly the mechanism that makes multiversion concurrency control (MVCC) free on the read path.

MVCC is the concurrency-control family (Database Internals, ch. 5 “Transaction Processing and Recovery”, §“Multiversion Concurrency Control”) in which “reads can continue accessing older values until the new ones are committed” — coordination is pushed down to visibility rather than mutual exclusion. The dominant isolation level on top of it is snapshot isolation (SI): a transaction reads against a logical snapshot, sees the version of each row that was committed as of that snapshot, and ignores versions written by transactions still in flight. A no-overwrite heap is the natural substrate for SI because every version a snapshot might need is physically still present; visibility is a predicate over the version’s creator/destroyer transaction IDs.

Three implementation choices follow, and they organize the rest of this document:

  1. What metadata each version carries so visibility can be decided. In PostgreSQL: xmin (inserting xact), xmax (deleting/locking xact), and a word of hint/flag bits.
  2. How versions of one row are chained. In PostgreSQL: the old tuple’s t_ctid points forward to the new tuple’s slot on (usually) the same page.
  3. How dead versions are reclaimed. In PostgreSQL: single-page HOT pruning for the cheap case, and full vacuum (deferred to postgres-vacuum.md) for the index-coordinated case.

Snapshot construction — how the set of in-flight XIDs is captured and how XidInMVCCSnapshot decides membership — is the subject of postgres-mvcc-snapshots.md and postgres-procarray.md. This document treats the snapshot as a given oracle and focuses on the tuple: its header, its place on the page, and the insert/update/delete/prune operations that write and reclaim it.

The textbook gives the model; this section names the engineering conventions that recur across heap implementations, so that PostgreSQL’s specific symbols in the next section read as one set of dials within a shared space rather than as inventions.

Every MVCC engine stamps a version with who created it and who destroyed it, plus enough flag state to short-circuit the common visibility cases. The minimum is a (created_by, deleted_by) pair of transaction identifiers. PostgreSQL stores this pair inline on the heap tuple (xmin/xmax). Oracle and InnoDB store only a pointer (an undo locator / rollback pointer) on the live row and keep the prior image in undo; the “deleted_by” question is answered by walking undo. The inline choice makes PostgreSQL’s read path a local test on the tuple header, at the cost of carrying that header on every version and needing a vacuum to remove dead ones.

A second universal convention is hint bits: a cache, stored alongside the version, of “the inserting transaction is known committed / aborted” and the same for the deleter. Resolving a transaction’s commit status from the commit log (CLOG/pg_xact) is comparatively expensive, so the first reader to learn the outcome records it on the tuple, and subsequent readers skip the lookup. The hint bit is derived, not authoritative — it can always be recomputed from the commit log — which is why it can be set lazily and without WAL.

The slotted page is near-universal, but the slot gains extra duty in a versioned heap. Beyond “offset + length of this record”, the slot can encode a small state machine: a normal slot points at a tuple; a redirect slot points at another slot (no tuple of its own); a dead slot points at nothing but cannot be reused yet because an index still references it; an unused slot is free for the taking. This four-state slot is what lets an engine reclaim the tuple bytes of a dead version while keeping the slot alive as a stable index target — the key trick behind single-page reclamation.

When a row is updated, the engine must let a reader who holds a reference to the old version find the new one. The convention is a forward link from old to new — a “next version” pointer in the old version’s header. In an in-place engine this is implicit (the row simply changed); in a no-overwrite engine the link is explicit and physical. Keeping the chain within one page where possible is the optimization that makes following it free of extra I/O.

In a no-overwrite heap, the naïve cost of an UPDATE is one new heap tuple plus one new entry in every index, even if no indexed column changed — because the new tuple has a new physical address (TID) and indexes are keyed to TIDs. Every serious no-overwrite engine has some answer to this. PostgreSQL’s answer is Heap-Only Tuples (HOT): if the update changes no indexed column and the new version fits on the same page, no index entry is made and the new version is reachable only by following the on-page chain from the existing index entry.

Reclamation: cheap single-page vs. coordinated full

Section titled “Reclamation: cheap single-page vs. coordinated full”

Dead versions accumulate, so reclamation is mandatory, and it splits in two:

  • Single-page reclamation can drop a dead version’s bytes without touching indexes — but only if no index points at that specific version. HOT chains are engineered precisely so the intermediate versions have no index entries, making them single-page-reclaimable.
  • Coordinated reclamation (full vacuum) is needed when an index does point at the dead version; it must scan and clean the indexes before the heap slot can be freed. This is the expensive, table-wide path.
Theory / conventionPostgreSQL name
Slotted page record layoutheap page (PageHeaderData + ItemIdData[] + tuples) — postgres-page-layout.md
Indirection slot (4-state)ItemIdData.lp_flags: LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD
Version “created by”HeapTupleHeaderData.t_choice.t_heap.t_xmin
Version “deleted/locked by”HeapTupleHeaderData.t_choice.t_heap.t_xmax
Visibility hint bitst_infomask: HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, …
Forward version pointerHeapTupleHeaderData.t_ctid (→ slot of next version, or self)
Index-free same-page updateHOT update — HEAP_HOT_UPDATED (old) + HEAP_ONLY_TUPLE (new)
Single-page reclamationheap_page_prune_optheap_page_prune_and_freeze
Coordinated reclamationvacuum — postgres-vacuum.md
Snapshot membership oracleXidInMVCCSnapshot (in postgres-mvcc-snapshots.md)
Pluggable table interfaceTableAmRoutine (heap = heapam_methods) — postgres-table-am.md

PostgreSQL’s heap is the default implementation of the table access method. The executor never calls heap_insert directly; it calls table_tuple_insert, which dispatches through a TableAmRoutine function- pointer table. For an ordinary table that table is heapam_methods, and its slots point at thin wrappers (heapam_tuple_insert, heapam_tuple_update, …) that unwrap the executor’s TupleTableSlot into a HeapTuple and call the real heap functions in heapam.c. The AM boundary itself is owned by postgres-table-am.md; here we care about what the heap does once dispatched.

flowchart TB
  EXEC["executor / ExecInsert, ExecUpdate, ExecDelete"] --> TAM["table_tuple_insert / _update / _delete\n(tableam.h inline wrappers)"]
  TAM --> RT["heapam_methods\n(const TableAmRoutine)"]
  RT --> H1["heapam_tuple_insert"]
  RT --> H2["heapam_tuple_update"]
  RT --> H3["heapam_tuple_delete"]
  RT --> H4["heapam_index_fetch_tuple"]
  H1 --> HI["heap_insert (heapam.c)"]
  H2 --> HU["heap_update (heapam.c)"]
  H3 --> HD["heap_delete (heapam.c)"]
  H4 --> HS["heap_hot_search_buffer (heapam.c)"]

Figure 1 — The heap as a table AM. The executor speaks only the TableAmRoutine vocabulary; heapam_methods binds those slots to wrappers in heapam_handler.c, which call the operative functions in heapam.c. Swapping heapam_methods for another routine (a columnar or in-memory AM) is the pluggable-storage extension point — see postgres-table-am.md.

The routine is a plain static const struct of function pointers:

// heapam_methods — src/backend/access/heap/heapam_handler.c (condensed)
static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
.scan_begin = heap_beginscan,
.scan_getnextslot = heap_getnextslot,
.index_fetch_tuple = heapam_index_fetch_tuple,
.tuple_insert = heapam_tuple_insert,
.multi_insert = heap_multi_insert,
.tuple_delete = heapam_tuple_delete,
.tuple_update = heapam_tuple_update,
.tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot,
.relation_vacuum = heap_vacuum_rel,
/* ... ~40 callbacks total ... */
};

The wrappers are deliberately thin. heapam_tuple_insert is essentially “pull a HeapTuple out of the slot, call heap_insert, copy the resulting TID back into the slot”:

// heapam_tuple_insert — src/backend/access/heap/heapam_handler.c
static void
heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
int options, BulkInsertState bistate)
{
bool shouldFree = true;
HeapTuple tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
slot->tts_tableOid = RelationGetRelid(relation);
tuple->t_tableOid = slot->tts_tableOid;
heap_insert(relation, tuple, cid, options, bistate);
ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree)
pfree(tuple);
}

Every on-disk heap tuple begins with a fixed 23-byte header, followed by an optional null bitmap, alignment padding, and the user data. The header is where MVCC lives:

// HeapTupleFields + HeapTupleHeaderData — src/include/access/htup_details.h (condensed)
typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */
union {
CommandId t_cid; /* inserting or deleting command ID, or both */
TransactionId t_xvac; /* old-style VACUUM FULL xact ID */
} t_field3;
} HeapTupleFields;
struct HeapTupleHeaderData
{
union {
HeapTupleFields t_heap;
DatumTupleFields t_datum; /* overlay for in-memory composite Datums */
} t_choice;
ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
* speculative insertion token) */
uint16 t_infomask2; /* number of attributes + various flags */
uint16 t_infomask; /* various flag bits, see below */
uint8 t_hoff; /* sizeof header incl. bitmap, padding */
/* ^ - 23 bytes - ^ */
bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */
};

Three fields carry the entire MVCC story:

  • t_xmin — the XID of the transaction that inserted this version. A reader’s snapshot decides whether that transaction’s effects are visible.
  • t_xmax — the XID of the transaction that deleted or updated-away this version (or merely locked it; the HEAP_XMAX_LOCK_ONLY flag distinguishes a locker from a deleter). Zero/invalid means the version is live, the latest of its row.
  • t_ctid — normally points to the tuple’s own slot. After an UPDATE, the old tuple’s t_ctid is rewritten to point at the new version’s slot — the physical forward chain. (It is also overloaded for a speculative-insert token and a “moved to another partition” marker.)

The t_infomask flag word holds the hint bits and lock state; the t_infomask2 word holds the attribute count plus the two HOT flags:

// selected flags — src/include/access/htup_details.h
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */
#define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */
/* ... in t_infomask2 ... */
#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x8000 /* this is a heap-only tuple */

HEAP_XMIN_FROZEN is the combination used by freezing (see postgres-vacuum.md): a frozen tuple’s xmin is treated as committed-and- visible-to-everyone, so its real XID can be recycled across the 32-bit wraparound boundary. The frozen state is both the committed and invalid bits set — a value that would otherwise be contradictory, repurposed as a sentinel.

A heap tuple is never addressed by its byte offset; it is addressed by a line pointer (ItemIdData), a 4-byte slot in the array that grows downward from the page header. The slot is a bit-packed triple:

// ItemIdData — src/include/storage/itemid.h
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of line pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
#define LP_UNUSED 0 /* unused (should always have lp_len=0) */
#define LP_NORMAL 1 /* used (should always have lp_len>0) */
#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
#define LP_DEAD 3 /* dead, may or may not have storage */

The two-bit lp_flags is the four-state indirection slot from §“Common DBMS Design”, and it is the linchpin of HOT. A LP_REDIRECT slot has no tuple of its own (lp_len == 0); its lp_off holds the offset number of another line pointer. A LP_DEAD slot points at nothing but cannot be recycled while an index still references it. The detailed page geometry (pd_lower, pd_upper, pd_special, PageAddItem) is owned by postgres-page-layout.md; what matters here is that the TID (block, offset) an index stores is a slot index, not a byte address, so the tuple bytes can be moved or removed while the slot remains a valid index target.

flowchart TB
  subgraph PAGE["heap page (slotted)"]
    direction TB
    HDR["PageHeaderData\n(pd_lower, pd_upper, pd_prune_xid, ...)"]
    subgraph LP["line pointer array (grows down)"]
      direction LR
      L1["ItemId 1\nLP_NORMAL\noff=upper-A, len"]
      L2["ItemId 2\nLP_NORMAL"]
      L3["ItemId 3\nLP_DEAD\nlen may be 0"]
    end
    FREE["free space\n(pd_lower .. pd_upper)"]
    subgraph TUP["tuples (grow up)"]
      direction LR
      T1["tuple A\n(header + data)"]
      T2["tuple B"]
    end
  end
  L1 --> T1
  L2 --> T2

Figure 2 — Slotted heap page. The header is fixed at the front; line pointers grow downward; tuples grow upward; the hole between pd_lower and pd_upper is free space. An index entry stores (block, offset-number-of-line-pointer), so line pointer 3 can be LP_DEAD (its tuple bytes already gone) yet still be a legal index target until vacuum clears the index entry. Full geometry: postgres-page-layout.md.

An insert is the simple case: stamp the header, find a page with room, write the tuple, log it. heap_prepare_insert fills the header — note it clears all visibility bits, sets xmin to the current XID, marks xmax invalid, and sets cmin:

// heap_prepare_insert — src/backend/access/heap/heapam.c (condensed)
tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
HeapTupleHeaderSetXmin(tup->t_data, xid);
if (options & HEAP_INSERT_FROZEN)
HeapTupleHeaderSetXminFrozen(tup->t_data);
HeapTupleHeaderSetCmin(tup->t_data, cid);
HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */
tup->t_tableOid = RelationGetRelid(relation);

heap_insert then picks the target buffer via RelationGetBufferForTuple (which consults the free-space map — postgres-free-space-map.md), runs the SSI conflict check, and places the tuple inside a critical section:

// heap_insert — src/backend/access/heap/heapam.c (condensed)
heaptup = heap_prepare_insert(relation, tup, xid, cid, options);
buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ...);
CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
START_CRIT_SECTION();
RelationPutHeapTuple(relation, buffer, heaptup,
(options & HEAP_INSERT_SPECULATIVE) != 0);
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, ItemPointerGetBlockNumber(&heaptup->t_self),
vmbuffer, VISIBILITYMAP_VALID_BITS);
}
MarkBufferDirty(buffer);
/* ... XLogInsert(RM_HEAP_ID, XLOG_HEAP_INSERT) ... */
END_CRIT_SECTION();

RelationPutHeapTuple does the slot allocation and, crucially, sets the new tuple’s t_ctid to point at itself — a freshly inserted version is the latest of its row:

// RelationPutHeapTuple — src/backend/access/heap/hio.c (condensed)
offnum = PageAddItem(pageHeader, (Item) tuple->t_data,
tuple->t_len, InvalidOffsetNumber, false, true);
ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
if (!token)
{
ItemId itemId = PageGetItemId(pageHeader, offnum);
HeapTupleHeader item = (HeapTupleHeader) PageGetItem(pageHeader, itemId);
item->t_ctid = tuple->t_self; /* point at self */
}

The order is load-bearing: the page is dirtied and WAL is emitted inside START_CRIT_SECTION()/END_CRIT_SECTION(), and the WAL-before-flush rule (see postgres-xlog-wal.md) guarantees the insert record is durable before the dirty page can reach disk. Clearing the visibility-map bit if the page was all-visible is the seam to postgres-visibility-map.md.

A delete never removes the tuple. It stamps xmax with the deleting XID, records the command id, and sets the page’s prune hint. The tuple stays on the page, fully formed, until pruning or vacuum reclaims it:

// heap_delete — src/backend/access/heap/heapam.c (condensed, the commit path)
PageSetPrunable(page, xid); /* this page now has a deletable tuple */
/* store transaction information of xact deleting the tuple */
tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
tp.t_data->t_infomask |= new_infomask;
tp.t_data->t_infomask2 |= new_infomask2;
HeapTupleHeaderClearHotUpdated(tp.t_data);
HeapTupleHeaderSetXmax(tp.t_data, new_xmax);
HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);
/* Make sure there is no forward chain link in t_ctid */
tp.t_data->t_ctid = tp.t_self;

Before reaching this point, heap_delete resolved concurrency: it called HeapTupleSatisfiesUpdate to classify the tuple (TM_Ok, TM_BeingModified, TM_Updated, TM_Deleted, …), and if another transaction held xmax it waited on that transaction (or its MultiXact) before re-checking. The PageSetPrunable(page, xid) call records, in the page header’s pd_prune_xid, the oldest XID whose commit would make a tuple on this page prunable — the hint that later lets heap_page_prune_opt skip pages with nothing to do. Note t_ctid is explicitly reset to self: a plain delete leaves no forward link.

UPDATE — heap_update and the HOT decision

Section titled “UPDATE — heap_update and the HOT decision”

An update is a delete-plus-insert that links the two: the old tuple gets xmax set (as in delete), a new tuple is written, and the old tuple’s t_ctid is pointed at the new one. The interesting decision is whether the update can be HOT.

heap_update first computes which columns actually changed (HeapDetermineColumnsInfomodified_attrs) and fetches the relation’s “interesting” index-column bitmaps. After it has secured a page for the new tuple, it asks: did the new tuple land on the same page, and did the update touch no HOT-blocking indexed column?

// heap_update — src/backend/access/heap/heapam.c (the HOT decision, condensed)
if (newbuf == buffer)
{
/* same page: maybe a HOT update */
if (!bms_overlap(modified_attrs, hot_attrs))
{
use_hot_update = true;
/* still must propagate to summarizing (e.g. BRIN) indexes */
if (bms_overlap(modified_attrs, sum_attrs))
summarized_update = true;
}
}
else
{
/* new tuple went to a different page: cold update */
PageSetFull(page); /* hint: old page wants prune/defrag */
}

If HOT applies, the old tuple is flagged HEAP_HOT_UPDATED and the new tuple HEAP_ONLY_TUPLE; otherwise both flags are cleared. Then the new tuple is placed and the forward link is written:

// heap_update — src/backend/access/heap/heapam.c (condensed)
if (use_hot_update)
{
HeapTupleSetHotUpdated(&oldtup); /* old: HEAP_HOT_UPDATED */
HeapTupleSetHeapOnly(heaptup); /* new: HEAP_ONLY_TUPLE */
HeapTupleSetHeapOnly(newtup);
}
else
{
HeapTupleClearHotUpdated(&oldtup);
HeapTupleClearHeapOnly(heaptup);
HeapTupleClearHeapOnly(newtup);
}
RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new version */
/* old tuple gets xmax + forward link */
HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
oldtup.t_data->t_infomask |= infomask_old_tuple;
HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);
oldtup.t_data->t_ctid = heaptup->t_self; /* OLD points to NEW */

The semantic payoff: a HEAP_ONLY_TUPLE has no index entry. The only way to reach it is to start at an index entry that points to the root of the chain and walk forward via t_ctid while the HEAP_HOT_UPDATED flag is set. That is exactly what an index scan does (next subsection). Because the chain is constrained to a single page, the walk costs no extra buffer fetch.

flowchart LR
  IDX["index entry\n(key, TID=block:1)"] --> LP1
  subgraph PAGE["one heap page"]
    direction LR
    LP1["lp 1 (LP_NORMAL)\nroot tuple v1\nHEAP_HOT_UPDATED\nt_ctid -> lp 2"]
    LP2["lp 2 (LP_NORMAL)\ntuple v2\nHEAP_ONLY_TUPLE, HEAP_HOT_UPDATED\nt_ctid -> lp 3"]
    LP3["lp 3 (LP_NORMAL)\ntuple v3 (live)\nHEAP_ONLY_TUPLE\nt_ctid -> self"]
    LP1 --> LP2 --> LP3
  end

Figure 3 — A HOT chain. The index points only at line pointer 1 (the root). Versions v2 and v3 are HEAP_ONLY_TUPLEs with no index entries; an index scan arriving at lp 1 follows t_ctid forward while HEAP_HOT_UPDATED is set, finding the one version visible to its snapshot. All three versions sit on the same page, so the walk is free of extra I/O. (Mechanism per access/heap/README.HOT.)

The condition for HOT is strict (README.HOT, §“Update Chains With a Single Index Entry”): the update must change no column referenced by any non-summarizing index (including columns used only in a functional-index expression or a partial-index predicate), and the new version must fit on the old version’s page. Summarizing indexes (BRIN) hold no per-tuple TID, so a HOT update may still proceed but must propagate the new values to them (summarized_update). Equality is tested bitwise, not with the datatype’s equality operator, because the engine cannot know which of possibly several notions of equality the indexes care about.

Index scan over a HOT chain — heap_hot_search_buffer

Section titled “Index scan over a HOT chain — heap_hot_search_buffer”

When an index scan resolves a TID to a heap tuple, it does not blindly trust that the TID still names the version it wants. It calls heap_hot_search_buffer, which starts at the (possibly redirected) root line pointer and walks the chain, returning the first member visible to the snapshot:

// heap_hot_search_buffer — src/backend/access/heap/heapam.c (condensed)
for (;;)
{
lp = PageGetItemId(page, offnum);
if (!ItemIdIsNormal(lp))
{
/* a redirect is only legal at the chain start */
if (ItemIdIsRedirected(lp) && at_chain_start)
{
offnum = ItemIdGetRedirect(lp);
at_chain_start = false;
continue;
}
break; /* unused/dead → end of chain */
}
heapTuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
/* ... set t_len, t_self ... */
/* xmin of this link must equal xmax of the previous, else chain broke */
if (TransactionIdIsValid(prev_xmax) &&
!TransactionIdEquals(prev_xmax, HeapTupleHeaderGetXmin(heapTuple->t_data)))
break;
if (!skip)
{
valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
HeapCheckForSerializableConflictOut(valid, relation, heapTuple, buffer, snapshot);
if (valid)
{
ItemPointerSetOffsetNumber(tid, offnum);
PredicateLockTID(relation, &heapTuple->t_self, snapshot, ...);
return true;
}
}
/* not visible: if HOT-updated, advance to t_ctid; else end of chain */
/* ... prev_xmax = HeapTupleHeaderGetUpdateXid(...); offnum = ctid offset ... */
}

Two correctness rails appear here. First, the xmin/xmax match check: when following a forward link, the next tuple’s xmin must equal the prior tuple’s xmax, otherwise the slot was recycled by vacuum for an unrelated tuple and the chain has actually ended (README.HOT, §“Abort Cases”). Second, under an MVCC snapshot at most one chain member is visible, so the caller (heapam_index_fetch_tuple) sets call_again = !IsMVCCSnapshot(snapshot) — non-MVCC snapshots (e.g. SnapshotDirty) may see several and must be re-driven.

Visibility — reading the header against a snapshot

Section titled “Visibility — reading the header against a snapshot”

heap_hot_search_buffer delegates the actual yes/no to HeapTupleSatisfiesVisibility, which for an ordinary query routes to HeapTupleSatisfiesMVCC. This is where the tuple header meets the snapshot. The logic is a two-stage test — first decide whether xmin is visible (the version exists for me), then whether xmax is visible (the version was deleted before my snapshot):

// HeapTupleSatisfiesMVCC — src/backend/access/heap/heapam_visibility.c (heavily condensed)
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
return false; /* inserter aborted: never visible */
else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
{
if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
return false; /* my own insert, but later command */
/* ... then fall through to xmax checks for my own delete ... */
}
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false; /* inserter still in-flight at snapshot */
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple));
else
{
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId);
return false; /* inserter aborted */
}
}
else /* xmin already hinted committed */
{
if (!HeapTupleHeaderXminFrozen(tuple) &&
XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false; /* committed, but after my snapshot */
}
/* by here the inserter is visible; now is it deleted-and-visible-as-deleted? */
if (tuple->t_infomask & HEAP_XMAX_INVALID) return true; /* never deleted */
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)) return true; /* only locked */
/* ... resolve xmax (incl. MultiXact), test XidInMVCCSnapshot / DidCommit ... */

The decision tree, distilled:

flowchart TD
  A["tuple header"] --> B{"xmin committed\nand visible to snapshot?"}
  B -- "no: in-flight or after snapshot" --> R1["NOT VISIBLE"]
  B -- "no: inserter aborted" --> R1
  B -- "yes" --> C{"xmax set?\n(HEAP_XMAX_INVALID clear)"}
  C -- "no" --> R2["VISIBLE\n(live, never deleted)"]
  C -- "yes" --> D{"xmax is lock-only?\n(HEAP_XMAX_LOCK_ONLY)"}
  D -- "yes" --> R2
  D -- "no" --> E{"deleter committed\nand visible to snapshot?"}
  E -- "no: deleter in-flight or aborted" --> R3["VISIBLE\n(deletion not yet effective)"]
  E -- "yes" --> R4["NOT VISIBLE\n(deleted before snapshot)"]

Figure 4 — HeapTupleSatisfiesMVCC decision tree. A version is visible iff its inserter is committed-and-before-my-snapshot and it is not deleted by a transaction that is also committed-and-before-my-snapshot. XidInMVCCSnapshot (owned by postgres-mvcc-snapshots.md) answers the “in my snapshot’s in-flight set?” question; everything else here is a local read of the tuple header.

The SetHintBits calls are the lazy hint-bit cache from §“Common DBMS Design”: the first reader to resolve xmin/xmax against the commit log writes HEAP_XMIN_COMMITTED / HEAP_XMIN_INVALID / HEAP_XMAX_COMMITTED onto the tuple so the next reader skips the CLOG lookup. The bits are advisory — they are only ever set to a value the commit log already justifies — so they need no WAL and can be written under a share lock.

Dead versions are reclaimed in two regimes. The cheap one is HOT pruning, triggered opportunistically when a page is accessed. heap_page_prune_opt gates on the page’s pd_prune_xid hint and a free-space heuristic, then takes a conditional cleanup lock and calls the workhorse heap_page_prune_and_freeze:

// heap_page_prune_opt — src/backend/access/heap/pruneheap.c (condensed)
prune_xid = ((PageHeader) page)->pd_prune_xid;
if (!TransactionIdIsValid(prune_xid))
return; /* nothing was ever deleted here */
vistest = GlobalVisTestFor(relation);
if (!GlobalVisTestIsRemovableXid(vistest, prune_xid))
return; /* nothing removable yet */
minfree = Max(RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR),
BLCKSZ / 10);
if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
{
if (!ConditionalLockBufferForCleanup(buffer))
return; /* don't block; try again later */
/* recheck under lock, then: */
heap_page_prune_and_freeze(relation, buffer, vistest, 0,
NULL, &presult, PRUNE_ON_ACCESS, &dummy_off_loc, NULL, NULL);
}

Two design notes. First, pruning is opportunistic and non-blocking: it needs a cleanup lock (a stronger lock than exclusive: no other backend may hold even a pin, because pruning may move tuples that others have pointers into), and it uses the conditional variant so a page under heavy concurrent access is simply skipped rather than waited on. The worst-case consequence is only that an UPDATE must place its new version on a different page for lack of reclaimed space. Second, because pruning happens during retrieval of a page that is nearly full (< 10% free), SELECT, UPDATE, and DELETE can all trigger it, but a plain INSERT ... VALUES, which retrieves no row, usually does not (README.HOT, §“When can/should we prune or defragment?”).

The chain-collapsing logic lives in heap_prune_chain, which walks each HOT chain classifying members via the cached HeapTupleSatisfiesVacuum result and deciding which slots become redirects, dead, or unused:

// heap_prune_chain — src/backend/access/heap/pruneheap.c (control flow, condensed)
for (;;)
{
lp = PageGetItemId(page, offnum);
if (ItemIdIsRedirected(lp)) { /* jump to redirect target at chain start */ }
htup = (HeapTupleHeader) PageGetItem(page, lp);
/* chain integrity: this link's xmin must equal prior link's xmax */
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
chainitems[nchain++] = offnum;
switch (htsv_get_valid_status(prstate->htsv[offnum]))
{
case HEAPTUPLE_DEAD: ndeadchain = nchain; break; /* removable */
case HEAPTUPLE_RECENTLY_DEAD: break; /* keep, may follow */
case HEAPTUPLE_LIVE:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS: goto process_chain; /* stop here */
}
if (!HeapTupleHeaderIsHotUpdated(htup)) goto process_chain; /* end of chain */
offnum = ItemPointerGetOffsetNumber(&htup->t_ctid); /* advance */
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}

The outcome of pruning a chain follows README.HOT exactly:

flowchart LR
  subgraph BEFORE["before prune"]
    direction TB
    B1["lp 1 (LP_NORMAL) root\nv1 DEAD\nctid -> 2"]
    B2["lp 2 (LP_NORMAL)\nv2 live\nHEAP_ONLY_TUPLE"]
    B1 --> B2
    BI["index -> lp 1"] -.-> B1
  end
  subgraph AFTER["after prune"]
    direction TB
    A1["lp 1 (LP_REDIRECT)\nlp_off = 2\nno tuple"]
    A2["lp 2 (LP_NORMAL)\nv2 live"]
    A1 --> A2
    AI["index -> lp 1"] -.-> A1
  end
  BEFORE --> AFTER

Figure 5 — Pruning collapses a HOT chain. The root tuple v1 is dead, so its bytes are reclaimed and line pointer 1 is converted to LP_REDIRECT pointing at line pointer 2 — the index entry, which still names lp 1, now reaches the live v2 through the redirect. When the last live member of a chain dies, the root becomes LP_DEAD instead (it cannot be freed while the index entry persists); the next vacuum removes the index entry and then the dead slot. (Mechanism per access/heap/README.HOT, §“Pruning”.)

This is the precise boundary with vacuum (postgres-vacuum.md): pruning reclaims tuple bytes and shortens chains using only page-local manipulation and never touches indexes. It can convert a root into a redirect or a dead line pointer, but it cannot free a dead line pointer itself, because an index entry still names it. Freeing the dead line pointer — and the index entry that pins it — is vacuum’s job. Pruning is the line pointer’s demotion; vacuum is its burial.

Anchor on symbol names, not line numbers. The PostgreSQL source moves; a function/struct/macro name is the stable handle. Use git grep -n '<symbol>' src/backend/access/heap/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

  • struct HeapTupleHeaderData (in htup_details.h) — the on-disk tuple header: t_choice union (t_heap with t_xmin/t_xmax/t_cid, or the t_datum overlay for in-memory composites), t_ctid, t_infomask2, t_infomask, t_hoff, t_bits[].
  • HeapTupleHeaderGetXmin / …SetXmin / …GetRawXmax / …SetXmax (in htup_details.h) — the canonical accessors; …GetXmin returns FrozenTransactionId for a frozen tuple.
  • HeapTupleHeaderGetUpdateXid (in htup_details.h) — resolves xmax through a MultiXact when HEAP_XMAX_IS_MULTI is set.
  • HeapTupleHeaderIsHotUpdated / …IsHeapOnly and their Set/Clear forms — the HOT flag accessors over t_infomask2.
  • HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMIN_FROZEN, HEAP_XMAX_COMMITTED, HEAP_XMAX_INVALID, HEAP_XMAX_IS_MULTI, HEAP_XMAX_LOCK_ONLY (in htup_details.h) — t_infomask visibility bits.
  • HEAP_HOT_UPDATED, HEAP_ONLY_TUPLE (in htup_details.h) — t_infomask2 HOT flags.
  • struct ItemIdData and LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD (in storage/itemid.h) — the four-state line pointer.

Table AM binding (src/include/access/tableam.h, heapam_handler.c)

Section titled “Table AM binding (src/include/access/tableam.h, heapam_handler.c)”
  • struct TableAmRoutine (in tableam.h) — the function-pointer interface; tuple_insert, tuple_update, tuple_delete, index_fetch_tuple, scan_begin, etc.
  • heapam_methods (in heapam_handler.c) — the heap’s filled-in routine.
  • GetHeapamTableAmRoutine / heap_tableam_handler (in heapam_handler.c) — return the routine (the latter is the SQL-visible am_handler).
  • heapam_tuple_insert / heapam_tuple_update / heapam_tuple_delete / heapam_index_fetch_tuple (in heapam_handler.c) — the thin wrappers that unwrap a TupleTableSlot and call the heapam.c functions.

Heap operations (src/backend/access/heap/heapam.c)

Section titled “Heap operations (src/backend/access/heap/heapam.c)”
  • heap_insert — stamp header, place tuple, WAL; the no-op-on-abort prune comment lives here.
  • heap_prepare_insert — set xmin/cmin, clear xmax, optional freeze, TOAST if oversized.
  • heap_multi_insert — batched insert (one WAL record, one lock cycle per page).
  • heap_delete — concurrency resolution via HeapTupleSatisfiesUpdate, then set xmax + PageSetPrunable, leave tuple in place.
  • heap_update — compute modified_attrs (HeapDetermineColumnsInfo), decide use_hot_update, place new tuple, set old xmax and t_ctid forward link.
  • heap_hot_search_buffer — walk a HOT chain from the root, returning the member visible to the snapshot; enforces the xmin/xmax chain-integrity check.

Visibility (src/backend/access/heap/heapam_visibility.c)

Section titled “Visibility (src/backend/access/heap/heapam_visibility.c)”
  • HeapTupleSatisfiesVisibility — dispatcher by snapshot type.
  • HeapTupleSatisfiesMVCC — the SI visibility decision over xmin/xmax.
  • HeapTupleSatisfiesUpdate — the writer-side classification used by heap_delete/heap_update (TM_Ok, TM_BeingModified, …).
  • HeapTupleSatisfiesVacuum / HeapTupleSatisfiesVacuumHorizon — the prune/vacuum classification (HEAPTUPLE_DEAD, …_RECENTLY_DEAD, …).
  • SetHintBits — lazily caches a resolved commit outcome onto the tuple.
  • XidInMVCCSnapshot — snapshot membership test; the snapshot machinery is in postgres-mvcc-snapshots.md.

Page placement and pruning (hio.c, pruneheap.c)

Section titled “Page placement and pruning (hio.c, pruneheap.c)”
  • RelationPutHeapTuple (in hio.c) — PageAddItem + set t_self and the self-referential t_ctid.
  • RelationGetBufferForTuple (in hio.c) — find/extend a page with room (consults the FSM).
  • heap_page_prune_opt (in pruneheap.c) — opportunistic on-access prune entry; gates on pd_prune_xid and free-space heuristic, conditional cleanup lock.
  • heap_page_prune_and_freeze (in pruneheap.c) — the workhorse, shared by on-access pruning and vacuum.
  • heap_prune_chain (in pruneheap.c) — classify and collapse one HOT chain.
  • heap_page_prune_execute (in pruneheap.c) — apply the redirect/dead/unused decisions to the page.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
struct HeapTupleFieldshtup_details.h122
struct HeapTupleHeaderDatahtup_details.h153
#define HEAP_XMIN_COMMITTEDhtup_details.h204
#define HEAP_HOT_UPDATEDhtup_details.h295
#define HEAP_ONLY_TUPLEhtup_details.h296
HeapTupleHeaderGetUpdateXidhtup_details.h402
HeapTupleHeaderIsHotUpdatedhtup_details.h539
struct ItemIdDatastorage/itemid.h25
typedef struct TableAmRoutineaccess/tableam.h288
heapam_index_fetch_tupleheapam_handler.c115
heapam_tuple_insertheapam_handler.c244
heapam_tuple_deleteheapam_handler.c303
heapam_tuple_updateheapam_handler.c317
heapam_methodsheapam_handler.c2616
GetHeapamTableAmRoutineheapam_handler.c2676
heap_hot_search_bufferheapam.c1717
heap_insertheapam.c2080
heap_prepare_insertheapam.c2271
heap_multi_insertheapam.c2351
heap_deleteheapam.c2773
heap_updateheapam.c3242
HeapDetermineColumnsInfoheapam.c4396
SetHintBitsheapam_visibility.c114
HeapTupleSatisfiesUpdateheapam_visibility.c458
HeapTupleSatisfiesMVCCheapam_visibility.c960
HeapTupleSatisfiesVacuumheapam_visibility.c1171
HeapTupleSatisfiesVisibilityheapam_visibility.c1776
RelationPutHeapTuplehio.c35
RelationGetBufferForTuplehio.c502
heap_page_prune_optpruneheap.c193
heap_page_prune_and_freezepruneheap.c350
heap_prune_chainpruneheap.c999
heap_page_prune_executepruneheap.c1561

Each entry is a fact about the current source at commit 273fe94 (REL_18_STABLE) — readable without any external materials. The trailing note shows how it was checked. Open questions follow as recorded gaps.

  • The heap tuple header is 23 bytes before the null bitmap, and xmin/xmax are always physically stored while cmin/cmax/xvac share one 4-byte field. Verified by reading struct HeapTupleHeaderData and HeapTupleFields in htup_details.h (the t_field3 union and the /* ^ - 23 bytes - ^ */ marker). The comment block above the struct documents the five-virtual-fields- in-three-physical-fields scheme and the combo-CID fallback (combocid.c).

  • HEAP_XMIN_FROZEN is the bitwise OR of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID. Verified in htup_details.h: #define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID). A combination that is otherwise contradictory is repurposed as the frozen sentinel; the freeze mechanism itself is in postgres-vacuum.md.

  • A freshly inserted tuple’s t_ctid points at itself; an UPDATE rewrites the old tuple’s t_ctid to the new version’s slot. Verified in RelationPutHeapTuple (hio.c, item->t_ctid = tuple->t_self) and in heap_update (oldtup.t_data->t_ctid = heaptup->t_self). heap_delete explicitly resets t_ctid to self (tp.t_data->t_ctid = tp.t_self), so a plain delete leaves no forward link.

  • HOT is chosen iff the new tuple lands on the same page AND the update overlaps no HOT-blocking indexed column. Verified in heap_update: the if (newbuf == buffer) guard wrapping if (!bms_overlap(modified_attrs, hot_attrs)) use_hot_update = true;. Summarizing-index overlap sets summarized_update but does not block HOT.

  • HOT equality is tested bitwise, not via datatype equality operators. Stated in README.HOT (§“Update Chains…”, “We insist on bitwise equality rather than using datatype-specific equality routines”) and realized by HeapDetermineColumnsInfo comparing attribute images in heapam.c.

  • Following a forward chain link requires that the next tuple’s xmin equal the prior tuple’s xmax, else the chain has ended. Verified in both heap_hot_search_buffer and heap_prune_chain (the TransactionIdEquals(..., priorXmax) / prev_xmax guards). README.HOT §“Abort Cases” gives the race this defends against: a recycled slot.

  • On-access pruning needs a conditional cleanup lock and silently skips the page if it cannot get one. Verified in heap_page_prune_opt: if (!ConditionalLockBufferForCleanup(buffer)) return;. The free-space gate is Max(fillfactor target, BLCKSZ/10) and the prune-needed hint is the page header’s pd_prune_xid.

  • Pruning never touches indexes; it can demote a root line pointer to LP_REDIRECT or LP_DEAD but cannot free a dead line pointer. Verified by reading heap_prune_chain / heap_page_prune_execute (they manipulate only the page’s line pointers and tuple storage) and confirmed by README.HOT §“Pruning” and §“VACUUM”. Line-pointer burial is vacuum’s job (postgres-vacuum.md).

  • The heap is reached only through the heapam_methods TableAmRoutine. Verified that heap_tableam_handler returns &heapam_methods and that the routine’s tuple_insert/tuple_update/tuple_delete/index_fetch_tuple slots point at the heapam_* wrappers in heapam_handler.c. The AM contract is owned by postgres-table-am.md.

  1. Worst-case HOT chain length and line-pointer bloat. README.HOT caps line pointers per page at MaxHeapTuplesPerPage “arbitrarily” to bound bloat, and notes the truncate-to-line-pointer design “probably needs further work” for statistics. Under a pathological same-page update workload, how long do chains actually get before a cold update is forced, and how much line-pointer bloat persists between vacuums? Investigation path: instrument heap_prune_chain nchain and correlate with n_tup_hot_upd pgstats.

  2. pd_prune_xid precision vs. wasted prune attempts. heap_page_prune_opt reads page free space without a lock as a heuristic (“in the worst case we could get a bogus answer”). How often does this produce a cleanup-lock acquisition that then finds nothing removable? Investigation path: count ConditionalLockBufferForCleanup successes that yield zero ndeleted in heap_page_prune_and_freeze.

  3. MultiXact interaction on the delete/lock path. heap_delete branches heavily on HEAP_XMAX_IS_MULTI (waiting on MultiXactIdWait), and xmax may name a MultiXact rather than a plain XID. The full MultiXact lifecycle (allocation, member storage, the HEAP_XMAX_LOCK_ONLY interplay) is only sketched here. Investigation path: cross-reference the planned postgres-xid-wraparound-freeze.md / multixact doc and README.tuplock.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

  • In-place + undo (Oracle, MySQL/InnoDB). These engines overwrite the live row and keep the prior image in an undo segment, with a rollback pointer on the row. The trade is symmetric to PostgreSQL’s: a compact heap and no vacuum-of-the-table, paid for with undo-segment management and read amplification when reconstructing old versions. A measured side-by-side of PostgreSQL HOT/prune cost vs. InnoDB undo cost would quantify the no-overwrite tax. (Stonebraker 1987 is the no-overwrite thesis; Database Internals ch. 5 covers both families.)

  • CUBRID’s out-of-place MVCC. CUBRID stamps mvcc_ins_id/mvcc_del_id on the record header like PostgreSQL stamps xmin/xmax, but chains old versions backward into the log via prev_version_lsa rather than forward on the heap page. The contrast with PostgreSQL’s same-page forward HOT chain is instructive: CUBRID keeps the heap compact at the cost of a log indirection to read an old version, while PostgreSQL keeps old versions on-page at the cost of bloat and vacuum. See knowledge/code-analysis/cubrid/cubrid-mvcc.md.

  • The undo-log for PostgreSQL: zheap. The (incomplete) zheap project aimed to give PostgreSQL an in-place storage AM with undo, precisely to escape bloat and vacuum. It is the clearest demonstration that the pluggable TableAmRoutine was designed to host a fundamentally different storage model. A doc tracing what zheap would change (and why it stalled) belongs next to postgres-table-am.md.

  • In-memory MVCC version layout (HyPer, Hekaton, Cicada). Cache-aware in-memory engines redesign the version chain (newest-to-oldest deltas, per-tuple version vectors) and often drop the central registry. Wu et al., An Empirical Evaluation of In-Memory MVCC (VLDB 2017) surveys the space and is the right anchor for distinguishing costs intrinsic to MVCC from those intrinsic to a disk-resident, no-overwrite MVCC like PostgreSQL’s.

  • Time travel, the original goal. Stonebraker’s 1987 storage paper kept all versions specifically to support AS OF historical queries and to eliminate the WAL. PostgreSQL dropped both but kept the no-overwrite heap. The modern re-emergence of temporal/time-travel features (SQL:2011 system-versioned tables) makes a “what PostgreSQL gave up and what it would cost to get back” a worthwhile historical-arc note.

  • src/backend/access/heap/README.HOT — the authoritative HOT specification: single-index-entry update chains, redirect/dead line pointers, pruning vs. defragmentation, abort-case chain integrity, CREATE INDEX [CONCURRENTLY] interaction, and the glossary used throughout this doc.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz et al., 7e), ch. 13 “Data Storage Structures” — file organization, slotted pages, record layout.
  • Database Internals (Petrov), ch. 3 “File Formats” (slotted pages) and ch. 5 “Transaction Processing and Recovery”, §“Multiversion Concurrency Control” / §“Isolation Levels”.

Papers (under knowledge/research/dbms-papers/ and the bibliography)

Section titled “Papers (under knowledge/research/dbms-papers/ and the bibliography)”
  • Stonebraker, The Design of the POSTGRES Storage System (VLDB 1987) — the no-overwrite storage thesis that the PostgreSQL heap descends from. See .omc/plans/postgres-paper-bibliography.md §2 (acquisition queue).

PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”
  • src/backend/access/heap/heapam.c
  • src/backend/access/heap/heapam_handler.c
  • src/backend/access/heap/heapam_visibility.c
  • src/backend/access/heap/hio.c
  • src/backend/access/heap/pruneheap.c
  • src/include/access/htup_details.h
  • src/include/access/tableam.h
  • src/include/storage/itemid.h
  • postgres-table-am.md — the TableAmRoutine contract the heap implements.
  • postgres-mvcc-snapshots.md — snapshot construction and XidInMVCCSnapshot.
  • postgres-visibility-map.md — the all-visible bit cleared on insert/delete.
  • postgres-vacuum.md — coordinated reclamation, freezing, dead-line-pointer burial.
  • postgres-page-layout.md — full heap page geometry (PageHeaderData, PageAddItem).