PostgreSQL Heap Access Method — No-Overwrite MVCC Tuples, Slotted Pages, and HOT
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A heap file in the textbook sense is an unordered collection of records: a relation’s rows live in pages, addressed by a page number plus a slot, with no ordering guarantee (Database System Concepts, 7e, ch. 13 “Data Storage Structures”, §“File Organization”; Database Internals, Petrov, ch. 3 “File Formats”). The two questions every heap implementation must answer are: how is a variable-length record laid out within a fixed-size page, and what happens to the old bytes when a record is updated or deleted.
The layout answer is almost universal: the slotted page (also “slot
directory” page). A page header sits at the front; an array of small line
pointers (slots) grows downward from just after the header; the actual records
grow upward from the end of the page; the free space is the hole in the middle.
A record’s address is (page, slot index), and the slot — not the record’s
byte offset — is the stable handle, so the record can be moved within the page
during compaction without invalidating references that go through the slot
(Database Internals, ch. 3, §“Slotted Pages”; Database System Concepts,
§“Organization of Records in Files”). This indirection is the structural
foundation everything else in this document builds on.
The update/delete answer is the design fork that defines the engine. There are two families:
- Update-in-place. The new value overwrites the old bytes; the prior image, if kept at all, goes to a separate undo area so that concurrent readers and rollback can reconstruct it. Oracle and MySQL/InnoDB are the canonical in-place engines.
- No-overwrite (append/versioned). The old version is left untouched; a new version is written elsewhere and the two are linked. Readers choose a version by timestamp/transaction visibility. This is the lineage of the Berkeley POSTGRES storage system.
PostgreSQL is the no-overwrite engine made famous. Its design traces directly
to Stonebraker’s The Design of the POSTGRES Storage System (VLDB 1987), whose
thesis was that storage should never overwrite committed data: an update
writes a new tuple and leaves the prior one as historical state, originally so
that “time-travel” queries could read the database as of any past instant, and
the log could be eliminated in favor of forcing data at commit. Modern
PostgreSQL kept the no-overwrite heap but pairs it with a conventional WAL
(see postgres-xlog-wal.md) rather than the paper’s force-at-commit scheme;
the time-travel feature was dropped, but the no-overwrite tuple is exactly the
mechanism that makes multiversion concurrency control (MVCC) free on the
read path.
MVCC is the concurrency-control family (Database Internals, ch. 5 “Transaction Processing and Recovery”, §“Multiversion Concurrency Control”) in which “reads can continue accessing older values until the new ones are committed” — coordination is pushed down to visibility rather than mutual exclusion. The dominant isolation level on top of it is snapshot isolation (SI): a transaction reads against a logical snapshot, sees the version of each row that was committed as of that snapshot, and ignores versions written by transactions still in flight. A no-overwrite heap is the natural substrate for SI because every version a snapshot might need is physically still present; visibility is a predicate over the version’s creator/destroyer transaction IDs.
Three implementation choices follow, and they organize the rest of this document:
- What metadata each version carries so visibility can be decided. In
PostgreSQL:
xmin(inserting xact),xmax(deleting/locking xact), and a word of hint/flag bits. - How versions of one row are chained. In PostgreSQL: the old tuple’s
t_ctidpoints forward to the new tuple’s slot on (usually) the same page. - How dead versions are reclaimed. In PostgreSQL: single-page HOT
pruning for the cheap case, and full vacuum (deferred to
postgres-vacuum.md) for the index-coordinated case.
Snapshot construction — how the set of in-flight XIDs is captured and how
XidInMVCCSnapshot decides membership — is the subject of
postgres-mvcc-snapshots.md and postgres-procarray.md. This document treats
the snapshot as a given oracle and focuses on the tuple: its header, its place
on the page, and the insert/update/delete/prune operations that write and
reclaim it.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model; this section names the engineering conventions that recur across heap implementations, so that PostgreSQL’s specific symbols in the next section read as one set of dials within a shared space rather than as inventions.
Per-version visibility metadata
Section titled “Per-version visibility metadata”Every MVCC engine stamps a version with who created it and who destroyed
it, plus enough flag state to short-circuit the common visibility cases. The
minimum is a (created_by, deleted_by) pair of transaction identifiers.
PostgreSQL stores this pair inline on the heap tuple (xmin/xmax).
Oracle and InnoDB store only a pointer (an undo locator / rollback pointer) on
the live row and keep the prior image in undo; the “deleted_by” question is
answered by walking undo. The inline choice makes PostgreSQL’s read path a
local test on the tuple header, at the cost of carrying that header on every
version and needing a vacuum to remove dead ones.
A second universal convention is hint bits: a cache, stored alongside the
version, of “the inserting transaction is known committed / aborted” and the
same for the deleter. Resolving a transaction’s commit status from the commit
log (CLOG/pg_xact) is comparatively expensive, so the first reader to learn
the outcome records it on the tuple, and subsequent readers skip the lookup.
The hint bit is derived, not authoritative — it can always be recomputed from
the commit log — which is why it can be set lazily and without WAL.
Slotted page with an indirection slot
Section titled “Slotted page with an indirection slot”The slotted page is near-universal, but the slot gains extra duty in a versioned heap. Beyond “offset + length of this record”, the slot can encode a small state machine: a normal slot points at a tuple; a redirect slot points at another slot (no tuple of its own); a dead slot points at nothing but cannot be reused yet because an index still references it; an unused slot is free for the taking. This four-state slot is what lets an engine reclaim the tuple bytes of a dead version while keeping the slot alive as a stable index target — the key trick behind single-page reclamation.
Forward version chain on the page
Section titled “Forward version chain on the page”When a row is updated, the engine must let a reader who holds a reference to the old version find the new one. The convention is a forward link from old to new — a “next version” pointer in the old version’s header. In an in-place engine this is implicit (the row simply changed); in a no-overwrite engine the link is explicit and physical. Keeping the chain within one page where possible is the optimization that makes following it free of extra I/O.
Avoiding redundant index work on update
Section titled “Avoiding redundant index work on update”In a no-overwrite heap, the naïve cost of an UPDATE is one new heap tuple plus one new entry in every index, even if no indexed column changed — because the new tuple has a new physical address (TID) and indexes are keyed to TIDs. Every serious no-overwrite engine has some answer to this. PostgreSQL’s answer is Heap-Only Tuples (HOT): if the update changes no indexed column and the new version fits on the same page, no index entry is made and the new version is reachable only by following the on-page chain from the existing index entry.
Reclamation: cheap single-page vs. coordinated full
Section titled “Reclamation: cheap single-page vs. coordinated full”Dead versions accumulate, so reclamation is mandatory, and it splits in two:
- Single-page reclamation can drop a dead version’s bytes without touching indexes — but only if no index points at that specific version. HOT chains are engineered precisely so the intermediate versions have no index entries, making them single-page-reclaimable.
- Coordinated reclamation (full vacuum) is needed when an index does point at the dead version; it must scan and clean the indexes before the heap slot can be freed. This is the expensive, table-wide path.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Slotted page record layout | heap page (PageHeaderData + ItemIdData[] + tuples) — postgres-page-layout.md |
| Indirection slot (4-state) | ItemIdData.lp_flags: LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD |
| Version “created by” | HeapTupleHeaderData.t_choice.t_heap.t_xmin |
| Version “deleted/locked by” | HeapTupleHeaderData.t_choice.t_heap.t_xmax |
| Visibility hint bits | t_infomask: HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, … |
| Forward version pointer | HeapTupleHeaderData.t_ctid (→ slot of next version, or self) |
| Index-free same-page update | HOT update — HEAP_HOT_UPDATED (old) + HEAP_ONLY_TUPLE (new) |
| Single-page reclamation | heap_page_prune_opt → heap_page_prune_and_freeze |
| Coordinated reclamation | vacuum — postgres-vacuum.md |
| Snapshot membership oracle | XidInMVCCSnapshot (in postgres-mvcc-snapshots.md) |
| Pluggable table interface | TableAmRoutine (heap = heapam_methods) — postgres-table-am.md |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s heap is the default implementation of the table access method.
The executor never calls heap_insert directly; it calls
table_tuple_insert, which dispatches through a TableAmRoutine function-
pointer table. For an ordinary table that table is heapam_methods, and its
slots point at thin wrappers (heapam_tuple_insert, heapam_tuple_update, …)
that unwrap the executor’s TupleTableSlot into a HeapTuple and call the
real heap functions in heapam.c. The AM boundary itself is owned by
postgres-table-am.md; here we care about what the heap does once dispatched.
flowchart TB EXEC["executor / ExecInsert, ExecUpdate, ExecDelete"] --> TAM["table_tuple_insert / _update / _delete\n(tableam.h inline wrappers)"] TAM --> RT["heapam_methods\n(const TableAmRoutine)"] RT --> H1["heapam_tuple_insert"] RT --> H2["heapam_tuple_update"] RT --> H3["heapam_tuple_delete"] RT --> H4["heapam_index_fetch_tuple"] H1 --> HI["heap_insert (heapam.c)"] H2 --> HU["heap_update (heapam.c)"] H3 --> HD["heap_delete (heapam.c)"] H4 --> HS["heap_hot_search_buffer (heapam.c)"]
Figure 1 — The heap as a table AM. The executor speaks only the
TableAmRoutine vocabulary; heapam_methods binds those slots to wrappers in
heapam_handler.c, which call the operative functions in heapam.c. Swapping
heapam_methods for another routine (a columnar or in-memory AM) is the
pluggable-storage extension point — see postgres-table-am.md.
The routine is a plain static const struct of function pointers:
// heapam_methods — src/backend/access/heap/heapam_handler.c (condensed)static const TableAmRoutine heapam_methods = { .type = T_TableAmRoutine, .slot_callbacks = heapam_slot_callbacks, .scan_begin = heap_beginscan, .scan_getnextslot = heap_getnextslot, .index_fetch_tuple = heapam_index_fetch_tuple, .tuple_insert = heapam_tuple_insert, .multi_insert = heap_multi_insert, .tuple_delete = heapam_tuple_delete, .tuple_update = heapam_tuple_update, .tuple_satisfies_snapshot = heapam_tuple_satisfies_snapshot, .relation_vacuum = heap_vacuum_rel, /* ... ~40 callbacks total ... */};The wrappers are deliberately thin. heapam_tuple_insert is essentially “pull
a HeapTuple out of the slot, call heap_insert, copy the resulting TID back
into the slot”:
// heapam_tuple_insert — src/backend/access/heap/heapam_handler.cstatic voidheapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid, int options, BulkInsertState bistate){ bool shouldFree = true; HeapTuple tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
slot->tts_tableOid = RelationGetRelid(relation); tuple->t_tableOid = slot->tts_tableOid;
heap_insert(relation, tuple, cid, options, bistate); ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
if (shouldFree) pfree(tuple);}The tuple header — HeapTupleHeaderData
Section titled “The tuple header — HeapTupleHeaderData”Every on-disk heap tuple begins with a fixed 23-byte header, followed by an optional null bitmap, alignment padding, and the user data. The header is where MVCC lives:
// HeapTupleFields + HeapTupleHeaderData — src/include/access/htup_details.h (condensed)typedef struct HeapTupleFields{ TransactionId t_xmin; /* inserting xact ID */ TransactionId t_xmax; /* deleting or locking xact ID */ union { CommandId t_cid; /* inserting or deleting command ID, or both */ TransactionId t_xvac; /* old-style VACUUM FULL xact ID */ } t_field3;} HeapTupleFields;
struct HeapTupleHeaderData{ union { HeapTupleFields t_heap; DatumTupleFields t_datum; /* overlay for in-memory composite Datums */ } t_choice;
ItemPointerData t_ctid; /* current TID of this or newer tuple (or a * speculative insertion token) */ uint16 t_infomask2; /* number of attributes + various flags */ uint16 t_infomask; /* various flag bits, see below */ uint8 t_hoff; /* sizeof header incl. bitmap, padding */ /* ^ - 23 bytes - ^ */ bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */};Three fields carry the entire MVCC story:
t_xmin— the XID of the transaction that inserted this version. A reader’s snapshot decides whether that transaction’s effects are visible.t_xmax— the XID of the transaction that deleted or updated-away this version (or merely locked it; theHEAP_XMAX_LOCK_ONLYflag distinguishes a locker from a deleter). Zero/invalid means the version is live, the latest of its row.t_ctid— normally points to the tuple’s own slot. After an UPDATE, the old tuple’st_ctidis rewritten to point at the new version’s slot — the physical forward chain. (It is also overloaded for a speculative-insert token and a “moved to another partition” marker.)
The t_infomask flag word holds the hint bits and lock state; the
t_infomask2 word holds the attribute count plus the two HOT flags:
// selected flags — src/include/access/htup_details.h#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */#define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker *//* ... in t_infomask2 ... */#define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */#define HEAP_ONLY_TUPLE 0x8000 /* this is a heap-only tuple */HEAP_XMIN_FROZEN is the combination used by freezing (see
postgres-vacuum.md): a frozen tuple’s xmin is treated as committed-and-
visible-to-everyone, so its real XID can be recycled across the 32-bit
wraparound boundary. The frozen state is both the committed and invalid bits
set — a value that would otherwise be contradictory, repurposed as a sentinel.
The slotted heap page and line pointers
Section titled “The slotted heap page and line pointers”A heap tuple is never addressed by its byte offset; it is addressed by a
line pointer (ItemIdData), a 4-byte slot in the array that grows
downward from the page header. The slot is a bit-packed triple:
// ItemIdData — src/include/storage/itemid.htypedef struct ItemIdData{ unsigned lp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of line pointer, see below */ lp_len:15; /* byte length of tuple */} ItemIdData;
#define LP_UNUSED 0 /* unused (should always have lp_len=0) */#define LP_NORMAL 1 /* used (should always have lp_len>0) */#define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */#define LP_DEAD 3 /* dead, may or may not have storage */The two-bit lp_flags is the four-state indirection slot from §“Common DBMS
Design”, and it is the linchpin of HOT. A LP_REDIRECT slot has no tuple of
its own (lp_len == 0); its lp_off holds the offset number of another line
pointer. A LP_DEAD slot points at nothing but cannot be recycled while an
index still references it. The detailed page geometry (pd_lower, pd_upper,
pd_special, PageAddItem) is owned by postgres-page-layout.md; what matters
here is that the TID (block, offset) an index stores is a slot index, not a
byte address, so the tuple bytes can be moved or removed while the slot
remains a valid index target.
flowchart TB
subgraph PAGE["heap page (slotted)"]
direction TB
HDR["PageHeaderData\n(pd_lower, pd_upper, pd_prune_xid, ...)"]
subgraph LP["line pointer array (grows down)"]
direction LR
L1["ItemId 1\nLP_NORMAL\noff=upper-A, len"]
L2["ItemId 2\nLP_NORMAL"]
L3["ItemId 3\nLP_DEAD\nlen may be 0"]
end
FREE["free space\n(pd_lower .. pd_upper)"]
subgraph TUP["tuples (grow up)"]
direction LR
T1["tuple A\n(header + data)"]
T2["tuple B"]
end
end
L1 --> T1
L2 --> T2
Figure 2 — Slotted heap page. The header is fixed at the front; line pointers
grow downward; tuples grow upward; the hole between pd_lower and pd_upper
is free space. An index entry stores (block, offset-number-of-line-pointer),
so line pointer 3 can be LP_DEAD (its tuple bytes already gone) yet still be
a legal index target until vacuum clears the index entry. Full geometry:
postgres-page-layout.md.
INSERT — heap_insert
Section titled “INSERT — heap_insert”An insert is the simple case: stamp the header, find a page with room, write
the tuple, log it. heap_prepare_insert fills the header — note it clears all
visibility bits, sets xmin to the current XID, marks xmax invalid, and sets
cmin:
// heap_prepare_insert — src/backend/access/heap/heapam.c (condensed)tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);tup->t_data->t_infomask |= HEAP_XMAX_INVALID;HeapTupleHeaderSetXmin(tup->t_data, xid);if (options & HEAP_INSERT_FROZEN) HeapTupleHeaderSetXminFrozen(tup->t_data);HeapTupleHeaderSetCmin(tup->t_data, cid);HeapTupleHeaderSetXmax(tup->t_data, 0); /* for cleanliness */tup->t_tableOid = RelationGetRelid(relation);heap_insert then picks the target buffer via RelationGetBufferForTuple
(which consults the free-space map — postgres-free-space-map.md), runs the
SSI conflict check, and places the tuple inside a critical section:
// heap_insert — src/backend/access/heap/heapam.c (condensed)heaptup = heap_prepare_insert(relation, tup, xid, cid, options);buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ...);CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
START_CRIT_SECTION();RelationPutHeapTuple(relation, buffer, heaptup, (options & HEAP_INSERT_SPECULATIVE) != 0);if (PageIsAllVisible(BufferGetPage(buffer))){ all_visible_cleared = true; PageClearAllVisible(BufferGetPage(buffer)); visibilitymap_clear(relation, ItemPointerGetBlockNumber(&heaptup->t_self), vmbuffer, VISIBILITYMAP_VALID_BITS);}MarkBufferDirty(buffer);/* ... XLogInsert(RM_HEAP_ID, XLOG_HEAP_INSERT) ... */END_CRIT_SECTION();RelationPutHeapTuple does the slot allocation and, crucially, sets the new
tuple’s t_ctid to point at itself — a freshly inserted version is the
latest of its row:
// RelationPutHeapTuple — src/backend/access/heap/hio.c (condensed)offnum = PageAddItem(pageHeader, (Item) tuple->t_data, tuple->t_len, InvalidOffsetNumber, false, true);ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);if (!token){ ItemId itemId = PageGetItemId(pageHeader, offnum); HeapTupleHeader item = (HeapTupleHeader) PageGetItem(pageHeader, itemId); item->t_ctid = tuple->t_self; /* point at self */}The order is load-bearing: the page is dirtied and WAL is emitted inside
START_CRIT_SECTION()/END_CRIT_SECTION(), and the WAL-before-flush rule
(see postgres-xlog-wal.md) guarantees the insert record is durable before the
dirty page can reach disk. Clearing the visibility-map bit if the page was
all-visible is the seam to postgres-visibility-map.md.
DELETE — heap_delete
Section titled “DELETE — heap_delete”A delete never removes the tuple. It stamps xmax with the deleting XID,
records the command id, and sets the page’s prune hint. The tuple stays on the
page, fully formed, until pruning or vacuum reclaims it:
// heap_delete — src/backend/access/heap/heapam.c (condensed, the commit path)PageSetPrunable(page, xid); /* this page now has a deletable tuple */
/* store transaction information of xact deleting the tuple */tp.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);tp.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;tp.t_data->t_infomask |= new_infomask;tp.t_data->t_infomask2 |= new_infomask2;HeapTupleHeaderClearHotUpdated(tp.t_data);HeapTupleHeaderSetXmax(tp.t_data, new_xmax);HeapTupleHeaderSetCmax(tp.t_data, cid, iscombo);/* Make sure there is no forward chain link in t_ctid */tp.t_data->t_ctid = tp.t_self;Before reaching this point, heap_delete resolved concurrency: it called
HeapTupleSatisfiesUpdate to classify the tuple (TM_Ok, TM_BeingModified,
TM_Updated, TM_Deleted, …), and if another transaction held xmax it
waited on that transaction (or its MultiXact) before re-checking. The
PageSetPrunable(page, xid) call records, in the page header’s pd_prune_xid,
the oldest XID whose commit would make a tuple on this page prunable — the hint
that later lets heap_page_prune_opt skip pages with nothing to do. Note
t_ctid is explicitly reset to self: a plain delete leaves no forward link.
UPDATE — heap_update and the HOT decision
Section titled “UPDATE — heap_update and the HOT decision”An update is a delete-plus-insert that links the two: the old tuple gets xmax
set (as in delete), a new tuple is written, and the old tuple’s t_ctid is
pointed at the new one. The interesting decision is whether the update can be
HOT.
heap_update first computes which columns actually changed
(HeapDetermineColumnsInfo → modified_attrs) and fetches the relation’s
“interesting” index-column bitmaps. After it has secured a page for the new
tuple, it asks: did the new tuple land on the same page, and did the update
touch no HOT-blocking indexed column?
// heap_update — src/backend/access/heap/heapam.c (the HOT decision, condensed)if (newbuf == buffer){ /* same page: maybe a HOT update */ if (!bms_overlap(modified_attrs, hot_attrs)) { use_hot_update = true; /* still must propagate to summarizing (e.g. BRIN) indexes */ if (bms_overlap(modified_attrs, sum_attrs)) summarized_update = true; }}else{ /* new tuple went to a different page: cold update */ PageSetFull(page); /* hint: old page wants prune/defrag */}If HOT applies, the old tuple is flagged HEAP_HOT_UPDATED and the new tuple
HEAP_ONLY_TUPLE; otherwise both flags are cleared. Then the new tuple is
placed and the forward link is written:
// heap_update — src/backend/access/heap/heapam.c (condensed)if (use_hot_update){ HeapTupleSetHotUpdated(&oldtup); /* old: HEAP_HOT_UPDATED */ HeapTupleSetHeapOnly(heaptup); /* new: HEAP_ONLY_TUPLE */ HeapTupleSetHeapOnly(newtup);}else{ HeapTupleClearHotUpdated(&oldtup); HeapTupleClearHeapOnly(heaptup); HeapTupleClearHeapOnly(newtup);}RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new version */
/* old tuple gets xmax + forward link */HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);oldtup.t_data->t_infomask |= infomask_old_tuple;HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo);oldtup.t_data->t_ctid = heaptup->t_self; /* OLD points to NEW */The semantic payoff: a HEAP_ONLY_TUPLE has no index entry. The only way
to reach it is to start at an index entry that points to the root of the
chain and walk forward via t_ctid while the HEAP_HOT_UPDATED flag is set.
That is exactly what an index scan does (next subsection). Because the chain is
constrained to a single page, the walk costs no extra buffer fetch.
flowchart LR
IDX["index entry\n(key, TID=block:1)"] --> LP1
subgraph PAGE["one heap page"]
direction LR
LP1["lp 1 (LP_NORMAL)\nroot tuple v1\nHEAP_HOT_UPDATED\nt_ctid -> lp 2"]
LP2["lp 2 (LP_NORMAL)\ntuple v2\nHEAP_ONLY_TUPLE, HEAP_HOT_UPDATED\nt_ctid -> lp 3"]
LP3["lp 3 (LP_NORMAL)\ntuple v3 (live)\nHEAP_ONLY_TUPLE\nt_ctid -> self"]
LP1 --> LP2 --> LP3
end
Figure 3 — A HOT chain. The index points only at line pointer 1 (the root).
Versions v2 and v3 are HEAP_ONLY_TUPLEs with no index entries; an index scan
arriving at lp 1 follows t_ctid forward while HEAP_HOT_UPDATED is set,
finding the one version visible to its snapshot. All three versions sit on the
same page, so the walk is free of extra I/O. (Mechanism per
access/heap/README.HOT.)
The condition for HOT is strict (README.HOT, §“Update Chains With a Single
Index Entry”): the update must change no column referenced by any
non-summarizing index (including columns used only in a functional-index
expression or a partial-index predicate), and the new version must fit on the
old version’s page. Summarizing indexes (BRIN) hold no per-tuple TID, so a HOT
update may still proceed but must propagate the new values to them
(summarized_update). Equality is tested bitwise, not with the datatype’s
equality operator, because the engine cannot know which of possibly several
notions of equality the indexes care about.
Index scan over a HOT chain — heap_hot_search_buffer
Section titled “Index scan over a HOT chain — heap_hot_search_buffer”When an index scan resolves a TID to a heap tuple, it does not blindly trust
that the TID still names the version it wants. It calls
heap_hot_search_buffer, which starts at the (possibly redirected) root line
pointer and walks the chain, returning the first member visible to the
snapshot:
// heap_hot_search_buffer — src/backend/access/heap/heapam.c (condensed)for (;;){ lp = PageGetItemId(page, offnum);
if (!ItemIdIsNormal(lp)) { /* a redirect is only legal at the chain start */ if (ItemIdIsRedirected(lp) && at_chain_start) { offnum = ItemIdGetRedirect(lp); at_chain_start = false; continue; } break; /* unused/dead → end of chain */ }
heapTuple->t_data = (HeapTupleHeader) PageGetItem(page, lp); /* ... set t_len, t_self ... */
/* xmin of this link must equal xmax of the previous, else chain broke */ if (TransactionIdIsValid(prev_xmax) && !TransactionIdEquals(prev_xmax, HeapTupleHeaderGetXmin(heapTuple->t_data))) break;
if (!skip) { valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer); HeapCheckForSerializableConflictOut(valid, relation, heapTuple, buffer, snapshot); if (valid) { ItemPointerSetOffsetNumber(tid, offnum); PredicateLockTID(relation, &heapTuple->t_self, snapshot, ...); return true; } } /* not visible: if HOT-updated, advance to t_ctid; else end of chain */ /* ... prev_xmax = HeapTupleHeaderGetUpdateXid(...); offnum = ctid offset ... */}Two correctness rails appear here. First, the xmin/xmax match check: when
following a forward link, the next tuple’s xmin must equal the prior tuple’s
xmax, otherwise the slot was recycled by vacuum for an unrelated tuple and the
chain has actually ended (README.HOT, §“Abort Cases”). Second, under an MVCC
snapshot at most one chain member is visible, so the caller
(heapam_index_fetch_tuple) sets call_again = !IsMVCCSnapshot(snapshot) —
non-MVCC snapshots (e.g. SnapshotDirty) may see several and must be re-driven.
Visibility — reading the header against a snapshot
Section titled “Visibility — reading the header against a snapshot”heap_hot_search_buffer delegates the actual yes/no to
HeapTupleSatisfiesVisibility, which for an ordinary query routes to
HeapTupleSatisfiesMVCC. This is where the tuple header meets the snapshot.
The logic is a two-stage test — first decide whether xmin is visible (the
version exists for me), then whether xmax is visible (the version was deleted
before my snapshot):
// HeapTupleSatisfiesMVCC — src/backend/access/heap/heapam_visibility.c (heavily condensed)if (!HeapTupleHeaderXminCommitted(tuple)){ if (HeapTupleHeaderXminInvalid(tuple)) return false; /* inserter aborted: never visible */ else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple))) { if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid) return false; /* my own insert, but later command */ /* ... then fall through to xmax checks for my own delete ... */ } else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; /* inserter still in-flight at snapshot */ else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); else { SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, InvalidTransactionId); return false; /* inserter aborted */ }}else /* xmin already hinted committed */{ if (!HeapTupleHeaderXminFrozen(tuple) && XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; /* committed, but after my snapshot */}
/* by here the inserter is visible; now is it deleted-and-visible-as-deleted? */if (tuple->t_infomask & HEAP_XMAX_INVALID) return true; /* never deleted */if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)) return true; /* only locked *//* ... resolve xmax (incl. MultiXact), test XidInMVCCSnapshot / DidCommit ... */The decision tree, distilled:
flowchart TD
A["tuple header"] --> B{"xmin committed\nand visible to snapshot?"}
B -- "no: in-flight or after snapshot" --> R1["NOT VISIBLE"]
B -- "no: inserter aborted" --> R1
B -- "yes" --> C{"xmax set?\n(HEAP_XMAX_INVALID clear)"}
C -- "no" --> R2["VISIBLE\n(live, never deleted)"]
C -- "yes" --> D{"xmax is lock-only?\n(HEAP_XMAX_LOCK_ONLY)"}
D -- "yes" --> R2
D -- "no" --> E{"deleter committed\nand visible to snapshot?"}
E -- "no: deleter in-flight or aborted" --> R3["VISIBLE\n(deletion not yet effective)"]
E -- "yes" --> R4["NOT VISIBLE\n(deleted before snapshot)"]
Figure 4 — HeapTupleSatisfiesMVCC decision tree. A version is visible iff its
inserter is committed-and-before-my-snapshot and it is not deleted by a
transaction that is also committed-and-before-my-snapshot. XidInMVCCSnapshot
(owned by postgres-mvcc-snapshots.md) answers the “in my snapshot’s in-flight
set?” question; everything else here is a local read of the tuple header.
The SetHintBits calls are the lazy hint-bit cache from §“Common DBMS Design”:
the first reader to resolve xmin/xmax against the commit log writes
HEAP_XMIN_COMMITTED / HEAP_XMIN_INVALID / HEAP_XMAX_COMMITTED onto the
tuple so the next reader skips the CLOG lookup. The bits are advisory — they are
only ever set to a value the commit log already justifies — so they need no WAL
and can be written under a share lock.
Pruning — single-page reclamation
Section titled “Pruning — single-page reclamation”Dead versions are reclaimed in two regimes. The cheap one is HOT pruning,
triggered opportunistically when a page is accessed. heap_page_prune_opt
gates on the page’s pd_prune_xid hint and a free-space heuristic, then takes a
conditional cleanup lock and calls the workhorse
heap_page_prune_and_freeze:
// heap_page_prune_opt — src/backend/access/heap/pruneheap.c (condensed)prune_xid = ((PageHeader) page)->pd_prune_xid;if (!TransactionIdIsValid(prune_xid)) return; /* nothing was ever deleted here */
vistest = GlobalVisTestFor(relation);if (!GlobalVisTestIsRemovableXid(vistest, prune_xid)) return; /* nothing removable yet */
minfree = Max(RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR), BLCKSZ / 10);if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree){ if (!ConditionalLockBufferForCleanup(buffer)) return; /* don't block; try again later */ /* recheck under lock, then: */ heap_page_prune_and_freeze(relation, buffer, vistest, 0, NULL, &presult, PRUNE_ON_ACCESS, &dummy_off_loc, NULL, NULL);}Two design notes. First, pruning is opportunistic and non-blocking: it needs
a cleanup lock (a stronger lock than exclusive: no other backend may hold even
a pin, because pruning may move tuples that others have pointers into), and it
uses the conditional variant so a page under heavy concurrent access is simply
skipped rather than waited on. The worst-case consequence is only that an UPDATE
must place its new version on a different page for lack of reclaimed space.
Second, because pruning happens during retrieval of a page that is nearly full
(< 10% free), SELECT, UPDATE, and DELETE can all trigger it, but a plain
INSERT ... VALUES, which retrieves no row, usually does not (README.HOT,
§“When can/should we prune or defragment?”).
The chain-collapsing logic lives in heap_prune_chain, which walks each HOT
chain classifying members via the cached HeapTupleSatisfiesVacuum result and
deciding which slots become redirects, dead, or unused:
// heap_prune_chain — src/backend/access/heap/pruneheap.c (control flow, condensed)for (;;){ lp = PageGetItemId(page, offnum); if (ItemIdIsRedirected(lp)) { /* jump to redirect target at chain start */ } htup = (HeapTupleHeader) PageGetItem(page, lp);
/* chain integrity: this link's xmin must equal prior link's xmax */ if (TransactionIdIsValid(priorXmax) && !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax)) break;
chainitems[nchain++] = offnum; switch (htsv_get_valid_status(prstate->htsv[offnum])) { case HEAPTUPLE_DEAD: ndeadchain = nchain; break; /* removable */ case HEAPTUPLE_RECENTLY_DEAD: break; /* keep, may follow */ case HEAPTUPLE_LIVE: case HEAPTUPLE_INSERT_IN_PROGRESS: case HEAPTUPLE_DELETE_IN_PROGRESS: goto process_chain; /* stop here */ } if (!HeapTupleHeaderIsHotUpdated(htup)) goto process_chain; /* end of chain */ offnum = ItemPointerGetOffsetNumber(&htup->t_ctid); /* advance */ priorXmax = HeapTupleHeaderGetUpdateXid(htup);}The outcome of pruning a chain follows README.HOT exactly:
flowchart LR
subgraph BEFORE["before prune"]
direction TB
B1["lp 1 (LP_NORMAL) root\nv1 DEAD\nctid -> 2"]
B2["lp 2 (LP_NORMAL)\nv2 live\nHEAP_ONLY_TUPLE"]
B1 --> B2
BI["index -> lp 1"] -.-> B1
end
subgraph AFTER["after prune"]
direction TB
A1["lp 1 (LP_REDIRECT)\nlp_off = 2\nno tuple"]
A2["lp 2 (LP_NORMAL)\nv2 live"]
A1 --> A2
AI["index -> lp 1"] -.-> A1
end
BEFORE --> AFTER
Figure 5 — Pruning collapses a HOT chain. The root tuple v1 is dead, so its
bytes are reclaimed and line pointer 1 is converted to LP_REDIRECT pointing at
line pointer 2 — the index entry, which still names lp 1, now reaches the live
v2 through the redirect. When the last live member of a chain dies, the root
becomes LP_DEAD instead (it cannot be freed while the index entry persists);
the next vacuum removes the index entry and then the dead slot. (Mechanism per
access/heap/README.HOT, §“Pruning”.)
This is the precise boundary with vacuum (postgres-vacuum.md): pruning
reclaims tuple bytes and shortens chains using only page-local manipulation
and never touches indexes. It can convert a root into a redirect or a dead line
pointer, but it cannot free a dead line pointer itself, because an index entry
still names it. Freeing the dead line pointer — and the index entry that pins it
— is vacuum’s job. Pruning is the line pointer’s demotion; vacuum is its burial.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The PostgreSQL source moves; a function/struct/macro name is the stable handle. Use
git grep -n '<symbol>' src/backend/access/heap/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Tuple and page layout (src/include/)
Section titled “Tuple and page layout (src/include/)”struct HeapTupleHeaderData(inhtup_details.h) — the on-disk tuple header:t_choiceunion (t_heapwitht_xmin/t_xmax/t_cid, or thet_datumoverlay for in-memory composites),t_ctid,t_infomask2,t_infomask,t_hoff,t_bits[].HeapTupleHeaderGetXmin/…SetXmin/…GetRawXmax/…SetXmax(inhtup_details.h) — the canonical accessors;…GetXminreturnsFrozenTransactionIdfor a frozen tuple.HeapTupleHeaderGetUpdateXid(inhtup_details.h) — resolvesxmaxthrough a MultiXact whenHEAP_XMAX_IS_MULTIis set.HeapTupleHeaderIsHotUpdated/…IsHeapOnlyand theirSet/Clearforms — the HOT flag accessors overt_infomask2.HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMIN_FROZEN,HEAP_XMAX_COMMITTED,HEAP_XMAX_INVALID,HEAP_XMAX_IS_MULTI,HEAP_XMAX_LOCK_ONLY(inhtup_details.h) —t_infomaskvisibility bits.HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE(inhtup_details.h) —t_infomask2HOT flags.struct ItemIdDataandLP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD(instorage/itemid.h) — the four-state line pointer.
Table AM binding (src/include/access/tableam.h, heapam_handler.c)
Section titled “Table AM binding (src/include/access/tableam.h, heapam_handler.c)”struct TableAmRoutine(intableam.h) — the function-pointer interface;tuple_insert,tuple_update,tuple_delete,index_fetch_tuple,scan_begin, etc.heapam_methods(inheapam_handler.c) — the heap’s filled-in routine.GetHeapamTableAmRoutine/heap_tableam_handler(inheapam_handler.c) — return the routine (the latter is the SQL-visibleam_handler).heapam_tuple_insert/heapam_tuple_update/heapam_tuple_delete/heapam_index_fetch_tuple(inheapam_handler.c) — the thin wrappers that unwrap aTupleTableSlotand call theheapam.cfunctions.
Heap operations (src/backend/access/heap/heapam.c)
Section titled “Heap operations (src/backend/access/heap/heapam.c)”heap_insert— stamp header, place tuple, WAL; the no-op-on-abort prune comment lives here.heap_prepare_insert— setxmin/cmin, clearxmax, optional freeze, TOAST if oversized.heap_multi_insert— batched insert (one WAL record, one lock cycle per page).heap_delete— concurrency resolution viaHeapTupleSatisfiesUpdate, then setxmax+PageSetPrunable, leave tuple in place.heap_update— computemodified_attrs(HeapDetermineColumnsInfo), decideuse_hot_update, place new tuple, set oldxmaxandt_ctidforward link.heap_hot_search_buffer— walk a HOT chain from the root, returning the member visible to the snapshot; enforces the xmin/xmax chain-integrity check.
Visibility (src/backend/access/heap/heapam_visibility.c)
Section titled “Visibility (src/backend/access/heap/heapam_visibility.c)”HeapTupleSatisfiesVisibility— dispatcher by snapshot type.HeapTupleSatisfiesMVCC— the SI visibility decision overxmin/xmax.HeapTupleSatisfiesUpdate— the writer-side classification used byheap_delete/heap_update(TM_Ok,TM_BeingModified, …).HeapTupleSatisfiesVacuum/HeapTupleSatisfiesVacuumHorizon— the prune/vacuum classification (HEAPTUPLE_DEAD,…_RECENTLY_DEAD, …).SetHintBits— lazily caches a resolved commit outcome onto the tuple.XidInMVCCSnapshot— snapshot membership test; the snapshot machinery is inpostgres-mvcc-snapshots.md.
Page placement and pruning (hio.c, pruneheap.c)
Section titled “Page placement and pruning (hio.c, pruneheap.c)”RelationPutHeapTuple(inhio.c) —PageAddItem+ sett_selfand the self-referentialt_ctid.RelationGetBufferForTuple(inhio.c) — find/extend a page with room (consults the FSM).heap_page_prune_opt(inpruneheap.c) — opportunistic on-access prune entry; gates onpd_prune_xidand free-space heuristic, conditional cleanup lock.heap_page_prune_and_freeze(inpruneheap.c) — the workhorse, shared by on-access pruning and vacuum.heap_prune_chain(inpruneheap.c) — classify and collapse one HOT chain.heap_page_prune_execute(inpruneheap.c) — apply the redirect/dead/unused decisions to the page.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct HeapTupleFields | htup_details.h | 122 |
struct HeapTupleHeaderData | htup_details.h | 153 |
#define HEAP_XMIN_COMMITTED | htup_details.h | 204 |
#define HEAP_HOT_UPDATED | htup_details.h | 295 |
#define HEAP_ONLY_TUPLE | htup_details.h | 296 |
HeapTupleHeaderGetUpdateXid | htup_details.h | 402 |
HeapTupleHeaderIsHotUpdated | htup_details.h | 539 |
struct ItemIdData | storage/itemid.h | 25 |
typedef struct TableAmRoutine | access/tableam.h | 288 |
heapam_index_fetch_tuple | heapam_handler.c | 115 |
heapam_tuple_insert | heapam_handler.c | 244 |
heapam_tuple_delete | heapam_handler.c | 303 |
heapam_tuple_update | heapam_handler.c | 317 |
heapam_methods | heapam_handler.c | 2616 |
GetHeapamTableAmRoutine | heapam_handler.c | 2676 |
heap_hot_search_buffer | heapam.c | 1717 |
heap_insert | heapam.c | 2080 |
heap_prepare_insert | heapam.c | 2271 |
heap_multi_insert | heapam.c | 2351 |
heap_delete | heapam.c | 2773 |
heap_update | heapam.c | 3242 |
HeapDetermineColumnsInfo | heapam.c | 4396 |
SetHintBits | heapam_visibility.c | 114 |
HeapTupleSatisfiesUpdate | heapam_visibility.c | 458 |
HeapTupleSatisfiesMVCC | heapam_visibility.c | 960 |
HeapTupleSatisfiesVacuum | heapam_visibility.c | 1171 |
HeapTupleSatisfiesVisibility | heapam_visibility.c | 1776 |
RelationPutHeapTuple | hio.c | 35 |
RelationGetBufferForTuple | hio.c | 502 |
heap_page_prune_opt | pruneheap.c | 193 |
heap_page_prune_and_freeze | pruneheap.c | 350 |
heap_prune_chain | pruneheap.c | 999 |
heap_page_prune_execute | pruneheap.c | 1561 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry is a fact about the current source at commit
273fe94(REL_18_STABLE) — readable without any external materials. The trailing note shows how it was checked. Open questions follow as recorded gaps.
Verified facts
Section titled “Verified facts”-
The heap tuple header is 23 bytes before the null bitmap, and
xmin/xmaxare always physically stored whilecmin/cmax/xvacshare one 4-byte field. Verified by readingstruct HeapTupleHeaderDataandHeapTupleFieldsinhtup_details.h(thet_field3union and the/* ^ - 23 bytes - ^ */marker). The comment block above the struct documents the five-virtual-fields- in-three-physical-fields scheme and the combo-CID fallback (combocid.c). -
HEAP_XMIN_FROZENis the bitwise OR ofHEAP_XMIN_COMMITTEDandHEAP_XMIN_INVALID. Verified inhtup_details.h:#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID). A combination that is otherwise contradictory is repurposed as the frozen sentinel; the freeze mechanism itself is inpostgres-vacuum.md. -
A freshly inserted tuple’s
t_ctidpoints at itself; an UPDATE rewrites the old tuple’st_ctidto the new version’s slot. Verified inRelationPutHeapTuple(hio.c,item->t_ctid = tuple->t_self) and inheap_update(oldtup.t_data->t_ctid = heaptup->t_self).heap_deleteexplicitly resetst_ctidto self (tp.t_data->t_ctid = tp.t_self), so a plain delete leaves no forward link. -
HOT is chosen iff the new tuple lands on the same page AND the update overlaps no HOT-blocking indexed column. Verified in
heap_update: theif (newbuf == buffer)guard wrappingif (!bms_overlap(modified_attrs, hot_attrs)) use_hot_update = true;. Summarizing-index overlap setssummarized_updatebut does not block HOT. -
HOT equality is tested bitwise, not via datatype equality operators. Stated in
README.HOT(§“Update Chains…”, “We insist on bitwise equality rather than using datatype-specific equality routines”) and realized byHeapDetermineColumnsInfocomparing attribute images inheapam.c. -
Following a forward chain link requires that the next tuple’s
xminequal the prior tuple’sxmax, else the chain has ended. Verified in bothheap_hot_search_bufferandheap_prune_chain(theTransactionIdEquals(..., priorXmax)/prev_xmaxguards).README.HOT§“Abort Cases” gives the race this defends against: a recycled slot. -
On-access pruning needs a conditional cleanup lock and silently skips the page if it cannot get one. Verified in
heap_page_prune_opt:if (!ConditionalLockBufferForCleanup(buffer)) return;. The free-space gate isMax(fillfactor target, BLCKSZ/10)and the prune-needed hint is the page header’spd_prune_xid. -
Pruning never touches indexes; it can demote a root line pointer to
LP_REDIRECTorLP_DEADbut cannot free a dead line pointer. Verified by readingheap_prune_chain/heap_page_prune_execute(they manipulate only the page’s line pointers and tuple storage) and confirmed byREADME.HOT§“Pruning” and §“VACUUM”. Line-pointer burial is vacuum’s job (postgres-vacuum.md). -
The heap is reached only through the
heapam_methodsTableAmRoutine. Verified thatheap_tableam_handlerreturns&heapam_methodsand that the routine’stuple_insert/tuple_update/tuple_delete/index_fetch_tupleslots point at theheapam_*wrappers inheapam_handler.c. The AM contract is owned bypostgres-table-am.md.
Open questions
Section titled “Open questions”-
Worst-case HOT chain length and line-pointer bloat.
README.HOTcaps line pointers per page atMaxHeapTuplesPerPage“arbitrarily” to bound bloat, and notes the truncate-to-line-pointer design “probably needs further work” for statistics. Under a pathological same-page update workload, how long do chains actually get before a cold update is forced, and how much line-pointer bloat persists between vacuums? Investigation path: instrumentheap_prune_chainnchainand correlate withn_tup_hot_updpgstats. -
pd_prune_xidprecision vs. wasted prune attempts.heap_page_prune_optreads page free space without a lock as a heuristic (“in the worst case we could get a bogus answer”). How often does this produce a cleanup-lock acquisition that then finds nothing removable? Investigation path: countConditionalLockBufferForCleanupsuccesses that yield zerondeletedinheap_page_prune_and_freeze. -
MultiXact interaction on the delete/lock path.
heap_deletebranches heavily onHEAP_XMAX_IS_MULTI(waiting onMultiXactIdWait), andxmaxmay name a MultiXact rather than a plain XID. The full MultiXact lifecycle (allocation, member storage, theHEAP_XMAX_LOCK_ONLYinterplay) is only sketched here. Investigation path: cross-reference the plannedpostgres-xid-wraparound-freeze.md/ multixact doc andREADME.tuplock.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
In-place + undo (Oracle, MySQL/InnoDB). These engines overwrite the live row and keep the prior image in an undo segment, with a rollback pointer on the row. The trade is symmetric to PostgreSQL’s: a compact heap and no vacuum-of-the-table, paid for with undo-segment management and read amplification when reconstructing old versions. A measured side-by-side of PostgreSQL HOT/prune cost vs. InnoDB undo cost would quantify the no-overwrite tax. (Stonebraker 1987 is the no-overwrite thesis; Database Internals ch. 5 covers both families.)
-
CUBRID’s out-of-place MVCC. CUBRID stamps
mvcc_ins_id/mvcc_del_idon the record header like PostgreSQL stampsxmin/xmax, but chains old versions backward into the log viaprev_version_lsarather than forward on the heap page. The contrast with PostgreSQL’s same-page forward HOT chain is instructive: CUBRID keeps the heap compact at the cost of a log indirection to read an old version, while PostgreSQL keeps old versions on-page at the cost of bloat and vacuum. Seeknowledge/code-analysis/cubrid/cubrid-mvcc.md. -
The undo-log for PostgreSQL: zheap. The (incomplete) zheap project aimed to give PostgreSQL an in-place storage AM with undo, precisely to escape bloat and vacuum. It is the clearest demonstration that the pluggable
TableAmRoutinewas designed to host a fundamentally different storage model. A doc tracing what zheap would change (and why it stalled) belongs next topostgres-table-am.md. -
In-memory MVCC version layout (HyPer, Hekaton, Cicada). Cache-aware in-memory engines redesign the version chain (newest-to-oldest deltas, per-tuple version vectors) and often drop the central registry. Wu et al., An Empirical Evaluation of In-Memory MVCC (VLDB 2017) surveys the space and is the right anchor for distinguishing costs intrinsic to MVCC from those intrinsic to a disk-resident, no-overwrite MVCC like PostgreSQL’s.
-
Time travel, the original goal. Stonebraker’s 1987 storage paper kept all versions specifically to support
AS OFhistorical queries and to eliminate the WAL. PostgreSQL dropped both but kept the no-overwrite heap. The modern re-emergence of temporal/time-travel features (SQL:2011 system-versioned tables) makes a “what PostgreSQL gave up and what it would cost to get back” a worthwhile historical-arc note.
Sources
Section titled “Sources”In-tree README
Section titled “In-tree README”src/backend/access/heap/README.HOT— the authoritative HOT specification: single-index-entry update chains, redirect/dead line pointers, pruning vs. defragmentation, abort-case chain integrity, CREATE INDEX [CONCURRENTLY] interaction, and the glossary used throughout this doc.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz et al., 7e), ch. 13 “Data Storage Structures” — file organization, slotted pages, record layout.
- Database Internals (Petrov), ch. 3 “File Formats” (slotted pages) and ch. 5 “Transaction Processing and Recovery”, §“Multiversion Concurrency Control” / §“Isolation Levels”.
Papers (under knowledge/research/dbms-papers/ and the bibliography)
Section titled “Papers (under knowledge/research/dbms-papers/ and the bibliography)”- Stonebraker, The Design of the POSTGRES Storage System (VLDB 1987) — the
no-overwrite storage thesis that the PostgreSQL heap descends from. See
.omc/plans/postgres-paper-bibliography.md§2 (acquisition queue).
PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”src/backend/access/heap/heapam.csrc/backend/access/heap/heapam_handler.csrc/backend/access/heap/heapam_visibility.csrc/backend/access/heap/hio.csrc/backend/access/heap/pruneheap.csrc/include/access/htup_details.hsrc/include/access/tableam.hsrc/include/storage/itemid.h
Cross-references (sibling docs)
Section titled “Cross-references (sibling docs)”postgres-table-am.md— theTableAmRoutinecontract the heap implements.postgres-mvcc-snapshots.md— snapshot construction andXidInMVCCSnapshot.postgres-visibility-map.md— the all-visible bit cleared on insert/delete.postgres-vacuum.md— coordinated reclamation, freezing, dead-line-pointer burial.postgres-page-layout.md— full heap page geometry (PageHeaderData,PageAddItem).