PostgreSQL MultiXact — Multiple Lockers and Updaters on One Tuple
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Row-level locking is one of the oldest features of a relational engine, and the textbook treatment makes it look simple: a transaction that wants to read a row “for update” takes a lock on it, a transaction that wants to modify a row takes a stronger lock, and a lock manager arbitrates conflicts using a compatibility matrix. Database System Concepts (Silberschatz 7e, ch. 18 “Concurrency Control”) frames the two canonical lock modes — shared (S) and exclusive (X) — and the rule that any number of transactions may hold S simultaneously while X is incompatible with everything. Two-phase locking (2PL) builds serializability on top of that matrix: a transaction acquires locks in a growing phase and releases them in a shrinking phase. The lock manager itself is “just” a hash table keyed by lockable object, with each entry holding a queue of granted and waiting requests.
The hard part — the part the textbook chapter waves at — is where the lock
state lives. A pure lock manager keeps every lock in a shared-memory hash
table, which is fine when the number of held locks is bounded by the number
of active transactions. But MVCC engines want a different trick: instead of
keeping a row lock in volatile memory that vanishes on crash, they want the
lock to be durable and self-describing, recorded on the row itself, so
that a reader arriving later can tell — without consulting any lock table —
whether the row is locked, by whom, and how strongly. PostgreSQL does exactly
this for the strongest lock (an update/delete writes the updater’s XID into
the tuple’s xmax), and the whole point of the MultiXact subsystem is to
extend that durable, on-row representation to the case where more than one
transaction has a stake in the row at the same time.
Consider the conflict matrix for tuple locks. PostgreSQL has four
LockTupleMode strengths, in increasing order: KeyShare, Share,
NoKeyExclusive, Exclusive. The two share modes are compatible with each
other and with themselves — SELECT ... FOR KEY SHARE is what a foreign-key
check takes on the referenced row, and many concurrent child-row inserts can
hold key-share on the same parent simultaneously. So a single parent row can
legitimately be “locked” by ten transactions at once. A bare xmax field
holds exactly one XID. The representational problem is therefore concrete and
unavoidable: how do you record N concurrent lockers in a one-XID slot?
The answer the literature suggests is indirection: store an identifier in the row, and let that identifier name a side structure that holds the real list. This is the same move a filesystem makes when a directory entry points at an inode rather than inlining the file. The cost is an extra lookup and a garbage-collection problem (when can the side structure be reclaimed?), and the benefit is that the fixed-width slot can represent an unbounded set. The MultiXact subsystem is PostgreSQL’s instantiation of that idea, specialized to the lifetime and crash-durability constraints of a transactional store: a MultiXactId is the identifier, the members SLRU is the side structure, and VACUUM-driven freezing is the garbage collector.
There is a second theoretical wrinkle that makes MultiXact more than a simple
locker list. Locking and updating are not mutually exclusive over time. A
transaction can take FOR KEY SHARE on a row, and then another transaction
can UPDATE it (a no-key update does not conflict with key-share). Now the
row is simultaneously locked by one transaction and updated by another —
and the updater’s XID must end up in xmax (that is how the next reader
follows the update chain), yet the locker’s XID must also be preserved (the
update must not silently drop the foreign-key lock). One xmax slot, two
semantically distinct claims. So the side structure must record, per member,
not just who but what kind of claim: a lock (and how strong) versus an
update (key-touching or not). That per-member status is what turns a flat
locker list into the six-valued MultiXactStatus enum at the heart of this
module.
Common DBMS Design
Section titled “Common DBMS Design”Engines diverge sharply on how, or whether, they make row locks durable.
Lock-table-only designs (DB2, SQL Server, classic 2PL). The lock manager is a shared-memory hash table; a row lock is a transient entry that lives only as long as the lock is held and disappears on transaction end or crash. There is no on-row footprint at all, so there is no representational limit on concurrent lockers (the table simply holds more entries) and no garbage-collection problem (entries are freed at unlock). The cost is memory pressure and lock escalation: when too many row locks accumulate, the engine escalates to a coarser page- or table-level lock, trading concurrency for bounded memory. There is nothing like a MultiXactId because the lock state never needs to fit in the row.
On-row single-locker designs. Several MVCC engines record the most recent
writer’s transaction id in a per-row slot (Oracle’s interested-transaction
list, InnoDB’s DB_TRX_ID) and resolve visibility/locking by consulting that
id plus undo information. Oracle’s ITL (Interested Transaction List) is
the closest cousin to MultiXact: each data block header carries a small array
of ITL slots, and a transaction that locks or modifies a row claims a slot
and points the row at it. Multiple concurrent lockers are accommodated by
multiple ITL slots in the block, and the array can grow (INITRANS /
MAXTRANS) at the cost of block space. The crucial difference: Oracle’s
locker list lives in the data block (one list per block, shared by all
its rows), whereas PostgreSQL’s lives in a global side SLRU (one list per
MultiXactId, referenced by any row in any table). Oracle’s design localizes
the list with the data (good cache behavior, bounded by block size); the
PostgreSQL design decouples list size from page space (a multi can have many
members without consuming heap-page room) at the cost of an SLRU lookup and a
separate wraparound domain.
The PostgreSQL position. PostgreSQL is unusual in that its row lock is
both on-row and indirected. The common case — a single updater, or a
single exclusive locker — needs no MultiXact at all: the tuple’s xmax holds
the bare XID and the HEAP_XMAX_IS_MULTI bit is clear. Only when the row
genuinely accrues multiple stakeholders, or a locker must coexist with an
updater, does the heap allocate a MultiXactId, write that into xmax, and
set the multi bit. This keeps the overwhelmingly common single-locker path
free of SLRU traffic, and confines the expensive machinery (SLRU pages,
member arrays, a second wraparound counter) to the genuinely-concurrent
minority of tuples.
flowchart TD
A["heap_lock_tuple / heap_update<br/>wants to claim a tuple"] --> B{"xmax already<br/>set and valid?"}
B -- "no" --> C["write bare XID into xmax<br/>HEAP_XMAX_IS_MULTI clear"]
B -- "yes, bare XID" --> D{"new claim compatible<br/>with existing one?"}
D -- "incompatible" --> E["wait on existing XID, then retry"]
D -- "compatible" --> F["MultiXactIdCreate(oldXID, newXID)<br/>allocate a 2-member multi"]
B -- "yes, already a multi" --> G["MultiXactIdExpand(multi, newXID)<br/>drop dead members, add new one"]
F --> H["write MXID into xmax<br/>set HEAP_XMAX_IS_MULTI"]
G --> H
H --> I["members live in pg_multixact SLRUs<br/>until VACUUM freezes the tuple"]
The diagram above is the policy layer, owned by heapam.c. Everything below
the dashed line — allocating the MXID, packing the member array into SLRU
pages, reading it back, and reclaiming it — is the mechanism layer owned by
multixact.c, and that mechanism is the subject of this document.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”What a MultiXactId is, and what it is not
Section titled “What a MultiXactId is, and what it is not”The file header of multixact.c is unusually candid about the history:
// multixact.c — file header comment * The pg_multixact manager is a pg_xact-like manager that stores an array of * MultiXactMember for each MultiXactId. It is a fundamental part of the * shared-row-lock implementation. Each MultiXactMember is comprised of a * TransactionId and a set of flag bits. The name is a bit historical: * originally, a MultiXactId consisted of more than one TransactionId (except * in rare corner cases), hence "multi". Nowadays, however, it's perfectly * legitimate to have MultiXactIds that only include a single Xid.So a MultiXactId is not “a set of transactions that share a lock” in the
naive sense. It is a name for a stored, immutable array of
MultiXactMember records. A single-member multi is legal and common (it
arises whenever a tuple needs the richer per-member status that a bare xmax
cannot express — e.g. a key-share locker on an already-updated row). The flag
bits — the member status — are opaque to multixact.c; the module just
stores and retrieves the arrays, and heapam.c interprets the statuses as
lock modes.
The MXID is a 32-bit value living in the same numeric neighborhood as an XID but in a separate address space with its own counter and its own wraparound. The header reserves the low values:
// multixact.h — reserved MXID values#define InvalidMultiXactId ((MultiXactId) 0)#define FirstMultiXactId ((MultiXactId) 1)#define MaxMultiXactId ((MultiXactId) 0xFFFFFFFF)
#define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)The six member statuses
Section titled “The six member statuses”The member status encodes both lock strength and whether this member updated the tuple, in a single enum whose numeric ordering matters:
// multixact.h — MultiXactStatustypedef enum{ MultiXactStatusForKeyShare = 0x00, MultiXactStatusForShare = 0x01, MultiXactStatusForNoKeyUpdate = 0x02, MultiXactStatusForUpdate = 0x03, /* an update that doesn't touch "key" columns */ MultiXactStatusNoKeyUpdate = 0x04, /* other updates, and delete */ MultiXactStatusUpdate = 0x05,} MultiXactStatus;
#define MaxMultiXactStatus MultiXactStatusUpdate
/* does a status value correspond to a tuple update? */#define ISUPDATE_from_mxstatus(status) \ ((status) > MultiXactStatusForUpdate)The first four are lock-only statuses (SELECT ... FOR KEY SHARE / SHARE / NO KEY UPDATE / UPDATE); the last two mark a member that actually performed
an update or delete. The ISUPDATE_from_mxstatus macro relies on the ordering
— anything strictly greater than ForUpdate (0x03) is an updater. This single
predicate is load-bearing throughout the module: it is how MultiXactIdExpand
decides which dead members to keep, and how FreezeMultiXactId distinguishes
a locker (droppable once not running) from a committed updater (must be
preserved as the new xmax). The crucial invariant, checked at create time,
is that a multi has at most one updating member — many lockers may share
a row, but only one transaction can have updated it.
heapam.c maps a SQL-level LockTupleMode to a MultiXactStatus through a
static table, which is the bridge between the lock manager’s vocabulary and
the multixact member vocabulary:
// heapam.c — tupleLockExtraInfo (lock mode -> hwlock + member statuses)tupleLockExtraInfo[MaxLockTupleMode + 1] ={ { AccessShareLock, MultiXactStatusForKeyShare, -1 }, /* KeyShare */ { RowShareLock, MultiXactStatusForShare, -1 }, /* Share */ { ExclusiveLock, MultiXactStatusForNoKeyUpdate, MultiXactStatusNoKeyUpdate }, /* NoKeyExclusive */ { AccessExclusiveLock, MultiXactStatusForUpdate, MultiXactStatusUpdate }, /* Exclusive */};Each row gives the heavyweight LOCKMODE used for the in-memory tuple lock
(see postgres-lock-manager.md), the member status to record when the claim is
a lock, and the status to record when it is an update. The -1 entries
encode that key-share and share locks can never themselves be the updating
member.
Two SLRUs: offsets and members
Section titled “Two SLRUs: offsets and members”The representational core is two parallel SLRU areas. The “offsets” SLRU is a
flat array indexed by MXID, where each four-byte slot holds the starting
offset of that multi’s member array in the “members” SLRU. The “members”
SLRU holds the variable-length (xid, status) arrays themselves. This
two-area split is the same trick clog/commit_ts use (see
postgres-slru.md and postgres-clog-commit-ts.md) and is explained in the
header:
// multixact.c — file header comment * We use two SLRU areas, one for storing the offsets at which the data * starts for each MultiXactId in the other one. This trick allows us to * store variable length arrays of TransactionIds.The offsets layout is a trivial division because each entry is fixed-width:
// MultiXactIdToOffsetPage / Entry — multixact.c#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
static inline int64MultiXactIdToOffsetPage(MultiXactId multi){ return multi / MULTIXACT_OFFSETS_PER_PAGE;}
static inline intMultiXactIdToOffsetEntry(MultiXactId multi){ return multi % MULTIXACT_OFFSETS_PER_PAGE;}The members layout is subtler because each member needs a TransactionId
(4 bytes) plus a flag byte for its status. To avoid alignment waste, members
are packed into groups of four: four flag bytes followed by four xids — a
20-byte, 5-word group:
// multixact.c — members layout comment + group constants * we store four bytes of flags, and then the * corresponding 4 Xids. Each such 5-word (20-byte) set we call a "group", and * are stored as a whole in pages. Thus, with 8kB BLCKSZ, we keep 409 groups * per page.
#define MULTIXACT_FLAGBYTES_PER_GROUP 4#define MULTIXACT_MEMBERS_PER_MEMBERGROUP \ (MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)#define MULTIXACT_MEMBERGROUP_SIZE \ (sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)#define MULTIXACT_MEMBERS_PER_PAGE \ (MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)Resolving a member offset to a physical byte position therefore involves
locating the group on its page, skipping the four flag bytes, and indexing the
xid within the group — MXOffsetToMemberPage, MXOffsetToFlagsOffset, and
MXOffsetToMemberOffset do this arithmetic. The “offset” here is a member
index in a global 32-bit address space (the MultiXactOffset type), distinct
from the MXID space — which is why members have their own wraparound
concern, discussed below.
flowchart LR X["tuple.xmax = MXID 4711<br/>HEAP_XMAX_IS_MULTI set"] --> O["offsets SLRU<br/>slot[4711] = member offset 9020"] O --> M["members SLRU<br/>offset 9020: (xid=812, ForKeyShare)<br/>offset 9021: (xid=915, NoKeyUpdate)"] O2["offsets SLRU<br/>slot[4712] = member offset 9022"] -.->|"length = 9022 - 9020 = 2"| M X --> O2
The length of multi 4711’s member array is not stored explicitly; it is
computed as the difference between slot[4711] and slot[4712] in the offsets
SLRU. This is why RecordNewMultiXact must set both the current multi’s
offset and eagerly initialize the next slot, and why GetMultiXactIdMembers
has careful corner-case handling for “the latest multi has no successor yet.”
Per-backend horizons keep the SLRU from being truncated underfoot
Section titled “Per-backend horizons keep the SLRU from being truncated underfoot”Because the members/offsets SLRUs are continuously truncated by VACUUM, a
backend that is about to read an old multi must publish a horizon first. Each
backend has two shared-memory slots: OldestMemberMXactId[k] (the oldest
multi this backend’s transaction could become a member of) and
OldestVisibleMXactId[k] (the oldest multi it might inspect). Setting the
former before taking any shared lock, and the latter before reading any
member array, is what guarantees the data is not truncated away mid-read — the
global minimum of these slots is the OldestMulti cutoff VACUUM respects.
This is the multixact analogue of the procarray xmin horizon (see
postgres-procarray.md).
Source Walkthrough
Section titled “Source Walkthrough”Creating a multi: MultiXactIdCreate and MultiXactIdExpand
Section titled “Creating a multi: MultiXactIdCreate and MultiXactIdExpand”The two entry points the heap calls correspond to the two situations in the
policy diagram. MultiXactIdCreate turns two single XIDs into a two-member
multi — used when a bare-xmax tuple gains a second, compatible stakeholder:
// MultiXactIdCreate — multixact.cMultiXactIdMultiXactIdCreate(TransactionId xid1, MultiXactStatus status1, TransactionId xid2, MultiXactStatus status2){ MultiXactId newMulti; MultiXactMember members[2];
Assert(!TransactionIdEquals(xid1, xid2) || (status1 != status2)); /* MultiXactIdSetOldestMember() must have been called already. */ Assert(MultiXactIdIsValid(*MyOldestMemberMXactIdSlot()));
members[0].xid = xid1; members[0].status = status1; members[1].xid = xid2; members[1].status = status2;
newMulti = MultiXactIdCreateFromMembers(2, members); return newMulti;}The Assert enforces the rule that the two claims must differ — same XID with
the same status would be a redundant member. The other Assert is the
crash-safety contract: MultiXactIdSetOldestMember() must have published this
backend’s horizon before it can be folded into a multi, so VACUUM cannot
truncate a range this backend is about to join.
MultiXactIdExpand handles the “already a multi” branch. Critically, it does
not mutate the existing multi — it reads the old member array, filters it,
appends the new member, and allocates a fresh MXID:
// MultiXactIdExpand — multixact.c (condensed)MultiXactIdMultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status){ nmembers = GetMultiXactIdMembers(multi, &members, false, false); if (nmembers < 0) { /* all members gone; just make a singleton */ member.xid = xid; member.status = status; return MultiXactIdCreateFromMembers(1, &member); }
/* already a member with the same status? return multi unchanged */ for (i = 0; i < nmembers; i++) if (TransactionIdEquals(members[i].xid, xid) && members[i].status == status) return multi;
/* keep running members, and committed updaters; drop the rest */ for (i = 0, j = 0; i < nmembers; i++) { if (TransactionIdIsInProgress(members[i].xid) || (ISUPDATE_from_mxstatus(members[i].status) && TransactionIdDidCommit(members[i].xid))) { newMembers[j].xid = members[i].xid; newMembers[j++].status = members[i].status; } } newMembers[j].xid = xid; newMembers[j++].status = status; return MultiXactIdCreateFromMembers(j, newMembers);}The comment in the source spells out why immutability is mandatory: mutating
a multi in place would race against a transaction that is waiting on that
multi to finish. By always minting a new MXID, a waiter that captured the old
MXID sees a stable, never-growing member set — “a false result [of
MultiXactIdIsRunning] is certain not to change, because it is not legal to
add members to an existing MultiXactId.” The filtering loop is also not just
an optimization: dropping dead lockers is a correctness requirement for
freezing, because it bounds how long a member XID can keep an MXID alive.
Assigning the MXID: MultiXactIdCreateFromMembers and GetNewMultiXactId
Section titled “Assigning the MXID: MultiXactIdCreateFromMembers and GetNewMultiXactId”MultiXactIdCreateFromMembers first checks the backend-local cache (most
multis the backend reads are ones it created), enforces the single-updater
invariant, then allocates and durably records the multi:
// MultiXactIdCreateFromMembers — multixact.c (condensed)multi = mXactCacheGetBySet(nmembers, members);if (MultiXactIdIsValid(multi)) return multi; /* re-use cached identical multi */
/* Verify that there is a single update Xid among the given members. */for (i = 0; i < nmembers; i++) if (ISUPDATE_from_mxstatus(members[i].status)) { if (has_update) elog(ERROR, "new multixact has more than one updating member: %s", ...); has_update = true; }
multi = GetNewMultiXactId(nmembers, &offset); /* enters crit section */
/* WAL the create, then write the SLRU entries */xlrec.mid = multi; xlrec.moff = offset; xlrec.nmembers = nmembers;XLogBeginInsert();XLogRegisterData(&xlrec, SizeOfMultiXactCreate);XLogRegisterData(members, nmembers * sizeof(MultiXactMember));(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);RecordNewMultiXact(multi, offset, nmembers, members);END_CRIT_SECTION();mXactCachePut(multi, nmembers, members);GetNewMultiXactId is where both the MXID counter and the member-offset
counter advance under MultiXactGenLock, and where both wraparound guards
fire. Note the careful ordering — file extension (which can fail) happens
before the critical section that bumps the counters, exactly as
GetNewTransactionId does:
// GetNewMultiXactId — multixact.c (condensed)LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);if (MultiXactState->nextMXact < FirstMultiXactId) MultiXactState->nextMXact = FirstMultiXactId;result = MultiXactState->nextMXact;
/* MXID wraparound guard: vac/warn/stop limits, like GetNewTransactionId */if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit)){ ... if past multiStopLimit: ereport(ERROR, "... to avoid wraparound data loss"); ... if past multiWarnLimit: ereport(WARNING, "... must be vacuumed before ..."); ... SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);}
/* Reserve offsets-file room for the *next* MXID's start offset */ExtendMultiXactOffset(result + 1);
nextOffset = MultiXactState->nextOffset;if (nextOffset == 0) { *offset = 1; nmembers++; } /* never hand out offset 0 */else *offset = nextOffset;
/* MEMBERS-space wraparound guard (a *separate* 32-bit domain) */if (MultiXactState->oldestOffsetKnown && MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset, nmembers)) ereport(ERROR, (errmsg("multixact \"members\" limit exceeded"), ...));
ExtendMultiXactMember(nextOffset, nmembers);
START_CRIT_SECTION();(MultiXactState->nextMXact)++;MultiXactState->nextOffset += nmembers;LWLockRelease(MultiXactGenLock);The two wraparound checks here are the crux of the two address spaces. The
first (multiVacLimit / multiStopLimit) protects the MXID counter; the
second (offsetStopLimit via MultiXactOffsetWouldWrap) protects the
members offset counter. A workload with many large multis can exhaust the
member space long before the MXID space — which is why offset-zero is reserved
as “unset” and why nextOffset == 0 triggers the *offset = 1 skip.
Reading a multi back: GetMultiXactIdMembers
Section titled “Reading a multi back: GetMultiXactIdMembers”Reading is the inverse: cache lookup, publish the visible horizon, validate the MXID against the known range, then read the start offset and the next multi’s start offset to compute the array length:
// GetMultiXactIdMembers — multixact.c (condensed)length = mXactCacheGetById(multi, members);if (length >= 0) return length; /* cache hit */
MultiXactIdSetOldestVisible(); /* pin the truncation horizon */
/* lock-only multis older than our visible horizon cannot be running */if (isLockOnly && MultiXactIdPrecedes(multi, *MyOldestVisibleMXactIdSlot())) { *members = NULL; return -1; }
LWLockAcquire(MultiXactGenLock, LW_SHARED);oldestMXact = MultiXactState->oldestMultiXactId;nextMXact = MultiXactState->nextMXact;nextOffset = MultiXactState->nextOffset;LWLockRelease(MultiXactGenLock);
if (MultiXactIdPrecedes(multi, oldestMXact)) ereport(ERROR, "MultiXactId %u does no longer exist -- apparent wraparound");if (!MultiXactIdPrecedes(multi, nextMXact)) ereport(ERROR, "MultiXactId %u has not been created yet -- apparent wraparound");
/* read offsets[multi] and offsets[multi+1]; length is the difference */slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);offset = ((MultiXactOffset *) ...page_buffer[slotno])[entryno];...if (nextMXact == multi + 1) length = nextOffset - offset; /* corner case 1 */else length = nextMXOffset - offset;The two ereport(ERROR, ...) calls are the runtime detectors of MXID
wraparound: a multi older than oldestMultiXactId “does no longer exist,” and
one not older than nextMXact “has not been created yet.” Both indicate the
counter has lapped the live range — the exact disaster the wraparound guards
in GetNewMultiXactId exist to prevent. The member loop that follows reads
each (xid, status) group, skipping any zero XID (the reserved offset-0
ambiguity), and returns the filtered array.
Member-space wraparound arithmetic: MultiXactOffsetWouldWrap
Section titled “Member-space wraparound arithmetic: MultiXactOffsetWouldWrap”Because the member offset is a 32-bit counter that wraps at 0xFFFFFFFF,
“is X far enough from the boundary?” cannot be a simple comparison — the
addition can itself wrap. The helper does signed-difference reasoning with an
explicit skip of the reserved offset 0:
// MultiXactOffsetWouldWrap — multixact.cstatic boolMultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start, uint32 distance){ MultiXactOffset finish;
finish = start + distance; if (finish < start) finish++; /* skip the reserved offset 0 on overflow */
if (start < boundary) return finish >= boundary || finish < start; else return finish >= boundary && finish < start;}The member freeze threshold builds on the same MULTIXACT_MEMBER_SAFE_THRESHOLD
(half the offset space) and MULTIXACT_MEMBER_DANGER_THRESHOLD (three-quarters)
constants to make autovacuum more aggressive as the member space fills,
independent of the MXID age:
// MultiXactMemberFreezeThreshold — multixact.c (condensed)if (!ReadMultiXactCounts(&multixacts, &members)) return 0; /* unknown utilization: assume the worst */if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD) return autovacuum_multixact_freeze_max_age; /* plenty of room */
fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) / (MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);victim_multixacts = multixacts * fraction;result = multixacts - victim_multixacts;return Min(result, autovacuum_multixact_freeze_max_age);This is why a member-space crunch can force aggressive freezing even when no
table is anywhere near autovacuum_multixact_freeze_max_age in MXID terms:
the two domains have independent pressure, and autovacuum reacts to whichever
is tighter.
Setting the wraparound limits: SetMultiXactIdLimit
Section titled “Setting the wraparound limits: SetMultiXactIdLimit”The vacuum/warn/stop/wrap limits are derived from the oldest datminmxid
across all databases, mirroring SetTransactionIdLimit:
// SetMultiXactIdLimit — multixact.c (condensed)multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1); /* "half the space" */multiStopLimit = multiWrapLimit - 3000000; /* refuse new MXIDs */multiWarnLimit = multiWrapLimit - 40000000; /* loud warnings */multiVacLimit = oldest_datminmxid + autovacuum_multixact_freeze_max_age;
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);MultiXactState->oldestMultiXactId = oldest_datminmxid;MultiXactState->multiVacLimit = multiVacLimit;MultiXactState->multiWarnLimit = multiWarnLimit;MultiXactState->multiStopLimit = multiStopLimit;MultiXactState->multiWrapLimit = multiWrapLimit;LWLockRelease(MultiXactGenLock);
/* Members have their *own* limits, computed separately */needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);The “half the space” comment in the source is deliberately a fiction —
multis wrap differently from XIDs — but it gives a comfortable buffer, and the
member-space limits (SetOffsetVacuumLimit) are the ones that actually fire
first on multi-heavy workloads.
Freezing a multi: FreezeMultiXactId (heapam.c)
Section titled “Freezing a multi: FreezeMultiXactId (heapam.c)”VACUUM cannot just leave an old MXID in a tuple forever — eventually the MXID
or its members fall before the freeze cutoffs and must be removed. The heap’s
FreezeMultiXactId is the multixact garbage collector. It returns one of four
dispositions via the FRM_* flags:
// heapam.c — FRM_* freeze dispositions#define FRM_NOOP 0x0001 /* keep the multi as-is */#define FRM_INVALIDATE_XMAX 0x0002 /* drop xmax entirely */#define FRM_RETURN_IS_XID 0x0004 /* replace multi with a single XID */#define FRM_RETURN_IS_MULTI 0x0008 /* replace with a new, smaller multi */The decision tree, condensed:
// FreezeMultiXactId — heapam.c (condensed)if (!MultiXactIdIsValid(multi) || HEAP_LOCKED_UPGRADED(t_infomask)){ *flags |= FRM_INVALIDATE_XMAX; return InvalidTransactionId; }
if (MultiXactIdPrecedes(multi, cutoffs->relminmxid)) ereport(ERROR, "found multixact %u from before relminmxid %u", ...);
if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact)){ /* this old multi cannot have running members; verify, then resolve */ if (MultiXactIdIsRunning(multi, HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))) ereport(ERROR, "multixact %u ... found to be still running", ...); if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) { *flags |= FRM_INVALIDATE_XMAX; return InvalidTransactionId; } /* lockers only */ update_xact = MultiXactIdGetUpdateXid(multi, t_infomask); ... if updater aborted: FRM_INVALIDATE_XMAX ... else: *flags |= FRM_RETURN_IS_XID; return update_xact;}
/* multi is >= OldestMxact: maybe keep it. Walk members. */nmembers = GetMultiXactIdMembers(multi, &members, false, HEAP_XMAX_IS_LOCKED_ONLY(...));need_replace = false;for (i = 0; i < nmembers; i++) if (TransactionIdPrecedes(members[i].xid, cutoffs->FreezeLimit)) { need_replace = true; break; }if (!need_replace) need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);if (!need_replace) { *flags |= FRM_NOOP; return multi; } /* keep it */
/* second pass: keep only running lockers + the live updater, build new multi */The structure makes the lock/update distinction concrete. A multi older than
OldestMxact that is lock-only is simply dropped (FRM_INVALIDATE_XMAX)
— its lockers are gone, the locks meant nothing once their holders ended. A
multi that contains a committed updater must keep that updater’s XID as
the new xmax (FRM_RETURN_IS_XID), because the update chain still depends on
it. A multi that still has some live members but also some below the
freeze cutoff gets rebuilt smaller (FRM_RETURN_IS_MULTI) — keeping only
running lockers and the live updater — which is the freezing-time counterpart
of MultiXactIdExpand’s filtering. The ereport(ERROR, ...) for a multi
“from before relminmxid” is the corruption tripwire that proves freezing kept
pace with wraparound. (The XID-side of these cutoffs — relfrozenxid,
OldestXmin, FreezeLimit — is owned by postgres-xid-wraparound-freeze.md.)
Truncation: TruncateMultiXact and the two-SLRU dance
Section titled “Truncation: TruncateMultiXact and the two-SLRU dance”When the global oldest multi advances, both SLRUs are truncated. The members
SLRU cannot be truncated by page number alone (it can be nearly full across
the whole range at once), so the to-be-deleted member range is derived from
the offsets SLRU — find_multixact_start(newOldestMulti) gives the member
offset boundary, and PerformMembersTruncation deletes whole segments below
it while PerformOffsetsTruncation truncates the offsets SLRU by page. The
page-precedes callbacks (MultiXactOffsetPagePrecedes,
MultiXactMemberPagePrecedes) teach SimpleLruTruncate how “older” is defined
in each wrapping address space. Truncation is WAL-logged
(XLOG_MULTIXACT_TRUNCATE_ID) so standbys replay it via multixact_redo.
flowchart TD V["VACUUM advances datminmxid<br/>vac_truncate_clog calls TruncateMultiXact"] --> S["find_multixact_start(newOldestMulti)<br/>read offsets SLRU -> newOldestOffset"] S --> M["PerformMembersTruncation<br/>SlruDeleteSegment for member segments<br/>below newOldestOffset"] S --> O["PerformOffsetsTruncation<br/>SimpleLruTruncate(offsets, page of newOldest-1)"] M --> W["WriteMTruncateXlogRec<br/>XLOG_MULTIXACT_TRUNCATE_ID"] O --> W W --> R["standby: multixact_redo replays truncation"]
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
MultiXactStatus (enum) | src/include/access/multixact.h | 37 |
ISUPDATE_from_mxstatus | src/include/access/multixact.h | 52 |
MultiXactMember (struct) | src/include/access/multixact.h | 56 |
FirstMultiXactId / MaxMultiXactId | src/include/access/multixact.h | 25 |
MULTIXACT_OFFSETS_PER_PAGE | src/backend/access/transam/multixact.c | 110 |
MultiXactIdToOffsetPage | src/backend/access/transam/multixact.c | 113 |
MULTIXACT_MEMBERGROUP_SIZE | src/backend/access/transam/multixact.c | 152 |
MXOffsetToMemberOffset | src/backend/access/transam/multixact.c | 206 |
MULTIXACT_MEMBER_SAFE_THRESHOLD | src/backend/access/transam/multixact.c | 216 |
MultiXactStateData (struct) | src/backend/access/transam/multixact.c | 242 |
MultiXactIdCreate | src/backend/access/transam/multixact.c | 478 |
MultiXactIdExpand | src/backend/access/transam/multixact.c | 531 |
MultiXactIdIsRunning | src/backend/access/transam/multixact.c | 643 |
MultiXactIdSetOldestMember | src/backend/access/transam/multixact.c | 717 |
MultiXactIdSetOldestVisible | src/backend/access/transam/multixact.c | 774 |
MultiXactIdCreateFromMembers | src/backend/access/transam/multixact.c | 859 |
RecordNewMultiXact | src/backend/access/transam/multixact.c | 960 |
GetNewMultiXactId | src/backend/access/transam/multixact.c | 1201 |
GetMultiXactIdMembers | src/backend/access/transam/multixact.c | 1470 |
mxstatus_to_string | src/backend/access/transam/multixact.c | 1904 |
SetMultiXactIdLimit | src/backend/access/transam/multixact.c | 2530 |
ExtendMultiXactOffset | src/backend/access/transam/multixact.c | 2721 |
ExtendMultiXactMember | src/backend/access/transam/multixact.c | 2753 |
MultiXactOffsetWouldWrap | src/backend/access/transam/multixact.c | 3012 |
find_multixact_start | src/backend/access/transam/multixact.c | 3060 |
MultiXactMemberFreezeThreshold | src/backend/access/transam/multixact.c | 3150 |
PerformMembersTruncation | src/backend/access/transam/multixact.c | 3219 |
TruncateMultiXact | src/backend/access/transam/multixact.c | 3274 |
MultiXactOffsetPagePrecedes | src/backend/access/transam/multixact.c | 3466 |
MultiXactIdPrecedes | src/backend/access/transam/multixact.c | 3506 |
multixact_redo | src/backend/access/transam/multixact.c | 3583 |
tupleLockExtraInfo | src/backend/access/heap/heapam.c | 132 |
get_mxact_status_for_lock | src/backend/access/heap/heapam.c | 4527 |
FreezeMultiXactId | src/backend/access/heap/heapam.c | 6713 |
MultiXactIdGetUpdateXid | src/backend/access/heap/heapam.c | 7536 |
HEAP_XMAX_IS_MULTI | src/include/access/htup_details.h | 209 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”- Two SLRUs, offsets→members indirection. Confirmed in the file header
comment and the
MultiXactIdToOffsetPage/MXOffsetToMemberPagemacro families. The offsets SLRU is a fixed-widthMultiXactOffset[]indexed by MXID; the members SLRU stores variable-length arrays located by those offsets. Member-array length is computed as the difference between consecutive offsets, not stored — verified inGetMultiXactIdMembers(“corner case 1” handling for the latest multi). - 5-word (20-byte) member groups, 409 per 8 kB page. The packing of
4 flag bytes + 4 xids is in the members layout comment and the
MULTIXACT_MEMBERGROUP_SIZE/MULTIXACT_MEMBERGROUPS_PER_PAGEmacros. WithBLCKSZ=8192,8192 / 20 = 409groups, wasting 12 bytes/page, as the comment states. - Six member statuses, ordering load-bearing.
MultiXactStatushas exactly the six values shown (ForKeyShare=0 …Update=5), andISUPDATE_from_mxstatuskeys onstatus > MultiXactStatusForUpdate(0x03). Verified inmultixact.h. - Multis are immutable; expand creates a new MXID.
MultiXactIdExpandcallsMultiXactIdCreateFromMembersfor every non-trivial outcome and never writes back into the passedmulti. The source comment explains the race with waiters. Verified. - Single-updater invariant.
MultiXactIdCreateFromMemberselog(ERROR, "new multixact has more than one updating member")if two members satisfyISUPDATE_from_mxstatus. Verified. - Two independent wraparound domains.
GetNewMultiXactIdadvances bothnextMXactandnextOffsetunderMultiXactGenLock, and runs two distinct guard blocks — the MXID guard (multiVacLimit/multiStopLimit) and the member guard (offsetStopLimitviaMultiXactOffsetWouldWrap). Verified. - Offset 0 is reserved.
GetNewMultiXactIddoesif (nextOffset == 0) { *offset = 1; nmembers++; }, andGetMultiXactIdMembersignores a zero member read. Verified in both functions. - Lock-mode → status mapping lives in heapam.
tupleLockExtraInfomaps the fourLockTupleModestrengths to(hwlock, lockstatus, updstatus), with-1updstatus for the two share modes.get_mxact_status_for_lockis the accessor. Verified. FreezeMultiXactIdreturns four FRM_ dispositions.*FRM_NOOP,FRM_INVALIDATE_XMAX,FRM_RETURN_IS_XID,FRM_RETURN_IS_MULTIdefined at heapam.c:6660–6663; the decision tree distinguishes lock-only (invalidate) from committed-updater (return XID). Verified.- Truncation derives member range from offsets SLRU.
TruncateMultiXact→find_multixact_start→PerformMembersTruncation(segment deletes) +PerformOffsetsTruncation(SimpleLruTruncate), all WAL-logged viaWriteMTruncateXlogRec(XLOG_MULTIXACT_TRUNCATE_ID). Verified.
Open questions
Section titled “Open questions”- The backend-local
MXactCachecomment carries an explicitFIXMEnoting the per-transaction lifetime is “plain wrong now that multixact’s may contain update Xids” — the cache can outlive the relevance of an entry whose other members persist. This is a known, tolerated imperfection, not a bug with an observable correctness impact (the cache is consulted only as a fast path; misses fall through to the SLRU). - The “half the space” wrap-limit comment in
SetMultiXactIdLimitis an acknowledged approximation; the real first-to-fire constraint on multi-heavy workloads is the member-space limit fromSetOffsetVacuumLimit, not the MXID limit. The exact interaction between the two limit sets under extreme member pressure is subtle and best understood by readingSetOffsetVacuumLimitalongsideMultiXactMemberFreezeThreshold.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Oracle’s Interested Transaction List (ITL) vs. the members SLRU. The ITL is the closest mainstream cousin to MultiXact: both record a set of transactions interested in a row, and both let a locker coexist with an updater. The instructive difference is placement. Oracle keeps the list in the data block header (one ITL array per block, sized by
INITRANS/MAXTRANS), so concurrent lockers compete for finite, in-block slots and a block can throwORA-00060-class waits when the ITL is exhausted. PostgreSQL keeps the list in a global members SLRU, decoupling list size from heap-page space — a multi can grow without consuming row room — at the cost of an SLRU lookup, a second wraparound domain, and a VACUUM-time garbage-collection obligation (FreezeMultiXactId) that Oracle’s block-local, undo-reclaimed scheme does not have. A side-by-side of ITL slot contention against members-offset exhaustion would frame the core trade: locality and bounded space (Oracle) versus unbounded per-row stakeholders and amortized SLRU traffic (PostgreSQL). -
Lock-table-only engines (DB2, SQL Server) and lock escalation. In a classic 2PL engine the row lock is a transient shared-memory hash entry with no on-row footprint, so there is no representational ceiling on concurrent lockers and no freezing problem — but accumulated row locks trigger escalation to page/table granularity, trading concurrency for bounded memory. PostgreSQL’s MultiXact is the dual design: it never escalates (the durable on-row MXID has no memory-pressure cliff) but it does pay a wraparound and truncation tax that an in-memory lock table sidesteps entirely. The comparison is a clean illustration of “where does lock state live” being the load-bearing design axis the Theoretical Background opened with. See postgres-lock-manager.md for PostgreSQL’s own in-memory heavyweight lock table, which the tuple-lock path uses transiently in addition to the durable MXID.
-
Hekaton / in-memory MVCC and the absence of durable row locks. Microsoft’s Hekaton (Larson et al., “High-Performance Concurrency Control Mechanisms for Main-Memory Databases,” VLDB 2011) resolves write-write and lock conflicts through optimistic, version-chain validation with begin/end timestamps stamped directly on versions — there is no on-row locker set at all, because conflicts are detected at validation time rather than recorded durably. MultiXact occupies the opposite corner of the design space: pessimistic, durable, on-row encoding of who holds what, justified by PostgreSQL’s disk-oriented, crash-recoverable heap where a reader arriving after a crash must reconstruct lock state from the tuple alone. Positioning the six-valued
MultiXactStatusagainst Hekaton’s timestamp-only versions sharpens what durability of the lock itself (not just the data) costs. -
The two-counter wraparound problem as a research artifact. The member-offset space is a second 32-bit wraparound domain layered on the MXID domain, and historically (the 9.3-era SLRU bugs that motivated
SetOffsetVacuumLimitandMultiXactMemberFreezeThreshold) it has been the one that bites first under FK-heavy workloads. A focused study of howautovacuum_multixact_freeze_max_ageinteracts with the member-fraction back-pressure curve — and whether a 64-bit member offset (mirroring the 64-bit-XID proposals tracked in postgres-xid-wraparound-freeze.md) would retire the whole second domain — is the natural research frontier here. -
Generalizing per-member status beyond lock/update. The
MultiXactMemberflag byte currently encodes exactly six statuses. Nothing in the SLRU layout forbids richer per-member metadata (e.g., distinguishing predicate-lock intent, or carrying a sub-transaction id), and the immutable-multi discipline means any such extension is purely additive at create time. Whether the member vocabulary should grow — versus pushing richer intent into the separate predicate-lock machinery (postgres-ssi-predicate-locking.md) — is an open design question the current single-updater invariant keeps narrowly scoped.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/access/transam/multixact.c— the subsystem proper: the offsets/members SLRU layout macros (MultiXactIdToOffsetPage,MXOffsetToMemberPage, the 5-word member-group constants), the create path (MultiXactIdCreate,MultiXactIdExpand,MultiXactIdCreateFromMembers,GetNewMultiXactId,RecordNewMultiXact), the read path (GetMultiXactIdMembers), the per-backend horizons (MultiXactIdSetOldestMember/SetOldestVisible), the dual wraparound guards (MultiXactOffsetWouldWrap,SetMultiXactIdLimit,SetOffsetVacuumLimit,MultiXactMemberFreezeThreshold), truncation (TruncateMultiXact,find_multixact_start,PerformMembersTruncation,PerformOffsetsTruncation), and WAL replay (multixact_redo).src/include/access/multixact.h—MultiXactStatus(the six member statuses),ISUPDATE_from_mxstatus,MultiXactMember, the reserved MXID values (InvalidMultiXactId,FirstMultiXactId,MaxMultiXactId,MaxMultiXactOffset).src/backend/access/heap/heapam.c— the policy layer that owns the lock-mode vocabulary and the freeze collector:tupleLockExtraInfo(LockTupleMode → member status),get_mxact_status_for_lock,MultiXactIdGetUpdateXid, andFreezeMultiXactIdwith its fourFRM_*dispositions.src/include/access/htup_details.h—HEAP_XMAX_IS_MULTIand theHEAP_XMAX_IS_LOCKED_ONLY/HEAP_LOCKED_UPGRADEDinfomask predicates that gate whether anxmaxis read as a bare XID or an MXID.src/backend/access/heap/README.tuplock— the design narrative for tuple locking that motivates why a multi is needed at all (compatible share locks, locker-plus-updater coexistence, the infomask encoding).
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 18
“Concurrency Control” — shared/exclusive lock modes, the compatibility
matrix, and 2PL, the textbook frame the Theoretical Background builds on
(
knowledge/research/dbms-general/). - Database Internals (Petrov 2019), ch. 5 “Transaction Processing and Recovery” — lock-manager structure and where lock state lives, the axis the Common DBMS Design section turns on.
- Larson, P.-Å. et al. (2011). “High-Performance Concurrency Control Mechanisms for Main-Memory Databases.” VLDB 4(5):298-309. The Hekaton optimistic-MVCC contrast in Beyond PostgreSQL (timestamp-on-version vs. durable on-row locker set).
- Oracle Database Concepts — the Interested Transaction List (ITL),
INITRANS/MAXTRANS, and block-level row-lock storage, the comparative anchor for the members-SLRU-vs-block-header trade.
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-slru.md— theSimpleLruReadPage/SimpleLruTruncate/SlruDeleteSegmentbuffer machinery that the offsets and members SLRUs are instances of; this doc treats the SLRU as a given.postgres-lock-manager.md— the in-memory heavyweight lock table and theLOCKMODEvocabulary (AccessShareLock…AccessExclusiveLock) thattupleLockExtraInfo’s first column feeds; the tuple-lock path takes one of these transiently alongside the durable MXID.postgres-xid-wraparound-freeze.md— the XID side of freezing (relfrozenxid,OldestXmin,FreezeLimit,vac_update_datfrozenxid) that runs in lockstep with the MXID side (relminmxid,OldestMxact,MultiXactCutoff) inside the same VACUUM pass.postgres-heap-am.md/postgres-mvcc-snapshots.md—heap_lock_tuple,heap_update, and thexmax/infomask visibility rules that decide when an MXID is allocated and how a later reader interprets it.postgres-procarray.md— the xmin-horizon analogue to theOldestMemberMXactId/OldestVisibleMXactIdper-backend horizons that keep the SLRUs from being truncated underfoot.postgres-clog-commit-ts.md— the other two-area-free SLRUs whose page-arithmetic and truncation patterns parallel multixact’s, referenced when the offsets/members split is introduced.