Skip to content

PostgreSQL MultiXact — Multiple Lockers and Updaters on One Tuple

Contents:

Row-level locking is one of the oldest features of a relational engine, and the textbook treatment makes it look simple: a transaction that wants to read a row “for update” takes a lock on it, a transaction that wants to modify a row takes a stronger lock, and a lock manager arbitrates conflicts using a compatibility matrix. Database System Concepts (Silberschatz 7e, ch. 18 “Concurrency Control”) frames the two canonical lock modes — shared (S) and exclusive (X) — and the rule that any number of transactions may hold S simultaneously while X is incompatible with everything. Two-phase locking (2PL) builds serializability on top of that matrix: a transaction acquires locks in a growing phase and releases them in a shrinking phase. The lock manager itself is “just” a hash table keyed by lockable object, with each entry holding a queue of granted and waiting requests.

The hard part — the part the textbook chapter waves at — is where the lock state lives. A pure lock manager keeps every lock in a shared-memory hash table, which is fine when the number of held locks is bounded by the number of active transactions. But MVCC engines want a different trick: instead of keeping a row lock in volatile memory that vanishes on crash, they want the lock to be durable and self-describing, recorded on the row itself, so that a reader arriving later can tell — without consulting any lock table — whether the row is locked, by whom, and how strongly. PostgreSQL does exactly this for the strongest lock (an update/delete writes the updater’s XID into the tuple’s xmax), and the whole point of the MultiXact subsystem is to extend that durable, on-row representation to the case where more than one transaction has a stake in the row at the same time.

Consider the conflict matrix for tuple locks. PostgreSQL has four LockTupleMode strengths, in increasing order: KeyShare, Share, NoKeyExclusive, Exclusive. The two share modes are compatible with each other and with themselvesSELECT ... FOR KEY SHARE is what a foreign-key check takes on the referenced row, and many concurrent child-row inserts can hold key-share on the same parent simultaneously. So a single parent row can legitimately be “locked” by ten transactions at once. A bare xmax field holds exactly one XID. The representational problem is therefore concrete and unavoidable: how do you record N concurrent lockers in a one-XID slot?

The answer the literature suggests is indirection: store an identifier in the row, and let that identifier name a side structure that holds the real list. This is the same move a filesystem makes when a directory entry points at an inode rather than inlining the file. The cost is an extra lookup and a garbage-collection problem (when can the side structure be reclaimed?), and the benefit is that the fixed-width slot can represent an unbounded set. The MultiXact subsystem is PostgreSQL’s instantiation of that idea, specialized to the lifetime and crash-durability constraints of a transactional store: a MultiXactId is the identifier, the members SLRU is the side structure, and VACUUM-driven freezing is the garbage collector.

There is a second theoretical wrinkle that makes MultiXact more than a simple locker list. Locking and updating are not mutually exclusive over time. A transaction can take FOR KEY SHARE on a row, and then another transaction can UPDATE it (a no-key update does not conflict with key-share). Now the row is simultaneously locked by one transaction and updated by another — and the updater’s XID must end up in xmax (that is how the next reader follows the update chain), yet the locker’s XID must also be preserved (the update must not silently drop the foreign-key lock). One xmax slot, two semantically distinct claims. So the side structure must record, per member, not just who but what kind of claim: a lock (and how strong) versus an update (key-touching or not). That per-member status is what turns a flat locker list into the six-valued MultiXactStatus enum at the heart of this module.

Engines diverge sharply on how, or whether, they make row locks durable.

Lock-table-only designs (DB2, SQL Server, classic 2PL). The lock manager is a shared-memory hash table; a row lock is a transient entry that lives only as long as the lock is held and disappears on transaction end or crash. There is no on-row footprint at all, so there is no representational limit on concurrent lockers (the table simply holds more entries) and no garbage-collection problem (entries are freed at unlock). The cost is memory pressure and lock escalation: when too many row locks accumulate, the engine escalates to a coarser page- or table-level lock, trading concurrency for bounded memory. There is nothing like a MultiXactId because the lock state never needs to fit in the row.

On-row single-locker designs. Several MVCC engines record the most recent writer’s transaction id in a per-row slot (Oracle’s interested-transaction list, InnoDB’s DB_TRX_ID) and resolve visibility/locking by consulting that id plus undo information. Oracle’s ITL (Interested Transaction List) is the closest cousin to MultiXact: each data block header carries a small array of ITL slots, and a transaction that locks or modifies a row claims a slot and points the row at it. Multiple concurrent lockers are accommodated by multiple ITL slots in the block, and the array can grow (INITRANS / MAXTRANS) at the cost of block space. The crucial difference: Oracle’s locker list lives in the data block (one list per block, shared by all its rows), whereas PostgreSQL’s lives in a global side SLRU (one list per MultiXactId, referenced by any row in any table). Oracle’s design localizes the list with the data (good cache behavior, bounded by block size); the PostgreSQL design decouples list size from page space (a multi can have many members without consuming heap-page room) at the cost of an SLRU lookup and a separate wraparound domain.

The PostgreSQL position. PostgreSQL is unusual in that its row lock is both on-row and indirected. The common case — a single updater, or a single exclusive locker — needs no MultiXact at all: the tuple’s xmax holds the bare XID and the HEAP_XMAX_IS_MULTI bit is clear. Only when the row genuinely accrues multiple stakeholders, or a locker must coexist with an updater, does the heap allocate a MultiXactId, write that into xmax, and set the multi bit. This keeps the overwhelmingly common single-locker path free of SLRU traffic, and confines the expensive machinery (SLRU pages, member arrays, a second wraparound counter) to the genuinely-concurrent minority of tuples.

flowchart TD
  A["heap_lock_tuple / heap_update<br/>wants to claim a tuple"] --> B{"xmax already<br/>set and valid?"}
  B -- "no" --> C["write bare XID into xmax<br/>HEAP_XMAX_IS_MULTI clear"]
  B -- "yes, bare XID" --> D{"new claim compatible<br/>with existing one?"}
  D -- "incompatible" --> E["wait on existing XID, then retry"]
  D -- "compatible" --> F["MultiXactIdCreate(oldXID, newXID)<br/>allocate a 2-member multi"]
  B -- "yes, already a multi" --> G["MultiXactIdExpand(multi, newXID)<br/>drop dead members, add new one"]
  F --> H["write MXID into xmax<br/>set HEAP_XMAX_IS_MULTI"]
  G --> H
  H --> I["members live in pg_multixact SLRUs<br/>until VACUUM freezes the tuple"]

The diagram above is the policy layer, owned by heapam.c. Everything below the dashed line — allocating the MXID, packing the member array into SLRU pages, reading it back, and reclaiming it — is the mechanism layer owned by multixact.c, and that mechanism is the subject of this document.

The file header of multixact.c is unusually candid about the history:

// multixact.c — file header comment
* The pg_multixact manager is a pg_xact-like manager that stores an array of
* MultiXactMember for each MultiXactId. It is a fundamental part of the
* shared-row-lock implementation. Each MultiXactMember is comprised of a
* TransactionId and a set of flag bits. The name is a bit historical:
* originally, a MultiXactId consisted of more than one TransactionId (except
* in rare corner cases), hence "multi". Nowadays, however, it's perfectly
* legitimate to have MultiXactIds that only include a single Xid.

So a MultiXactId is not “a set of transactions that share a lock” in the naive sense. It is a name for a stored, immutable array of MultiXactMember records. A single-member multi is legal and common (it arises whenever a tuple needs the richer per-member status that a bare xmax cannot express — e.g. a key-share locker on an already-updated row). The flag bits — the member status — are opaque to multixact.c; the module just stores and retrieves the arrays, and heapam.c interprets the statuses as lock modes.

The MXID is a 32-bit value living in the same numeric neighborhood as an XID but in a separate address space with its own counter and its own wraparound. The header reserves the low values:

// multixact.h — reserved MXID values
#define InvalidMultiXactId ((MultiXactId) 0)
#define FirstMultiXactId ((MultiXactId) 1)
#define MaxMultiXactId ((MultiXactId) 0xFFFFFFFF)
#define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)

The member status encodes both lock strength and whether this member updated the tuple, in a single enum whose numeric ordering matters:

// multixact.h — MultiXactStatus
typedef enum
{
MultiXactStatusForKeyShare = 0x00,
MultiXactStatusForShare = 0x01,
MultiXactStatusForNoKeyUpdate = 0x02,
MultiXactStatusForUpdate = 0x03,
/* an update that doesn't touch "key" columns */
MultiXactStatusNoKeyUpdate = 0x04,
/* other updates, and delete */
MultiXactStatusUpdate = 0x05,
} MultiXactStatus;
#define MaxMultiXactStatus MultiXactStatusUpdate
/* does a status value correspond to a tuple update? */
#define ISUPDATE_from_mxstatus(status) \
((status) > MultiXactStatusForUpdate)

The first four are lock-only statuses (SELECT ... FOR KEY SHARE / SHARE / NO KEY UPDATE / UPDATE); the last two mark a member that actually performed an update or delete. The ISUPDATE_from_mxstatus macro relies on the ordering — anything strictly greater than ForUpdate (0x03) is an updater. This single predicate is load-bearing throughout the module: it is how MultiXactIdExpand decides which dead members to keep, and how FreezeMultiXactId distinguishes a locker (droppable once not running) from a committed updater (must be preserved as the new xmax). The crucial invariant, checked at create time, is that a multi has at most one updating member — many lockers may share a row, but only one transaction can have updated it.

heapam.c maps a SQL-level LockTupleMode to a MultiXactStatus through a static table, which is the bridge between the lock manager’s vocabulary and the multixact member vocabulary:

// heapam.c — tupleLockExtraInfo (lock mode -> hwlock + member statuses)
tupleLockExtraInfo[MaxLockTupleMode + 1] =
{
{ AccessShareLock, MultiXactStatusForKeyShare, -1 }, /* KeyShare */
{ RowShareLock, MultiXactStatusForShare, -1 }, /* Share */
{ ExclusiveLock, MultiXactStatusForNoKeyUpdate, MultiXactStatusNoKeyUpdate }, /* NoKeyExclusive */
{ AccessExclusiveLock, MultiXactStatusForUpdate, MultiXactStatusUpdate }, /* Exclusive */
};

Each row gives the heavyweight LOCKMODE used for the in-memory tuple lock (see postgres-lock-manager.md), the member status to record when the claim is a lock, and the status to record when it is an update. The -1 entries encode that key-share and share locks can never themselves be the updating member.

The representational core is two parallel SLRU areas. The “offsets” SLRU is a flat array indexed by MXID, where each four-byte slot holds the starting offset of that multi’s member array in the “members” SLRU. The “members” SLRU holds the variable-length (xid, status) arrays themselves. This two-area split is the same trick clog/commit_ts use (see postgres-slru.md and postgres-clog-commit-ts.md) and is explained in the header:

// multixact.c — file header comment
* We use two SLRU areas, one for storing the offsets at which the data
* starts for each MultiXactId in the other one. This trick allows us to
* store variable length arrays of TransactionIds.

The offsets layout is a trivial division because each entry is fixed-width:

// MultiXactIdToOffsetPage / Entry — multixact.c
#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
static inline int64
MultiXactIdToOffsetPage(MultiXactId multi)
{
return multi / MULTIXACT_OFFSETS_PER_PAGE;
}
static inline int
MultiXactIdToOffsetEntry(MultiXactId multi)
{
return multi % MULTIXACT_OFFSETS_PER_PAGE;
}

The members layout is subtler because each member needs a TransactionId (4 bytes) plus a flag byte for its status. To avoid alignment waste, members are packed into groups of four: four flag bytes followed by four xids — a 20-byte, 5-word group:

// multixact.c — members layout comment + group constants
* we store four bytes of flags, and then the
* corresponding 4 Xids. Each such 5-word (20-byte) set we call a "group", and
* are stored as a whole in pages. Thus, with 8kB BLCKSZ, we keep 409 groups
* per page.
#define MULTIXACT_FLAGBYTES_PER_GROUP 4
#define MULTIXACT_MEMBERS_PER_MEMBERGROUP \
(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
#define MULTIXACT_MEMBERGROUP_SIZE \
(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)
#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)
#define MULTIXACT_MEMBERS_PER_PAGE \
(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)

Resolving a member offset to a physical byte position therefore involves locating the group on its page, skipping the four flag bytes, and indexing the xid within the group — MXOffsetToMemberPage, MXOffsetToFlagsOffset, and MXOffsetToMemberOffset do this arithmetic. The “offset” here is a member index in a global 32-bit address space (the MultiXactOffset type), distinct from the MXID space — which is why members have their own wraparound concern, discussed below.

flowchart LR
  X["tuple.xmax = MXID 4711<br/>HEAP_XMAX_IS_MULTI set"] --> O["offsets SLRU<br/>slot[4711] = member offset 9020"]
  O --> M["members SLRU<br/>offset 9020: (xid=812, ForKeyShare)<br/>offset 9021: (xid=915, NoKeyUpdate)"]
  O2["offsets SLRU<br/>slot[4712] = member offset 9022"] -.->|"length = 9022 - 9020 = 2"| M
  X --> O2

The length of multi 4711’s member array is not stored explicitly; it is computed as the difference between slot[4711] and slot[4712] in the offsets SLRU. This is why RecordNewMultiXact must set both the current multi’s offset and eagerly initialize the next slot, and why GetMultiXactIdMembers has careful corner-case handling for “the latest multi has no successor yet.”

Per-backend horizons keep the SLRU from being truncated underfoot

Section titled “Per-backend horizons keep the SLRU from being truncated underfoot”

Because the members/offsets SLRUs are continuously truncated by VACUUM, a backend that is about to read an old multi must publish a horizon first. Each backend has two shared-memory slots: OldestMemberMXactId[k] (the oldest multi this backend’s transaction could become a member of) and OldestVisibleMXactId[k] (the oldest multi it might inspect). Setting the former before taking any shared lock, and the latter before reading any member array, is what guarantees the data is not truncated away mid-read — the global minimum of these slots is the OldestMulti cutoff VACUUM respects. This is the multixact analogue of the procarray xmin horizon (see postgres-procarray.md).

Creating a multi: MultiXactIdCreate and MultiXactIdExpand

Section titled “Creating a multi: MultiXactIdCreate and MultiXactIdExpand”

The two entry points the heap calls correspond to the two situations in the policy diagram. MultiXactIdCreate turns two single XIDs into a two-member multi — used when a bare-xmax tuple gains a second, compatible stakeholder:

// MultiXactIdCreate — multixact.c
MultiXactId
MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
TransactionId xid2, MultiXactStatus status2)
{
MultiXactId newMulti;
MultiXactMember members[2];
Assert(!TransactionIdEquals(xid1, xid2) || (status1 != status2));
/* MultiXactIdSetOldestMember() must have been called already. */
Assert(MultiXactIdIsValid(*MyOldestMemberMXactIdSlot()));
members[0].xid = xid1; members[0].status = status1;
members[1].xid = xid2; members[1].status = status2;
newMulti = MultiXactIdCreateFromMembers(2, members);
return newMulti;
}

The Assert enforces the rule that the two claims must differ — same XID with the same status would be a redundant member. The other Assert is the crash-safety contract: MultiXactIdSetOldestMember() must have published this backend’s horizon before it can be folded into a multi, so VACUUM cannot truncate a range this backend is about to join.

MultiXactIdExpand handles the “already a multi” branch. Critically, it does not mutate the existing multi — it reads the old member array, filters it, appends the new member, and allocates a fresh MXID:

// MultiXactIdExpand — multixact.c (condensed)
MultiXactId
MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
{
nmembers = GetMultiXactIdMembers(multi, &members, false, false);
if (nmembers < 0)
{
/* all members gone; just make a singleton */
member.xid = xid; member.status = status;
return MultiXactIdCreateFromMembers(1, &member);
}
/* already a member with the same status? return multi unchanged */
for (i = 0; i < nmembers; i++)
if (TransactionIdEquals(members[i].xid, xid) &&
members[i].status == status)
return multi;
/* keep running members, and committed updaters; drop the rest */
for (i = 0, j = 0; i < nmembers; i++)
{
if (TransactionIdIsInProgress(members[i].xid) ||
(ISUPDATE_from_mxstatus(members[i].status) &&
TransactionIdDidCommit(members[i].xid)))
{
newMembers[j].xid = members[i].xid;
newMembers[j++].status = members[i].status;
}
}
newMembers[j].xid = xid;
newMembers[j++].status = status;
return MultiXactIdCreateFromMembers(j, newMembers);
}

The comment in the source spells out why immutability is mandatory: mutating a multi in place would race against a transaction that is waiting on that multi to finish. By always minting a new MXID, a waiter that captured the old MXID sees a stable, never-growing member set — “a false result [of MultiXactIdIsRunning] is certain not to change, because it is not legal to add members to an existing MultiXactId.” The filtering loop is also not just an optimization: dropping dead lockers is a correctness requirement for freezing, because it bounds how long a member XID can keep an MXID alive.

Assigning the MXID: MultiXactIdCreateFromMembers and GetNewMultiXactId

Section titled “Assigning the MXID: MultiXactIdCreateFromMembers and GetNewMultiXactId”

MultiXactIdCreateFromMembers first checks the backend-local cache (most multis the backend reads are ones it created), enforces the single-updater invariant, then allocates and durably records the multi:

// MultiXactIdCreateFromMembers — multixact.c (condensed)
multi = mXactCacheGetBySet(nmembers, members);
if (MultiXactIdIsValid(multi))
return multi; /* re-use cached identical multi */
/* Verify that there is a single update Xid among the given members. */
for (i = 0; i < nmembers; i++)
if (ISUPDATE_from_mxstatus(members[i].status))
{
if (has_update)
elog(ERROR, "new multixact has more than one updating member: %s", ...);
has_update = true;
}
multi = GetNewMultiXactId(nmembers, &offset); /* enters crit section */
/* WAL the create, then write the SLRU entries */
xlrec.mid = multi; xlrec.moff = offset; xlrec.nmembers = nmembers;
XLogBeginInsert();
XLogRegisterData(&xlrec, SizeOfMultiXactCreate);
XLogRegisterData(members, nmembers * sizeof(MultiXactMember));
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
RecordNewMultiXact(multi, offset, nmembers, members);
END_CRIT_SECTION();
mXactCachePut(multi, nmembers, members);

GetNewMultiXactId is where both the MXID counter and the member-offset counter advance under MultiXactGenLock, and where both wraparound guards fire. Note the careful ordering — file extension (which can fail) happens before the critical section that bumps the counters, exactly as GetNewTransactionId does:

// GetNewMultiXactId — multixact.c (condensed)
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
if (MultiXactState->nextMXact < FirstMultiXactId)
MultiXactState->nextMXact = FirstMultiXactId;
result = MultiXactState->nextMXact;
/* MXID wraparound guard: vac/warn/stop limits, like GetNewTransactionId */
if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit))
{
... if past multiStopLimit: ereport(ERROR, "... to avoid wraparound data loss");
... if past multiWarnLimit: ereport(WARNING, "... must be vacuumed before ...");
... SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
}
/* Reserve offsets-file room for the *next* MXID's start offset */
ExtendMultiXactOffset(result + 1);
nextOffset = MultiXactState->nextOffset;
if (nextOffset == 0) { *offset = 1; nmembers++; } /* never hand out offset 0 */
else *offset = nextOffset;
/* MEMBERS-space wraparound guard (a *separate* 32-bit domain) */
if (MultiXactState->oldestOffsetKnown &&
MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset, nmembers))
ereport(ERROR, (errmsg("multixact \"members\" limit exceeded"), ...));
ExtendMultiXactMember(nextOffset, nmembers);
START_CRIT_SECTION();
(MultiXactState->nextMXact)++;
MultiXactState->nextOffset += nmembers;
LWLockRelease(MultiXactGenLock);

The two wraparound checks here are the crux of the two address spaces. The first (multiVacLimit / multiStopLimit) protects the MXID counter; the second (offsetStopLimit via MultiXactOffsetWouldWrap) protects the members offset counter. A workload with many large multis can exhaust the member space long before the MXID space — which is why offset-zero is reserved as “unset” and why nextOffset == 0 triggers the *offset = 1 skip.

Reading a multi back: GetMultiXactIdMembers

Section titled “Reading a multi back: GetMultiXactIdMembers”

Reading is the inverse: cache lookup, publish the visible horizon, validate the MXID against the known range, then read the start offset and the next multi’s start offset to compute the array length:

// GetMultiXactIdMembers — multixact.c (condensed)
length = mXactCacheGetById(multi, members);
if (length >= 0)
return length; /* cache hit */
MultiXactIdSetOldestVisible(); /* pin the truncation horizon */
/* lock-only multis older than our visible horizon cannot be running */
if (isLockOnly && MultiXactIdPrecedes(multi, *MyOldestVisibleMXactIdSlot()))
{ *members = NULL; return -1; }
LWLockAcquire(MultiXactGenLock, LW_SHARED);
oldestMXact = MultiXactState->oldestMultiXactId;
nextMXact = MultiXactState->nextMXact;
nextOffset = MultiXactState->nextOffset;
LWLockRelease(MultiXactGenLock);
if (MultiXactIdPrecedes(multi, oldestMXact))
ereport(ERROR, "MultiXactId %u does no longer exist -- apparent wraparound");
if (!MultiXactIdPrecedes(multi, nextMXact))
ereport(ERROR, "MultiXactId %u has not been created yet -- apparent wraparound");
/* read offsets[multi] and offsets[multi+1]; length is the difference */
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
offset = ((MultiXactOffset *) ...page_buffer[slotno])[entryno];
...
if (nextMXact == multi + 1) length = nextOffset - offset; /* corner case 1 */
else length = nextMXOffset - offset;

The two ereport(ERROR, ...) calls are the runtime detectors of MXID wraparound: a multi older than oldestMultiXactId “does no longer exist,” and one not older than nextMXact “has not been created yet.” Both indicate the counter has lapped the live range — the exact disaster the wraparound guards in GetNewMultiXactId exist to prevent. The member loop that follows reads each (xid, status) group, skipping any zero XID (the reserved offset-0 ambiguity), and returns the filtered array.

Member-space wraparound arithmetic: MultiXactOffsetWouldWrap

Section titled “Member-space wraparound arithmetic: MultiXactOffsetWouldWrap”

Because the member offset is a 32-bit counter that wraps at 0xFFFFFFFF, “is X far enough from the boundary?” cannot be a simple comparison — the addition can itself wrap. The helper does signed-difference reasoning with an explicit skip of the reserved offset 0:

// MultiXactOffsetWouldWrap — multixact.c
static bool
MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
uint32 distance)
{
MultiXactOffset finish;
finish = start + distance;
if (finish < start)
finish++; /* skip the reserved offset 0 on overflow */
if (start < boundary)
return finish >= boundary || finish < start;
else
return finish >= boundary && finish < start;
}

The member freeze threshold builds on the same MULTIXACT_MEMBER_SAFE_THRESHOLD (half the offset space) and MULTIXACT_MEMBER_DANGER_THRESHOLD (three-quarters) constants to make autovacuum more aggressive as the member space fills, independent of the MXID age:

// MultiXactMemberFreezeThreshold — multixact.c (condensed)
if (!ReadMultiXactCounts(&multixacts, &members))
return 0; /* unknown utilization: assume the worst */
if (members <= MULTIXACT_MEMBER_SAFE_THRESHOLD)
return autovacuum_multixact_freeze_max_age; /* plenty of room */
fraction = (double) (members - MULTIXACT_MEMBER_SAFE_THRESHOLD) /
(MULTIXACT_MEMBER_DANGER_THRESHOLD - MULTIXACT_MEMBER_SAFE_THRESHOLD);
victim_multixacts = multixacts * fraction;
result = multixacts - victim_multixacts;
return Min(result, autovacuum_multixact_freeze_max_age);

This is why a member-space crunch can force aggressive freezing even when no table is anywhere near autovacuum_multixact_freeze_max_age in MXID terms: the two domains have independent pressure, and autovacuum reacts to whichever is tighter.

Setting the wraparound limits: SetMultiXactIdLimit

Section titled “Setting the wraparound limits: SetMultiXactIdLimit”

The vacuum/warn/stop/wrap limits are derived from the oldest datminmxid across all databases, mirroring SetTransactionIdLimit:

// SetMultiXactIdLimit — multixact.c (condensed)
multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1); /* "half the space" */
multiStopLimit = multiWrapLimit - 3000000; /* refuse new MXIDs */
multiWarnLimit = multiWrapLimit - 40000000; /* loud warnings */
multiVacLimit = oldest_datminmxid + autovacuum_multixact_freeze_max_age;
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->oldestMultiXactId = oldest_datminmxid;
MultiXactState->multiVacLimit = multiVacLimit;
MultiXactState->multiWarnLimit = multiWarnLimit;
MultiXactState->multiStopLimit = multiStopLimit;
MultiXactState->multiWrapLimit = multiWrapLimit;
LWLockRelease(MultiXactGenLock);
/* Members have their *own* limits, computed separately */
needs_offset_vacuum = SetOffsetVacuumLimit(is_startup);

The “half the space” comment in the source is deliberately a fiction — multis wrap differently from XIDs — but it gives a comfortable buffer, and the member-space limits (SetOffsetVacuumLimit) are the ones that actually fire first on multi-heavy workloads.

Freezing a multi: FreezeMultiXactId (heapam.c)

Section titled “Freezing a multi: FreezeMultiXactId (heapam.c)”

VACUUM cannot just leave an old MXID in a tuple forever — eventually the MXID or its members fall before the freeze cutoffs and must be removed. The heap’s FreezeMultiXactId is the multixact garbage collector. It returns one of four dispositions via the FRM_* flags:

// heapam.c — FRM_* freeze dispositions
#define FRM_NOOP 0x0001 /* keep the multi as-is */
#define FRM_INVALIDATE_XMAX 0x0002 /* drop xmax entirely */
#define FRM_RETURN_IS_XID 0x0004 /* replace multi with a single XID */
#define FRM_RETURN_IS_MULTI 0x0008 /* replace with a new, smaller multi */

The decision tree, condensed:

// FreezeMultiXactId — heapam.c (condensed)
if (!MultiXactIdIsValid(multi) || HEAP_LOCKED_UPGRADED(t_infomask))
{ *flags |= FRM_INVALIDATE_XMAX; return InvalidTransactionId; }
if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
ereport(ERROR, "found multixact %u from before relminmxid %u", ...);
if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
{
/* this old multi cannot have running members; verify, then resolve */
if (MultiXactIdIsRunning(multi, HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)))
ereport(ERROR, "multixact %u ... found to be still running", ...);
if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
{ *flags |= FRM_INVALIDATE_XMAX; return InvalidTransactionId; } /* lockers only */
update_xact = MultiXactIdGetUpdateXid(multi, t_infomask);
... if updater aborted: FRM_INVALIDATE_XMAX
... else: *flags |= FRM_RETURN_IS_XID; return update_xact;
}
/* multi is >= OldestMxact: maybe keep it. Walk members. */
nmembers = GetMultiXactIdMembers(multi, &members, false, HEAP_XMAX_IS_LOCKED_ONLY(...));
need_replace = false;
for (i = 0; i < nmembers; i++)
if (TransactionIdPrecedes(members[i].xid, cutoffs->FreezeLimit)) { need_replace = true; break; }
if (!need_replace)
need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
if (!need_replace) { *flags |= FRM_NOOP; return multi; } /* keep it */
/* second pass: keep only running lockers + the live updater, build new multi */

The structure makes the lock/update distinction concrete. A multi older than OldestMxact that is lock-only is simply dropped (FRM_INVALIDATE_XMAX) — its lockers are gone, the locks meant nothing once their holders ended. A multi that contains a committed updater must keep that updater’s XID as the new xmax (FRM_RETURN_IS_XID), because the update chain still depends on it. A multi that still has some live members but also some below the freeze cutoff gets rebuilt smaller (FRM_RETURN_IS_MULTI) — keeping only running lockers and the live updater — which is the freezing-time counterpart of MultiXactIdExpand’s filtering. The ereport(ERROR, ...) for a multi “from before relminmxid” is the corruption tripwire that proves freezing kept pace with wraparound. (The XID-side of these cutoffs — relfrozenxid, OldestXmin, FreezeLimit — is owned by postgres-xid-wraparound-freeze.md.)

Truncation: TruncateMultiXact and the two-SLRU dance

Section titled “Truncation: TruncateMultiXact and the two-SLRU dance”

When the global oldest multi advances, both SLRUs are truncated. The members SLRU cannot be truncated by page number alone (it can be nearly full across the whole range at once), so the to-be-deleted member range is derived from the offsets SLRUfind_multixact_start(newOldestMulti) gives the member offset boundary, and PerformMembersTruncation deletes whole segments below it while PerformOffsetsTruncation truncates the offsets SLRU by page. The page-precedes callbacks (MultiXactOffsetPagePrecedes, MultiXactMemberPagePrecedes) teach SimpleLruTruncate how “older” is defined in each wrapping address space. Truncation is WAL-logged (XLOG_MULTIXACT_TRUNCATE_ID) so standbys replay it via multixact_redo.

flowchart TD
  V["VACUUM advances datminmxid<br/>vac_truncate_clog calls TruncateMultiXact"] --> S["find_multixact_start(newOldestMulti)<br/>read offsets SLRU -> newOldestOffset"]
  S --> M["PerformMembersTruncation<br/>SlruDeleteSegment for member segments<br/>below newOldestOffset"]
  S --> O["PerformOffsetsTruncation<br/>SimpleLruTruncate(offsets, page of newOldest-1)"]
  M --> W["WriteMTruncateXlogRec<br/>XLOG_MULTIXACT_TRUNCATE_ID"]
  O --> W
  W --> R["standby: multixact_redo replays truncation"]

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
MultiXactStatus (enum)src/include/access/multixact.h37
ISUPDATE_from_mxstatussrc/include/access/multixact.h52
MultiXactMember (struct)src/include/access/multixact.h56
FirstMultiXactId / MaxMultiXactIdsrc/include/access/multixact.h25
MULTIXACT_OFFSETS_PER_PAGEsrc/backend/access/transam/multixact.c110
MultiXactIdToOffsetPagesrc/backend/access/transam/multixact.c113
MULTIXACT_MEMBERGROUP_SIZEsrc/backend/access/transam/multixact.c152
MXOffsetToMemberOffsetsrc/backend/access/transam/multixact.c206
MULTIXACT_MEMBER_SAFE_THRESHOLDsrc/backend/access/transam/multixact.c216
MultiXactStateData (struct)src/backend/access/transam/multixact.c242
MultiXactIdCreatesrc/backend/access/transam/multixact.c478
MultiXactIdExpandsrc/backend/access/transam/multixact.c531
MultiXactIdIsRunningsrc/backend/access/transam/multixact.c643
MultiXactIdSetOldestMembersrc/backend/access/transam/multixact.c717
MultiXactIdSetOldestVisiblesrc/backend/access/transam/multixact.c774
MultiXactIdCreateFromMemberssrc/backend/access/transam/multixact.c859
RecordNewMultiXactsrc/backend/access/transam/multixact.c960
GetNewMultiXactIdsrc/backend/access/transam/multixact.c1201
GetMultiXactIdMemberssrc/backend/access/transam/multixact.c1470
mxstatus_to_stringsrc/backend/access/transam/multixact.c1904
SetMultiXactIdLimitsrc/backend/access/transam/multixact.c2530
ExtendMultiXactOffsetsrc/backend/access/transam/multixact.c2721
ExtendMultiXactMembersrc/backend/access/transam/multixact.c2753
MultiXactOffsetWouldWrapsrc/backend/access/transam/multixact.c3012
find_multixact_startsrc/backend/access/transam/multixact.c3060
MultiXactMemberFreezeThresholdsrc/backend/access/transam/multixact.c3150
PerformMembersTruncationsrc/backend/access/transam/multixact.c3219
TruncateMultiXactsrc/backend/access/transam/multixact.c3274
MultiXactOffsetPagePrecedessrc/backend/access/transam/multixact.c3466
MultiXactIdPrecedessrc/backend/access/transam/multixact.c3506
multixact_redosrc/backend/access/transam/multixact.c3583
tupleLockExtraInfosrc/backend/access/heap/heapam.c132
get_mxact_status_for_locksrc/backend/access/heap/heapam.c4527
FreezeMultiXactIdsrc/backend/access/heap/heapam.c6713
MultiXactIdGetUpdateXidsrc/backend/access/heap/heapam.c7536
HEAP_XMAX_IS_MULTIsrc/include/access/htup_details.h209
  • Two SLRUs, offsets→members indirection. Confirmed in the file header comment and the MultiXactIdToOffsetPage / MXOffsetToMemberPage macro families. The offsets SLRU is a fixed-width MultiXactOffset[] indexed by MXID; the members SLRU stores variable-length arrays located by those offsets. Member-array length is computed as the difference between consecutive offsets, not stored — verified in GetMultiXactIdMembers (“corner case 1” handling for the latest multi).
  • 5-word (20-byte) member groups, 409 per 8 kB page. The packing of 4 flag bytes + 4 xids is in the members layout comment and the MULTIXACT_MEMBERGROUP_SIZE / MULTIXACT_MEMBERGROUPS_PER_PAGE macros. With BLCKSZ=8192, 8192 / 20 = 409 groups, wasting 12 bytes/page, as the comment states.
  • Six member statuses, ordering load-bearing. MultiXactStatus has exactly the six values shown (ForKeyShare=0 … Update=5), and ISUPDATE_from_mxstatus keys on status > MultiXactStatusForUpdate (0x03). Verified in multixact.h.
  • Multis are immutable; expand creates a new MXID. MultiXactIdExpand calls MultiXactIdCreateFromMembers for every non-trivial outcome and never writes back into the passed multi. The source comment explains the race with waiters. Verified.
  • Single-updater invariant. MultiXactIdCreateFromMembers elog(ERROR, "new multixact has more than one updating member") if two members satisfy ISUPDATE_from_mxstatus. Verified.
  • Two independent wraparound domains. GetNewMultiXactId advances both nextMXact and nextOffset under MultiXactGenLock, and runs two distinct guard blocks — the MXID guard (multiVacLimit/multiStopLimit) and the member guard (offsetStopLimit via MultiXactOffsetWouldWrap). Verified.
  • Offset 0 is reserved. GetNewMultiXactId does if (nextOffset == 0) { *offset = 1; nmembers++; }, and GetMultiXactIdMembers ignores a zero member read. Verified in both functions.
  • Lock-mode → status mapping lives in heapam. tupleLockExtraInfo maps the four LockTupleMode strengths to (hwlock, lockstatus, updstatus), with -1 updstatus for the two share modes. get_mxact_status_for_lock is the accessor. Verified.
  • FreezeMultiXactId returns four FRM_ dispositions.* FRM_NOOP, FRM_INVALIDATE_XMAX, FRM_RETURN_IS_XID, FRM_RETURN_IS_MULTI defined at heapam.c:6660–6663; the decision tree distinguishes lock-only (invalidate) from committed-updater (return XID). Verified.
  • Truncation derives member range from offsets SLRU. TruncateMultiXactfind_multixact_startPerformMembersTruncation (segment deletes) + PerformOffsetsTruncation (SimpleLruTruncate), all WAL-logged via WriteMTruncateXlogRec (XLOG_MULTIXACT_TRUNCATE_ID). Verified.
  • The backend-local MXactCache comment carries an explicit FIXME noting the per-transaction lifetime is “plain wrong now that multixact’s may contain update Xids” — the cache can outlive the relevance of an entry whose other members persist. This is a known, tolerated imperfection, not a bug with an observable correctness impact (the cache is consulted only as a fast path; misses fall through to the SLRU).
  • The “half the space” wrap-limit comment in SetMultiXactIdLimit is an acknowledged approximation; the real first-to-fire constraint on multi-heavy workloads is the member-space limit from SetOffsetVacuumLimit, not the MXID limit. The exact interaction between the two limit sets under extreme member pressure is subtle and best understood by reading SetOffsetVacuumLimit alongside MultiXactMemberFreezeThreshold.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Oracle’s Interested Transaction List (ITL) vs. the members SLRU. The ITL is the closest mainstream cousin to MultiXact: both record a set of transactions interested in a row, and both let a locker coexist with an updater. The instructive difference is placement. Oracle keeps the list in the data block header (one ITL array per block, sized by INITRANS / MAXTRANS), so concurrent lockers compete for finite, in-block slots and a block can throw ORA-00060-class waits when the ITL is exhausted. PostgreSQL keeps the list in a global members SLRU, decoupling list size from heap-page space — a multi can grow without consuming row room — at the cost of an SLRU lookup, a second wraparound domain, and a VACUUM-time garbage-collection obligation (FreezeMultiXactId) that Oracle’s block-local, undo-reclaimed scheme does not have. A side-by-side of ITL slot contention against members-offset exhaustion would frame the core trade: locality and bounded space (Oracle) versus unbounded per-row stakeholders and amortized SLRU traffic (PostgreSQL).

  • Lock-table-only engines (DB2, SQL Server) and lock escalation. In a classic 2PL engine the row lock is a transient shared-memory hash entry with no on-row footprint, so there is no representational ceiling on concurrent lockers and no freezing problem — but accumulated row locks trigger escalation to page/table granularity, trading concurrency for bounded memory. PostgreSQL’s MultiXact is the dual design: it never escalates (the durable on-row MXID has no memory-pressure cliff) but it does pay a wraparound and truncation tax that an in-memory lock table sidesteps entirely. The comparison is a clean illustration of “where does lock state live” being the load-bearing design axis the Theoretical Background opened with. See postgres-lock-manager.md for PostgreSQL’s own in-memory heavyweight lock table, which the tuple-lock path uses transiently in addition to the durable MXID.

  • Hekaton / in-memory MVCC and the absence of durable row locks. Microsoft’s Hekaton (Larson et al., “High-Performance Concurrency Control Mechanisms for Main-Memory Databases,” VLDB 2011) resolves write-write and lock conflicts through optimistic, version-chain validation with begin/end timestamps stamped directly on versions — there is no on-row locker set at all, because conflicts are detected at validation time rather than recorded durably. MultiXact occupies the opposite corner of the design space: pessimistic, durable, on-row encoding of who holds what, justified by PostgreSQL’s disk-oriented, crash-recoverable heap where a reader arriving after a crash must reconstruct lock state from the tuple alone. Positioning the six-valued MultiXactStatus against Hekaton’s timestamp-only versions sharpens what durability of the lock itself (not just the data) costs.

  • The two-counter wraparound problem as a research artifact. The member-offset space is a second 32-bit wraparound domain layered on the MXID domain, and historically (the 9.3-era SLRU bugs that motivated SetOffsetVacuumLimit and MultiXactMemberFreezeThreshold) it has been the one that bites first under FK-heavy workloads. A focused study of how autovacuum_multixact_freeze_max_age interacts with the member-fraction back-pressure curve — and whether a 64-bit member offset (mirroring the 64-bit-XID proposals tracked in postgres-xid-wraparound-freeze.md) would retire the whole second domain — is the natural research frontier here.

  • Generalizing per-member status beyond lock/update. The MultiXactMember flag byte currently encodes exactly six statuses. Nothing in the SLRU layout forbids richer per-member metadata (e.g., distinguishing predicate-lock intent, or carrying a sub-transaction id), and the immutable-multi discipline means any such extension is purely additive at create time. Whether the member vocabulary should grow — versus pushing richer intent into the separate predicate-lock machinery (postgres-ssi-predicate-locking.md) — is an open design question the current single-updater invariant keeps narrowly scoped.

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/access/transam/multixact.c — the subsystem proper: the offsets/members SLRU layout macros (MultiXactIdToOffsetPage, MXOffsetToMemberPage, the 5-word member-group constants), the create path (MultiXactIdCreate, MultiXactIdExpand, MultiXactIdCreateFromMembers, GetNewMultiXactId, RecordNewMultiXact), the read path (GetMultiXactIdMembers), the per-backend horizons (MultiXactIdSetOldestMember / SetOldestVisible), the dual wraparound guards (MultiXactOffsetWouldWrap, SetMultiXactIdLimit, SetOffsetVacuumLimit, MultiXactMemberFreezeThreshold), truncation (TruncateMultiXact, find_multixact_start, PerformMembersTruncation, PerformOffsetsTruncation), and WAL replay (multixact_redo).
  • src/include/access/multixact.hMultiXactStatus (the six member statuses), ISUPDATE_from_mxstatus, MultiXactMember, the reserved MXID values (InvalidMultiXactId, FirstMultiXactId, MaxMultiXactId, MaxMultiXactOffset).
  • src/backend/access/heap/heapam.c — the policy layer that owns the lock-mode vocabulary and the freeze collector: tupleLockExtraInfo (LockTupleMode → member status), get_mxact_status_for_lock, MultiXactIdGetUpdateXid, and FreezeMultiXactId with its four FRM_* dispositions.
  • src/include/access/htup_details.hHEAP_XMAX_IS_MULTI and the HEAP_XMAX_IS_LOCKED_ONLY / HEAP_LOCKED_UPGRADED infomask predicates that gate whether an xmax is read as a bare XID or an MXID.
  • src/backend/access/heap/README.tuplock — the design narrative for tuple locking that motivates why a multi is needed at all (compatible share locks, locker-plus-updater coexistence, the infomask encoding).
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 18 “Concurrency Control” — shared/exclusive lock modes, the compatibility matrix, and 2PL, the textbook frame the Theoretical Background builds on (knowledge/research/dbms-general/).
  • Database Internals (Petrov 2019), ch. 5 “Transaction Processing and Recovery” — lock-manager structure and where lock state lives, the axis the Common DBMS Design section turns on.
  • Larson, P.-Å. et al. (2011). “High-Performance Concurrency Control Mechanisms for Main-Memory Databases.” VLDB 4(5):298-309. The Hekaton optimistic-MVCC contrast in Beyond PostgreSQL (timestamp-on-version vs. durable on-row locker set).
  • Oracle Database Concepts — the Interested Transaction List (ITL), INITRANS/MAXTRANS, and block-level row-lock storage, the comparative anchor for the members-SLRU-vs-block-header trade.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-slru.md — the SimpleLruReadPage / SimpleLruTruncate / SlruDeleteSegment buffer machinery that the offsets and members SLRUs are instances of; this doc treats the SLRU as a given.
  • postgres-lock-manager.md — the in-memory heavyweight lock table and the LOCKMODE vocabulary (AccessShareLockAccessExclusiveLock) that tupleLockExtraInfo’s first column feeds; the tuple-lock path takes one of these transiently alongside the durable MXID.
  • postgres-xid-wraparound-freeze.md — the XID side of freezing (relfrozenxid, OldestXmin, FreezeLimit, vac_update_datfrozenxid) that runs in lockstep with the MXID side (relminmxid, OldestMxact, MultiXactCutoff) inside the same VACUUM pass.
  • postgres-heap-am.md / postgres-mvcc-snapshots.mdheap_lock_tuple, heap_update, and the xmax/infomask visibility rules that decide when an MXID is allocated and how a later reader interprets it.
  • postgres-procarray.md — the xmin-horizon analogue to the OldestMemberMXactId / OldestVisibleMXactId per-backend horizons that keep the SLRUs from being truncated underfoot.
  • postgres-clog-commit-ts.md — the other two-area-free SLRUs whose page-arithmetic and truncation patterns parallel multixact’s, referenced when the offsets/members split is introduced.