PostgreSQL SSI and Predicate Locking — Serializable Snapshot Isolation, SIREAD Locks, and the rw-conflict Graph
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Snapshot isolation (SI) gives each transaction a consistent view of the database as of a point in time, so readers never block writers and writers never block readers. Database System Concepts (Silberschatz, 7e, §19.6) calls SI an optimistic scheme: because it is based on MVCC, the usual read/write contention disappears. However, SI does not guarantee full serializability. The canonical counter-example is the write skew anomaly: two concurrent transactions each read overlapping data and each write to a disjoint part, yet the combined result is inconsistent with any serial execution.
The theoretical fix was published in Making Snapshot Isolation Serializable (Fekete et al., ACM TODS 2005, [2] in README-SSI). The key observation is that every SI anomaly contains a dangerous structure in the dependency graph: two adjacent rw-conflict edges forming the shape
Tin ──rw──► Tpivot ──rw──► Toutwhere Tout must commit before Tpivot and Tin. No other dependency
type (wr- or ww-) is needed; rw-conflicts alone are sufficient to detect
all anomalies without false-negatives. An algorithm that watches for this
pattern and aborts one transaction when it appears can run SI underneath
and still deliver serializability — Serializable Snapshot Isolation
(SSI).
Cahill, Röhm, and Fekete (SIGMOD 2008, [1] in README-SSI) gave the first practical SSI algorithm and demonstrated that its overhead over plain SI is small. PostgreSQL adopted SSI in version 9.1. The implementation deviates from the paper in several ways driven by PostgreSQL’s existing architecture (see the “Innovations” section of README-SSI), but the core graph-monitoring idea is unchanged.
From the multi-version serialization graph to the dangerous structure
Section titled “From the multi-version serialization graph to the dangerous structure”To make the “dangerous structure” claim precise it helps to recall the multi-version serialization graph (MVSG) that the SSI literature builds on. Nodes are committed transactions; edges encode the apparent execution order. README-SSI distinguishes three edge types, following Fekete et al.:
- wr-dependency —
T1 ─wr─► T2:T2reads a version thatT1wrote.T1must have committed beforeT2took its snapshot, so the edge is always consistent with commit order. - ww-dependency —
T1 ─ww─► T2:T2overwrites a versionT1wrote. AgainT1committed first. - rw-conflict (rw-antidependency) —
T1 ─rw─► T2:T1reads a version thatT2overwrites (orT2’s write would have been visible toT1’s read predicate had it landed first). Critically, this edge can point backwards in time:T1appears to execute first because it saw a state prior toT2’s write, even ifT1actually committed afterT2.
A schedule is serializable iff its MVSG is acyclic. Under snapshot
isolation the only edges that can run counter to real commit order are the
rw-antidependencies, so they are the only edges capable of closing a cycle
that did not exist under a serial schedule. This is the engine behind
write skew: two transactions each read the other’s “before” state and
write disjoint rows, producing the cycle T1 ─rw─► T2 ─rw─► T1.
Fekete, Liarokapis, O’Neil, O’Neil & Shasha (Making Snapshot Isolation
Serializable, ACM TODS 30:2, 2005 — [2] in README-SSI) proved the
sharpest structural result: every cycle in the MVSG of an SI schedule
contains two rw-antidependency edges that are consecutive — i.e. a
Tpivot with one rw-edge in and one rw-edge out. The transaction in the
middle is the pivot. Cahill et al.’s SIGMOD 2008 contribution was to turn
this theorem into a cheap runtime test: instead of materializing the whole
graph and searching for cycles (expensive), track only rw-antidependencies
incrementally and fire whenever a pivot acquires both an inbound and an
outbound rw-edge. Because not every such pivot lies on an actual cycle, the
test admits false positives but never false negatives — it is safe
(serializable) and conservative (may over-abort). README-SSI states this
tradeoff explicitly: “The number of false positives is low in practice, so
this represents an acceptable tradeoff for keeping the detection overhead
low.”
The commit-order refinement (Theorem 2.1)
Section titled “The commit-order refinement (Theorem 2.1)”The bare pivot test would abort too eagerly. Fekete et al.’s Theorem 2.1
adds the timing constraint PostgreSQL exploits: in any dangerous structure
Tin ─rw─► Tpivot ─rw─► Tout that lies on a real cycle, Tout must be
the first of the three to commit. PostgreSQL therefore refuses to abort
anyone until the out-side partner has actually committed (the
SXACT_FLAG_DOOMED/commit-time check in
OnConflict_CheckForSerializationFailure). This single refinement removes
a large class of false positives at essentially zero cost, because the
commit order is already tracked via SerCommitSeqNo.
Berenson et al. and why “ANSI SERIALIZABLE” is not enough
Section titled “Berenson et al. and why “ANSI SERIALIZABLE” is not enough”The reason PostgreSQL needed SSI at all traces to Berenson, Bernstein,
Gray, Melton, O’Neil & O’Neil, A Critique of ANSI SQL Isolation Levels
(SIGMOD 1995). That paper showed the ANSI SQL-92 isolation definitions —
phrased as “phenomena” P1/P2/P3 (dirty read, non-repeatable read, phantom)
— are ambiguous and, crucially, do not characterize snapshot isolation.
SI forbids all three ANSI phenomena yet still permits write skew (A5B) and
read-only-transaction anomalies, so a vendor could legally label SI as
“SERIALIZABLE” while violating true serializability. PostgreSQL did exactly
this before 9.1: requesting SERIALIZABLE silently gave snapshot
isolation. SSI closed that gap; REPEATABLE READ now retains the legacy SI
behavior (which is itself stronger than the ANSI definition of REPEATABLE
READ — PostgreSQL has no dirty reads and no phantoms at that level), while
SERIALIZABLE adds the rw-antidependency monitor on top.
The read-only-transaction optimization (an original contribution)
Section titled “The read-only-transaction optimization (an original contribution)”README-SSI’s SSI-algorithm section records a refinement that is not in
the Cahill/Fekete papers: if Tin is read-only, an anomaly is possible
only if Tout committed before Tin took its snapshot. The proof
(reproduced in README-SSI) observes that the edge entering a read-only
Tin on the cycle cannot be an rw- or ww-edge — a read-only transaction
writes nothing — so it must be a wr-dependency, which forces the
predecessor to have committed before Tin’s snapshot. Combined with the
“Tout commits first” rule, Tout must precede Tin’s snapshot. This is
what makes the DEFERRABLE READ ONLY safe snapshot sound (see the
PostgreSQL section) and lets ordinary read-only transactions skip much of
the conflict bookkeeping. Ports & Grittner, Serializable Snapshot
Isolation in PostgreSQL (VLDB 2012), is the authoritative write-up of
these engineering refinements and the measured overheads; it is the paper
to cite for the PostgreSQL-specific deviations.
Predicate locking is a prerequisite. To detect an rw-conflict when the read happens before the write, the system must record what each transaction has read — not just which tuples, but the “predicate” covering the set of tuples the scan would return. A later write that falls inside that predicate is then a potential rw-conflict. Classical predicate locks are hard to implement exactly; PostgreSQL uses physical-object locks (relation, page, tuple) as conservative approximations, accepting some false-positive rw-conflicts in exchange for simplicity. Hellerstein, Stonebraker & Hamilton’s Architecture of a Database System (§6.5.3, “Next-Key Locking: Physical Surrogates for Logical Properties” — [3] in README-SSI) is the canonical reference for this physical-surrogate trick; the SIREAD lock is its non-blocking cousin.
Common DBMS Design
Section titled “Common DBMS Design”SSI as described in the Cahill papers is a general framework, not a PostgreSQL-specific design. Several engineering conventions appear in any MVCC engine that implements it.
A per-transaction conflict record
Section titled “A per-transaction conflict record”Each serializable transaction needs bookkeeping beyond a snapshot: a list of its outgoing rw-conflicts (transactions it read from, whose later writes created a conflict) and its incoming rw-conflicts (transactions that read data it later wrote). A dangerous structure exists when a transaction has both a conflict in and a conflict out where the out-side transaction has already committed.
Lock granularity escalation
Section titled “Lock granularity escalation”Tracking every individual tuple read in RAM is not feasible at scale. The standard approach (used in PostgreSQL, IBM DB2’s next-key locking, and the paper prototypes) is a multi-granularity lock hierarchy: try to lock the tuple, but if too many tuple locks accumulate, promote them to a single page lock; if page locks proliferate, promote to a relation lock. Coarser locks cause more false-positive conflicts but guarantee bounded memory use.
SIREAD locks vs. heavyweight locks
Section titled “SIREAD locks vs. heavyweight locks”Predicate locks for SSI are fundamentally different from heavyweight (S2PL-style) locks: they never block. A SIREAD lock is closer to a flag — it records that a transaction accessed an object, so that a later writer on the same object can detect an rw-conflict. No queue, no wait, no deadlock. The structures and code paths are necessarily separate from the regular lock manager.
Committed-transaction state lifetime
Section titled “Committed-transaction state lifetime”A transaction’s conflict information must outlive the transaction itself:
a committed transaction may still be part of a dangerous structure visible
to concurrent serializable transactions that have not yet committed. Every
engine needs a policy for when to reclaim this state. The standard
approach is to keep state for all transactions whose xmin is no older
than the oldest live serializable transaction’s snapshot.
DEFERRABLE READ ONLY optimization
Section titled “DEFERRABLE READ ONLY optimization”A read-only transaction can never cause a write-skew anomaly on its own.
If it also starts after all concurrent read-write transactions have
committed without outgoing conflicts, it is on a safe snapshot and
can opt out of predicate lock tracking entirely. Delaying the start of a
READ ONLY DEFERRABLE transaction until such a safe snapshot is available
costs latency but eliminates SSI overhead for that transaction.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s SSI implementation lives almost entirely in
storage/lmgr/predicate.c (~4 700 lines). The README-SSI in that
directory is the authoritative design document; what follows distils the
key architectural choices.
Shared-memory layout
Section titled “Shared-memory layout”Six shared-memory objects underpin SSI:
flowchart TD PX["PredXactList\n(PredXact)\nactiveList + availableList\nSxactGlobalXmin, WritableSxactCount\nLastSxactCommitSeqNo"] SX["SERIALIZABLEXACT[]\n(pool allocated)\nvxid, topXid, xmin, flags\noutConflicts, inConflicts\npredicateLocks"] SXH["SerializableXidHash\nxid -> SERIALIZABLEXACT*"] PLT["PredicateLockTargetHash\nPREDICATELOCKTARGETTAG -> PREDICATELOCKTARGET\n(partitioned by target hash)"] PL["PredicateLockHash\n(PREDICATELOCKTARGET*, SERIALIZABLEXACT*) -> PREDICATELOCK"] SLRU["SerialSlruCtl\nold committed sxact state\nxid -> SerCommitSeqNo"] PX -- "pool" --> SX SX -- "lookup" --> SXH SX -- "lock list" --> PL PLT -- "anchor" --> PL PX -- "overflow" --> SLRU
PredXactList is a single shared struct (PredXact) holding global
counters and two free/active lists of SERIALIZABLEXACT slots.
SerializableXidHash maps a transaction’s top-level XID to its
SERIALIZABLEXACT so that writers can locate a reader’s record quickly.
The predicate lock tables are partitioned (like the heavyweight lock
table) to reduce contention — NUM_PREDICATELOCK_PARTITIONS partitions,
each guarded by its own LWLock.
SERIALIZABLEXACT — the per-transaction SSI record
Section titled “SERIALIZABLEXACT — the per-transaction SSI record”// SERIALIZABLEXACT — src/include/storage/predicate_internals.htypedef struct SERIALIZABLEXACT{ VirtualTransactionId vxid; SerCommitSeqNo prepareSeqNo; SerCommitSeqNo commitSeqNo; union { SerCommitSeqNo earliestOutConflictCommit; SerCommitSeqNo lastCommitBeforeSnapshot; } SeqNo; dlist_head outConflicts; /* rw-conflicts where this txn is the reader */ dlist_head inConflicts; /* rw-conflicts where this txn is the writer */ dlist_head predicateLocks; /* SIREAD locks held */ dlist_head possibleUnsafeConflicts; /* for READ ONLY safe-snapshot tracking */ TransactionId topXid; TransactionId finishedBefore; TransactionId xmin; uint32 flags; /* SXACT_FLAG_* bitmask */ int pid; int pgprocno;} SERIALIZABLEXACT;Key flags are SXACT_FLAG_COMMITTED, SXACT_FLAG_DOOMED (will abort
at next check), SXACT_FLAG_READ_ONLY, SXACT_FLAG_RO_SAFE (eligible
for early lock release), and SXACT_FLAG_CONFLICT_OUT (has a conflict
out to a transaction that committed ahead of it).
SIREAD lock structures
Section titled “SIREAD lock structures”// PREDICATELOCKTARGET — src/include/storage/predicate_internals.htypedef struct PREDICATELOCKTARGET { PREDICATELOCKTARGETTAG tag; /* db + rel + blkno + offset; field4 invalid => page; field3+4 invalid => relation */ dlist_head predicateLocks;} PREDICATELOCKTARGET;
// PREDICATELOCK — src/include/storage/predicate_internals.htypedef struct PREDICATELOCK { PREDICATELOCKTAG tag; /* (myTarget*, myXact*) */ dlist_node targetLink; dlist_node xactLink; SerCommitSeqNo commitSeqNo; /* set when owner commits; used for summarization */} PREDICATELOCK;Each backend also maintains a private LocalPredicateLockHash
(LOCALPREDICATELOCK) that tracks lock counts per granularity level
without touching shared memory. This hash drives granularity
promotion: when a transaction’s tuple-level lock count for a page
exceeds max_predicate_locks_per_page, CheckAndPromotePredicateLockRequest
upgrades them to a single page-level lock.
SIREAD lock acquisition at three granularities
Section titled “SIREAD lock acquisition at three granularities”The three public entry points are thin wrappers that build a
PREDICATELOCKTARGETTAG at the requested granularity and funnel into the
common PredicateLockAcquire. They differ only in which SET_*TAG macro
they call and in the early-out guards. PredicateLockTID carries the most
logic, because a tuple lock is redundant if this transaction already wrote
the tuple, or if a relation lock already covers it:
// PredicateLockTID — src/backend/storage/lmgr/predicate.cvoidPredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot, TransactionId tuple_xid){ PREDICATELOCKTARGETTAG tag;
if (!SerializationNeededForRead(relation, snapshot)) return;
/* If we wrote it; we already have a write lock. */ if (relation->rd_index == NULL) if (TransactionIdIsCurrentTransactionId(tuple_xid)) return;
/* Quick (non-definitive) check for a relation-level lock first. */ SET_PREDICATELOCKTARGETTAG_RELATION(tag, relation->rd_locator.dbOid, relation->rd_id); if (PredicateLockExists(&tag)) return;
SET_PREDICATELOCKTARGETTAG_TUPLE(tag, relation->rd_locator.dbOid, relation->rd_id, ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid)); PredicateLockAcquire(&tag);}PredicateLockAcquire is where the SIREAD lock actually lands. Note the
two short-circuits — already-held and covered-by-coarser — that make
acquisition idempotent and implement the README rule “an attempt to lock a
tuple when the same transaction already holds a lock on the page or
relation will be ignored”:
// PredicateLockAcquire — src/backend/storage/lmgr/predicate.cstatic voidPredicateLockAcquire(const PREDICATELOCKTARGETTAG *targettag){ uint32 targettaghash; bool found; LOCALPREDICATELOCK *locallock;
/* Do we have the lock already, or a covering lock? */ if (PredicateLockExists(targettag)) return; if (CoarserLockCovers(targettag)) return;
targettaghash = PredicateLockTargetTagHashCode(targettag);
/* Record in the per-backend local table (drives promotion). */ locallock = (LOCALPREDICATELOCK *) hash_search_with_hash_value(LocalPredicateLockHash, targettag, targettaghash, HASH_ENTER, &found); locallock->held = true; if (!found) locallock->childLocks = 0;
/* Actually create the lock in shared memory. */ CreatePredicateLock(targettag, targettaghash, MySerializableXact);
/* Promote to coarser granularity, or clean up finer locks. */ if (CheckAndPromotePredicateLockRequest(targettag)) ; /* promoted: parent acquire deleted this lock and its children */ else if (GET_PREDICATELOCKTARGETTAG_TYPE(*targettag) != PREDLOCKTAG_TUPLE) DeleteChildTargetLocks(targettag);}Granularity promotion in detail
Section titled “Granularity promotion in detail”CheckAndPromotePredicateLockRequest walks the parent chain (tuple → page
→ relation) in the backend-private hash, incrementing each ancestor’s
childLocks counter, and promotes to the coarsest ancestor whose child
count has crossed MaxPredicateChildLocks (derived from
max_predicate_locks_per_page / max_predicate_locks_per_relation):
// CheckAndPromotePredicateLockRequest — src/backend/storage/lmgr/predicate.cstatic boolCheckAndPromotePredicateLockRequest(const PREDICATELOCKTARGETTAG *reqtag){ PREDICATELOCKTARGETTAG targettag, nexttag, promotiontag; LOCALPREDICATELOCK *parentlock; bool found, promote = false;
targettag = *reqtag;
/* check parents iteratively */ while (GetParentPredicateLockTag(&targettag, &nexttag)) { targettag = nexttag; parentlock = (LOCALPREDICATELOCK *) hash_search(LocalPredicateLockHash, &targettag, HASH_ENTER, &found); if (!found) { parentlock->held = false; parentlock->childLocks = 1; } else parentlock->childLocks++;
if (parentlock->childLocks > MaxPredicateChildLocks(&targettag)) { promotiontag = targettag; /* keep climbing for child counts */ promote = true; } }
if (promote) { PredicateLockAcquire(&promotiontag); /* acquire coarsest eligible */ return true; } return false;}Transaction lifecycle
Section titled “Transaction lifecycle”-
Snapshot acquisition —
GetSerializableTransactionSnapshotallocates aSERIALIZABLEXACT, links it intoPredXact->activeList, and initializesoutConflicts,inConflicts, andpredicateLocks. If the transaction is read-only and no writable serializable transactions are active (WritableSxactCount == 0), it is immediately given a safe snapshot and never enters the tracking machinery. -
Read — SIREAD lock acquisition — the table AM calls
PredicateLockRelation,PredicateLockPage, orPredicateLockTID. These call through toPredicateLockAcquire, which inserts aPREDICATELOCKin the shared hash and aLOCALPREDICATELOCKin the backend-private hash. If a coarser lock already covers the target, the acquisition is a no-op. -
Write — conflict-out detection — when the table AM reads a tuple version written by another concurrent transaction, it calls
CheckForSerializableConflictOut. This looks up the writer’sSERIALIZABLEXACTand, if both transactions are concurrent, callsFlagRWConflict(MySerializableXact, writer_sxact)— recording a conflict whereMySerializableXactis the reader (out-side). -
Write — conflict-in detection — when the table AM writes a tuple, it calls
CheckForSerializableConflictIn, which scans thePredicateLockTargetHashfor all SIREAD locks on that target. For each lock held by a concurrent serializable transaction, it callsFlagRWConflict(reader_sxact, MySerializableXact). -
Dangerous-structure check —
FlagRWConflictcallsOnConflict_CheckForSerializationFailurebefore recording the edge (full excerpt below in the dedicated subsection). -
Pre-commit check —
PreCommit_CheckForSerializationFailureruns at commit time to catch the case where this transaction isTout(committing first) and its in-side conflict partner has an in-conflict that would complete the dangerous structure. -
Lock release and summarization —
ReleasePredicateLocksruns at commit or rollback. On commit,PREDICATELOCKentries are not freed immediately; they are transferred toOldCommittedSxact(a shared dummySERIALIZABLEXACT) and theircommitSeqNois set. When memory pressure builds,SummarizeOldestCommittedSxactpushes the oldest committed transaction’s conflict state to the SLRU-backed serial log (SerialSlruCtl) and reclaims its RAM.CheckForSerializableConflictOutchecks the serial log (viaSerialGetMinConflictCommitSeqNo) when a writer’sSERIALIZABLEXACTis no longer in the hash.
rw-conflict graph and dangerous structure
Section titled “rw-conflict graph and dangerous structure”flowchart LR Tin["Tin\n(reader)"] Tpivot["Tpivot\n(reader+writer)"] Tout["Tout\n(writer, commits first)"] Tin -- "rw-conflict out" --> Tpivot Tpivot -- "rw-conflict out" --> Tout
Tout must commit before Tpivot and Tin for the structure to be
dangerous (Theorem 2.1 of Fekete et al.). PostgreSQL exploits two
additional optimizations from README-SSI:
- Tout-commits-first guard: PostgreSQL never aborts a transaction until the out-side conflict partner has committed. This eliminates premature aborts.
- Read-only optimization: if
Tinis read-only, an anomaly is only possible ifToutcommitted beforeTinacquired its snapshot. Read-only transactions that started after all concurrent writable serializable transactions have committed cleanly can opt out entirely (theSXACT_FLAG_RO_SAFEpath inGetSerializableTransactionSnapshotInt).
Recording an edge: SetRWConflict
Section titled “Recording an edge: SetRWConflict”An edge is a pooled RWConflictData taken from RWConflictPool and
double-linked into the reader’s outConflicts and the writer’s
inConflicts. There is no separate adjacency matrix — the graph is these
two intrusive lists per SERIALIZABLEXACT. Exhaustion of the pool is a
hard error, not a silent miss, because a lost edge would mean a missed
anomaly:
// SetRWConflict — src/backend/storage/lmgr/predicate.cstatic voidSetRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer){ RWConflict conflict;
Assert(reader != writer); Assert(!RWConflictExists(reader, writer));
if (dlist_is_empty(&RWConflictPool->availableList)) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("not enough elements in RWConflictPool to record a read/write conflict"), errhint("You might need to run fewer transactions at a time or increase \"max_connections\".")));
conflict = dlist_head_element(RWConflictData, outLink, &RWConflictPool->availableList); dlist_delete(&conflict->outLink);
conflict->sxactOut = reader; /* edge tail = reader */ conflict->sxactIn = writer; /* edge head = writer */ dlist_push_tail(&reader->outConflicts, &conflict->outLink); dlist_push_tail(&writer->inConflicts, &conflict->inLink);}Conflict-out: the reader sees a concurrent writer
Section titled “Conflict-out: the reader sees a concurrent writer”CheckForSerializableConflictOut is called from the table AM when a scan
reads a tuple version whose xmin/xmax belongs to another transaction.
It looks the writer up by top-level XID — first in SerializableXidHash,
then, if the writer has already been summarized out to the serial log, via
SerialGetMinConflictCommitSeqNo. The read-only short-circuit implements
the original optimization from README-SSI:
// CheckForSerializableConflictOut — src/backend/storage/lmgr/predicate.cvoidCheckForSerializableConflictOut(Relation relation, TransactionId xid, Snapshot snapshot){ /* ... lookup SERIALIZABLEXACT *sxact for top-level xid ... */
/* If this is read-only and the writer committed without an earlier * out-conflict, the reader simply appears to run first: no conflict. */ if (SxactIsReadOnly(MySerializableXact) && SxactIsCommitted(sxact) && !SxactHasSummaryConflictOut(sxact) && (!SxactHasConflictOut(sxact) || MySerializableXact->SeqNo.lastCommitBeforeSnapshot < sxact->SeqNo.earliestOutConflictCommit)) { LWLockRelease(SerializableXactHashLock); return; }
if (!XidIsConcurrent(xid)) { /* write was already in our snapshot */ LWLockRelease(SerializableXactHashLock); return; } if (RWConflictExists(MySerializableXact, sxact)) { /* no duplicate edges */ LWLockRelease(SerializableXactHashLock); return; } /* ... FlagRWConflict(MySerializableXact, sxact) — we are the reader ... */}Conflict-in: the writer probes existing SIREAD locks
Section titled “Conflict-in: the writer probes existing SIREAD locks”CheckForSerializableConflictIn is the mirror image, called when a tuple
is written. It probes the predicate-lock hash from finest to coarsest
granularity (tuple, then page, then relation) — the order matters, because
a concurrent granularity promotion always acquires the coarser lock
before releasing the finer ones, so checking fine→coarse can never miss:
// CheckForSerializableConflictIn — src/backend/storage/lmgr/predicate.cvoidCheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno){ PREDICATELOCKTARGETTAG targettag;
if (!SerializationNeededForWrite(relation)) return; if (SxactIsDoomed(MySerializableXact)) ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot, during conflict in */ ...));
MyXactDidWrite = true;
if (tid != NULL) { /* finest: tuple */ SET_PREDICATELOCKTARGETTAG_TUPLE(targettag, relation->rd_locator.dbOid, relation->rd_id, ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid)); CheckTargetForConflictsIn(&targettag); } if (blkno != InvalidBlockNumber) { /* page */ SET_PREDICATELOCKTARGETTAG_PAGE(targettag, relation->rd_locator.dbOid, relation->rd_id, blkno); CheckTargetForConflictsIn(&targettag); } SET_PREDICATELOCKTARGETTAG_RELATION(targettag, relation->rd_locator.dbOid, relation->rd_id); /* coarsest: relation */ CheckTargetForConflictsIn(&targettag);}CheckTargetForConflictsIn does the per-target inner loop: for every
SIREAD lock on the target held by a still-relevant concurrent transaction
(not doomed; either uncommitted or committed recently enough that our
snapshot xmin predates its finishedBefore), it records the inbound edge
via FlagRWConflict(reader, me). It also drops our own SIREAD lock on a
tuple we are now writing — the README’s “write lock subsumes the SIREAD
lock” optimization:
// CheckTargetForConflictsIn — src/backend/storage/lmgr/predicate.cdlist_foreach_modify(iter, &target->predicateLocks){ PREDICATELOCK *predlock = dlist_container(PREDICATELOCK, targetLink, iter.cur); SERIALIZABLEXACT *sxact = predlock->tag.myXact;
if (sxact == MySerializableXact) { /* our own SIREAD lock on a tuple we're writing — defer removal */ if (!IsSubTransaction() && GET_PREDICATELOCKTARGETTAG_OFFSET(*targettag)) { mypredlock = predlock; mypredlocktag = predlock->tag; } } else if (!SxactIsDoomed(sxact) && (!SxactIsCommitted(sxact) || TransactionIdPrecedes(GetTransactionSnapshot()->xmin, sxact->finishedBefore)) && !RWConflictExists(sxact, MySerializableXact)) { /* upgrade to exclusive, re-check, then flag the inbound edge */ FlagRWConflict(sxact, MySerializableXact); }}Flagging and the dangerous-structure detector
Section titled “Flagging and the dangerous-structure detector”FlagRWConflict is the single choke point: it first asks
OnConflict_CheckForSerializationFailure whether adding this edge completes
a pivot, then either records a real edge or, when one party has already been
summarized into OldCommittedSxact, sets a summary-conflict flag instead of
allocating an RWConflictData:
// FlagRWConflict — src/backend/storage/lmgr/predicate.cstatic voidFlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer){ Assert(reader != writer);
/* First, see if this conflict causes failure. */ OnConflict_CheckForSerializationFailure(reader, writer);
/* Actually do the conflict flagging. */ if (reader == OldCommittedSxact) writer->flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN; else if (writer == OldCommittedSxact) reader->flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT; else SetRWConflict(reader, writer);}The detector checks the three ways the new reader ─rw─► writer edge can
close a dangerous structure. The third case — where the reader becomes
the pivot (T0 ─rw─► reader ─rw─► writer) — is the subtle one; it walks the
reader’s existing inConflicts looking for a T0 that has not safely
committed ahead of the writer:
// OnConflict_CheckForSerializationFailure — src/backend/storage/lmgr/predicate.cstatic voidOnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer){ bool failure = false; Assert(LWLockHeldByMe(SerializableXactHashLock));
/* Case 1: writer already committed and has a conflict out (R->W->T2). */ if (SxactIsCommitted(writer) && (SxactHasConflictOut(writer) || SxactHasSummaryConflictOut(writer))) failure = true;
/* Case 2: writer is the pivot — walk its outConflicts for a T2 that * committed/prepared first (R->W->T2). */ if (!failure && SxactHasSummaryConflictOut(writer)) failure = true; else if (!failure) { dlist_foreach(iter, &writer->outConflicts) { SERIALIZABLEXACT *t2 = dlist_container(RWConflictData, outLink, iter.cur)->sxactIn; if (SxactIsPrepared(t2) && (!SxactIsCommitted(reader) || t2->prepareSeqNo <= reader->commitSeqNo) && (!SxactIsCommitted(writer) || t2->prepareSeqNo <= writer->commitSeqNo) && (!SxactIsReadOnly(reader) || t2->prepareSeqNo <= reader->SeqNo.lastCommitBeforeSnapshot)) { failure = true; break; } } }
/* Case 3: reader is the pivot — walk its inConflicts for a T0 * (T0->R->W), writer prepared, reader not read-only. */ if (!failure && SxactIsPrepared(writer) && !SxactIsReadOnly(reader)) { if (SxactHasSummaryConflictIn(reader)) failure = true; else dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->inConflicts) { const SERIALIZABLEXACT *t0 = dlist_container(RWConflictData, inLink, iter.cur)->sxactOut; if (!SxactIsDoomed(t0) && (!SxactIsCommitted(t0) || t0->commitSeqNo >= writer->prepareSeqNo) && (!SxactIsReadOnly(t0) || t0->SeqNo.lastCommitBeforeSnapshot >= writer->prepareSeqNo)) { failure = true; break; } } }
if (failure) { if (MySerializableXact == writer) ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot during write */ ...)); else if (SxactIsPrepared(writer)) ereport(ERROR, ...); /* writer prepared: we must be the reader */ else writer->flags |= SXACT_FLAG_DOOMED; /* defer abort to writer's commit */ }}The SXACT_FLAG_DOOMED branch is the heart of the “Tout commits first”
discipline: rather than aborting immediately when the pattern appears,
PostgreSQL marks the pivot doomed and lets the actual abort fire later (in
CheckForSerializableConflictIn/Out or PreCommit_*), by which time it
is certain the out-side partner has committed first.
flowchart TD
READ["table AM reads a tuple<br/>written by another xact"] --> COUT["CheckForSerializableConflictOut"]
WRITE["table AM writes a tuple"] --> CIN["CheckForSerializableConflictIn"]
COUT --> COUTQ{"writer concurrent<br/>and no dup edge?"}
COUTQ -->|"yes"| FLAG["FlagRWConflict<br/>(reader=me, writer)"]
CIN --> CTGT["CheckTargetForConflictsIn<br/>tuple -> page -> relation"]
CTGT --> CINQ{"SIREAD lock by live<br/>concurrent xact?"}
CINQ -->|"yes"| FLAG2["FlagRWConflict<br/>(reader, writer=me)"]
FLAG --> OCF["OnConflict_CheckForSerializationFailure"]
FLAG2 --> OCF
OCF --> PIVOT{"dangerous structure?<br/>Tin rw Tpivot rw Tout<br/>Tout committed first"}
PIVOT -->|"no"| REC["SetRWConflict<br/>record edge in pool"]
PIVOT -->|"yes, I am writer"| ERR["ereport 40001<br/>serialization_failure"]
PIVOT -->|"yes, writer prepared"| ERR
PIVOT -->|"yes, otherwise"| DOOM["writer.flags |= SXACT_FLAG_DOOMED<br/>abort deferred to commit"]
Pre-commit: catching the Tout role
Section titled “Pre-commit: catching the Tout role”A committing transaction might be the Tout of a structure whose pivot and
Tin have not yet committed — a case the incremental edge check could not
have fired, because at edge-recording time neither partner had committed.
PreCommit_CheckForSerializationFailure does the final sweep, dooming the
pivot (the near conflict) rather than the committing transaction so that
the committer makes progress and a retry of the victim can succeed:
// PreCommit_CheckForSerializationFailure — src/backend/storage/lmgr/predicate.cvoidPreCommit_CheckForSerializationFailure(void){ dlist_iter near_iter; if (MySerializableXact == InvalidSerializableXact) return; Assert(IsolationIsSerializable()); LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
if (SxactIsDoomed(MySerializableXact) && !SxactIsPartiallyReleased(MySerializableXact)) { LWLockRelease(SerializableXactHashLock); ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot during commit */ ...)); }
/* For each in-edge (nearConflict->sxactOut = the pivot), look for a * far edge that closes T_far -> pivot -> me, with the far partner * still able to be the apparent-first transaction. */ dlist_foreach(near_iter, &MySerializableXact->inConflicts) { RWConflict nearConflict = dlist_container(RWConflictData, inLink, near_iter.cur); if (!SxactIsCommitted(nearConflict->sxactOut) && !SxactIsDoomed(nearConflict->sxactOut)) { dlist_foreach(far_iter, &nearConflict->sxactOut->inConflicts) { RWConflict farConflict = dlist_container(RWConflictData, inLink, far_iter.cur); if (farConflict->sxactOut == MySerializableXact || (!SxactIsCommitted(farConflict->sxactOut) && !SxactIsReadOnly(farConflict->sxactOut) && !SxactIsDoomed(farConflict->sxactOut))) { if (SxactIsPrepared(nearConflict->sxactOut)) { LWLockRelease(SerializableXactHashLock); ereport(ERROR, ...); /* pivot already prepared: commit suicide */ } nearConflict->sxactOut->flags |= SXACT_FLAG_DOOMED; /* doom the pivot */ break; } } } }
MySerializableXact->prepareSeqNo = ++(PredXact->LastSxactCommitSeqNo); MySerializableXact->flags |= SXACT_FLAG_PREPARED; LWLockRelease(SerializableXactHashLock);}Granularity escalation
Section titled “Granularity escalation”flowchart TD T["PredicateLockTID\n(tuple)"] --> P["PredicateLockPage\n(page)"] P --> R["PredicateLockRelation\n(relation)"] T -->|"childLocks > max_predicate_locks_per_page"| CAP["CheckAndPromotePredicateLockRequest\ndelete child locks,\ncreate coarser lock"] P -->|"childLocks > max_predicate_locks_per_relation"| CAP
GUC knobs max_predicate_locks_per_xact, max_predicate_locks_per_relation,
and max_predicate_locks_per_page control when promotion fires.
PredicateLockShmemSize sizes the shared hash tables from these values.
Safe snapshots and DEFERRABLE READ ONLY
Section titled “Safe snapshots and DEFERRABLE READ ONLY”There are two distinct “opt-out” paths for read-only transactions, both grounded in the read-only theorem from the Theoretical Background.
The immediate opt-out happens inside
GetSerializableTransactionSnapshotInt: if the transaction is READ ONLY
and no writable serializable transaction is currently active
(WritableSxactCount == 0), it can never join a dangerous structure, so it
releases its SERIALIZABLEXACT slot and runs as plain SI with zero SSI
bookkeeping. A second check fires after registering possible conflicts —
if every concurrent writable transaction turned out to be doomed, the list
of possibleUnsafeConflicts is empty and the same opt-out applies:
// GetSerializableTransactionSnapshotInt — src/backend/storage/lmgr/predicate.cif (XactReadOnly){ sxact->flags |= SXACT_FLAG_READ_ONLY;
/* Register all concurrent r/w transactions as possible conflicts; if * they all commit without an outgoing conflict to an earlier txn, this * snapshot is safe and we can run without tracking predicate locks. */ dlist_foreach(iter, &PredXact->activeList) { othersxact = dlist_container(SERIALIZABLEXACT, xactLink, iter.cur); if (!SxactIsCommitted(othersxact) && !SxactIsDoomed(othersxact) && !SxactIsReadOnly(othersxact)) SetPossibleUnsafeConflict(sxact, othersxact); }
if (dlist_is_empty(&sxact->possibleUnsafeConflicts)) { ReleasePredXact(sxact); /* every writer was doomed */ LWLockRelease(SerializableXactHashLock); return snapshot; /* immediate opt-out */ }}The deferred path is GetSafeSnapshot, used for READ ONLY DEFERRABLE.
Instead of running with tracking, it waits — sleeping on
WAIT_EVENT_SAFE_SNAPSHOT — until the concurrent writers either drain
(possibleUnsafeConflicts empties, meaning a safe snapshot is now
available) or one of them flags this transaction RO_UNSAFE, in which case
it retries with a fresh snapshot. The cost is latency; the payoff is a
read-only transaction that is provably anomaly-free and tracks nothing:
// GetSafeSnapshot — src/backend/storage/lmgr/predicate.cstatic SnapshotGetSafeSnapshot(Snapshot origSnapshot){ Snapshot snapshot; Assert(XactReadOnly && XactDeferrable);
while (true) { snapshot = GetSerializableTransactionSnapshotInt(origSnapshot, NULL, InvalidPid); if (MySerializableXact == InvalidSerializableXact) return snapshot; /* no concurrent r/w xacts; already safe */
LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE); MySerializableXact->flags |= SXACT_FLAG_DEFERRABLE_WAITING; while (!(dlist_is_empty(&MySerializableXact->possibleUnsafeConflicts) || SxactIsROUnsafe(MySerializableXact))) { LWLockRelease(SerializableXactHashLock); ProcWaitForSignal(WAIT_EVENT_SAFE_SNAPSHOT); LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE); } MySerializableXact->flags &= ~SXACT_FLAG_DEFERRABLE_WAITING;
if (!SxactIsROUnsafe(MySerializableXact)) { LWLockRelease(SerializableXactHashLock); break; /* success: snapshot is safe */ } LWLockRelease(SerializableXactHashLock); ReleasePredicateLocks(false, false); /* unsafe; retry with a new snapshot */ }
Assert(SxactIsROSafe(MySerializableXact)); ReleasePredicateLocks(false, true); return snapshot;}flowchart TD
START["READ ONLY DEFERRABLE<br/>GetSafeSnapshot"] --> SNAP["GetSerializableTransactionSnapshotInt"]
SNAP --> NOXACT{"MySerializableXact<br/>== Invalid?"}
NOXACT -->|"yes: no concurrent r/w"| SAFE["return snapshot<br/>(SXACT_FLAG_RO_SAFE)"]
NOXACT -->|"no"| WAIT["set DEFERRABLE_WAITING<br/>ProcWaitForSignal"]
WAIT --> COND{"possibleUnsafeConflicts<br/>empty?"}
COND -->|"yes"| SAFE
COND -->|"flagged RO_UNSAFE"| RETRY["ReleasePredicateLocks<br/>retry new snapshot"]
RETRY --> SNAP
### Index AM integration
B-tree, GIN, GiST, and hash index scans each call a different variant
of `PredicateLock*` to cover the "gaps" between index entries — the
ranges that would be affected by a later insert. B-tree and GIN lock
leaf pages; GiST locks pages at every level (because splits can ripple
up); hash locks the primary bucket page. For index AMs without predicate
lock support, the entire index relation is locked. Page splits and
combines in all AMs call `PredicateLockPageSplit` /
`PredicateLockPageCombine` to transfer locks to new pages so no rw-
conflicts are silently lost.
### Locking hierarchy
The LWLocks that guard SSI state must be acquired in a fixed order to
prevent latch-ordering deadlocks (from the predicate.c file header):
SerializableFinishedListLock SerializablePredicateListLock perXactPredicateListLock (parallel query only) PredicateLockHashPartitionLock(hashcode) SerializableXactHashLock SerialControlLock SLRU per-bank locks
### Two-phase commit
`AtPrepare_PredicateLocks` serializes a transaction's predicate lock stateinto the 2PC state file as `TwoPhasePredicateRecord` entries (one`TWOPHASEPREDICATERECORD_XACT` plus one `TWOPHASEPREDICATERECORD_LOCK` perlock). `predicatelock_twophase_recover` reconstructs the state on recovery.The conflict lists are summarized via flags (`SXACT_FLAG_SUMMARY_CONFLICT_IN`/ `SXACT_FLAG_SUMMARY_CONFLICT_OUT`) rather than pointers, since theoriginating `SERIALIZABLEXACT` objects will not exist after restart.
## Source Walkthrough
### Shared-memory init
| Symbol | Role ||---|---|| `PredicateLockShmemInit` | allocates all SSI shared-memory objects at startup || `PredicateLockShmemSize` | computes required shared memory from GUCs || `PredXact` (global `PredXactList`) | root pointer to the active/available sxact pool || `SerializableXidHash` | XID → `SERIALIZABLEXACT*` lookup || `PredicateLockTargetHash` | `PREDICATELOCKTARGETTAG` → `PREDICATELOCKTARGET*` || `PredicateLockHash` | `(target*, sxact*)` → `PREDICATELOCK*` || `SerialSlruCtl` | SLRU buffer for old committed transaction state || `OldCommittedSxact` | shared dummy sxact that accumulates summarized locks |
### Transaction lifecycle
| Symbol | Role ||---|---|| `GetSerializableTransactionSnapshot` | public entry; calls `GetSerializableTransactionSnapshotInt` || `GetSerializableTransactionSnapshotInt` | allocates `SERIALIZABLEXACT`, populates `possibleUnsafeConflicts` for READ ONLY; immediate opt-out when `WritableSxactCount == 0` || `GetSafeSnapshot` | `READ ONLY DEFERRABLE` path; waits on `WAIT_EVENT_SAFE_SNAPSHOT` for a safe snapshot || `SetPossibleUnsafeConflict` | links a RO transaction to a concurrent writer in `possibleUnsafeConflicts` || `RegisterPredicateLockingXid` | assigns `topXid` to `MySerializableXact` on first write || `SerializationNeededForRead` | inline gate: skip if not serializable, not MVCC snapshot, or RO-safe || `SerializationNeededForWrite` | inline gate for write paths || `ReleasePredicateLocks` | commit/rollback cleanup; transfers locks to `OldCommittedSxact` || `SummarizeOldestCommittedSxact` | pushes oldest finished sxact to SLRU, reclaims RAM |
### SIREAD lock acquisition
| Symbol | Role ||---|---|| `PredicateLockRelation` | acquires a relation-granularity SIREAD lock || `PredicateLockPage` | acquires a page-granularity SIREAD lock || `PredicateLockTID` | acquires a tuple-granularity SIREAD lock || `PredicateLockAcquire` | core insertion into `PredicateLockTargetHash` + `PredicateLockHash` || `CoarserLockCovers` | checks whether a coarser lock already subsumes the requested target || `CheckAndPromotePredicateLockRequest` | granularity escalation decision using `LocalPredicateLockHash` || `DeleteChildTargetLocks` | removes finer-grained locks after promotion || `LocalPredicateLockHash` | per-backend private hash (`LOCALPREDICATELOCK`) for promotion heuristics |
### Conflict detection
| Symbol | Role ||---|---|| `CheckForSerializableConflictOut` | reader path: writer's xid → lookup sxact → `FlagRWConflict(me, writer)` || `CheckForSerializableConflictIn` | writer path: scan predicate locks on target → `FlagRWConflict(reader, me)` || `CheckTargetForConflictsIn` | inner loop of `CheckForSerializableConflictIn` for one target tag || `CheckTableForSerializableConflictIn` | called by CLUSTER/REINDEX to conflict-check whole table || `FlagRWConflict` | calls failure check, then `SetRWConflict` || `SetRWConflict` | allocates `RWConflictData` from pool, links into `outConflicts`/`inConflicts` || `OnConflict_CheckForSerializationFailure` | dangerous-structure detector; sets `SXACT_FLAG_DOOMED` or raises ERROR || `PreCommit_CheckForSerializationFailure` | commit-time check for Tout role |
### Serial log (SLRU overflow)
| Symbol | Role ||---|---|| `SerialControlData` | `headPage`, `headXid`, `tailXid` for the SLRU window || `SerialAdd` | writes a committed sxact's `minConflictCommitSeqNo` to SLRU || `SerialGetMinConflictCommitSeqNo` | reads the serial log for an old xid's conflict info || `SerialSetActiveSerXmin` | advances the SLRU tail (reclaims old pages) |
### Key structs and macros
| Symbol | Role ||---|---|| `SERIALIZABLEXACT` | per-serializable-transaction SSI bookkeeping || `PredXactListData` | global counters + sxact pool lists || `RWConflictData` | one directed rw-conflict edge (reader → writer) || `PREDICATELOCKTARGET` | one lockable physical object (relation/page/tuple) || `PREDICATELOCK` | one (target, sxact) lock binding || `LOCALPREDICATELOCK` | per-backend private lock count for promotion heuristics || `SerCommitSeqNo` | monotone `uint64` ordering commits for conflict age tests || `SxactIsCommitted`, `SxactIsDoomed`, `SxactIsROSafe`, … | `SXACT_FLAG_*` test macros || `TargetTagIsCoveredBy` | macro: true when one tag is at a coarser granularity covering another || `PredicateLockHashPartitionLock` | maps a target hash code to its partition LWLock |
## Source verification (as of 2026-06-05)
Branch `REL_18_STABLE`, commit `273fe94`.
| Symbol | File | Line ||---|---|---|| `SERIALIZABLEXACT` | `src/include/storage/predicate_internals.h` | 58 || `PredXactListData` | `src/include/storage/predicate_internals.h` | 144 || `RWConflictData` | `src/include/storage/predicate_internals.h` | 193 || `PREDICATELOCKTARGETTAG` | `src/include/storage/predicate_internals.h` | 267 || `PREDICATELOCKTARGET` | `src/include/storage/predicate_internals.h` | 284 || `PREDICATELOCK` | `src/include/storage/predicate_internals.h` | 317 || `LOCALPREDICATELOCK` | `src/include/storage/predicate_internals.h` | 347 || `PredicateLockTargetType` | `src/include/storage/predicate_internals.h` | 361 || `SerialControlData` | `src/backend/storage/lmgr/predicate.c` | 345 || `SerializationNeededForRead` | `src/backend/storage/lmgr/predicate.c` | 516 || `SerializationNeededForWrite` | `src/backend/storage/lmgr/predicate.c` | 560 || `SetRWConflict` | `src/backend/storage/lmgr/predicate.c` | 643 || `SetPossibleUnsafeConflict` | `src/backend/storage/lmgr/predicate.c` | 666 || `PredicateLockShmemInit` | `src/backend/storage/lmgr/predicate.c` | 1145 || `PredicateLockShmemSize` | `src/backend/storage/lmgr/predicate.c` | 1357 || `GetSafeSnapshot` | `src/backend/storage/lmgr/predicate.c` | 1558 || `GetSerializableTransactionSnapshot` | `src/backend/storage/lmgr/predicate.c` | 1682 || `GetSerializableTransactionSnapshotInt` | `src/backend/storage/lmgr/predicate.c` | 1764 || `RegisterPredicateLockingXid` | `src/backend/storage/lmgr/predicate.c` | 1959 || `CheckAndPromotePredicateLockRequest` | `src/backend/storage/lmgr/predicate.c` | 2326 || `PredicateLockAcquire` | `src/backend/storage/lmgr/predicate.c` | 2517 || `PredicateLockRelation` | `src/backend/storage/lmgr/predicate.c` | 2576 || `PredicateLockPage` | `src/backend/storage/lmgr/predicate.c` | 2599 || `PredicateLockTID` | `src/backend/storage/lmgr/predicate.c` | 2621 || `PredicateLockPageSplit` | `src/backend/storage/lmgr/predicate.c` | 3144 || `PredicateLockPageCombine` | `src/backend/storage/lmgr/predicate.c` | 3229 || `ReleasePredicateLocks` | `src/backend/storage/lmgr/predicate.c` | 3312 || `CheckForSerializableConflictOut` | `src/backend/storage/lmgr/predicate.c` | 4023 || `CheckTargetForConflictsIn` | `src/backend/storage/lmgr/predicate.c` | 4166 || `CheckForSerializableConflictIn` | `src/backend/storage/lmgr/predicate.c` | 4336 || `CheckTableForSerializableConflictIn` | `src/backend/storage/lmgr/predicate.c` | 4419 || `FlagRWConflict` | `src/backend/storage/lmgr/predicate.c` | 4501 || `OnConflict_CheckForSerializationFailure` | `src/backend/storage/lmgr/predicate.c` | 4536 || `PreCommit_CheckForSerializationFailure` | `src/backend/storage/lmgr/predicate.c` | 4703 |
## Beyond PostgreSQL — Comparative Designs & Research Frontiers
**S2PL vs. SSI.** The classic alternative is Strict Two-Phase Locking(S2PL): every read acquires a shared lock, every write acquires anexclusive lock, all held until commit. S2PL is trivially serializable butblocks reads against concurrent writers. SSI avoids blocking readsentirely — at the cost of occasionally aborting a transaction when adangerous structure is detected. For read-heavy workloads SSI wins onthroughput; for write-heavy, high-conflict workloads S2PL may bepreferable because fewer transactions are aborted after doing work.
**CockroachDB and Spanner.** CockroachDB uses a variant of SSI adapted fordistributed transactions, adding a *write-intent* mechanism forwrite-write conflicts and a *timestamp cache* that plays the role ofpredicate locks. Google Spanner uses strict two-phase locking withexternal-consistency timestamps; it does not implement SSI.
**MySQL/InnoDB** implements SERIALIZABLE as S2PL (shared locks on allreads), not SSI. It also uses next-key locking for gap coverage, whichis a physical implementation of predicate locking for B-tree ranges.InnoDB's `trx_sys` carries per-transaction conflict state analogous to`SERIALIZABLEXACT`, but without the dangerous-structure check.
**Research: reducing false positives.** The PostgreSQL implementationtracks only rw-conflicts and performs a simple pattern match. Moreprecise implementations (e.g., port of the Cahill prototype to acommercial engine) maintain a richer graph and can sometimes prove that adangerous structure is not embedded in a true cycle, reducing falseaborts. The tradeoff is higher bookkeeping cost per conflict edge.
**Research: write skew vs. lost update.** A separate class of anomalies —the lost update — is prevented by PostgreSQL's first-updater-winsmechanism in the heap AM (`HeapTupleUpdated` path in `heap_update`), notby SSI. SSI targets write-skew and its generalizations; lost updates arehandled outside the predicate locking layer. See`postgres-mvcc-snapshots.md` for the interaction between snapshotisolation and these anomaly types.
**The lineage of the theory.** The four papers cited here form a chain.Berenson et al. (SIGMOD 1995) showed ANSI's phenomena-based definitionsfail to pin down SI and that SI permits anomalies (A5B write skew) thestandard's "SERIALIZABLE" label implies are excluded — establishing theproblem. Fekete et al. (TODS 2005) supplied the structural theorem: everySI-schedule cycle contains two consecutive rw-antidependencies with apivot, and `Tout` commits first (Theorem 2.1) — establishing what to watchfor. Cahill, Röhm & Fekete (SIGMOD 2008) turned the theorem into anincremental, low-overhead runtime monitor with a small false-positive rate— establishing how to watch cheaply. Ports & Grittner (VLDB 2012)documented the PostgreSQL realization and its original refinements: theSLRU-backed serial log for bounded RAM, the multi-granularity SIREADlocks, subtransaction handling on top-level XIDs, the safe-snapshot /DEFERRABLE path, and the victim-selection rule that dooms the pivot so aretry can succeed. README-SSI's "Innovations" section is the in-treecounterpart to that VLDB paper.
**False positives, quantified.** README-SSI is candid that the monitor isconservative: "not every dangerous structure is embedded in an actualcycle." Two design choices keep the false-positive rate low without a fullcycle search: (1) the Theorem-2.1 commit-order guard never aborts until theout-side partner commits; (2) the rw-conflict graph is kept as explicit`RWConflictData` edge lists rather than the paper-prototype'sconflictIn/conflictOut booleans, so the detector can distinguish "has *a*conflict out" from "has a conflict out to *this specific* earliertransaction." Coarse predicate-lock granularity is the other false-positivesource — a relation-level SIREAD lock conflicts with any write to thetable — and is tunable via the `max_predicate_locks_per_*` GUCs and viaplan choice (an index scan may lock only a few leaf pages where a seq scanlocks the whole relation).
**SSI in other engines.** Beyond CockroachDB (timestamp cache as apredicate-lock surrogate) and Spanner (S2PL + TrueTime, no SSI), theacademic line continued with *serializable* variants that reduce aborts:PSSI (Precisely Serializable Snapshot Isolation, Revilak et al.) tracks thereal dependency graph to eliminate the rw-only false positives PostgreSQLaccepts, at higher bookkeeping cost; and various OCC-based MVCC engines(Hekaton, Silo) validate read sets at commit rather than tracking edgesincrementally. The PostgreSQL choice — incremental rw-edge tracking with aconservative pivot test — trades some precision for a constant-factorper-conflict cost and a bounded-RAM story via SLRU spill, which suits adisk-based engine with long-lived transactions better than the in-RAMread-set validation of a main-memory OCC system.
**The `knowledge/research/dbms-papers/` anchors for this doc** are:`occ.md` (optimistic concurrency control theory, a precursor framework forSSI analysis), `scalable-lock-manager.md` (lock-manager scalability, thecontrast point for SIREAD's non-blocking design), and the primaryCahill/Fekete/Ports-Grittner papers listed in README-SSI [1][2].
## Sources
- `src/backend/storage/lmgr/predicate.c` — full SSI implementation- `src/include/storage/predicate_internals.h` — shared-memory struct definitions- `src/include/storage/predicate.h` — public interface declarations- `src/backend/storage/lmgr/README-SSI` — in-tree design document; the primary reference- Cahill, Röhm, Fekete: *Serializable isolation for snapshot databases*, SIGMOD 2008 ([1] in README-SSI)- Fekete, Liarokapis, O'Neil, O'Neil, Shasha: *Making Snapshot Isolation Serializable*, ACM TODS 30:2, 2005 ([2] in README-SSI; Theorem 2.1)- Ports, Grittner: *Serializable Snapshot Isolation in PostgreSQL*, VLDB 2012 (the PostgreSQL realization and its refinements)- Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil: *A Critique of ANSI SQL Isolation Levels*, SIGMOD 1995 (why SI is not SERIALIZABLE; A5B write skew)- Hellerstein, Stonebraker, Hamilton: *Architecture of a Database System*, §6.5.3 next-key locking ([3] in README-SSI)- `knowledge/research/dbms-papers/occ.md`- `knowledge/research/dbms-papers/scalable-lock-manager.md` — lock-manager scalability; contrast for the non-blocking SIREAD design- `knowledge/code-analysis/postgres/postgres-lock-manager.md` — heavyweight lock manager (contrasted here)- `knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md` — snapshot machinery that SSI hooks into- `knowledge/code-analysis/postgres/postgres-xact.md` — transaction lifecycle that calls the SSI entry points- `knowledge/code-analysis/postgres/postgres-slru.md` — SLRU subsystem used by the serial log