Skip to content

PostgreSQL SSI and Predicate Locking — Serializable Snapshot Isolation, SIREAD Locks, and the rw-conflict Graph

Contents:

Snapshot isolation (SI) gives each transaction a consistent view of the database as of a point in time, so readers never block writers and writers never block readers. Database System Concepts (Silberschatz, 7e, §19.6) calls SI an optimistic scheme: because it is based on MVCC, the usual read/write contention disappears. However, SI does not guarantee full serializability. The canonical counter-example is the write skew anomaly: two concurrent transactions each read overlapping data and each write to a disjoint part, yet the combined result is inconsistent with any serial execution.

The theoretical fix was published in Making Snapshot Isolation Serializable (Fekete et al., ACM TODS 2005, [2] in README-SSI). The key observation is that every SI anomaly contains a dangerous structure in the dependency graph: two adjacent rw-conflict edges forming the shape

Tin ──rw──► Tpivot ──rw──► Tout

where Tout must commit before Tpivot and Tin. No other dependency type (wr- or ww-) is needed; rw-conflicts alone are sufficient to detect all anomalies without false-negatives. An algorithm that watches for this pattern and aborts one transaction when it appears can run SI underneath and still deliver serializability — Serializable Snapshot Isolation (SSI).

Cahill, Röhm, and Fekete (SIGMOD 2008, [1] in README-SSI) gave the first practical SSI algorithm and demonstrated that its overhead over plain SI is small. PostgreSQL adopted SSI in version 9.1. The implementation deviates from the paper in several ways driven by PostgreSQL’s existing architecture (see the “Innovations” section of README-SSI), but the core graph-monitoring idea is unchanged.

From the multi-version serialization graph to the dangerous structure

Section titled “From the multi-version serialization graph to the dangerous structure”

To make the “dangerous structure” claim precise it helps to recall the multi-version serialization graph (MVSG) that the SSI literature builds on. Nodes are committed transactions; edges encode the apparent execution order. README-SSI distinguishes three edge types, following Fekete et al.:

  • wr-dependencyT1 ─wr─► T2: T2 reads a version that T1 wrote. T1 must have committed before T2 took its snapshot, so the edge is always consistent with commit order.
  • ww-dependencyT1 ─ww─► T2: T2 overwrites a version T1 wrote. Again T1 committed first.
  • rw-conflict (rw-antidependency)T1 ─rw─► T2: T1 reads a version that T2 overwrites (or T2’s write would have been visible to T1’s read predicate had it landed first). Critically, this edge can point backwards in time: T1 appears to execute first because it saw a state prior to T2’s write, even if T1 actually committed after T2.

A schedule is serializable iff its MVSG is acyclic. Under snapshot isolation the only edges that can run counter to real commit order are the rw-antidependencies, so they are the only edges capable of closing a cycle that did not exist under a serial schedule. This is the engine behind write skew: two transactions each read the other’s “before” state and write disjoint rows, producing the cycle T1 ─rw─► T2 ─rw─► T1.

Fekete, Liarokapis, O’Neil, O’Neil & Shasha (Making Snapshot Isolation Serializable, ACM TODS 30:2, 2005 — [2] in README-SSI) proved the sharpest structural result: every cycle in the MVSG of an SI schedule contains two rw-antidependency edges that are consecutive — i.e. a Tpivot with one rw-edge in and one rw-edge out. The transaction in the middle is the pivot. Cahill et al.’s SIGMOD 2008 contribution was to turn this theorem into a cheap runtime test: instead of materializing the whole graph and searching for cycles (expensive), track only rw-antidependencies incrementally and fire whenever a pivot acquires both an inbound and an outbound rw-edge. Because not every such pivot lies on an actual cycle, the test admits false positives but never false negatives — it is safe (serializable) and conservative (may over-abort). README-SSI states this tradeoff explicitly: “The number of false positives is low in practice, so this represents an acceptable tradeoff for keeping the detection overhead low.”

The bare pivot test would abort too eagerly. Fekete et al.’s Theorem 2.1 adds the timing constraint PostgreSQL exploits: in any dangerous structure Tin ─rw─► Tpivot ─rw─► Tout that lies on a real cycle, Tout must be the first of the three to commit. PostgreSQL therefore refuses to abort anyone until the out-side partner has actually committed (the SXACT_FLAG_DOOMED/commit-time check in OnConflict_CheckForSerializationFailure). This single refinement removes a large class of false positives at essentially zero cost, because the commit order is already tracked via SerCommitSeqNo.

Berenson et al. and why “ANSI SERIALIZABLE” is not enough

Section titled “Berenson et al. and why “ANSI SERIALIZABLE” is not enough”

The reason PostgreSQL needed SSI at all traces to Berenson, Bernstein, Gray, Melton, O’Neil & O’Neil, A Critique of ANSI SQL Isolation Levels (SIGMOD 1995). That paper showed the ANSI SQL-92 isolation definitions — phrased as “phenomena” P1/P2/P3 (dirty read, non-repeatable read, phantom) — are ambiguous and, crucially, do not characterize snapshot isolation. SI forbids all three ANSI phenomena yet still permits write skew (A5B) and read-only-transaction anomalies, so a vendor could legally label SI as “SERIALIZABLE” while violating true serializability. PostgreSQL did exactly this before 9.1: requesting SERIALIZABLE silently gave snapshot isolation. SSI closed that gap; REPEATABLE READ now retains the legacy SI behavior (which is itself stronger than the ANSI definition of REPEATABLE READ — PostgreSQL has no dirty reads and no phantoms at that level), while SERIALIZABLE adds the rw-antidependency monitor on top.

The read-only-transaction optimization (an original contribution)

Section titled “The read-only-transaction optimization (an original contribution)”

README-SSI’s SSI-algorithm section records a refinement that is not in the Cahill/Fekete papers: if Tin is read-only, an anomaly is possible only if Tout committed before Tin took its snapshot. The proof (reproduced in README-SSI) observes that the edge entering a read-only Tin on the cycle cannot be an rw- or ww-edge — a read-only transaction writes nothing — so it must be a wr-dependency, which forces the predecessor to have committed before Tin’s snapshot. Combined with the “Tout commits first” rule, Tout must precede Tin’s snapshot. This is what makes the DEFERRABLE READ ONLY safe snapshot sound (see the PostgreSQL section) and lets ordinary read-only transactions skip much of the conflict bookkeeping. Ports & Grittner, Serializable Snapshot Isolation in PostgreSQL (VLDB 2012), is the authoritative write-up of these engineering refinements and the measured overheads; it is the paper to cite for the PostgreSQL-specific deviations.

Predicate locking is a prerequisite. To detect an rw-conflict when the read happens before the write, the system must record what each transaction has read — not just which tuples, but the “predicate” covering the set of tuples the scan would return. A later write that falls inside that predicate is then a potential rw-conflict. Classical predicate locks are hard to implement exactly; PostgreSQL uses physical-object locks (relation, page, tuple) as conservative approximations, accepting some false-positive rw-conflicts in exchange for simplicity. Hellerstein, Stonebraker & Hamilton’s Architecture of a Database System (§6.5.3, “Next-Key Locking: Physical Surrogates for Logical Properties” — [3] in README-SSI) is the canonical reference for this physical-surrogate trick; the SIREAD lock is its non-blocking cousin.

SSI as described in the Cahill papers is a general framework, not a PostgreSQL-specific design. Several engineering conventions appear in any MVCC engine that implements it.

Each serializable transaction needs bookkeeping beyond a snapshot: a list of its outgoing rw-conflicts (transactions it read from, whose later writes created a conflict) and its incoming rw-conflicts (transactions that read data it later wrote). A dangerous structure exists when a transaction has both a conflict in and a conflict out where the out-side transaction has already committed.

Tracking every individual tuple read in RAM is not feasible at scale. The standard approach (used in PostgreSQL, IBM DB2’s next-key locking, and the paper prototypes) is a multi-granularity lock hierarchy: try to lock the tuple, but if too many tuple locks accumulate, promote them to a single page lock; if page locks proliferate, promote to a relation lock. Coarser locks cause more false-positive conflicts but guarantee bounded memory use.

Predicate locks for SSI are fundamentally different from heavyweight (S2PL-style) locks: they never block. A SIREAD lock is closer to a flag — it records that a transaction accessed an object, so that a later writer on the same object can detect an rw-conflict. No queue, no wait, no deadlock. The structures and code paths are necessarily separate from the regular lock manager.

A transaction’s conflict information must outlive the transaction itself: a committed transaction may still be part of a dangerous structure visible to concurrent serializable transactions that have not yet committed. Every engine needs a policy for when to reclaim this state. The standard approach is to keep state for all transactions whose xmin is no older than the oldest live serializable transaction’s snapshot.

A read-only transaction can never cause a write-skew anomaly on its own. If it also starts after all concurrent read-write transactions have committed without outgoing conflicts, it is on a safe snapshot and can opt out of predicate lock tracking entirely. Delaying the start of a READ ONLY DEFERRABLE transaction until such a safe snapshot is available costs latency but eliminates SSI overhead for that transaction.

PostgreSQL’s SSI implementation lives almost entirely in storage/lmgr/predicate.c (~4 700 lines). The README-SSI in that directory is the authoritative design document; what follows distils the key architectural choices.

Six shared-memory objects underpin SSI:

flowchart TD
  PX["PredXactList\n(PredXact)\nactiveList + availableList\nSxactGlobalXmin, WritableSxactCount\nLastSxactCommitSeqNo"]
  SX["SERIALIZABLEXACT[]\n(pool allocated)\nvxid, topXid, xmin, flags\noutConflicts, inConflicts\npredicateLocks"]
  SXH["SerializableXidHash\nxid -> SERIALIZABLEXACT*"]
  PLT["PredicateLockTargetHash\nPREDICATELOCKTARGETTAG -> PREDICATELOCKTARGET\n(partitioned by target hash)"]
  PL["PredicateLockHash\n(PREDICATELOCKTARGET*, SERIALIZABLEXACT*) -> PREDICATELOCK"]
  SLRU["SerialSlruCtl\nold committed sxact state\nxid -> SerCommitSeqNo"]
  PX -- "pool" --> SX
  SX -- "lookup" --> SXH
  SX -- "lock list" --> PL
  PLT -- "anchor" --> PL
  PX -- "overflow" --> SLRU

PredXactList is a single shared struct (PredXact) holding global counters and two free/active lists of SERIALIZABLEXACT slots. SerializableXidHash maps a transaction’s top-level XID to its SERIALIZABLEXACT so that writers can locate a reader’s record quickly. The predicate lock tables are partitioned (like the heavyweight lock table) to reduce contention — NUM_PREDICATELOCK_PARTITIONS partitions, each guarded by its own LWLock.

SERIALIZABLEXACT — the per-transaction SSI record

Section titled “SERIALIZABLEXACT — the per-transaction SSI record”
// SERIALIZABLEXACT — src/include/storage/predicate_internals.h
typedef struct SERIALIZABLEXACT
{
VirtualTransactionId vxid;
SerCommitSeqNo prepareSeqNo;
SerCommitSeqNo commitSeqNo;
union {
SerCommitSeqNo earliestOutConflictCommit;
SerCommitSeqNo lastCommitBeforeSnapshot;
} SeqNo;
dlist_head outConflicts; /* rw-conflicts where this txn is the reader */
dlist_head inConflicts; /* rw-conflicts where this txn is the writer */
dlist_head predicateLocks; /* SIREAD locks held */
dlist_head possibleUnsafeConflicts; /* for READ ONLY safe-snapshot tracking */
TransactionId topXid;
TransactionId finishedBefore;
TransactionId xmin;
uint32 flags; /* SXACT_FLAG_* bitmask */
int pid;
int pgprocno;
} SERIALIZABLEXACT;

Key flags are SXACT_FLAG_COMMITTED, SXACT_FLAG_DOOMED (will abort at next check), SXACT_FLAG_READ_ONLY, SXACT_FLAG_RO_SAFE (eligible for early lock release), and SXACT_FLAG_CONFLICT_OUT (has a conflict out to a transaction that committed ahead of it).

// PREDICATELOCKTARGET — src/include/storage/predicate_internals.h
typedef struct PREDICATELOCKTARGET {
PREDICATELOCKTARGETTAG tag; /* db + rel + blkno + offset; field4 invalid => page; field3+4 invalid => relation */
dlist_head predicateLocks;
} PREDICATELOCKTARGET;
// PREDICATELOCK — src/include/storage/predicate_internals.h
typedef struct PREDICATELOCK {
PREDICATELOCKTAG tag; /* (myTarget*, myXact*) */
dlist_node targetLink;
dlist_node xactLink;
SerCommitSeqNo commitSeqNo; /* set when owner commits; used for summarization */
} PREDICATELOCK;

Each backend also maintains a private LocalPredicateLockHash (LOCALPREDICATELOCK) that tracks lock counts per granularity level without touching shared memory. This hash drives granularity promotion: when a transaction’s tuple-level lock count for a page exceeds max_predicate_locks_per_page, CheckAndPromotePredicateLockRequest upgrades them to a single page-level lock.

SIREAD lock acquisition at three granularities

Section titled “SIREAD lock acquisition at three granularities”

The three public entry points are thin wrappers that build a PREDICATELOCKTARGETTAG at the requested granularity and funnel into the common PredicateLockAcquire. They differ only in which SET_*TAG macro they call and in the early-out guards. PredicateLockTID carries the most logic, because a tuple lock is redundant if this transaction already wrote the tuple, or if a relation lock already covers it:

// PredicateLockTID — src/backend/storage/lmgr/predicate.c
void
PredicateLockTID(Relation relation, ItemPointer tid, Snapshot snapshot,
TransactionId tuple_xid)
{
PREDICATELOCKTARGETTAG tag;
if (!SerializationNeededForRead(relation, snapshot))
return;
/* If we wrote it; we already have a write lock. */
if (relation->rd_index == NULL)
if (TransactionIdIsCurrentTransactionId(tuple_xid))
return;
/* Quick (non-definitive) check for a relation-level lock first. */
SET_PREDICATELOCKTARGETTAG_RELATION(tag, relation->rd_locator.dbOid,
relation->rd_id);
if (PredicateLockExists(&tag))
return;
SET_PREDICATELOCKTARGETTAG_TUPLE(tag, relation->rd_locator.dbOid,
relation->rd_id,
ItemPointerGetBlockNumber(tid),
ItemPointerGetOffsetNumber(tid));
PredicateLockAcquire(&tag);
}

PredicateLockAcquire is where the SIREAD lock actually lands. Note the two short-circuits — already-held and covered-by-coarser — that make acquisition idempotent and implement the README rule “an attempt to lock a tuple when the same transaction already holds a lock on the page or relation will be ignored”:

// PredicateLockAcquire — src/backend/storage/lmgr/predicate.c
static void
PredicateLockAcquire(const PREDICATELOCKTARGETTAG *targettag)
{
uint32 targettaghash;
bool found;
LOCALPREDICATELOCK *locallock;
/* Do we have the lock already, or a covering lock? */
if (PredicateLockExists(targettag))
return;
if (CoarserLockCovers(targettag))
return;
targettaghash = PredicateLockTargetTagHashCode(targettag);
/* Record in the per-backend local table (drives promotion). */
locallock = (LOCALPREDICATELOCK *)
hash_search_with_hash_value(LocalPredicateLockHash, targettag,
targettaghash, HASH_ENTER, &found);
locallock->held = true;
if (!found)
locallock->childLocks = 0;
/* Actually create the lock in shared memory. */
CreatePredicateLock(targettag, targettaghash, MySerializableXact);
/* Promote to coarser granularity, or clean up finer locks. */
if (CheckAndPromotePredicateLockRequest(targettag))
; /* promoted: parent acquire deleted this lock and its children */
else if (GET_PREDICATELOCKTARGETTAG_TYPE(*targettag) != PREDLOCKTAG_TUPLE)
DeleteChildTargetLocks(targettag);
}

CheckAndPromotePredicateLockRequest walks the parent chain (tuple → page → relation) in the backend-private hash, incrementing each ancestor’s childLocks counter, and promotes to the coarsest ancestor whose child count has crossed MaxPredicateChildLocks (derived from max_predicate_locks_per_page / max_predicate_locks_per_relation):

// CheckAndPromotePredicateLockRequest — src/backend/storage/lmgr/predicate.c
static bool
CheckAndPromotePredicateLockRequest(const PREDICATELOCKTARGETTAG *reqtag)
{
PREDICATELOCKTARGETTAG targettag, nexttag, promotiontag;
LOCALPREDICATELOCK *parentlock;
bool found, promote = false;
targettag = *reqtag;
/* check parents iteratively */
while (GetParentPredicateLockTag(&targettag, &nexttag))
{
targettag = nexttag;
parentlock = (LOCALPREDICATELOCK *)
hash_search(LocalPredicateLockHash, &targettag, HASH_ENTER, &found);
if (!found) { parentlock->held = false; parentlock->childLocks = 1; }
else parentlock->childLocks++;
if (parentlock->childLocks > MaxPredicateChildLocks(&targettag))
{
promotiontag = targettag; /* keep climbing for child counts */
promote = true;
}
}
if (promote)
{
PredicateLockAcquire(&promotiontag); /* acquire coarsest eligible */
return true;
}
return false;
}
  1. Snapshot acquisitionGetSerializableTransactionSnapshot allocates a SERIALIZABLEXACT, links it into PredXact->activeList, and initializes outConflicts, inConflicts, and predicateLocks. If the transaction is read-only and no writable serializable transactions are active (WritableSxactCount == 0), it is immediately given a safe snapshot and never enters the tracking machinery.

  2. Read — SIREAD lock acquisition — the table AM calls PredicateLockRelation, PredicateLockPage, or PredicateLockTID. These call through to PredicateLockAcquire, which inserts a PREDICATELOCK in the shared hash and a LOCALPREDICATELOCK in the backend-private hash. If a coarser lock already covers the target, the acquisition is a no-op.

  3. Write — conflict-out detection — when the table AM reads a tuple version written by another concurrent transaction, it calls CheckForSerializableConflictOut. This looks up the writer’s SERIALIZABLEXACT and, if both transactions are concurrent, calls FlagRWConflict(MySerializableXact, writer_sxact) — recording a conflict where MySerializableXact is the reader (out-side).

  4. Write — conflict-in detection — when the table AM writes a tuple, it calls CheckForSerializableConflictIn, which scans the PredicateLockTargetHash for all SIREAD locks on that target. For each lock held by a concurrent serializable transaction, it calls FlagRWConflict(reader_sxact, MySerializableXact).

  5. Dangerous-structure checkFlagRWConflict calls OnConflict_CheckForSerializationFailure before recording the edge (full excerpt below in the dedicated subsection).

  6. Pre-commit checkPreCommit_CheckForSerializationFailure runs at commit time to catch the case where this transaction is Tout (committing first) and its in-side conflict partner has an in-conflict that would complete the dangerous structure.

  7. Lock release and summarizationReleasePredicateLocks runs at commit or rollback. On commit, PREDICATELOCK entries are not freed immediately; they are transferred to OldCommittedSxact (a shared dummy SERIALIZABLEXACT) and their commitSeqNo is set. When memory pressure builds, SummarizeOldestCommittedSxact pushes the oldest committed transaction’s conflict state to the SLRU-backed serial log (SerialSlruCtl) and reclaims its RAM. CheckForSerializableConflictOut checks the serial log (via SerialGetMinConflictCommitSeqNo) when a writer’s SERIALIZABLEXACT is no longer in the hash.

flowchart LR
  Tin["Tin\n(reader)"]
  Tpivot["Tpivot\n(reader+writer)"]
  Tout["Tout\n(writer, commits first)"]
  Tin -- "rw-conflict out" --> Tpivot
  Tpivot -- "rw-conflict out" --> Tout

Tout must commit before Tpivot and Tin for the structure to be dangerous (Theorem 2.1 of Fekete et al.). PostgreSQL exploits two additional optimizations from README-SSI:

  • Tout-commits-first guard: PostgreSQL never aborts a transaction until the out-side conflict partner has committed. This eliminates premature aborts.
  • Read-only optimization: if Tin is read-only, an anomaly is only possible if Tout committed before Tin acquired its snapshot. Read-only transactions that started after all concurrent writable serializable transactions have committed cleanly can opt out entirely (the SXACT_FLAG_RO_SAFE path in GetSerializableTransactionSnapshotInt).

An edge is a pooled RWConflictData taken from RWConflictPool and double-linked into the reader’s outConflicts and the writer’s inConflicts. There is no separate adjacency matrix — the graph is these two intrusive lists per SERIALIZABLEXACT. Exhaustion of the pool is a hard error, not a silent miss, because a lost edge would mean a missed anomaly:

// SetRWConflict — src/backend/storage/lmgr/predicate.c
static void
SetRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer)
{
RWConflict conflict;
Assert(reader != writer);
Assert(!RWConflictExists(reader, writer));
if (dlist_is_empty(&RWConflictPool->availableList))
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("not enough elements in RWConflictPool to record a read/write conflict"),
errhint("You might need to run fewer transactions at a time or increase \"max_connections\".")));
conflict = dlist_head_element(RWConflictData, outLink, &RWConflictPool->availableList);
dlist_delete(&conflict->outLink);
conflict->sxactOut = reader; /* edge tail = reader */
conflict->sxactIn = writer; /* edge head = writer */
dlist_push_tail(&reader->outConflicts, &conflict->outLink);
dlist_push_tail(&writer->inConflicts, &conflict->inLink);
}

Conflict-out: the reader sees a concurrent writer

Section titled “Conflict-out: the reader sees a concurrent writer”

CheckForSerializableConflictOut is called from the table AM when a scan reads a tuple version whose xmin/xmax belongs to another transaction. It looks the writer up by top-level XID — first in SerializableXidHash, then, if the writer has already been summarized out to the serial log, via SerialGetMinConflictCommitSeqNo. The read-only short-circuit implements the original optimization from README-SSI:

// CheckForSerializableConflictOut — src/backend/storage/lmgr/predicate.c
void
CheckForSerializableConflictOut(Relation relation, TransactionId xid, Snapshot snapshot)
{
/* ... lookup SERIALIZABLEXACT *sxact for top-level xid ... */
/* If this is read-only and the writer committed without an earlier
* out-conflict, the reader simply appears to run first: no conflict. */
if (SxactIsReadOnly(MySerializableXact)
&& SxactIsCommitted(sxact)
&& !SxactHasSummaryConflictOut(sxact)
&& (!SxactHasConflictOut(sxact)
|| MySerializableXact->SeqNo.lastCommitBeforeSnapshot < sxact->SeqNo.earliestOutConflictCommit))
{
LWLockRelease(SerializableXactHashLock);
return;
}
if (!XidIsConcurrent(xid)) { /* write was already in our snapshot */
LWLockRelease(SerializableXactHashLock);
return;
}
if (RWConflictExists(MySerializableXact, sxact)) { /* no duplicate edges */
LWLockRelease(SerializableXactHashLock);
return;
}
/* ... FlagRWConflict(MySerializableXact, sxact) — we are the reader ... */
}

Conflict-in: the writer probes existing SIREAD locks

Section titled “Conflict-in: the writer probes existing SIREAD locks”

CheckForSerializableConflictIn is the mirror image, called when a tuple is written. It probes the predicate-lock hash from finest to coarsest granularity (tuple, then page, then relation) — the order matters, because a concurrent granularity promotion always acquires the coarser lock before releasing the finer ones, so checking fine→coarse can never miss:

// CheckForSerializableConflictIn — src/backend/storage/lmgr/predicate.c
void
CheckForSerializableConflictIn(Relation relation, ItemPointer tid, BlockNumber blkno)
{
PREDICATELOCKTARGETTAG targettag;
if (!SerializationNeededForWrite(relation))
return;
if (SxactIsDoomed(MySerializableXact))
ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot, during conflict in */ ...));
MyXactDidWrite = true;
if (tid != NULL) { /* finest: tuple */
SET_PREDICATELOCKTARGETTAG_TUPLE(targettag, relation->rd_locator.dbOid,
relation->rd_id, ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid));
CheckTargetForConflictsIn(&targettag);
}
if (blkno != InvalidBlockNumber) { /* page */
SET_PREDICATELOCKTARGETTAG_PAGE(targettag, relation->rd_locator.dbOid,
relation->rd_id, blkno);
CheckTargetForConflictsIn(&targettag);
}
SET_PREDICATELOCKTARGETTAG_RELATION(targettag, relation->rd_locator.dbOid,
relation->rd_id); /* coarsest: relation */
CheckTargetForConflictsIn(&targettag);
}

CheckTargetForConflictsIn does the per-target inner loop: for every SIREAD lock on the target held by a still-relevant concurrent transaction (not doomed; either uncommitted or committed recently enough that our snapshot xmin predates its finishedBefore), it records the inbound edge via FlagRWConflict(reader, me). It also drops our own SIREAD lock on a tuple we are now writing — the README’s “write lock subsumes the SIREAD lock” optimization:

// CheckTargetForConflictsIn — src/backend/storage/lmgr/predicate.c
dlist_foreach_modify(iter, &target->predicateLocks)
{
PREDICATELOCK *predlock = dlist_container(PREDICATELOCK, targetLink, iter.cur);
SERIALIZABLEXACT *sxact = predlock->tag.myXact;
if (sxact == MySerializableXact) {
/* our own SIREAD lock on a tuple we're writing — defer removal */
if (!IsSubTransaction() && GET_PREDICATELOCKTARGETTAG_OFFSET(*targettag))
{ mypredlock = predlock; mypredlocktag = predlock->tag; }
}
else if (!SxactIsDoomed(sxact)
&& (!SxactIsCommitted(sxact)
|| TransactionIdPrecedes(GetTransactionSnapshot()->xmin, sxact->finishedBefore))
&& !RWConflictExists(sxact, MySerializableXact))
{
/* upgrade to exclusive, re-check, then flag the inbound edge */
FlagRWConflict(sxact, MySerializableXact);
}
}

Flagging and the dangerous-structure detector

Section titled “Flagging and the dangerous-structure detector”

FlagRWConflict is the single choke point: it first asks OnConflict_CheckForSerializationFailure whether adding this edge completes a pivot, then either records a real edge or, when one party has already been summarized into OldCommittedSxact, sets a summary-conflict flag instead of allocating an RWConflictData:

// FlagRWConflict — src/backend/storage/lmgr/predicate.c
static void
FlagRWConflict(SERIALIZABLEXACT *reader, SERIALIZABLEXACT *writer)
{
Assert(reader != writer);
/* First, see if this conflict causes failure. */
OnConflict_CheckForSerializationFailure(reader, writer);
/* Actually do the conflict flagging. */
if (reader == OldCommittedSxact)
writer->flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
else if (writer == OldCommittedSxact)
reader->flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
else
SetRWConflict(reader, writer);
}

The detector checks the three ways the new reader ─rw─► writer edge can close a dangerous structure. The third case — where the reader becomes the pivot (T0 ─rw─► reader ─rw─► writer) — is the subtle one; it walks the reader’s existing inConflicts looking for a T0 that has not safely committed ahead of the writer:

// OnConflict_CheckForSerializationFailure — src/backend/storage/lmgr/predicate.c
static void
OnConflict_CheckForSerializationFailure(const SERIALIZABLEXACT *reader,
SERIALIZABLEXACT *writer)
{
bool failure = false;
Assert(LWLockHeldByMe(SerializableXactHashLock));
/* Case 1: writer already committed and has a conflict out (R->W->T2). */
if (SxactIsCommitted(writer)
&& (SxactHasConflictOut(writer) || SxactHasSummaryConflictOut(writer)))
failure = true;
/* Case 2: writer is the pivot — walk its outConflicts for a T2 that
* committed/prepared first (R->W->T2). */
if (!failure && SxactHasSummaryConflictOut(writer))
failure = true;
else if (!failure) {
dlist_foreach(iter, &writer->outConflicts) {
SERIALIZABLEXACT *t2 = dlist_container(RWConflictData, outLink, iter.cur)->sxactIn;
if (SxactIsPrepared(t2)
&& (!SxactIsCommitted(reader) || t2->prepareSeqNo <= reader->commitSeqNo)
&& (!SxactIsCommitted(writer) || t2->prepareSeqNo <= writer->commitSeqNo)
&& (!SxactIsReadOnly(reader) || t2->prepareSeqNo <= reader->SeqNo.lastCommitBeforeSnapshot))
{ failure = true; break; }
}
}
/* Case 3: reader is the pivot — walk its inConflicts for a T0
* (T0->R->W), writer prepared, reader not read-only. */
if (!failure && SxactIsPrepared(writer) && !SxactIsReadOnly(reader)) {
if (SxactHasSummaryConflictIn(reader)) failure = true;
else dlist_foreach(iter, &unconstify(SERIALIZABLEXACT *, reader)->inConflicts) {
const SERIALIZABLEXACT *t0 = dlist_container(RWConflictData, inLink, iter.cur)->sxactOut;
if (!SxactIsDoomed(t0)
&& (!SxactIsCommitted(t0) || t0->commitSeqNo >= writer->prepareSeqNo)
&& (!SxactIsReadOnly(t0) || t0->SeqNo.lastCommitBeforeSnapshot >= writer->prepareSeqNo))
{ failure = true; break; }
}
}
if (failure) {
if (MySerializableXact == writer)
ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot during write */ ...));
else if (SxactIsPrepared(writer))
ereport(ERROR, ...); /* writer prepared: we must be the reader */
else
writer->flags |= SXACT_FLAG_DOOMED; /* defer abort to writer's commit */
}
}

The SXACT_FLAG_DOOMED branch is the heart of the “Tout commits first” discipline: rather than aborting immediately when the pattern appears, PostgreSQL marks the pivot doomed and lets the actual abort fire later (in CheckForSerializableConflictIn/Out or PreCommit_*), by which time it is certain the out-side partner has committed first.

flowchart TD
  READ["table AM reads a tuple<br/>written by another xact"] --> COUT["CheckForSerializableConflictOut"]
  WRITE["table AM writes a tuple"] --> CIN["CheckForSerializableConflictIn"]
  COUT --> COUTQ{"writer concurrent<br/>and no dup edge?"}
  COUTQ -->|"yes"| FLAG["FlagRWConflict<br/>(reader=me, writer)"]
  CIN --> CTGT["CheckTargetForConflictsIn<br/>tuple -> page -> relation"]
  CTGT --> CINQ{"SIREAD lock by live<br/>concurrent xact?"}
  CINQ -->|"yes"| FLAG2["FlagRWConflict<br/>(reader, writer=me)"]
  FLAG --> OCF["OnConflict_CheckForSerializationFailure"]
  FLAG2 --> OCF
  OCF --> PIVOT{"dangerous structure?<br/>Tin rw Tpivot rw Tout<br/>Tout committed first"}
  PIVOT -->|"no"| REC["SetRWConflict<br/>record edge in pool"]
  PIVOT -->|"yes, I am writer"| ERR["ereport 40001<br/>serialization_failure"]
  PIVOT -->|"yes, writer prepared"| ERR
  PIVOT -->|"yes, otherwise"| DOOM["writer.flags |= SXACT_FLAG_DOOMED<br/>abort deferred to commit"]

A committing transaction might be the Tout of a structure whose pivot and Tin have not yet committed — a case the incremental edge check could not have fired, because at edge-recording time neither partner had committed. PreCommit_CheckForSerializationFailure does the final sweep, dooming the pivot (the near conflict) rather than the committing transaction so that the committer makes progress and a retry of the victim can succeed:

// PreCommit_CheckForSerializationFailure — src/backend/storage/lmgr/predicate.c
void
PreCommit_CheckForSerializationFailure(void)
{
dlist_iter near_iter;
if (MySerializableXact == InvalidSerializableXact)
return;
Assert(IsolationIsSerializable());
LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
if (SxactIsDoomed(MySerializableXact) && !SxactIsPartiallyReleased(MySerializableXact)) {
LWLockRelease(SerializableXactHashLock);
ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), /* pivot during commit */ ...));
}
/* For each in-edge (nearConflict->sxactOut = the pivot), look for a
* far edge that closes T_far -> pivot -> me, with the far partner
* still able to be the apparent-first transaction. */
dlist_foreach(near_iter, &MySerializableXact->inConflicts) {
RWConflict nearConflict = dlist_container(RWConflictData, inLink, near_iter.cur);
if (!SxactIsCommitted(nearConflict->sxactOut) && !SxactIsDoomed(nearConflict->sxactOut)) {
dlist_foreach(far_iter, &nearConflict->sxactOut->inConflicts) {
RWConflict farConflict = dlist_container(RWConflictData, inLink, far_iter.cur);
if (farConflict->sxactOut == MySerializableXact
|| (!SxactIsCommitted(farConflict->sxactOut)
&& !SxactIsReadOnly(farConflict->sxactOut)
&& !SxactIsDoomed(farConflict->sxactOut))) {
if (SxactIsPrepared(nearConflict->sxactOut)) {
LWLockRelease(SerializableXactHashLock);
ereport(ERROR, ...); /* pivot already prepared: commit suicide */
}
nearConflict->sxactOut->flags |= SXACT_FLAG_DOOMED; /* doom the pivot */
break;
}
}
}
}
MySerializableXact->prepareSeqNo = ++(PredXact->LastSxactCommitSeqNo);
MySerializableXact->flags |= SXACT_FLAG_PREPARED;
LWLockRelease(SerializableXactHashLock);
}
flowchart TD
  T["PredicateLockTID\n(tuple)"] --> P["PredicateLockPage\n(page)"]
  P --> R["PredicateLockRelation\n(relation)"]
  T -->|"childLocks > max_predicate_locks_per_page"| CAP["CheckAndPromotePredicateLockRequest\ndelete child locks,\ncreate coarser lock"]
  P -->|"childLocks > max_predicate_locks_per_relation"| CAP

GUC knobs max_predicate_locks_per_xact, max_predicate_locks_per_relation, and max_predicate_locks_per_page control when promotion fires. PredicateLockShmemSize sizes the shared hash tables from these values.

There are two distinct “opt-out” paths for read-only transactions, both grounded in the read-only theorem from the Theoretical Background.

The immediate opt-out happens inside GetSerializableTransactionSnapshotInt: if the transaction is READ ONLY and no writable serializable transaction is currently active (WritableSxactCount == 0), it can never join a dangerous structure, so it releases its SERIALIZABLEXACT slot and runs as plain SI with zero SSI bookkeeping. A second check fires after registering possible conflicts — if every concurrent writable transaction turned out to be doomed, the list of possibleUnsafeConflicts is empty and the same opt-out applies:

// GetSerializableTransactionSnapshotInt — src/backend/storage/lmgr/predicate.c
if (XactReadOnly)
{
sxact->flags |= SXACT_FLAG_READ_ONLY;
/* Register all concurrent r/w transactions as possible conflicts; if
* they all commit without an outgoing conflict to an earlier txn, this
* snapshot is safe and we can run without tracking predicate locks. */
dlist_foreach(iter, &PredXact->activeList)
{
othersxact = dlist_container(SERIALIZABLEXACT, xactLink, iter.cur);
if (!SxactIsCommitted(othersxact)
&& !SxactIsDoomed(othersxact)
&& !SxactIsReadOnly(othersxact))
SetPossibleUnsafeConflict(sxact, othersxact);
}
if (dlist_is_empty(&sxact->possibleUnsafeConflicts))
{
ReleasePredXact(sxact); /* every writer was doomed */
LWLockRelease(SerializableXactHashLock);
return snapshot; /* immediate opt-out */
}
}

The deferred path is GetSafeSnapshot, used for READ ONLY DEFERRABLE. Instead of running with tracking, it waits — sleeping on WAIT_EVENT_SAFE_SNAPSHOT — until the concurrent writers either drain (possibleUnsafeConflicts empties, meaning a safe snapshot is now available) or one of them flags this transaction RO_UNSAFE, in which case it retries with a fresh snapshot. The cost is latency; the payoff is a read-only transaction that is provably anomaly-free and tracks nothing:

// GetSafeSnapshot — src/backend/storage/lmgr/predicate.c
static Snapshot
GetSafeSnapshot(Snapshot origSnapshot)
{
Snapshot snapshot;
Assert(XactReadOnly && XactDeferrable);
while (true)
{
snapshot = GetSerializableTransactionSnapshotInt(origSnapshot, NULL, InvalidPid);
if (MySerializableXact == InvalidSerializableXact)
return snapshot; /* no concurrent r/w xacts; already safe */
LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
MySerializableXact->flags |= SXACT_FLAG_DEFERRABLE_WAITING;
while (!(dlist_is_empty(&MySerializableXact->possibleUnsafeConflicts)
|| SxactIsROUnsafe(MySerializableXact)))
{
LWLockRelease(SerializableXactHashLock);
ProcWaitForSignal(WAIT_EVENT_SAFE_SNAPSHOT);
LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
}
MySerializableXact->flags &= ~SXACT_FLAG_DEFERRABLE_WAITING;
if (!SxactIsROUnsafe(MySerializableXact)) {
LWLockRelease(SerializableXactHashLock);
break; /* success: snapshot is safe */
}
LWLockRelease(SerializableXactHashLock);
ReleasePredicateLocks(false, false); /* unsafe; retry with a new snapshot */
}
Assert(SxactIsROSafe(MySerializableXact));
ReleasePredicateLocks(false, true);
return snapshot;
}
flowchart TD
  START["READ ONLY DEFERRABLE<br/>GetSafeSnapshot"] --> SNAP["GetSerializableTransactionSnapshotInt"]
  SNAP --> NOXACT{"MySerializableXact<br/>== Invalid?"}
  NOXACT -->|"yes: no concurrent r/w"| SAFE["return snapshot<br/>(SXACT_FLAG_RO_SAFE)"]
  NOXACT -->|"no"| WAIT["set DEFERRABLE_WAITING<br/>ProcWaitForSignal"]
  WAIT --> COND{"possibleUnsafeConflicts<br/>empty?"}
  COND -->|"yes"| SAFE
  COND -->|"flagged RO_UNSAFE"| RETRY["ReleasePredicateLocks<br/>retry new snapshot"]
  RETRY --> SNAP

### Index AM integration

B-tree, GIN, GiST, and hash index scans each call a different variant
of `PredicateLock*` to cover the "gaps" between index entries — the
ranges that would be affected by a later insert. B-tree and GIN lock
leaf pages; GiST locks pages at every level (because splits can ripple
up); hash locks the primary bucket page. For index AMs without predicate
lock support, the entire index relation is locked. Page splits and
combines in all AMs call `PredicateLockPageSplit` /
`PredicateLockPageCombine` to transfer locks to new pages so no rw-
conflicts are silently lost.

### Locking hierarchy

The LWLocks that guard SSI state must be acquired in a fixed order to
prevent latch-ordering deadlocks (from the predicate.c file header):

SerializableFinishedListLock SerializablePredicateListLock perXactPredicateListLock (parallel query only) PredicateLockHashPartitionLock(hashcode) SerializableXactHashLock SerialControlLock SLRU per-bank locks

### Two-phase commit
`AtPrepare_PredicateLocks` serializes a transaction's predicate lock state
into the 2PC state file as `TwoPhasePredicateRecord` entries (one
`TWOPHASEPREDICATERECORD_XACT` plus one `TWOPHASEPREDICATERECORD_LOCK` per
lock). `predicatelock_twophase_recover` reconstructs the state on recovery.
The conflict lists are summarized via flags (`SXACT_FLAG_SUMMARY_CONFLICT_IN`
/ `SXACT_FLAG_SUMMARY_CONFLICT_OUT`) rather than pointers, since the
originating `SERIALIZABLEXACT` objects will not exist after restart.
## Source Walkthrough
### Shared-memory init
| Symbol | Role |
|---|---|
| `PredicateLockShmemInit` | allocates all SSI shared-memory objects at startup |
| `PredicateLockShmemSize` | computes required shared memory from GUCs |
| `PredXact` (global `PredXactList`) | root pointer to the active/available sxact pool |
| `SerializableXidHash` | XID → `SERIALIZABLEXACT*` lookup |
| `PredicateLockTargetHash` | `PREDICATELOCKTARGETTAG` → `PREDICATELOCKTARGET*` |
| `PredicateLockHash` | `(target*, sxact*)` → `PREDICATELOCK*` |
| `SerialSlruCtl` | SLRU buffer for old committed transaction state |
| `OldCommittedSxact` | shared dummy sxact that accumulates summarized locks |
### Transaction lifecycle
| Symbol | Role |
|---|---|
| `GetSerializableTransactionSnapshot` | public entry; calls `GetSerializableTransactionSnapshotInt` |
| `GetSerializableTransactionSnapshotInt` | allocates `SERIALIZABLEXACT`, populates `possibleUnsafeConflicts` for READ ONLY; immediate opt-out when `WritableSxactCount == 0` |
| `GetSafeSnapshot` | `READ ONLY DEFERRABLE` path; waits on `WAIT_EVENT_SAFE_SNAPSHOT` for a safe snapshot |
| `SetPossibleUnsafeConflict` | links a RO transaction to a concurrent writer in `possibleUnsafeConflicts` |
| `RegisterPredicateLockingXid` | assigns `topXid` to `MySerializableXact` on first write |
| `SerializationNeededForRead` | inline gate: skip if not serializable, not MVCC snapshot, or RO-safe |
| `SerializationNeededForWrite` | inline gate for write paths |
| `ReleasePredicateLocks` | commit/rollback cleanup; transfers locks to `OldCommittedSxact` |
| `SummarizeOldestCommittedSxact` | pushes oldest finished sxact to SLRU, reclaims RAM |
### SIREAD lock acquisition
| Symbol | Role |
|---|---|
| `PredicateLockRelation` | acquires a relation-granularity SIREAD lock |
| `PredicateLockPage` | acquires a page-granularity SIREAD lock |
| `PredicateLockTID` | acquires a tuple-granularity SIREAD lock |
| `PredicateLockAcquire` | core insertion into `PredicateLockTargetHash` + `PredicateLockHash` |
| `CoarserLockCovers` | checks whether a coarser lock already subsumes the requested target |
| `CheckAndPromotePredicateLockRequest` | granularity escalation decision using `LocalPredicateLockHash` |
| `DeleteChildTargetLocks` | removes finer-grained locks after promotion |
| `LocalPredicateLockHash` | per-backend private hash (`LOCALPREDICATELOCK`) for promotion heuristics |
### Conflict detection
| Symbol | Role |
|---|---|
| `CheckForSerializableConflictOut` | reader path: writer's xid → lookup sxact → `FlagRWConflict(me, writer)` |
| `CheckForSerializableConflictIn` | writer path: scan predicate locks on target → `FlagRWConflict(reader, me)` |
| `CheckTargetForConflictsIn` | inner loop of `CheckForSerializableConflictIn` for one target tag |
| `CheckTableForSerializableConflictIn` | called by CLUSTER/REINDEX to conflict-check whole table |
| `FlagRWConflict` | calls failure check, then `SetRWConflict` |
| `SetRWConflict` | allocates `RWConflictData` from pool, links into `outConflicts`/`inConflicts` |
| `OnConflict_CheckForSerializationFailure` | dangerous-structure detector; sets `SXACT_FLAG_DOOMED` or raises ERROR |
| `PreCommit_CheckForSerializationFailure` | commit-time check for Tout role |
### Serial log (SLRU overflow)
| Symbol | Role |
|---|---|
| `SerialControlData` | `headPage`, `headXid`, `tailXid` for the SLRU window |
| `SerialAdd` | writes a committed sxact's `minConflictCommitSeqNo` to SLRU |
| `SerialGetMinConflictCommitSeqNo` | reads the serial log for an old xid's conflict info |
| `SerialSetActiveSerXmin` | advances the SLRU tail (reclaims old pages) |
### Key structs and macros
| Symbol | Role |
|---|---|
| `SERIALIZABLEXACT` | per-serializable-transaction SSI bookkeeping |
| `PredXactListData` | global counters + sxact pool lists |
| `RWConflictData` | one directed rw-conflict edge (reader → writer) |
| `PREDICATELOCKTARGET` | one lockable physical object (relation/page/tuple) |
| `PREDICATELOCK` | one (target, sxact) lock binding |
| `LOCALPREDICATELOCK` | per-backend private lock count for promotion heuristics |
| `SerCommitSeqNo` | monotone `uint64` ordering commits for conflict age tests |
| `SxactIsCommitted`, `SxactIsDoomed`, `SxactIsROSafe`, … | `SXACT_FLAG_*` test macros |
| `TargetTagIsCoveredBy` | macro: true when one tag is at a coarser granularity covering another |
| `PredicateLockHashPartitionLock` | maps a target hash code to its partition LWLock |
## Source verification (as of 2026-06-05)
Branch `REL_18_STABLE`, commit `273fe94`.
| Symbol | File | Line |
|---|---|---|
| `SERIALIZABLEXACT` | `src/include/storage/predicate_internals.h` | 58 |
| `PredXactListData` | `src/include/storage/predicate_internals.h` | 144 |
| `RWConflictData` | `src/include/storage/predicate_internals.h` | 193 |
| `PREDICATELOCKTARGETTAG` | `src/include/storage/predicate_internals.h` | 267 |
| `PREDICATELOCKTARGET` | `src/include/storage/predicate_internals.h` | 284 |
| `PREDICATELOCK` | `src/include/storage/predicate_internals.h` | 317 |
| `LOCALPREDICATELOCK` | `src/include/storage/predicate_internals.h` | 347 |
| `PredicateLockTargetType` | `src/include/storage/predicate_internals.h` | 361 |
| `SerialControlData` | `src/backend/storage/lmgr/predicate.c` | 345 |
| `SerializationNeededForRead` | `src/backend/storage/lmgr/predicate.c` | 516 |
| `SerializationNeededForWrite` | `src/backend/storage/lmgr/predicate.c` | 560 |
| `SetRWConflict` | `src/backend/storage/lmgr/predicate.c` | 643 |
| `SetPossibleUnsafeConflict` | `src/backend/storage/lmgr/predicate.c` | 666 |
| `PredicateLockShmemInit` | `src/backend/storage/lmgr/predicate.c` | 1145 |
| `PredicateLockShmemSize` | `src/backend/storage/lmgr/predicate.c` | 1357 |
| `GetSafeSnapshot` | `src/backend/storage/lmgr/predicate.c` | 1558 |
| `GetSerializableTransactionSnapshot` | `src/backend/storage/lmgr/predicate.c` | 1682 |
| `GetSerializableTransactionSnapshotInt` | `src/backend/storage/lmgr/predicate.c` | 1764 |
| `RegisterPredicateLockingXid` | `src/backend/storage/lmgr/predicate.c` | 1959 |
| `CheckAndPromotePredicateLockRequest` | `src/backend/storage/lmgr/predicate.c` | 2326 |
| `PredicateLockAcquire` | `src/backend/storage/lmgr/predicate.c` | 2517 |
| `PredicateLockRelation` | `src/backend/storage/lmgr/predicate.c` | 2576 |
| `PredicateLockPage` | `src/backend/storage/lmgr/predicate.c` | 2599 |
| `PredicateLockTID` | `src/backend/storage/lmgr/predicate.c` | 2621 |
| `PredicateLockPageSplit` | `src/backend/storage/lmgr/predicate.c` | 3144 |
| `PredicateLockPageCombine` | `src/backend/storage/lmgr/predicate.c` | 3229 |
| `ReleasePredicateLocks` | `src/backend/storage/lmgr/predicate.c` | 3312 |
| `CheckForSerializableConflictOut` | `src/backend/storage/lmgr/predicate.c` | 4023 |
| `CheckTargetForConflictsIn` | `src/backend/storage/lmgr/predicate.c` | 4166 |
| `CheckForSerializableConflictIn` | `src/backend/storage/lmgr/predicate.c` | 4336 |
| `CheckTableForSerializableConflictIn` | `src/backend/storage/lmgr/predicate.c` | 4419 |
| `FlagRWConflict` | `src/backend/storage/lmgr/predicate.c` | 4501 |
| `OnConflict_CheckForSerializationFailure` | `src/backend/storage/lmgr/predicate.c` | 4536 |
| `PreCommit_CheckForSerializationFailure` | `src/backend/storage/lmgr/predicate.c` | 4703 |
## Beyond PostgreSQL — Comparative Designs & Research Frontiers
**S2PL vs. SSI.** The classic alternative is Strict Two-Phase Locking
(S2PL): every read acquires a shared lock, every write acquires an
exclusive lock, all held until commit. S2PL is trivially serializable but
blocks reads against concurrent writers. SSI avoids blocking reads
entirely — at the cost of occasionally aborting a transaction when a
dangerous structure is detected. For read-heavy workloads SSI wins on
throughput; for write-heavy, high-conflict workloads S2PL may be
preferable because fewer transactions are aborted after doing work.
**CockroachDB and Spanner.** CockroachDB uses a variant of SSI adapted for
distributed transactions, adding a *write-intent* mechanism for
write-write conflicts and a *timestamp cache* that plays the role of
predicate locks. Google Spanner uses strict two-phase locking with
external-consistency timestamps; it does not implement SSI.
**MySQL/InnoDB** implements SERIALIZABLE as S2PL (shared locks on all
reads), not SSI. It also uses next-key locking for gap coverage, which
is a physical implementation of predicate locking for B-tree ranges.
InnoDB's `trx_sys` carries per-transaction conflict state analogous to
`SERIALIZABLEXACT`, but without the dangerous-structure check.
**Research: reducing false positives.** The PostgreSQL implementation
tracks only rw-conflicts and performs a simple pattern match. More
precise implementations (e.g., port of the Cahill prototype to a
commercial engine) maintain a richer graph and can sometimes prove that a
dangerous structure is not embedded in a true cycle, reducing false
aborts. The tradeoff is higher bookkeeping cost per conflict edge.
**Research: write skew vs. lost update.** A separate class of anomalies —
the lost update — is prevented by PostgreSQL's first-updater-wins
mechanism in the heap AM (`HeapTupleUpdated` path in `heap_update`), not
by SSI. SSI targets write-skew and its generalizations; lost updates are
handled outside the predicate locking layer. See
`postgres-mvcc-snapshots.md` for the interaction between snapshot
isolation and these anomaly types.
**The lineage of the theory.** The four papers cited here form a chain.
Berenson et al. (SIGMOD 1995) showed ANSI's phenomena-based definitions
fail to pin down SI and that SI permits anomalies (A5B write skew) the
standard's "SERIALIZABLE" label implies are excluded — establishing the
problem. Fekete et al. (TODS 2005) supplied the structural theorem: every
SI-schedule cycle contains two consecutive rw-antidependencies with a
pivot, and `Tout` commits first (Theorem 2.1) — establishing what to watch
for. Cahill, Röhm & Fekete (SIGMOD 2008) turned the theorem into an
incremental, low-overhead runtime monitor with a small false-positive rate
— establishing how to watch cheaply. Ports & Grittner (VLDB 2012)
documented the PostgreSQL realization and its original refinements: the
SLRU-backed serial log for bounded RAM, the multi-granularity SIREAD
locks, subtransaction handling on top-level XIDs, the safe-snapshot /
DEFERRABLE path, and the victim-selection rule that dooms the pivot so a
retry can succeed. README-SSI's "Innovations" section is the in-tree
counterpart to that VLDB paper.
**False positives, quantified.** README-SSI is candid that the monitor is
conservative: "not every dangerous structure is embedded in an actual
cycle." Two design choices keep the false-positive rate low without a full
cycle search: (1) the Theorem-2.1 commit-order guard never aborts until the
out-side partner commits; (2) the rw-conflict graph is kept as explicit
`RWConflictData` edge lists rather than the paper-prototype's
conflictIn/conflictOut booleans, so the detector can distinguish "has *a*
conflict out" from "has a conflict out to *this specific* earlier
transaction." Coarse predicate-lock granularity is the other false-positive
source — a relation-level SIREAD lock conflicts with any write to the
table — and is tunable via the `max_predicate_locks_per_*` GUCs and via
plan choice (an index scan may lock only a few leaf pages where a seq scan
locks the whole relation).
**SSI in other engines.** Beyond CockroachDB (timestamp cache as a
predicate-lock surrogate) and Spanner (S2PL + TrueTime, no SSI), the
academic line continued with *serializable* variants that reduce aborts:
PSSI (Precisely Serializable Snapshot Isolation, Revilak et al.) tracks the
real dependency graph to eliminate the rw-only false positives PostgreSQL
accepts, at higher bookkeeping cost; and various OCC-based MVCC engines
(Hekaton, Silo) validate read sets at commit rather than tracking edges
incrementally. The PostgreSQL choice — incremental rw-edge tracking with a
conservative pivot test — trades some precision for a constant-factor
per-conflict cost and a bounded-RAM story via SLRU spill, which suits a
disk-based engine with long-lived transactions better than the in-RAM
read-set validation of a main-memory OCC system.
**The `knowledge/research/dbms-papers/` anchors for this doc** are:
`occ.md` (optimistic concurrency control theory, a precursor framework for
SSI analysis), `scalable-lock-manager.md` (lock-manager scalability, the
contrast point for SIREAD's non-blocking design), and the primary
Cahill/Fekete/Ports-Grittner papers listed in README-SSI [1][2].
## Sources
- `src/backend/storage/lmgr/predicate.c` — full SSI implementation
- `src/include/storage/predicate_internals.h` — shared-memory struct definitions
- `src/include/storage/predicate.h` — public interface declarations
- `src/backend/storage/lmgr/README-SSI` — in-tree design document; the primary reference
- Cahill, Röhm, Fekete: *Serializable isolation for snapshot databases*, SIGMOD 2008 ([1] in README-SSI)
- Fekete, Liarokapis, O'Neil, O'Neil, Shasha: *Making Snapshot Isolation Serializable*, ACM TODS 30:2, 2005 ([2] in README-SSI; Theorem 2.1)
- Ports, Grittner: *Serializable Snapshot Isolation in PostgreSQL*, VLDB 2012 (the PostgreSQL realization and its refinements)
- Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil: *A Critique of ANSI SQL Isolation Levels*, SIGMOD 1995 (why SI is not SERIALIZABLE; A5B write skew)
- Hellerstein, Stonebraker, Hamilton: *Architecture of a Database System*, §6.5.3 next-key locking ([3] in README-SSI)
- `knowledge/research/dbms-papers/occ.md`
- `knowledge/research/dbms-papers/scalable-lock-manager.md` — lock-manager scalability; contrast for the non-blocking SIREAD design
- `knowledge/code-analysis/postgres/postgres-lock-manager.md` — heavyweight lock manager (contrasted here)
- `knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md` — snapshot machinery that SSI hooks into
- `knowledge/code-analysis/postgres/postgres-xact.md` — transaction lifecycle that calls the SSI entry points
- `knowledge/code-analysis/postgres/postgres-slru.md` — SLRU subsystem used by the serial log