PostgreSQL Cache Invalidation — sinval Message Queue, inval.c Dispatcher, and Transactional Deferral
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A database engine that caches metadata — catalog tuples, relation descriptors, parsed plan trees — gains orders of magnitude in lookup speed relative to going to disk on every access. The cost is coherence: a backend with a stale cached entry will see the old schema for a table whose columns have just been renamed by another session. Every multi-process engine must solve the problem of propagating metadata changes to all caches that may hold a copy of the affected entry.
Database System Concepts (Silberschatz, 7e, ch. 25 §“Buffer Management”) identifies cache invalidation as a classical distributed consistency problem reduced by two architectural choices that nearly every engine makes: (1) caches are per-process rather than shared, so the number of parties that must be notified is bounded by the session count; (2) invalidation is lazy rather than eager — the committing session broadcasts what changed and each receiving session discards its stale copy at the next safe moment rather than synchronously. This avoids the cost of blocking DDL on all live sessions.
Database Internals (Petrov, ch. 6 §“System Catalogs”) notes that the transition from eager to lazy invalidation introduces a correctness constraint: a receiving backend must not return a stale cached entry to its own query even before it has processed messages from other sessions. This is the command-boundary deferral problem: within a transaction, a backend’s own catalog changes must also flow through a local invalidation path so the backend sees its own DDL immediately.
Two design axes determine the shape of every lazy invalidation system:
-
Broadcast medium. Is the channel a per-backend signal (push), a shared ring buffer that all backends poll (pull), or a combination? A pure push system delivers instantly but requires
O(N)signals per commit whereNis the session count. A pure pull system needs no signaling but a slow backend drifts arbitrarily far behind, potentially blocking garbage collection of old messages. Real systems use a ring buffer with a catchup signal sent to slow backends. -
Granularity of the invalidation message. Does the message name a specific tuple (by hash key) in a specific cache, or does it name the whole cache? Fine-grained messages allow selective eviction; coarse messages (reset the whole cache) are cheaper to generate and simpler to process when the buffer overflows. Real systems support both: per-tuple messages for normal DDL, whole-cache resets for overflow recovery.
PostgreSQL uses a fixed-size shared ring buffer (4096 slots, sinval)
combined with a catchup interrupt mechanism and a whole-cache reset
fallback for overflow. The invalidation dispatcher (inval.c) collects
pending messages during a transaction and defers broadcasting them until
commit, while processing them locally at command boundaries.
Common DBMS Design
Section titled “Common DBMS Design”The three-tier flow: generate → defer → broadcast
Section titled “The three-tier flow: generate → defer → broadcast”Nearly every lazy-invalidation system in a multi-process RDBMS follows the same three-tier architecture:
Tier 1 — generation. The code that mutates catalog tuples (heap insert, update, delete) calls an invalidation registration function immediately. The message is not sent yet; it is accumulated in a per-transaction buffer. This keeps invalidation generation collocated with the mutation site without coupling mutation latency to cross-process signaling.
Tier 2 — command-local application. At each command boundary
(after CommandCounterIncrement()), pending messages from the current
command are applied to the local caches only. The reason: the mutating
backend’s own next command must see the new catalog state. The messages
remain in the per-transaction buffer for later broadcast; they are not
sent to other backends yet.
Tier 3 — broadcast at commit. At transaction commit, all accumulated messages are pushed into the shared medium (ring buffer, log, or broadcast channel). Other backends drain this medium at their next transaction start. Abort discards Tier 1 messages (never sent) and reverses the Tier 2 local applications.
Catcache vs. relcache invalidation ordering
Section titled “Catcache vs. relcache invalidation ordering”Most engines that have both a tuple-level catalog cache (catcache) and an assembled relation descriptor cache (relcache) must process catcache invalidations before relcache invalidations. The reason: building or rebuilding a relcache entry requires reading catalog tuples; if the catcache still holds the old tuple at the moment the relcache is being rebuilt, the rebuilt relcache entry is born stale. The ordering rule is therefore: flush catcache first, then flush relcache.
Shared ring buffer with overflow reset
Section titled “Shared ring buffer with overflow reset”A fixed-size ring buffer in shared memory is the standard choice for the broadcast medium in process-local-cache engines. The invariants:
maxMsgNumis the producer cursor (next write slot).minMsgNumis the global consumer floor (no reader needs anything before this).- Each backend has its own
nextMsgNumcursor. - If a backend’s
nextMsgNumfalls more thanMAXNUMMESSAGESslots behindmaxMsgNum, its copy of the buffer has been overwritten. The engine sets a reset flag for that backend; when the backend next checks for messages it discards all its cached state rather than processing individual messages.
The reset is the correctness safety net: it is always safe to discard the entire cache, because the next cache miss will reload from the source of truth (system catalogs). The cost is a storm of reload activity after the reset. Catchup interrupts are therefore sent to slow backends to prevent reset from becoming the common case.
Registered callbacks
Section titled “Registered callbacks”Every subsystem that maintains a higher-level cache derived from catcache or relcache (plan cache, partition descriptor cache, event trigger cache, etc.) needs to know when its inputs are invalidated. The standard pattern is a per-cache callback registry: the subsystem registers a function pointer at startup; the invalidation dispatcher calls it when the relevant catcache or relcache entry is flushed. This decouples the dispatcher from knowledge of higher-level caches.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory concept | PostgreSQL entity |
|---|---|
| Invalidation message (per-tuple, catcache) | SharedInvalidationMessage with id ≥ 0 (catcache ID) |
| Invalidation message (whole-catalog) | SharedInvalidationMessage with id == SHAREDINVALCATALOG_ID (-1) |
| Invalidation message (relcache) | SharedInvalidationMessage with id == SHAREDINVALRELCACHE_ID (-2) |
| Invalidation message (smgr / relmap / snapshot / relsync) | IDs -3 to -6 |
| Shared ring buffer | SISeg.buffer[MAXNUMMESSAGES] in sinvaladt.c |
| Producer cursor | SISeg.maxMsgNum |
| Consumer floor | SISeg.minMsgNum |
| Per-backend cursor | ProcState.nextMsgNum |
| Reset flag | ProcState.resetState |
| Catchup signal | PROCSIG_CATCHUP_INTERRUPT via ProcState.signaled |
| Per-transaction accumulation buffer | TransInvalidationInfo chain in inval.c |
| Command-local application | CommandEndInvalidationMessages() |
| Commit broadcast | AtEOXact_Inval(isCommit=true) → SendSharedInvalidMessages |
| Abort rollback | AtEOXact_Inval(isCommit=false) → LocalExecuteInvalidationMessage |
| Subsystem callback | syscache_callback_list[] / relcache_callback_list[] |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Message types
Section titled “Message types”sinval.h defines six negative IDs for special message types and
reserves non-negative values for catcache IDs:
// SharedInvalidationMessage — src/include/storage/sinval.h#define SHAREDINVALCATALOG_ID (-1) /* whole catalog flush */#define SHAREDINVALRELCACHE_ID (-2) /* relcache entry */#define SHAREDINVALSMGR_ID (-3) /* smgr file reference */#define SHAREDINVALRELMAP_ID (-4) /* relation mapper */#define SHAREDINVALSNAPSHOT_ID (-5) /* catalog snapshot */#define SHAREDINVALRELSYNC_ID (-6) /* logical decoding relsync */A SharedInvalidationMessage is a union with one member per type. For
catcache messages (id ≥ 0) the relevant fields are the cache ID,
the hash value of the invalidated key, and the database OID. For relcache
messages the relevant fields are the database OID and the relation OID
(InvalidOid means flush the entire relcache).
The inval.c accumulation structures
Section titled “The inval.c accumulation structures”inval.c maintains two parallel accumulation arrays allocated in
TopTransactionContext — one for catcache messages and one for relcache
messages — plus a linked list of TransInvalidationInfo structs for
subtransaction nesting:
// TransInvalidationInfo — src/backend/utils/cache/inval.ctypedef struct TransInvalidationInfo{ struct InvalidationInfo ii; /* base: CurrentCmdInvalidMsgs + RelcacheInitFileInval */ InvalidationMsgsGroup PriorCmdInvalidMsgs; /* cmds already processed */ struct TransInvalidationInfo *parent; int my_level; /* subtransaction nesting depth */} TransInvalidationInfo;
// InvalidationMsgsGroup — indexes into the two flat arraystypedef struct InvalidationMsgsGroup{ int firstmsg[2]; /* [0]=CatCacheMsgs, [1]=RelCacheMsgs */ int nextmsg[2];} InvalidationMsgsGroup;The flat arrays are never copied; the InvalidationMsgsGroup structs
hold indexes into them. Appending a subtransaction’s messages to its
parent is therefore an O(1) index-range adjustment, not a data copy.
For inplace updates (non-transactional catalog mutations, e.g., system
catalog pg_class.reltuples write-back), a separate inplaceInvalInfo
path exists. It uses the same message arrays but bypasses the
command/transaction boundary logic, sending messages immediately during
the WAL insertion critical section.
Registering a catalog change
Section titled “Registering a catalog change”When heap_update or heap_delete touches a tuple in a system catalog,
it calls CacheInvalidateHeapTuple:
// CacheInvalidateHeapTuple — src/backend/utils/cache/inval.cvoidCacheInvalidateHeapTuple(Relation relation, HeapTuple tuple, HeapTuple newtuple){ CacheInvalidateHeapTupleCommon(relation, tuple, newtuple, PrepareInvalidationState);}The common path:
- Short-circuits on non-catalog relations (
IsCatalogRelationcheck). - Calls
PrepareToInvalidateCacheTupleincatcache.cto determine which catcache IDs are affected and registers aSharedInvalidationMessageper affected cache viaRegisterCatcacheInvalidation. - For tuples in
pg_class,pg_attribute,pg_index, orpg_constraint(foreign keys), also callsRegisterRelcacheInvalidationto enqueue a relcache flush for the owning relation. - If the relation is in the relcache init file, sets
RelcacheInitFileInval = trueso the file is deleted at commit.
// CacheInvalidateHeapTupleCommon (condensed) — src/backend/utils/cache/inval.cstatic voidCacheInvalidateHeapTupleCommon(Relation relation, HeapTuple tuple, HeapTuple newtuple, InvalidationInfo *(*prepare_callback)(void)){ if (!IsCatalogRelation(relation)) return; if (IsToastRelation(relation)) return;
info = prepare_callback(); /* PrepareInvalidationState or PrepareInplaceInvalidationState */
tupleRelId = RelationGetRelid(relation); if (RelationInvalidatesSnapshotsOnly(tupleRelId)) RegisterSnapshotInvalidation(info, databaseId, tupleRelId); else PrepareToInvalidateCacheTuple(relation, tuple, newtuple, RegisterCatcacheInvalidation, (void *) info);
/* relcache flush for pg_class / pg_attribute / pg_index / pg_constraint */ if (tupleRelId == RelationRelationId) { relationId = ...; } else if (tupleRelId == AttributeRelationId) { relationId = ...; } else if (tupleRelId == IndexRelationId) { relationId = ...; } else if (tupleRelId == ConstraintRelationId) { /* FK only */ } else return;
RegisterRelcacheInvalidation(info, databaseId, relationId);}Command boundary: local flush
Section titled “Command boundary: local flush”At each CommandCounterIncrement(), PostgreSQL calls
CommandEndInvalidationMessages(). This processes all messages from the
current command against the local caches only — no cross-process
signaling yet:
// CommandEndInvalidationMessages — src/backend/utils/cache/inval.cvoidCommandEndInvalidationMessages(void){ if (transInvalInfo == NULL) return;
ProcessInvalidationMessages(&transInvalInfo->ii.CurrentCmdInvalidMsgs, LocalExecuteInvalidationMessage);
/* WAL-log per-command invalidations for logical decoding */ if (XLogLogicalInfoActive()) LogLogicalInvalidations();
AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, &transInvalInfo->ii.CurrentCmdInvalidMsgs);}After this call, the messages move from CurrentCmdInvalidMsgs to
PriorCmdInvalidMsgs. The local cache is up to date for the next command
in the same transaction.
Transaction end: broadcast or rollback
Section titled “Transaction end: broadcast or rollback”At transaction commit, AtEOXact_Inval(isCommit=true):
- If
RelcacheInitFileInvalis set, callsRelationCacheInitFilePreInvalidate()(deletes the init file). - Appends
CurrentCmdInvalidMsgstoPriorCmdInvalidMsgs. - Calls
ProcessInvalidationMessagesMultiwithSendSharedInvalidMessagesas the processor, which pushes all accumulated messages into thesinvalbuffer viaSIInsertDataEntries. - If
RelcacheInitFileInval, callsRelationCacheInitFilePostInvalidate().
// AtEOXact_Inval (isCommit path, condensed) — src/backend/utils/cache/inval.cvoidAtEOXact_Inval(bool isCommit){ // ... NULL check ... if (isCommit) { if (transInvalInfo->ii.RelcacheInitFileInval) RelationCacheInitFilePreInvalidate();
AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, &transInvalInfo->ii.CurrentCmdInvalidMsgs);
ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, SendSharedInvalidMessages);
if (transInvalInfo->ii.RelcacheInitFileInval) RelationCacheInitFilePostInvalidate(); } else /* abort */ { ProcessInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, LocalExecuteInvalidationMessage); } transInvalInfo = NULL;}On abort, the local caches must be rolled back (the local changes from
PriorCmdInvalidMsgs — already applied at command boundaries — must be
undone by re-flushing). The CurrentCmdInvalidMsgs (not yet applied
locally) are simply discarded.
The sinval ring buffer (sinvaladt.c)
Section titled “The sinval ring buffer (sinvaladt.c)”SISeg is the shared-memory segment housing the ring buffer and all
per-backend state:
// SISeg — src/backend/storage/ipc/sinvaladt.ctypedef struct SISeg{ int minMsgNum; /* oldest unread message */ int maxMsgNum; /* next slot to write */ int nextThreshold; /* fullness trigger for SICleanupQueue */ slock_t msgnumLock; /* spinlock protecting maxMsgNum */
SharedInvalidationMessage buffer[MAXNUMMESSAGES]; /* 4096 slots */
int numProcs; int *pgprocnos; ProcState procState[FLEXIBLE_ARRAY_MEMBER]; /* one per backend slot */} SISeg;
// ProcState — per-backend cursor and flagstypedef struct ProcState{ pid_t procPid; /* 0 = inactive */ int nextMsgNum; /* next message to read */ bool resetState; /* missed messages; must reset entire cache */ bool signaled; /* catchup interrupt already sent */ bool hasMessages; /* unread messages present */ bool sendOnly; /* Startup process: send only, never receive */ // ... nextLXID} ProcState;MAXNUMMESSAGES = 4096. When a backend’s nextMsgNum would need to read
a slot that has already been overwritten (maxMsgNum - nextMsgNum > MAXNUMMESSAGES), resetState is set to true.
Writing (SIInsertDataEntries):
// SIInsertDataEntries (condensed) — src/backend/storage/ipc/sinvaladt.cvoidSIInsertDataEntries(const SharedInvalidationMessage *data, int n){ while (n > 0) { int nthistime = Min(n, WRITE_QUANTUM); /* 64 */ n -= nthistime;
LWLockAcquire(SInvalWriteLock, LW_EXCLUSIVE);
/* Clean/reset if full */ for (;;) { numMsgs = segP->maxMsgNum - segP->minMsgNum; if (numMsgs + nthistime > MAXNUMMESSAGES || numMsgs >= segP->nextThreshold) SICleanupQueue(true, nthistime); else break; }
max = segP->maxMsgNum; while (nthistime-- > 0) segP->buffer[max++ % MAXNUMMESSAGES] = *data++;
SpinLockAcquire(&segP->msgnumLock); segP->maxMsgNum = max; /* memory barrier via spinlock */ SpinLockRelease(&segP->msgnumLock);
for (i = 0; i < segP->numProcs; i++) segP->procState[segP->pgprocnos[i]].hasMessages = true;
LWLockRelease(SInvalWriteLock); }}SInvalWriteLock serializes producers. The spinlock on maxMsgNum provides
a memory barrier: messages are guaranteed to be visible in buffer[] before
maxMsgNum is advanced.
Reading (SIGetDataEntries):
// SIGetDataEntries (condensed) — src/backend/storage/ipc/sinvaladt.cintSIGetDataEntries(SharedInvalidationMessage *data, int datasize){ if (!stateP->hasMessages) return 0; /* fast path: nothing pending */
LWLockAcquire(SInvalReadLock, LW_SHARED); stateP->hasMessages = false;
SpinLockAcquire(&segP->msgnumLock); max = segP->maxMsgNum; SpinLockRelease(&segP->msgnumLock);
if (stateP->resetState) { stateP->nextMsgNum = max; stateP->resetState = false; LWLockRelease(SInvalReadLock); return -1; /* -1 = reset signal */ }
n = 0; while (n < datasize && stateP->nextMsgNum < max) data[n++] = segP->buffer[stateP->nextMsgNum++ % MAXNUMMESSAGES];
// ... reset hasMessages if partial read ... LWLockRelease(SInvalReadLock); return n;}Multiple backends can call SIGetDataEntries in parallel under a shared
SInvalReadLock, because each backend modifies only its own ProcState
fields. The lock is not held in the conventional read-only sense; it is
held to authorize mutation of self-only state, providing
the memory barrier needed to see messages written under SInvalWriteLock.
Receiving messages (AcceptInvalidationMessages)
Section titled “Receiving messages (AcceptInvalidationMessages)”Each backend calls AcceptInvalidationMessages() at the start of each
transaction (in StartTransaction) and at other checkpoints (e.g., after
acquiring a lock). It calls ReceiveSharedInvalidMessages, which loops
calling SIGetDataEntries until the queue is drained:
// AcceptInvalidationMessages — src/backend/utils/cache/inval.cvoidAcceptInvalidationMessages(void){ ReceiveSharedInvalidMessages(LocalExecuteInvalidationMessage, InvalidateSystemCaches); // ... optional debug_discard_caches path}LocalExecuteInvalidationMessage dispatches each message by type:
// LocalExecuteInvalidationMessage (condensed) — src/backend/utils/cache/inval.cvoidLocalExecuteInvalidationMessage(SharedInvalidationMessage *msg){ if (msg->id >= 0) /* catcache tuple */ { InvalidateCatalogSnapshot(); SysCacheInvalidate(msg->cc.id, msg->cc.hashValue); CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); } else if (msg->id == SHAREDINVALCATALOG_ID) /* whole catalog */ { InvalidateCatalogSnapshot(); CatalogCacheFlushCatalog(msg->cat.catId); } else if (msg->id == SHAREDINVALRELCACHE_ID) /* relcache entry */ { RelationCacheInvalidateEntry(msg->rc.relId); /* or full flush */ /* call relcache_callback_list entries */ } else if (msg->id == SHAREDINVALSMGR_ID) { smgrreleaserellocator(...); } else if (msg->id == SHAREDINVALRELMAP_ID) { RelationMapInvalidate(...); } else if (msg->id == SHAREDINVALSNAPSHOT_ID) { InvalidateCatalogSnapshot(); } else if (msg->id == SHAREDINVALRELSYNC_ID) { CallRelSyncCallbacks(...); }}If SIGetDataEntries returns -1 (reset), InvalidateSystemCaches() is
called instead: it wipes all catcache and relcache entries, then fires all
registered syscache and relcache callbacks.
Registered callbacks
Section titled “Registered callbacks”Subsystems that cache derived state register callbacks to be notified on invalidation events:
// CacheRegisterSyscacheCallback — src/backend/utils/cache/inval.cvoidCacheRegisterSyscacheCallback(int cacheid, SyscacheCallbackFunction func, Datum arg){ // adds to syscache_callback_list[], linked by syscache_callback_links[id]}
// CacheRegisterRelcacheCallback — src/backend/utils/cache/inval.cvoidCacheRegisterRelcacheCallback(RelcacheCallbackFunction func, Datum arg){ // adds to relcache_callback_list[]}Up to MAX_SYSCACHE_CALLBACKS = 64 syscache callbacks and
MAX_RELCACHE_CALLBACKS = 10 relcache callbacks are supported. Callbacks
are chained per-cache-ID via syscache_callback_links[] for O(1) dispatch.
Callers include the plan cache, the partition descriptor cache, the event
trigger cache, and logical decoding subsystems.
Inplace-update path
Section titled “Inplace-update path”For catalog changes that are not transactional — such as updating
pg_class.reltuples during ANALYZE — an inplace update path exists that
bypasses the transaction/subtransaction stack. CacheInvalidateHeapTupleInplace
queues messages into a separate inplaceInvalInfo structure;
AtInplace_Inval() sends them directly to the sinval buffer during the
WAL insertion critical section, so the update is immediately visible to
other backends. PreInplace_Inval() handles relcache init file deletion
before the critical section.
Flow diagram
Section titled “Flow diagram”flowchart TD
A["heap_update / heap_delete\n(catalog relation)"] -->|CacheInvalidateHeapTuple| B["CacheInvalidateHeapTupleCommon\ninval.c"]
B -->|catcache msg| C["CurrentCmdInvalidMsgs\n(catcache array)"]
B -->|relcache msg| D["CurrentCmdInvalidMsgs\n(relcache array)"]
C --> E["CommandEndInvalidationMessages\n(CommandCounterIncrement)"]
D --> E
E -->|LocalExecuteInvalidationMessage| F["SysCacheInvalidate +<br/>RelationCacheInvalidateEntry\n(local caches)"]
E -->|move to| G["PriorCmdInvalidMsgs"]
G -->|commit: AtEOXact_Inval| H["SendSharedInvalidMessages\n→ SIInsertDataEntries"]
G -->|abort: AtEOXact_Inval| I["LocalExecuteInvalidationMessage\n(undo prior-cmd changes)"]
H --> J["SISeg.buffer\n4096-slot ring, shared memory"]
J -->|AcceptInvalidationMessages\nnext transaction start| K["SIGetDataEntries"]
K -->|n > 0| L["LocalExecuteInvalidationMessage\nper message"]
K -->|returns -1 reset| M["InvalidateSystemCaches\nfull wipe + all callbacks"]
L --> N["SysCacheInvalidate\nRelationCacheInvalidateEntry\nCallbacks..."]
Figure 1 — Cache invalidation flow from catalog mutation to cross-backend delivery. Left side: the generating backend. Right side: the receiving backend. The sinval ring buffer is the only shared structure.
End-to-end fan-out: one mutation, every backend
Section titled “End-to-end fan-out: one mutation, every backend”Figure 1 traces the data structures inside a single backend. Figure 2
takes the complementary view: it follows one catalog mutation as it
fans out to every live backend through the shared sinval queue. The
key asymmetry is that the producer runs the path once (registration
→ accumulate → broadcast at commit), while the consumer path runs
N times, independently, in each backend that drains the ring buffer
at its own next AcceptInvalidationMessages checkpoint. Each consumer
ends at the same terminal action: dropping the stale catcache and
relcache entries so the next lookup reloads from the catalog.
Note the per-message database guard inside LocalExecuteInvalidationMessage:
catcache, relcache, relmap, and snapshot messages all short-circuit
unless msg->*.dbId == MyDatabaseId || dbId == InvalidOid, so a backend
connected to a different database ignores another database’s DDL even
though the message physically passed through the shared ring it polls.
flowchart TD
subgraph PROD["Producer backend (runs once)"]
A["heap_update / heap_delete\non a catalog relation"] -->|CacheInvalidateHeapTuple| B["CacheInvalidateHeapTupleCommon"]
B -->|PrepareToInvalidateCacheTuple| C["RegisterCatcacheInvalidation\n(catcache msg: id, hashValue, dbId)"]
B -->|pg_class / pg_attribute /<br/>pg_index / pg_constraint| D["RegisterRelcacheInvalidation\n(relcache msg: dbId, relId)"]
C --> E["CurrentCmdInvalidMsgs\n(per-transaction accumulation)"]
D --> E
E -->|CommandEndInvalidationMessages\nmove to PriorCmdInvalidMsgs| F["AtEOXact_Inval\nisCommit = true"]
F -->|SendSharedInvalidMessages| G["SIInsertDataEntries\nunder SInvalWriteLock"]
end
G --> Q["SISeg.buffer\n4096-slot ring in shared memory\nmaxMsgNum advanced; hasMessages=true for all backends"]
Q --> H1["Backend 1\nAcceptInvalidationMessages"]
Q --> H2["Backend 2\nAcceptInvalidationMessages"]
Q --> H3["Backend N\nAcceptInvalidationMessages"]
subgraph CONS["Consumer backend (runs in each of N backends)"]
H1 -->|ReceiveSharedInvalidMessages| I["SIGetDataEntries"]
I -->|n >= 0: per message| J["LocalExecuteInvalidationMessage\ndbId guard: MyDatabaseId or InvalidOid"]
I -->|returns -1: overflow| K["InvalidateSystemCaches\nfull catcache + relcache wipe"]
J -->|id >= 0| L["SysCacheInvalidate +\nCallSyscacheCallbacks\n(catcache entry drop)"]
J -->|SHAREDINVALRELCACHE_ID| M["RelationCacheInvalidateEntry +\nrelcache_callback_list\n(relcache entry drop)"]
end
Figure 2 — End-to-end fan-out of a single catalog mutation. The producer path executes once and terminates by writing the ring buffer under SInvalWriteLock. Every live backend independently drains the same ring at its next AcceptInvalidationMessages checkpoint; each ends by dropping the affected catcache and relcache entries (or, on overflow, wiping all caches). The dbId guard in LocalExecuteInvalidationMessage lets backends on other databases skip irrelevant messages.
WAL integration for logical decoding
Section titled “WAL integration for logical decoding”When wal_level = logical, CommandEndInvalidationMessages also calls
LogLogicalInvalidations() to write per-command invalidation messages into
the WAL stream. This allows logical decoding backends (WAL subscribers) to
replay catalog changes and maintain their own catcache/relcache without
access to the live sinval queue.
The ProcessCommittedInvalidationMessages function is the redo-time analog
of AtEOXact_Inval: it is called by xact_redo_commit() on the standby to
broadcast the invalidation messages embedded in the commit WAL record to the
standby’s sinval queue.
Source Walkthrough
Section titled “Source Walkthrough”Generation side (inval.c)
Section titled “Generation side (inval.c)”CacheInvalidateHeapTuple— entry point called by heap DML on catalog relations; routes toCacheInvalidateHeapTupleCommonwith transactional state prepare.CacheInvalidateHeapTupleInplace— inplace-update variant; bypasses transaction stack.CacheInvalidateCatalog— registers a whole-catalog flush (used byVACUUM FULLon a catalog).CacheInvalidateRelcache— registers a relcache flush for a specific relation when no tuple-level trigger fires (e.g.,DROP INDEX).CacheInvalidateRelcacheAll— broadcastsInvalidOidto flush all relcache entries cluster-wide.PrepareInvalidationState— allocates / reuses aTransInvalidationInfofor the current (sub)transaction nesting level.
Command / transaction boundary (inval.c)
Section titled “Command / transaction boundary (inval.c)”CommandEndInvalidationMessages— appliesCurrentCmdInvalidMsgslocally, WAL-logs ifwal_level=logical, moves toPriorCmdInvalidMsgs.AtEOXact_Inval— commit: broadcastsPriorCmdInvalidMsgsvia sinval; abort: applies them locally (undo).AtEOSubXact_Inval— subtransaction commit: bubbles messages to parent; subtransaction abort: applies locally.AtInplace_Inval/PreInplace_Inval— inplace-update broadcast during WAL critical section.PostPrepare_Inval— PREPARE path: behaves like abort (undo local changes; the transaction’s broadcast will arrive viaProcessCommittedInvalidationMessageswhen it ultimately commits).
Reception side
Section titled “Reception side”AcceptInvalidationMessages(inval.c) — outer entry; callsReceiveSharedInvalidMessageswithLocalExecuteInvalidationMessageandInvalidateSystemCachesas the two dispatch callbacks.LocalExecuteInvalidationMessage(inval.c) — per-message dispatch: routes toSysCacheInvalidate,RelationCacheInvalidateEntry, smgr, relmap, snapshot, or relsync handlers.InvalidateSystemCaches/InvalidateSystemCachesExtended(inval.c) — full reset: wipes catcache + relcache + all callbacks.ReceiveSharedInvalidMessages(sinval.c) — thin wrapper: loopsSIGetDataEntries; handlesPROCSIG_CATCHUP_INTERRUPTviaSendSharedInvalidMessages.SIGetDataEntries(sinvaladt.c) — reads backend’s pending messages from the ring buffer; returns -1 on reset.SIInsertDataEntries(sinvaladt.c) — writes up toWRITE_QUANTUM=64messages per lock hold; callsSICleanupQueueif buffer is filling.
Callback registration
Section titled “Callback registration”CacheRegisterSyscacheCallback(inval.c) — registers a hook for a specific syscache ID; linked list per cache viasyscache_callback_links[].CacheRegisterRelcacheCallback(inval.c) — registers a hook for any relcache flush.CacheRegisterRelSyncCallback(inval.c) — logical decoding relsync hook.
Position hints (as of 2026-06-05, commit 273fe94)
Section titled “Position hints (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
CacheInvalidateHeapTuple | src/backend/utils/cache/inval.c | 1571 |
CacheInvalidateHeapTupleCommon | src/backend/utils/cache/inval.c | 1436 |
CacheInvalidateHeapTupleInplace | src/backend/utils/cache/inval.c | 1593 |
CacheInvalidateCatalog | src/backend/utils/cache/inval.c | 1612 |
CacheInvalidateRelcache | src/backend/utils/cache/inval.c | 1635 |
CommandEndInvalidationMessages | src/backend/utils/cache/inval.c | 1409 |
AtEOXact_Inval | src/backend/utils/cache/inval.c | 1199 |
AtEOSubXact_Inval | src/backend/utils/cache/inval.c | 1310 |
AtInplace_Inval | src/backend/utils/cache/inval.c | 1263 |
AcceptInvalidationMessages | src/backend/utils/cache/inval.c | 930 |
LocalExecuteInvalidationMessage | src/backend/utils/cache/inval.c | 823 |
InvalidateSystemCaches | src/backend/utils/cache/inval.c | 916 |
PrepareInvalidationState | src/backend/utils/cache/inval.c | 682 |
RegisterCatcacheInvalidation | src/backend/utils/cache/inval.c | 604 |
RegisterRelcacheInvalidation | src/backend/utils/cache/inval.c | 632 |
RegisterSnapshotInvalidation | src/backend/utils/cache/inval.c | 672 |
CacheRegisterSyscacheCallback | src/backend/utils/cache/inval.c | 1816 |
CacheRegisterRelcacheCallback | src/backend/utils/cache/inval.c | 1858 |
xactGetCommittedInvalidationMessages | src/backend/utils/cache/inval.c | 1012 |
ProcessCommittedInvalidationMessages | src/backend/utils/cache/inval.c | 1135 |
TransInvalidationInfo (struct) | src/backend/utils/cache/inval.c | 241 |
SISeg (struct) | src/backend/storage/ipc/sinvaladt.c | 165 |
ProcState (struct) | src/backend/storage/ipc/sinvaladt.c | 136 |
SIInsertDataEntries | src/backend/storage/ipc/sinvaladt.c | 370 |
SIGetDataEntries | src/backend/storage/ipc/sinvaladt.c | 473 |
SharedInvalBackendInit | src/backend/storage/ipc/sinvaladt.c | 272 |
SICleanupQueue | src/backend/storage/ipc/sinvaladt.c | ~560 |
SendSharedInvalidMessages | src/backend/storage/ipc/sinval.c | 47 |
ReceiveSharedInvalidMessages | src/backend/storage/ipc/sinval.c | 69 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
MAXNUMMESSAGES = 4096is a compile-time constant. Verified insinvaladt.cline 129 at commit 273fe94. Not a GUC. There is no runtime mechanism to resize the ring buffer without recompiling. -
Catcache messages are ordered before relcache messages within each subgroup. Verified by reading
AtEOXact_Inval: it callsProcessInvalidationMessagesMultiwhich processesCatCacheMsgssubgroup first, thenRelCacheMsgs, matching the design principle that catcache must be clean before relcache is rebuilt. -
SIGetDataEntriesruns under a sharedSInvalReadLock. Confirmed insinvaladt.c. Multiple backends can drain their ownProcState.nextMsgNumconcurrently without contention on the read path, because each modifies only its own per-backend state. The shared lock is used unconventionally to authorize self-modification, not to serialize reads. -
hasMessagesflag provides a fast-path check before acquiring any lock. Verified:SIGetDataEntriesreturns 0 immediately ifstateP->hasMessagesis false. The flag is set by the producer insideSInvalWriteLockafter advancingmaxMsgNum, giving a memory-barrier ordering guarantee. -
Inplace-update path (
AtInplace_Inval) callsSendSharedInvalidMessagesinside a critical section (CritSectionCount > 0). Confirmed by theAssert(CritSectionCount > 0)inAtInplace_Inval. This means the sinval buffer write is atomic with the WAL record for the inplace heap change. -
The
debug_discard_cachesGUC exists (default 0) for stress testing. Verified:AcceptInvalidationMessagescontains aDISCARD_CACHES_ENABLEDblock that callsInvalidateSystemCachesExtended(true)recursively up todebug_discard_cacheslevels. Available only if compiled withDISCARD_CACHES_ENABLED(set by--enable-discard-caches). Thedebug_discard_caches = 1mode was formerlyCLOBBER_CACHE_ALWAYS. -
RelcacheInitFileInvalflag is per-transaction, not per-message. A single boolean onTransInvalidationInfo.ii.RelcacheInitFileInval. AnyRegisterRelcacheInvalidationfor a relation thatRelationIdIsInInitFilesets it; the init file is deleted only once at commit, not per-message. -
MAX_SYSCACHE_CALLBACKS = 64,MAX_RELCACHE_CALLBACKS = 10. Both are fixed arrays (no dynamic growth). If a subsystem attempts to register more,elog(FATAL)fires. As of REL_18_STABLE, no regression test exercises the limit, but the counts are well below the cap in a standard build.
Open questions
Section titled “Open questions”-
SICleanupQueuereset heuristic.SIG_THRESHOLD = MAXNUMMESSAGES / 2(2048) determines when a catchup interrupt is sent to a lagging backend. The comment says “the furthest-back backend might be stuck,” but there is no timeout-based reset: a backend that is stuck in a state where it cannot callAcceptInvalidationMessageswill eventually get aresetStateset, then log a message and invalidate its entire cache. The impact on a production system of a single stuck backend filling a 4096-slot buffer is worth measuring; investigation path: instrumentSICleanupQueuecalls inpg_stat_activityor a custom extension. -
sendOnlysemantics for the Startup process.ProcState.sendOnly = trueis set for the Startup process during recovery. The code comment says it “fires inval messages to allow query backends to see schema changes” but “doesn’t maintain a relcache.” How catalog changes during recovery (e.g., from committed transactions being replayed) flow through the sinval path and whether any receiving backend actually rebuilds relcache correctly during hot-standby reads is not fully traced in this doc. -
WAL-logged invalidations and logical decoding.
LogLogicalInvalidationsatCommandEndInvalidationMessageswrites per-command invalidation messages into WAL. The format and the exact consumer path throughxlogreaderanddecode.con the subscriber side is not traced here. The interaction with thereorderbuffersnapshots that logical decoding maintains deserves a dedicated analysis.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Oracle’s global shared cache invalidation. Oracle’s buffer cache is shared (SGA), so cache invalidation does not require per-process propagation. When a DDL commits, Oracle invalidates child cursors in the shared cursor cache by marking them invalid in the library cache; the next execution rebinds. The per-process sinval ring buffer pattern is an artifact of PostgreSQL’s per-process architecture — a shared-cache engine has a different problem (cursor invalidation rather than metadata propagation).
-
MySQL/InnoDB table definition cache (TDC) and MDL. MySQL uses a centralized table definition cache with metadata locks (MDL) that block DDL until all sessions using a table have released their references. This is an eager rather than lazy model: DDL waits for the cache to be clean rather than sending invalidation after commit. The tradeoff is no stale- read risk but higher DDL latency under concurrency.
-
Lock-based cache coherence (distributed DBMS). In distributed engines (CockroachDB, Spanner, YugabyteDB), schema changes propagate via a lease or version mechanism: each backend caches the schema at a given lease version; DDL bumps the version and waits for all backends to observe the new version before proceeding. The PostgreSQL sinval model is a single-node analog — “version bumped at commit, all backends eventually drain the queue” — without the distributed coordination.
-
Lock-free sinval. The current sinval design uses
SInvalReadLock(LWLock shared) andSInvalWriteLock(LWLock exclusive) plus a spinlock onmaxMsgNum. With atomic read/write and hardware memory barriers it may be possible to eliminate the LWLock for the reader path entirely, reducing contention at very high session counts. This is an open optimization direction noted in source comments.
Sources
Section titled “Sources”Raw files
Section titled “Raw files”None — synthesized from source tree directly.
Textbook chapters
Section titled “Textbook chapters”- Database System Concepts (Silberschatz et al., 7e) — ch. 11 §“System Catalog”, ch. 25 §“Buffer Management”.
- Database Internals (Alex Petrov) — ch. 6 §“System Catalogs and Metadata”.
Source code paths (REL_18_STABLE, commit 273fe94)
Section titled “Source code paths (REL_18_STABLE, commit 273fe94)”src/backend/utils/cache/inval.c— invalidation dispatcher and accumulator.src/backend/storage/ipc/sinvaladt.c— shared ring buffer implementation.src/backend/storage/ipc/sinval.c— thin wrappers over sinvaladt.src/include/storage/sinval.h—SharedInvalidationMessageunion and ID constants.src/include/utils/inval.h— public API declarations.src/backend/utils/cache/catcache.c—PrepareToInvalidateCacheTuple,SysCacheInvalidate.src/backend/utils/cache/relcache.c—RelationCacheInvalidateEntry,RelationCacheInitFilePreInvalidate.
Cross-references in this knowledge base
Section titled “Cross-references in this knowledge base”postgres-catcache-syscache.md— how catcache entries are invalidated whenSysCacheInvalidateis called; negative entries; callback chain.postgres-relcache.md— howRelationCacheInvalidateEntryrebuilds or flushes a relcache entry; the init file; bootstrap nailing.postgres-xlog-wal.md— WAL record format;LogLogicalInvalidationsand how inval messages embed in commit records.postgres-xact.md— transaction lifecycle; whereAtEOXact_InvalandCommandEndInvalidationMessagesare called inStartTransaction/CommitTransaction/CommandCounterIncrement.postgres-shared-memory-ipc.md—SISegallocation; LWLock and spinlock primitives used by sinvaladt.