PostgreSQL Relation Cache — RelationData, Bootstrap Nailing, and Sinval-Driven Rebuild
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every query the executor runs needs metadata about the relations it touches — column names and types, physical file location, applicable indexes, rewrite rules, triggers, and row-security policies. In a simple interpreter that metadata could be fetched from the catalog on every plan node invocation, but that would serialize every tuple access behind catalog scans. The standard engineering answer is a relation descriptor cache: a per-backend in-memory store keyed on relation OID that maps an OID to a fully-assembled metadata record, paying the catalog-scan cost at most once per relation per session.
Database System Concepts (Silberschatz, ch. 11 §“System Catalog”) identifies the core tension: the catalog is itself a set of relations, so building the cache from the catalog requires the cache to already exist for the catalog’s own relations. Every DBMS must break this bootstrap recursion somehow. The approaches are:
- Hard-wired stubs. For the handful of catalogs needed to read any catalog row at all, the engine ships their descriptors as compile-time constants (array initializers generated from the schema at build time) and installs them directly, bypassing the normal catalog-scan path.
- Init files. To avoid re-paying the catalog-scan cost on every backend startup, the engine serialises the assembled descriptors to disk after the first complete build and reads them back on subsequent startups.
- Invalidation signals. Because the cached metadata can become stale when DDL changes a relation’s schema, the cache must listen for signals from the catalog-invalidation mechanism and either flush or rebuild the affected entry.
Two design choices shape every implementation:
- Granularity of a cache entry. Does each entry correspond to
one catalog row (
pg_class) or to the fully-assembled aggregate of all the sub-catalogs that describe a relation (pg_attribute,pg_index,pg_rewrite,pg_trigger, …)? A fully-assembled entry is more expensive to build but eliminates per-field catalog lookups at query time. - Rebuild strategy for open entries. When an invalidation arrives for a relation that a backend currently has open, can the entry be rebuilt in place or must the backend wait? In-place rebuild requires that pointer stability for the sub-structures (tuple descriptor, rule trees) be maintained across the rebuild, because callers may hold raw pointers into them.
PostgreSQL chooses the fully-assembled aggregate entry and an in-place swap-and-rebuild discipline, for the reasons examined in §“PostgreSQL’s Approach”.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model; this section names the engineering conventions that virtually every multi-user RDBMS adopts.
OID-keyed per-backend hash
Section titled “OID-keyed per-backend hash”The descriptor cache is a process-local hash table keyed on the relation’s OID. OIDs are stable across the life of a database cluster (they are never recycled for existing objects), so the key never changes under a cached entry. The table is process-local because catalog metadata is read through normal MVCC snapshots; two backends at different transaction isolation levels may legitimately see different versions of a relation’s schema, so a shared cache would need snapshot-aware eviction logic that is more expensive than per-process tables.
A fully-assembled aggregate descriptor
Section titled “A fully-assembled aggregate descriptor”Rather than caching a single system-catalog row, the cache assembles one in-memory record per relation that includes:
- the
pg_classrow (relation name, OID, type OID, persistence, etc.) - the tuple descriptor (one
FormData_pg_attributeper column) - rule lock (parsed rewrite rules from
pg_rewrite) - trigger descriptor (from
pg_trigger) - RLS descriptor (from
pg_policy) - index list (from
pg_index) - index AM routine pointers (for index relations)
- physical file location
Building one aggregate record is expensive, but it means query execution never reads system catalogs for per-column metadata or for rule/trigger applicability — it reads the cache entry.
Nailed vs. ordinary entries
Section titled “Nailed vs. ordinary entries”A small subset of catalogs is needed to read any catalog at all (the relation and attribute catalogs, and their critical indexes). For these, the cache must be self-bootstrapping: they must be present before the code that builds cache entries from catalog scans can run. The convention is to nail these entries — mark them as never evictable and pre-install them from hard-wired compile-time data. All other entries are built on demand and can be evicted (flushed) on invalidation.
Reference counting with a resource-owner safety net
Section titled “Reference counting with a resource-owner safety net”Cache entries must not be freed while a backend is using them, but the cache also should not keep entries alive forever. The standard pattern is a reference count: open (pin) increments it; close (unpin) decrements it; entries with refcount zero can be evicted. A resource-owner mechanism ties open references to transaction scope so that a cache reference is automatically released if the transaction aborts before an explicit close.
Invalidation-driven rebuild
Section titled “Invalidation-driven rebuild”When DDL changes a relation, the system sends a shared-invalidation
(sinval) message to all backends. On receipt, a backend that has the
affected OID in cache must either drop the entry (if refcount is zero)
or mark it invalid and rebuild it at next access. Rebuilding an entry
that is currently open (refcount > 0) requires the rebuild to not
move the existing RelationData struct or its sub-pointers, because
callers may hold direct C-pointer references to sub-structures.
Init-file pre-loading
Section titled “Init-file pre-loading”Assembling all nailed entries from scratch on every backend startup is
slow. The first backend to complete the process serialises the nailed
entries to a binary file (pg_internal.init). Subsequent backends
load this file rather than scanning catalogs, skipping the bootstrap
scan entirely. The file is invalidated atomically (rename into place)
whenever a sinval event could have changed any nailed entry.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Per-backend OID-keyed cache | RelationIdCache — HTAB * in relcache.c |
| Per-entry aggregate descriptor | RelationData (typedef’d as Relation) in rel.h |
pg_class row sub-field | RelationData.rd_rel (Form_pg_class) |
| Tuple descriptor sub-field | RelationData.rd_att (TupleDesc) |
| Rule lock sub-field | RelationData.rd_rules (RuleLock *) |
| Physical file address | RelationData.rd_locator (RelFileLocator) |
| Nailed entry | rd_isnailed = true, rd_refcnt starts at 1 |
| Hard-wired stub builder | formrdesc() |
| Init-file load/write | load_relcache_init_file() / write_relcache_init_file() |
| Reference count | rd_refcnt, managed by RelationIncrementReferenceCount / RelationDecrementReferenceCount |
| Resource-owner integration | ResourceOwnerRememberRelationRef / relref_resowner_desc |
| Sinval-driven flush/rebuild | RelationCacheInvalidateEntry → RelationFlushRelation → RelationRebuildRelation |
| Invalid-but-open marker | rd_isvalid = false |
| In-progress-build guard | in_progress_list[] in relcache.c |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s relcache is a per-backend HTAB keyed on relation OID.
Each entry is a RelationData struct (typedef’d Relation) allocated
in CacheMemoryContext and holding the fully-assembled metadata for
one relation. The critical design properties are: (1) a three-phase
startup sequence that breaks the bootstrap recursion using
compile-time hard-wired stubs and a binary init file; (2) a reference
count plus resource-owner integration that prevents entries from being
freed while in use; (3) an invalidation path that either clears a
zero-refcount entry or performs a swap-in-place rebuild that preserves
pointer stability for open entries.
RelationData: the aggregate descriptor
Section titled “RelationData: the aggregate descriptor”RelationData (defined in src/include/utils/rel.h) is the central
data structure. Its fields fall into five groups:
// struct RelationData — src/include/utils/rel.h (condensed)typedef struct RelationData{ RelFileLocator rd_locator; /* physical identity: spc/db/relNumber */ SMgrRelation rd_smgr; /* cached smgr file handle, or NULL */ int rd_refcnt; /* reference count */ ProcNumber rd_backend; /* owning backend for temp rels */ bool rd_islocaltemp; /* this session's temp rel */ bool rd_isnailed; /* nailed in cache (never evicted) */ bool rd_isvalid; /* entry is valid (not stale) */
/* subtransaction-tracking fields for new/modified rels */ SubTransactionId rd_createSubid; SubTransactionId rd_newRelfilelocatorSubid; SubTransactionId rd_firstRelfilelocatorSubid; SubTransactionId rd_droppedSubid;
/* catalog-derived sub-structures */ Form_pg_class rd_rel; /* pg_class row (fixed-width part) */ TupleDesc rd_att; /* tuple descriptor (rd_rel->relnatts cols) */ Oid rd_id; /* relation OID */ LockInfoData rd_lockInfo; /* lock manager identity */ RuleLock *rd_rules; /* parsed rewrite rules, or NULL */ MemoryContext rd_rulescxt; /* private context for rd_rules */ TriggerDesc *trigdesc; /* trigger info, or NULL */ struct RowSecurityDesc *rd_rsdesc; /* RLS policies, or NULL */
/* on-demand lazy fields (NULL until first request) */ List *rd_fkeylist; /* FK cache */ PartitionKey rd_partkey; /* partition key */ PartitionDesc rd_partdesc; /* partition descriptor */ List *rd_indexlist; /* OID list of indexes */ Oid rd_pkindex; /* primary key index OID */ Bitmapset *rd_keyattr; /* FK-usable cols */ Bitmapset *rd_hotblockingattr; /* HOT-blocking cols */
/* index-relation-only fields */ MemoryContext rd_indexcxt; /* private context for index info */ struct IndexAmRoutine *rd_indam;/* index AM API struct */ Oid *rd_opfamily; /* opfamily per index column */ Oid *rd_opcintype; /* opclass input type per column */ RegProcedure *rd_support; /* support procedure OIDs */
/* ... additional fields omitted ... */} RelationData;The split between eager and lazy fields is deliberate.
rd_rel, rd_att, rd_rules, trigdesc, and rd_rsdesc are
loaded eagerly by RelationBuildDesc because most callers need them.
rd_fkeylist, rd_indexlist, rd_partkey, rd_partdesc, and the
attribute-bitmap fields are loaded on first request and guarded by
rd_fkeyvalid, rd_indexvalid, rd_partkey != NULL, and
rd_attrsvalid flags. This avoids recursion: RelationGetIndexList
triggers more catalog reads, which in turn need relcache entries — if
that were part of the eager build, the recursion would be deeper and
harder to break.
flowchart TD OID["relation OID"] --> HASH["RelationIdCache (HTAB)\n keyed on rd_id"] HASH -- "hit + rd_isvalid" --> RET["return Relation*\n(refcount++)"] HASH -- "hit + !rd_isvalid" --> REBUILD["RelationRebuildRelation\n(swap-in-place)"] REBUILD --> RET HASH -- "miss" --> BUILD["RelationBuildDesc\n(catalog scans)"] BUILD --> INSERT["RelationCacheInsert\nrd_isvalid = true"] INSERT --> RET
Figure 1 — RelationIdGetRelation lookup path. A cache hit on a valid entry returns immediately after incrementing the refcount. A hit on an invalid entry triggers in-place rebuild. A miss calls RelationBuildDesc to assemble the entry from system catalogs, then inserts it.
RelationBuildDesc: assembling an entry from catalogs
Section titled “RelationBuildDesc: assembling an entry from catalogs”RelationBuildDesc is called on a cache miss. It performs a sequence
of catalog reads, each of which may itself need relcache entries,
creating potential recursion. The recursion is broken by the
in_progress_list mechanism: while building entry R, if an
invalidation arrives for R, the build restarts from scratch
(the goto retry loop) rather than returning a stale partial entry:
// RelationBuildDesc — utils/cache/relcache.c (condensed)in_progress_list[in_progress_offset].reloid = targetRelId;retry:in_progress_list[in_progress_offset].invalidated = false;
pg_class_tuple = ScanPgRelation(targetRelId, true, false);if (!HeapTupleIsValid(pg_class_tuple)) { /* deleted */ return NULL; }
relation = AllocateRelationDesc(relp); /* alloc + copy pg_class fixed part */RelationBuildTupleDesc(relation); /* scans pg_attribute */
/* lazy fields are NIL/NULL; loaded on demand */relation->rd_fkeylist = NIL;relation->rd_partkey = NULL;
/* access method info — index or table AM */if (relkind == RELKIND_INDEX || ...) RelationInitIndexAccessInfo(relation);else if (RELKIND_HAS_TABLE_AM(relkind)) RelationInitTableAccessMethod(relation);
RelationParseRelOptions(relation, pg_class_tuple); /* rd_options */
if (relation->rd_rel->relhasrules) RelationBuildRuleLock(relation); /* scans pg_rewrite */if (relation->rd_rel->relhastriggers) RelationBuildTriggers(relation); /* scans pg_trigger */if (relation->rd_rel->relrowsecurity) RelationBuildRowSecurity(relation); /* scans pg_policy */
RelationInitLockInfo(relation);RelationInitPhysicalAddr(relation); /* rd_locator from relfilenode or mapper */
if (in_progress_list[in_progress_offset].invalidated){ RelationDestroyRelation(relation, false); goto retry; /* restart if inval arrived mid-build */}
if (insertIt) RelationCacheInsert(relation, true);relation->rd_isvalid = true;return relation;The condensed sketch above flattens the real control flow. The actual
function shows two things the sketch hides: (1) the eager vs. lazy
boundary is set field-by-field immediately after AllocateRelationDesc
copies the pg_class fixed part, with every lazy field explicitly
zeroed and its rd_*valid guard cleared; and (2) the access-method
initialization is a relkind dispatch — index relations get
RelationInitIndexAccessInfo, table-AM relations get
RelationInitTableAccessMethod, and partitioned tables get nothing
because partitions inherit their AM:
// RelationBuildDesc — utils/cache/relcache.c (condensed, real)relation = AllocateRelationDesc(relp);RelationGetRelid(relation) = relid;
relation->rd_refcnt = 0;relation->rd_isnailed = false; /* ordinary rels are evictable */relation->rd_createSubid = InvalidSubTransactionId;/* ... three more SubTransactionId fields zeroed ... */
RelationBuildTupleDesc(relation); /* scans pg_attribute → rd_att */
/* foreign key data is not loaded till asked for */relation->rd_fkeylist = NIL;relation->rd_fkeyvalid = false;/* partitioning data is not loaded till asked for */relation->rd_partkey = NULL;relation->rd_partdesc = NULL;relation->rd_partcheckvalid = false;
/* initialize access method information */if (relation->rd_rel->relkind == RELKIND_INDEX || relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) RelationInitIndexAccessInfo(relation);else if (RELKIND_HAS_TABLE_AM(relation->rd_rel->relkind) || relation->rd_rel->relkind == RELKIND_SEQUENCE) RelationInitTableAccessMethod(relation);else if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) /* Do nothing: partitions inherit the AM. */ ;else Assert(relation->rd_rel->relam == InvalidOid);
RelationParseRelOptions(relation, pg_class_tuple); /* rd_options */
if (relation->rd_rel->relhasrules) RelationBuildRuleLock(relation); /* scans pg_rewrite */else { relation->rd_rules = NULL; relation->rd_rulescxt = NULL; }/* ... triggers (pg_trigger) and row security (pg_policy) likewise ... */
RelationInitLockInfo(relation); /* see lmgr.c */RelationInitPhysicalAddr(relation); /* rd_locator */relation->rd_smgr = NULL; /* no open file yet */heap_freetuple(pg_class_tuple);Note the rd_refcnt = 0 / rd_isnailed = false defaults: a freshly
built ordinary entry starts unpinned, and the caller
(RelationIdGetRelation) pins it afterward. Nailed entries take the
opposite path — they are built by formrdesc / load_critical_index,
not by the ordinary cache-miss path, and start at rd_refcnt = 1.
RelationInitPhysicalAddr deserves attention. For most relations the
physical file number is read directly from pg_class.relfilenode. For
mapped relations — the handful whose file numbers cannot be stored
in pg_class itself during bootstrap — it consults relmapper.c,
which maintains a small separate file (pg_filenode.map) whose
contents are loaded at startup. System catalogs (pg_class,
pg_attribute, etc.) are mapped relations.
The three-phase bootstrap sequence
Section titled “The three-phase bootstrap sequence”PostgreSQL cannot scan pg_class without a relcache entry for
pg_class, which would require scanning pg_class. It breaks the
recursion in three phases:
Phase 1 — RelationCacheInitialize: Creates the empty
RelationIdCache hash table and the in_progress_list array.
No catalog access.
Phase 2 — RelationCacheInitializePhase2: Prepares for access
to shared catalogs (pg_database, pg_authid, pg_auth_members,
pg_shseclabel, pg_subscription). Tries to load
global/pg_internal.init. On failure, calls formrdesc five times
to install hard-wired stubs:
// RelationCacheInitializePhase2 — utils/cache/relcache.c (condensed)if (!load_relcache_init_file(true)){ formrdesc("pg_database", DatabaseRelation_Rowtype_Id, true, ...); formrdesc("pg_authid", AuthIdRelation_Rowtype_Id, true, ...); formrdesc("pg_auth_members", AuthMemRelation_Rowtype_Id, true, ...); formrdesc("pg_shseclabel", SharedSecLabelRelation_Rowtype_Id, true, ...); formrdesc("pg_subscription", SubscriptionRelation_Rowtype_Id, true, ...);#define NUM_CRITICAL_SHARED_RELS 5}Phase 3 — RelationCacheInitializePhase3: The critical
phase. Tries to load $PGDATA/base/<dboid>/pg_internal.init. On
failure, calls formrdesc four times for the local nailed catalogs
(pg_class, pg_attribute, pg_proc, pg_type), then loads the
seven critical indexes one by one:
// RelationCacheInitializePhase3 — utils/cache/relcache.c (condensed)if (!load_relcache_init_file(false)){ formrdesc("pg_class", RelationRelation_Rowtype_Id, false, ...); formrdesc("pg_attribute", AttributeRelation_Rowtype_Id, false, ...); formrdesc("pg_proc", ProcedureRelation_Rowtype_Id, false, ...); formrdesc("pg_type", TypeRelation_Rowtype_Id, false, ...);#define NUM_CRITICAL_LOCAL_RELS 4}// ...if (!criticalRelcachesBuilt){ load_critical_index(ClassOidIndexId, RelationRelationId); load_critical_index(AttributeRelidNumIndexId, AttributeRelationId); load_critical_index(IndexRelidIndexId, IndexRelationId); load_critical_index(OpclassOidIndexId, OperatorClassRelationId); load_critical_index(AccessMethodProcedureIndexId, AccessMethodProcedureRelationId); load_critical_index(RewriteRelRulenameIndexId, RewriteRelationId); load_critical_index(TriggerRelidNameIndexId, TriggerRelationId);#define NUM_CRITICAL_LOCAL_INDEXES 7 criticalRelcachesBuilt = true;}Setting criticalRelcachesBuilt = true is the key transition:
after this point, ScanPgRelation and RelationBuildTupleDesc may
use index scans rather than sequential heap scans. Before it, they
fall back to heap scans unconditionally. Phase 3 ends by scanning the
entire cache and replacing any formrdesc stubs (identifiable by
rd_rel->relowner == InvalidOid) with real pg_class rows read via
SearchSysCache1.
formrdesc itself builds a minimal RelationData from compile-time
attribute arrays generated by genbki.pl (Desc_pg_class,
Desc_pg_attribute, etc.) and marks the entry nailed (rd_isnailed = true, rd_refcnt = 1). The entry is rd_isvalid = false
intentionally — Phase 3 will fix it — so that any access before Phase
3 completes falls through to a rebuild.
flowchart LR P1["Phase 1\nRelationCacheInitialize\n(empty hash table)"] P2["Phase 2\nRelationCacheInitializePhase2\nshared stubs or init file"] P3["Phase 3\nRelationCacheInitializePhase3\nlocal stubs + critical indexes\ncriticalRelcachesBuilt = true\nfake → real pg_class rows"] NORMAL["Normal operation\nRelationIdGetRelation on demand"] P1 --> P2 --> P3 --> NORMAL
Figure 2 — Three-phase bootstrap. Phase 1 creates the empty hash. Phase 2 installs stubs for shared catalogs. Phase 3 installs local stubs, loads the 7 critical indexes, sets criticalRelcachesBuilt, then replaces stubs with real pg_class data. All three phases complete before the first user query runs.
The init file: pg_internal.init
Section titled “The init file: pg_internal.init”write_relcache_init_file(bool shared) serialises all nailed
(critical) relcache entries to a binary file at end-of-startup. The
file is written to a temporary name and renamed into place atomically,
so a concurrent reader never sees a partial file. The file starts with
a magic number (RELCACHE_INIT_FILEMAGIC = 0x573266) that acts as a
version identifier; a mismatch causes the file to be ignored and
rebuilding from scratch.
load_relcache_init_file reads the file back, reconstructs each
RelationData in CacheMemoryContext, recomputes lock info and
physical addresses (in case the file was copied from another database
by CREATE DATABASE), and counts nailed entries against
NUM_CRITICAL_LOCAL_RELS / NUM_CRITICAL_LOCAL_INDEXES as a sanity
check.
The write side is a length-prefixed binary dump. After the magic
number, it walks RelationIdCache and, for each entry in the correct
group (relisshared == shared) that belongs in the file, writes the
RelationData struct, the pg_class form, each attribute’s fixed
part, and the reloptions blob — then, for RELKIND_INDEX, the
pg_index tuple and the opfamily/opcintype/support vectors. The whole
thing is gated on having received zero sinval events since startup,
and the final step is the atomic temp-file rename under
RelCacheInitLock:
// write_relcache_init_file — utils/cache/relcache.c (condensed, real)if (relcacheInvalsReceived != 0L) return; /* already stale; don't bother */
/* ... snprintf temp + final names, AllocateFile(tempfilename) ... */magic = RELCACHE_INIT_FILEMAGIC;fwrite(&magic, 1, sizeof(magic), fp);
hash_seq_init(&status, RelationIdCache);while ((idhentry = hash_seq_search(&status)) != NULL){ Relation rel = idhentry->reldesc; Form_pg_class relform = rel->rd_rel;
if (relform->relisshared != shared) /* wrong group */ continue; if (!shared && !RelationIdIsInInitFile(RelationGetRelid(rel))) { Assert(!rel->rd_isnailed); /* nailed must be stored */ continue; } write_item(rel, sizeof(RelationData), fp); write_item(relform, CLASS_TUPLE_SIZE, fp); for (i = 0; i < relform->relnatts; i++) write_item(TupleDescAttr(rel->rd_att, i), ATTRIBUTE_FIXED_PART_SIZE, fp); write_item(rel->rd_options, rel->rd_options ? VARSIZE(rel->rd_options) : 0, fp); if (rel->rd_rel->relkind == RELKIND_INDEX) { /* pg_index + opfamily/support */ }}
/* serialize against the unlink-and-SI path */LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);AcceptInvalidationMessages();if (relcacheInvalsReceived == 0L){ if (rename(tempfilename, finalfilename) < 0) /* atomic publish */ unlink(tempfilename);}else unlink(tempfilename); /* obsolete; drop it */LWLockRelease(RelCacheInitLock);The read side reverses the dump but cannot trust it blindly. Two
structural checks (magic != RELCACHE_INIT_FILEMAGIC, and each
descriptor’s len != sizeof(RelationData)) jump to read_failed so a
layout change forces a clean rebuild. After reconstructing each entry it
recomputes lock and physical addressing — the on-disk rd_locator /
rd_lockInfo may belong to a different database if the file was copied
by CREATE DATABASE. Finally it cross-checks the nailed-entry counts
against the compile-time constants before inserting anything:
// load_relcache_init_file — utils/cache/relcache.c (condensed, real)if (fread(&magic, 1, sizeof(magic), fp) != sizeof(magic)) goto read_failed;if (magic != RELCACHE_INIT_FILEMAGIC) goto read_failed;
for (relno = 0;; relno++){ if ((nread = fread(&len, 1, sizeof(len), fp)) != sizeof(len)) { if (nread == 0) break; /* clean EOF */ goto read_failed; } if (len != sizeof(RelationData)) goto read_failed; /* layout drift */ rel = rels[num_rels++] = (Relation) palloc(len); fread(rel, 1, len, fp); /* ... read pg_class form, rebuild rd_att, AM info, nailed counts ... */
rel->rd_smgr = NULL; rel->rd_refcnt = rel->rd_isnailed ? 1 : 0; rel->rd_indexvalid = false; rel->rd_indexlist = NIL; /* recompute — file may have been copied by CREATE DATABASE */ RelationInitLockInfo(rel); RelationInitPhysicalAddr(rel);}
if (!shared && (nailed_rels != NUM_CRITICAL_LOCAL_RELS || nailed_indexes != NUM_CRITICAL_LOCAL_INDEXES)) goto read_failed; /* critical set changed → rebuild */
for (relno = 0; relno < num_rels; relno++) RelationCacheInsert(rels[relno], false);criticalRelcachesBuilt = true; /* (criticalSharedRelcachesBuilt if shared) */return true;
read_failed: /* leak the half-built rels (not in cache) and fall back to formrdesc */ return false;Note that a successful load_relcache_init_file is what sets
criticalRelcachesBuilt (or criticalSharedRelcachesBuilt) — the same
flag that Phase 3 sets after load_critical_index. Either path reaches
the post-bootstrap state; the init file is just the fast lane.
flowchart TD
START["backend startup\nPhase 2 / Phase 3"] --> TRY["load_relcache_init_file(shared)"]
TRY --> OPEN{"AllocateFile ok?"}
OPEN -- "no" --> SLOW["formrdesc stubs +\nload_critical_index\n(slow bootstrap)"]
OPEN -- "yes" --> MAGIC{"magic + len checks?"}
MAGIC -- "fail" --> RF["read_failed\nreturn false"]
RF --> SLOW
MAGIC -- "ok" --> RECON["reconstruct entries\nRelationInitLockInfo\nRelationInitPhysicalAddr"]
RECON --> COUNT{"nailed counts ==\nNUM_CRITICAL_*?"}
COUNT -- "no" --> RF
COUNT -- "yes" --> INS["RelationCacheInsert all\ncriticalRelcachesBuilt = true"]
SLOW --> RUN["normal operation"]
INS --> RUN
RUN --> WRITE["write_relcache_init_file\nat end of startup\nif relcacheInvalsReceived == 0"]
WRITE --> RENAME["rename temp → pg_internal.init\nunder RelCacheInitLock"]
Figure 4 — Init-file fast lane vs. slow bootstrap. load_relcache_init_file short-circuits the formrdesc path when the file opens, passes the magic/length structural checks, and matches the expected nailed-entry counts; any failure routes through read_failed back to the slow bootstrap. The first backend to finish startup without seeing a sinval event writes the file back and renames it into place atomically.
The file is invalidated by RelationCacheInitFilePreInvalidate /
RelationCacheInitFilePostInvalidate (called by the sinval system in
inval.c). Pre-invalidation renames the live file to a temp name;
post-invalidation deletes the temp file. Any backend that loaded the
file between pre- and post- gets an inconsistent view, but sinval
processing will correct it before the stale data is used.
Reference counting and resource ownership
Section titled “Reference counting and resource ownership”Every caller of RelationIdGetRelation receives an entry with
rd_refcnt incremented. The caller is responsible for calling
RelationClose (which calls RelationDecrementReferenceCount) when
done. In practice, access goes through table_open / table_close
(or index_open / index_close), which call RelationIdGetRelation
and RelationClose respectively.
To prevent leaks on error paths, each open reference is registered
with CurrentResourceOwner via ResourceOwnerRememberRelationRef:
// RelationIncrementReferenceCount — utils/cache/relcache.c (condensed)voidRelationIncrementReferenceCount(Relation rel){ ResourceOwnerEnlarge(CurrentResourceOwner); rel->rd_refcnt += 1; if (!IsBootstrapProcessingMode()) ResourceOwnerRememberRelationRef(CurrentResourceOwner, rel);}The relref_resowner_desc descriptor registers ResOwnerReleaseRelation
as the cleanup callback, which calls RelationDecrementReferenceCount
at transaction end if the caller forgot. This ensures that long-lived
processes (like the autovacuum worker) cannot accumulate leaked
relcache pins across transactions.
Invalidation: RelationFlushRelation and the swap-in-place rebuild
Section titled “Invalidation: RelationFlushRelation and the swap-in-place rebuild”When a sinval message arrives for a relation OID,
RelationCacheInvalidateEntry increments relcacheInvalsReceived
and calls RelationFlushRelation. The flush decision tree is:
flowchart TD
F["RelationFlushRelation(rel)"]
F --> NEW{"rd_createSubid != 0\nor rd_firstRelfilelocatorSubid != 0?"}
NEW -- "yes (new-in-txn rel)" --> TXNSTATE{"IsTransactionState\nand not dropped?"}
TXNSTATE -- "yes" --> REBUILDNEW["bump refcnt\nRelationRebuildRelation\ndecrement refcnt"]
TXNSTATE -- "no" --> INVAL["RelationInvalidateRelation\nrd_isvalid = false"]
NEW -- "no (pre-existing rel)" --> REFZERO{"refcnt == 0?"}
REFZERO -- "yes" --> CLEAR["RelationClearRelation\n(destroy entry)"]
REFZERO -- "no + !IsTransactionState" --> INVAL
REFZERO -- "no + nailed + refcnt==1" --> INVAL
REFZERO -- "no + open" --> REBUILD["RelationRebuildRelation\n(swap-in-place)"]
Figure 3 — RelationFlushRelation decision tree. A zero-refcount pre-existing entry is simply cleared. An open entry is rebuilt in-place by building a new RelationData, swapping contents field-by-field (preserving rd_refcnt, rd_smgr, transaction sub-IDs, and optionally rd_att / rd_rules / rd_rsdesc if they are structurally equal), then destroying the temporary new entry.
The swap-in-place logic in RelationRebuildRelation is the most
complex part of the relcache. The full struct is swapped with
memcpy, then a dozen fields are swapped back to preserve
invariants:
// RelationRebuildRelation — utils/cache/relcache.c (condensed)newrel = RelationBuildDesc(save_relid, false); /* build into temp entry */
keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att);keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules);keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc);keep_partkey = (relation->rd_partkey != NULL); /* immutable once set */
/* swap all fields at once */{ RelationData tmp; memcpy(&tmp, newrel, ...); memcpy(newrel, relation, ...); memcpy(relation, &tmp, ...); }
/* then swap back fields that must be preserved */SWAPFIELD(SMgrRelation, rd_smgr); /* back-links from smgr level */SWAPFIELD(int, rd_refcnt); /* callers hold this count */SWAPFIELD(SubTransactionId, rd_createSubid);/* ... other SubTransactionId fields ... */SWAPFIELD(Form_pg_class, rd_rel);memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE); /* update content */
if (keep_tupdesc) SWAPFIELD(TupleDesc, rd_att); /* preserve pointer */if (keep_rules) { SWAPFIELD(RuleLock*, rd_rules); SWAPFIELD(MemoryContext, rd_rulescxt); }if (keep_policies) SWAPFIELD(RowSecurityDesc*, rd_rsdesc);if (keep_partkey) SWAPFIELD(PartitionKey, rd_partkey);/* ... partition desc context handling ... */
RelationDestroyRelation(newrel, !keep_tupdesc);Preserving rd_att when the tuple descriptor is structurally
unchanged is important: catcache entries may embed pointers into
TupleDesc columns, so moving the descriptor would invalidate those
pointers. equalTupleDescs performs a structural comparison to decide
whether the preservation is safe.
If RelationBuildDesc returns NULL during a rebuild — which can
happen for a relation visible in a historic decoding snapshot but not
in the current one — and a historic snapshot is active, the function
returns without rebuilding (the entry remains invalid); otherwise it
ereports an error because a relation should not be dropped while still
open.
Lazy computed fields: on-demand sub-structure loading
Section titled “Lazy computed fields: on-demand sub-structure loading”Several RelationData fields are populated only on first request.
RelationGetIndexList scans pg_index and caches the result in
rd_indexlist; RelationGetFKeyList scans pg_constraint; both set
their rd_*valid guards before storing into CacheMemoryContext.
A subtle concurrency issue arises in RelationGetIndexAttrBitmap:
during the loop that opens each index to collect attribute bitmaps, a
relcache flush may arrive and reset rd_indexlist. The function
handles this with a restart loop — after collecting all bitmaps it
calls RelationGetIndexList a second time and compares the result
to its earlier snapshot; if they differ, it frees the partially-built
bitmaps and restarts from scratch.
Subtransaction-tracking SubIDs
Section titled “Subtransaction-tracking SubIDs”rd_createSubid, rd_newRelfilelocatorSubid,
rd_firstRelfilelocatorSubid, and rd_droppedSubid track which
subtransaction last mutated the relation’s file identity. These are
critical for RelationNeedsWAL(), which returns false for a relation
whose storage was created in the current top transaction (so WAL
records are unnecessary for that storage). At end-of-transaction
AtEOXact_RelationCache resets all sub-IDs to zero and, if the
relation was created and then dropped in the same transaction, clears
the entry.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Use
git grep -n '<symbol>' src/backend/utils/cache/relcache.cto relocate; line numbers in the position-hint table are scoped to commit273fe94.
Core data structures
Section titled “Core data structures”struct RelationData(rel.h) — the aggregate descriptor typedef’d asRelation. Sub-fields:rd_locator,rd_rel,rd_att,rd_rules,trigdesc,rd_rsdesc,rd_indexlist,rd_indam.struct relidcacheent/RelationIdCache(relcache.c) — the per-backendHTAB *keyed on OID.struct InProgressEnt/in_progress_list(relcache.c) — the recursion guard stack forRelationBuildDesc.eoxact_list[]/eoxact_list_overflowed(relcache.c) — fast path for end-of-transaction cleanup.criticalRelcachesBuilt/criticalSharedRelcachesBuilt(relcache.c) — flags that switchScanPgRelationfrom heap scan to index scan.
Initialization
Section titled “Initialization”RelationCacheInitialize— Phase 1: empty hash +in_progress_list.RelationCacheInitializePhase2— Phase 2: shared stubs or init file.RelationCacheInitializePhase3— Phase 3: local stubs, critical indexes,criticalRelcachesBuilt, replace stubs with real rows.formrdesc— build a hard-wired nailed entry from compile-time attrs.load_critical_index— callRelationBuildDescfor one critical index and nail it.
Entry build
Section titled “Entry build”ScanPgRelation— heap or index scan ofpg_classfor one OID.AllocateRelationDesc— palloc aRelationDatainCacheMemoryContext.RelationBuildTupleDesc— scanpg_attribute,pg_attrdef,pg_constraintto fillrd_att.RelationBuildDesc— top-level builder: orchestrates the catalog scans, handlesin_progress_listrestart, callsRelationCacheInsert.RelationInitPhysicalAddr— fillrd_locatorfrompg_class.relfilenodeor fromrelmapper.cfor mapped rels.RelationBuildRuleLock— scanpg_rewrite, buildrd_rules.RelationInitIndexAccessInfo— for index rels: fillrd_indam,rd_opfamily,rd_support, etc.RelationBuildLocalRelation— build an entry for a newly-created relation before it has apg_classrow.
Lookup interface
Section titled “Lookup interface”RelationIdGetRelation— public entry: hash lookup → rebuild if stale →RelationBuildDescon miss. Increments refcount.RelationIncrementReferenceCount/RelationDecrementReferenceCount— refcount ± 1 with resource-owner registration.RelationClose— decrement refcount; callsRelationCloseCleanup.
Invalidation
Section titled “Invalidation”RelationCacheInvalidateEntry— sinval dispatch: look up OID, callRelationFlushRelation; also setsin_progress_list[].invalidated.RelationCacheInvalidate— bulk flush on SI buffer overflow.RelationFlushRelation— dispatch: clear (refcnt=0) or rebuild (open).RelationRebuildRelation— swap-in-place rebuild for open entries.RelationInvalidateRelation— markrd_isvalid = falsewithout rebuild.RelationForgetRelation— caller-reported drop; mark dropped or clear.
Init file
Section titled “Init file”load_relcache_init_file— deserialisepg_internal.initinto cache.write_relcache_init_file— serialise nailed entries to temp file, then rename.RelationCacheInitFilePreInvalidate/RelationCacheInitFilePostInvalidate— atomic invalidation of the init file around sinval processing.RelationCacheInitFileRemove— remove init files (used byinitdb,pg_upgrade, etc.).RelationIdIsInInitFile— predicate: should this OID appear in the local init file?
On-demand sub-structure loaders
Section titled “On-demand sub-structure loaders”RelationGetFKeyList— FK constraint list.RelationGetIndexList— index OID list +rd_pkindex/rd_replidindex.RelationGetIndexExpressions/RelationGetIndexPredicate— parsed index expression / predicate trees.RelationGetIndexAttrBitmap— column bitmaps for HOT/FK/PK/replica identity; includes the restart loop for concurrent index-list changes.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct RelationData | utils/rel.h | 55 |
RelationIdCache | utils/cache/relcache.c | 134 |
criticalRelcachesBuilt | utils/cache/relcache.c | 140 |
in_progress_list | utils/cache/relcache.c | 170 |
eoxact_list | utils/cache/relcache.c | 185 |
ScanPgRelation | utils/cache/relcache.c | 340 |
AllocateRelationDesc | utils/cache/relcache.c | 413 |
RelationParseRelOptions | utils/cache/relcache.c | 468 |
RelationBuildTupleDesc | utils/cache/relcache.c | 525 |
RelationBuildRuleLock | utils/cache/relcache.c | 752 |
RelationBuildDesc | utils/cache/relcache.c | 1059 |
RelationInitPhysicalAddr | utils/cache/relcache.c | 1339 |
RelationInitIndexAccessInfo | utils/cache/relcache.c | 1445 |
RelationInitTableAccessMethod | utils/cache/relcache.c | 1829 |
RelationDestroyRelation | utils/cache/relcache.c | 2439 |
formrdesc | utils/cache/relcache.c | 1894 |
RelationIdGetRelation | utils/cache/relcache.c | 2099 |
RelationIncrementReferenceCount | utils/cache/relcache.c | 2187 |
RelationDecrementReferenceCount | utils/cache/relcache.c | 2200 |
RelationClose | utils/cache/relcache.c | 2220 |
RelationRebuildRelation | utils/cache/relcache.c | 2585 |
RelationFlushRelation | utils/cache/relcache.c | 2827 |
RelationCacheInvalidateEntry | utils/cache/relcache.c | 2938 |
RelationCacheInvalidate | utils/cache/relcache.c | 2994 |
AtEOXact_RelationCache | utils/cache/relcache.c | 3226 |
RelationBuildLocalRelation | utils/cache/relcache.c | 3515 |
RelationCacheInitialize | utils/cache/relcache.c | 4002 |
RelationCacheInitializePhase2 | utils/cache/relcache.c | 4048 |
RelationCacheInitializePhase3 | utils/cache/relcache.c | 4107 |
RelationGetFKeyList | utils/cache/relcache.c | 4731 |
RelationGetIndexList | utils/cache/relcache.c | 4836 |
load_relcache_init_file | utils/cache/relcache.c | 6167 |
write_relcache_init_file | utils/cache/relcache.c | 6585 |
write_item | utils/cache/relcache.c | 6797 |
RelationIdIsInInitFile | utils/cache/relcache.c | 6820 |
RelationCacheInitFilePreInvalidate | utils/cache/relcache.c | 6860 |
RelationCacheInitFilePostInvalidate | utils/cache/relcache.c | 6885 |
RelationCacheInitFileRemove | utils/cache/relcache.c | 6900 |
struct relidcacheent | utils/cache/relcache.c | 128 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Facts about the current source at commit
273fe94. Open questions follow as the curator’s recorded gaps.
Verified facts
Section titled “Verified facts”-
RelationIdCacheis a backend-privateHTAB *keyed onOidwithentrysize = sizeof(RelIdCacheEnt). Verified inRelationCacheInitialize(relcache.c):hash_create("Relcache by OID", INITRELCACHESIZE, &ctl, HASH_ELEM | HASH_BLOBS). -
Seven nailed local catalogs/indexes are listed by
NUM_CRITICAL_LOCAL_RELS = 4andNUM_CRITICAL_LOCAL_INDEXES = 7, hard-coded as compile-time constants. Verified inRelationCacheInitializePhase3. Changing the nailed set requires updating these constants and the correspondingformrdesc/load_critical_indexcall lists. -
criticalRelcachesBuiltswitchesScanPgRelationfrom heap scan to index scan. Verified:ScanPgRelationpassesindexOK && criticalRelcachesBuiltas the indexOK argument tosystable_beginscan. Before Phase 3 completes, all catalog reads are unconditional heap scans. -
The
in_progress_listrestart (goto retry) fires when an invalidation message arrives for the OID being built. Verified inRelationBuildDesc: after the full build sequence,if (in_progress_list[offset].invalidated) { RelationDestroyRelation(...); goto retry; }.RelationCacheInvalidateEntrysetsin_progress_list[i].invalidated = truefor matching OIDs when the relation is not yet in the hash table. -
The swap-in-place rebuild in
RelationRebuildRelationpreservesrd_attwhenequalTupleDescsreturns true, to avoid invalidating catcache entry pointers. Verified:keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att)and the conditionalSWAPFIELD(TupleDesc, rd_att)path. -
write_relcache_init_fileaborts early if any sinval has been received since startup, viarelcacheInvalsReceived != 0. Verified at the top ofwrite_relcache_init_file. This prevents writing a file that is already stale. -
RELCACHE_INIT_FILEMAGIC = 0x573266is a compile-time constant used as a version check onpg_internal.init. Verified: defined at line 93 ofrelcache.c; a magic mismatch causesload_relcache_init_fileto jump toread_failedand return false. -
The
rd_createSubidfamily of fields is reset toInvalidSubTransactionIdat end-of-transaction byAtEOXact_RelationCache. Verified inAtEOXact_cleanup: all four sub-ID fields are zeroed before the potentialRelationClearRelationcall. -
RelationGetIndexAttrBitmaphas an explicit restart-on-flush loop. Verified: after the mainforeachloop, a second call toRelationGetIndexListis compared to the earlier snapshot; if they differ, the bitmaps are freed and the code jumps to therestart:label.
Open questions
Section titled “Open questions”-
debug_discard_cachesinteraction withRelationBuildDesc. Whendebug_discard_caches > 0is set,RelationBuildDescuses a temporary memory context to recover transient data; andRelationCacheInvalidateis called to discard all entries after each query. The exact sequence in whichin_progress_listis managed under concurrentdebug_discard_cachesstress is not fully traced here. Investigation path: read the comment block inRelationBuildDescaroundMAYBE_RECOVER_RELATION_BUILD_MEMORYand instrument withdebug_discard_caches = 1. -
relmapper.cupdate atomicity. Mapped relations’ file numbers live inpg_filenode.map, managed byrelmapper.c. When a mapped relation’s storage is rewritten (e.g., byVACUUM FULLon a catalog), the mapper file is updated with its own two-phase rename. HowRelationInitPhysicalAddrinRelationRebuildRelationinteracts with a concurrent mapper update is not traced here. Investigation path: readrelmapper.cand theRelFileLocatorSkippingWALpath inRelationInitPhysicalAddr. -
Init-file correctness under
CREATE DATABASE. The code comment inload_relcache_init_filenotes that lock info and physical addresses must be recomputed “in case the pg_internal.init file was copied from some other database byCREATE DATABASE.” The exact fields that could diverge between a copied init file and the new database’s on-disk state are not exhaustively listed here. Investigation path: read theCREATE DATABASEtablespace-rewriting path incommands/dbcommands.c.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Shallow seed list for follow-up documents; not analysis.
-
Catcache as peer layer. PostgreSQL’s catcache (
utils/cache/catcache.c) caches individual tuples from system catalogs (one entry per catalog row), while the relcache caches aggregated descriptors. They share the sinval invalidation channel but serve different consumers. A side-by-side of their eviction policies (catcache has a clock-based LRU per syscache; relcache entries live until sinval) would clarify the architectural split. Seepostgres-catcache-syscache.md. -
Sinval invalidation coupling.
RelationCacheInvalidateEntryis one of several handlers registered withinval.c; the init-file invalidation viaRelationCacheInitFilePreInvalidateis another. The full sinval pipeline (shared-memory ring buffer, catchup on reconnect, catcache reset) is the mechanism tying the relcache to DDL. Seepostgres-cache-invalidation.md. -
CUBRID’s relcache analog. CUBRID maintains a per-thread
schema_managerthat cachesOR_CLASSREP(the physical schema) andSM_CLASS(the logical object-relational schema) separately. The split mirrors PostgreSQL’srd_att(physical) vs.rd_rel(catalog row) split, but CUBRID’s catalog is object-relational, making the mapping more complex. A comparison of bootstrap strategies (both systems have a set of “critical” tables that need hard-wired descriptors) would isolate what is universal from what is schema-model-specific. -
Descriptor stability and pointer aliasing. PostgreSQL’s swap-in-place rebuild (
SWAPFIELD) is an ad-hoc approach to pointer stability. Newer database research (e.g., Andy Pavlo’s group on live schema changes in OLAP systems) explores epoch-based approaches where old descriptors survive until all readers have left the epoch. Whether such an approach would simplify thekeep_tupdesc/ catcache pointer coupling is an open design question for a futurepostgres-relcache-evolution.md. -
The
formrdesc/ init-file pattern as a general bootstrap technique. The problem of a component that needs itself to initialize is a special case of the circular-dependency problem in system initialization. Architecture of a DB System (Hellerstein et al., 2007 —dbms-papers/fntdb07-architecture.md) §“Catalog Manager” briefly notes that hard-wired stubs are the universal solution; a survey of how MySQL/InnoDB, Oracle, and SQL Server each break the same recursion would be a useful comparative note.
Sources
Section titled “Sources”PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/utils/cache/relcache.c— all bootstrap, build, lookup, invalidation, and init-file logic.src/include/utils/rel.h—RelationDatastruct definition and all access macros.src/include/utils/relcache.h— public API declarations,RELCACHE_INIT_FILENAME.src/include/utils/relmapper.h— mapped-relation file-number interface.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz), Ch. 11 §“System Catalog” —
the catalog structure that
RelationBuildDescreads. - Database Internals (Petrov), Ch. 2 §“Memory-Mapped Files and Direct I/O” — caching and descriptor-lifetime context.
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”- Architecture of a DB System (Hellerstein et al., 2007) —
fntdb07-architecture.md. §“Catalog Manager” frames the catalog-as-relation bootstrapping problem thatformrdescsolves.
Cross-references (sibling module docs)
Section titled “Cross-references (sibling module docs)”postgres-catcache-syscache.md— catcache (individual catalog-row cache) that feeds into relcache builds.postgres-cache-invalidation.md— sinval pipeline that drivesRelationCacheInvalidateEntry.postgres-system-catalogs.md— thepg_class,pg_attributeetc. tables thatRelationBuildDescreads.postgres-memory-contexts.md—CacheMemoryContextthat hosts all relcache entries.postgres-architecture-overview.md— Axis 6 (Catalog + cache layer) and the sinval loop.