Skip to content

PostgreSQL Relation Cache — RelationData, Bootstrap Nailing, and Sinval-Driven Rebuild

Contents:

Every query the executor runs needs metadata about the relations it touches — column names and types, physical file location, applicable indexes, rewrite rules, triggers, and row-security policies. In a simple interpreter that metadata could be fetched from the catalog on every plan node invocation, but that would serialize every tuple access behind catalog scans. The standard engineering answer is a relation descriptor cache: a per-backend in-memory store keyed on relation OID that maps an OID to a fully-assembled metadata record, paying the catalog-scan cost at most once per relation per session.

Database System Concepts (Silberschatz, ch. 11 §“System Catalog”) identifies the core tension: the catalog is itself a set of relations, so building the cache from the catalog requires the cache to already exist for the catalog’s own relations. Every DBMS must break this bootstrap recursion somehow. The approaches are:

  1. Hard-wired stubs. For the handful of catalogs needed to read any catalog row at all, the engine ships their descriptors as compile-time constants (array initializers generated from the schema at build time) and installs them directly, bypassing the normal catalog-scan path.
  2. Init files. To avoid re-paying the catalog-scan cost on every backend startup, the engine serialises the assembled descriptors to disk after the first complete build and reads them back on subsequent startups.
  3. Invalidation signals. Because the cached metadata can become stale when DDL changes a relation’s schema, the cache must listen for signals from the catalog-invalidation mechanism and either flush or rebuild the affected entry.

Two design choices shape every implementation:

  • Granularity of a cache entry. Does each entry correspond to one catalog row (pg_class) or to the fully-assembled aggregate of all the sub-catalogs that describe a relation (pg_attribute, pg_index, pg_rewrite, pg_trigger, …)? A fully-assembled entry is more expensive to build but eliminates per-field catalog lookups at query time.
  • Rebuild strategy for open entries. When an invalidation arrives for a relation that a backend currently has open, can the entry be rebuilt in place or must the backend wait? In-place rebuild requires that pointer stability for the sub-structures (tuple descriptor, rule trees) be maintained across the rebuild, because callers may hold raw pointers into them.

PostgreSQL chooses the fully-assembled aggregate entry and an in-place swap-and-rebuild discipline, for the reasons examined in §“PostgreSQL’s Approach”.

The textbook gives the model; this section names the engineering conventions that virtually every multi-user RDBMS adopts.

The descriptor cache is a process-local hash table keyed on the relation’s OID. OIDs are stable across the life of a database cluster (they are never recycled for existing objects), so the key never changes under a cached entry. The table is process-local because catalog metadata is read through normal MVCC snapshots; two backends at different transaction isolation levels may legitimately see different versions of a relation’s schema, so a shared cache would need snapshot-aware eviction logic that is more expensive than per-process tables.

Rather than caching a single system-catalog row, the cache assembles one in-memory record per relation that includes:

  • the pg_class row (relation name, OID, type OID, persistence, etc.)
  • the tuple descriptor (one FormData_pg_attribute per column)
  • rule lock (parsed rewrite rules from pg_rewrite)
  • trigger descriptor (from pg_trigger)
  • RLS descriptor (from pg_policy)
  • index list (from pg_index)
  • index AM routine pointers (for index relations)
  • physical file location

Building one aggregate record is expensive, but it means query execution never reads system catalogs for per-column metadata or for rule/trigger applicability — it reads the cache entry.

A small subset of catalogs is needed to read any catalog at all (the relation and attribute catalogs, and their critical indexes). For these, the cache must be self-bootstrapping: they must be present before the code that builds cache entries from catalog scans can run. The convention is to nail these entries — mark them as never evictable and pre-install them from hard-wired compile-time data. All other entries are built on demand and can be evicted (flushed) on invalidation.

Reference counting with a resource-owner safety net

Section titled “Reference counting with a resource-owner safety net”

Cache entries must not be freed while a backend is using them, but the cache also should not keep entries alive forever. The standard pattern is a reference count: open (pin) increments it; close (unpin) decrements it; entries with refcount zero can be evicted. A resource-owner mechanism ties open references to transaction scope so that a cache reference is automatically released if the transaction aborts before an explicit close.

When DDL changes a relation, the system sends a shared-invalidation (sinval) message to all backends. On receipt, a backend that has the affected OID in cache must either drop the entry (if refcount is zero) or mark it invalid and rebuild it at next access. Rebuilding an entry that is currently open (refcount > 0) requires the rebuild to not move the existing RelationData struct or its sub-pointers, because callers may hold direct C-pointer references to sub-structures.

Assembling all nailed entries from scratch on every backend startup is slow. The first backend to complete the process serialises the nailed entries to a binary file (pg_internal.init). Subsequent backends load this file rather than scanning catalogs, skipping the bootstrap scan entirely. The file is invalidated atomically (rename into place) whenever a sinval event could have changed any nailed entry.

Theory / conventionPostgreSQL name
Per-backend OID-keyed cacheRelationIdCacheHTAB * in relcache.c
Per-entry aggregate descriptorRelationData (typedef’d as Relation) in rel.h
pg_class row sub-fieldRelationData.rd_rel (Form_pg_class)
Tuple descriptor sub-fieldRelationData.rd_att (TupleDesc)
Rule lock sub-fieldRelationData.rd_rules (RuleLock *)
Physical file addressRelationData.rd_locator (RelFileLocator)
Nailed entryrd_isnailed = true, rd_refcnt starts at 1
Hard-wired stub builderformrdesc()
Init-file load/writeload_relcache_init_file() / write_relcache_init_file()
Reference countrd_refcnt, managed by RelationIncrementReferenceCount / RelationDecrementReferenceCount
Resource-owner integrationResourceOwnerRememberRelationRef / relref_resowner_desc
Sinval-driven flush/rebuildRelationCacheInvalidateEntryRelationFlushRelationRelationRebuildRelation
Invalid-but-open markerrd_isvalid = false
In-progress-build guardin_progress_list[] in relcache.c

PostgreSQL’s relcache is a per-backend HTAB keyed on relation OID. Each entry is a RelationData struct (typedef’d Relation) allocated in CacheMemoryContext and holding the fully-assembled metadata for one relation. The critical design properties are: (1) a three-phase startup sequence that breaks the bootstrap recursion using compile-time hard-wired stubs and a binary init file; (2) a reference count plus resource-owner integration that prevents entries from being freed while in use; (3) an invalidation path that either clears a zero-refcount entry or performs a swap-in-place rebuild that preserves pointer stability for open entries.

RelationData (defined in src/include/utils/rel.h) is the central data structure. Its fields fall into five groups:

// struct RelationData — src/include/utils/rel.h (condensed)
typedef struct RelationData
{
RelFileLocator rd_locator; /* physical identity: spc/db/relNumber */
SMgrRelation rd_smgr; /* cached smgr file handle, or NULL */
int rd_refcnt; /* reference count */
ProcNumber rd_backend; /* owning backend for temp rels */
bool rd_islocaltemp; /* this session's temp rel */
bool rd_isnailed; /* nailed in cache (never evicted) */
bool rd_isvalid; /* entry is valid (not stale) */
/* subtransaction-tracking fields for new/modified rels */
SubTransactionId rd_createSubid;
SubTransactionId rd_newRelfilelocatorSubid;
SubTransactionId rd_firstRelfilelocatorSubid;
SubTransactionId rd_droppedSubid;
/* catalog-derived sub-structures */
Form_pg_class rd_rel; /* pg_class row (fixed-width part) */
TupleDesc rd_att; /* tuple descriptor (rd_rel->relnatts cols) */
Oid rd_id; /* relation OID */
LockInfoData rd_lockInfo; /* lock manager identity */
RuleLock *rd_rules; /* parsed rewrite rules, or NULL */
MemoryContext rd_rulescxt; /* private context for rd_rules */
TriggerDesc *trigdesc; /* trigger info, or NULL */
struct RowSecurityDesc *rd_rsdesc; /* RLS policies, or NULL */
/* on-demand lazy fields (NULL until first request) */
List *rd_fkeylist; /* FK cache */
PartitionKey rd_partkey; /* partition key */
PartitionDesc rd_partdesc; /* partition descriptor */
List *rd_indexlist; /* OID list of indexes */
Oid rd_pkindex; /* primary key index OID */
Bitmapset *rd_keyattr; /* FK-usable cols */
Bitmapset *rd_hotblockingattr; /* HOT-blocking cols */
/* index-relation-only fields */
MemoryContext rd_indexcxt; /* private context for index info */
struct IndexAmRoutine *rd_indam;/* index AM API struct */
Oid *rd_opfamily; /* opfamily per index column */
Oid *rd_opcintype; /* opclass input type per column */
RegProcedure *rd_support; /* support procedure OIDs */
/* ... additional fields omitted ... */
} RelationData;

The split between eager and lazy fields is deliberate. rd_rel, rd_att, rd_rules, trigdesc, and rd_rsdesc are loaded eagerly by RelationBuildDesc because most callers need them. rd_fkeylist, rd_indexlist, rd_partkey, rd_partdesc, and the attribute-bitmap fields are loaded on first request and guarded by rd_fkeyvalid, rd_indexvalid, rd_partkey != NULL, and rd_attrsvalid flags. This avoids recursion: RelationGetIndexList triggers more catalog reads, which in turn need relcache entries — if that were part of the eager build, the recursion would be deeper and harder to break.

flowchart TD
  OID["relation OID"] --> HASH["RelationIdCache (HTAB)\n keyed on rd_id"]
  HASH -- "hit + rd_isvalid" --> RET["return Relation*\n(refcount++)"]
  HASH -- "hit + !rd_isvalid" --> REBUILD["RelationRebuildRelation\n(swap-in-place)"]
  REBUILD --> RET
  HASH -- "miss" --> BUILD["RelationBuildDesc\n(catalog scans)"]
  BUILD --> INSERT["RelationCacheInsert\nrd_isvalid = true"]
  INSERT --> RET

Figure 1 — RelationIdGetRelation lookup path. A cache hit on a valid entry returns immediately after incrementing the refcount. A hit on an invalid entry triggers in-place rebuild. A miss calls RelationBuildDesc to assemble the entry from system catalogs, then inserts it.

RelationBuildDesc: assembling an entry from catalogs

Section titled “RelationBuildDesc: assembling an entry from catalogs”

RelationBuildDesc is called on a cache miss. It performs a sequence of catalog reads, each of which may itself need relcache entries, creating potential recursion. The recursion is broken by the in_progress_list mechanism: while building entry R, if an invalidation arrives for R, the build restarts from scratch (the goto retry loop) rather than returning a stale partial entry:

// RelationBuildDesc — utils/cache/relcache.c (condensed)
in_progress_list[in_progress_offset].reloid = targetRelId;
retry:
in_progress_list[in_progress_offset].invalidated = false;
pg_class_tuple = ScanPgRelation(targetRelId, true, false);
if (!HeapTupleIsValid(pg_class_tuple)) { /* deleted */ return NULL; }
relation = AllocateRelationDesc(relp); /* alloc + copy pg_class fixed part */
RelationBuildTupleDesc(relation); /* scans pg_attribute */
/* lazy fields are NIL/NULL; loaded on demand */
relation->rd_fkeylist = NIL;
relation->rd_partkey = NULL;
/* access method info — index or table AM */
if (relkind == RELKIND_INDEX || ...)
RelationInitIndexAccessInfo(relation);
else if (RELKIND_HAS_TABLE_AM(relkind))
RelationInitTableAccessMethod(relation);
RelationParseRelOptions(relation, pg_class_tuple); /* rd_options */
if (relation->rd_rel->relhasrules)
RelationBuildRuleLock(relation); /* scans pg_rewrite */
if (relation->rd_rel->relhastriggers)
RelationBuildTriggers(relation); /* scans pg_trigger */
if (relation->rd_rel->relrowsecurity)
RelationBuildRowSecurity(relation); /* scans pg_policy */
RelationInitLockInfo(relation);
RelationInitPhysicalAddr(relation); /* rd_locator from relfilenode or mapper */
if (in_progress_list[in_progress_offset].invalidated)
{
RelationDestroyRelation(relation, false);
goto retry; /* restart if inval arrived mid-build */
}
if (insertIt)
RelationCacheInsert(relation, true);
relation->rd_isvalid = true;
return relation;

The condensed sketch above flattens the real control flow. The actual function shows two things the sketch hides: (1) the eager vs. lazy boundary is set field-by-field immediately after AllocateRelationDesc copies the pg_class fixed part, with every lazy field explicitly zeroed and its rd_*valid guard cleared; and (2) the access-method initialization is a relkind dispatch — index relations get RelationInitIndexAccessInfo, table-AM relations get RelationInitTableAccessMethod, and partitioned tables get nothing because partitions inherit their AM:

// RelationBuildDesc — utils/cache/relcache.c (condensed, real)
relation = AllocateRelationDesc(relp);
RelationGetRelid(relation) = relid;
relation->rd_refcnt = 0;
relation->rd_isnailed = false; /* ordinary rels are evictable */
relation->rd_createSubid = InvalidSubTransactionId;
/* ... three more SubTransactionId fields zeroed ... */
RelationBuildTupleDesc(relation); /* scans pg_attribute → rd_att */
/* foreign key data is not loaded till asked for */
relation->rd_fkeylist = NIL;
relation->rd_fkeyvalid = false;
/* partitioning data is not loaded till asked for */
relation->rd_partkey = NULL;
relation->rd_partdesc = NULL;
relation->rd_partcheckvalid = false;
/* initialize access method information */
if (relation->rd_rel->relkind == RELKIND_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
RelationInitIndexAccessInfo(relation);
else if (RELKIND_HAS_TABLE_AM(relation->rd_rel->relkind) ||
relation->rd_rel->relkind == RELKIND_SEQUENCE)
RelationInitTableAccessMethod(relation);
else if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
/* Do nothing: partitions inherit the AM. */ ;
else
Assert(relation->rd_rel->relam == InvalidOid);
RelationParseRelOptions(relation, pg_class_tuple); /* rd_options */
if (relation->rd_rel->relhasrules)
RelationBuildRuleLock(relation); /* scans pg_rewrite */
else
{ relation->rd_rules = NULL; relation->rd_rulescxt = NULL; }
/* ... triggers (pg_trigger) and row security (pg_policy) likewise ... */
RelationInitLockInfo(relation); /* see lmgr.c */
RelationInitPhysicalAddr(relation); /* rd_locator */
relation->rd_smgr = NULL; /* no open file yet */
heap_freetuple(pg_class_tuple);

Note the rd_refcnt = 0 / rd_isnailed = false defaults: a freshly built ordinary entry starts unpinned, and the caller (RelationIdGetRelation) pins it afterward. Nailed entries take the opposite path — they are built by formrdesc / load_critical_index, not by the ordinary cache-miss path, and start at rd_refcnt = 1.

RelationInitPhysicalAddr deserves attention. For most relations the physical file number is read directly from pg_class.relfilenode. For mapped relations — the handful whose file numbers cannot be stored in pg_class itself during bootstrap — it consults relmapper.c, which maintains a small separate file (pg_filenode.map) whose contents are loaded at startup. System catalogs (pg_class, pg_attribute, etc.) are mapped relations.

PostgreSQL cannot scan pg_class without a relcache entry for pg_class, which would require scanning pg_class. It breaks the recursion in three phases:

Phase 1 — RelationCacheInitialize: Creates the empty RelationIdCache hash table and the in_progress_list array. No catalog access.

Phase 2 — RelationCacheInitializePhase2: Prepares for access to shared catalogs (pg_database, pg_authid, pg_auth_members, pg_shseclabel, pg_subscription). Tries to load global/pg_internal.init. On failure, calls formrdesc five times to install hard-wired stubs:

// RelationCacheInitializePhase2 — utils/cache/relcache.c (condensed)
if (!load_relcache_init_file(true))
{
formrdesc("pg_database", DatabaseRelation_Rowtype_Id, true, ...);
formrdesc("pg_authid", AuthIdRelation_Rowtype_Id, true, ...);
formrdesc("pg_auth_members", AuthMemRelation_Rowtype_Id, true, ...);
formrdesc("pg_shseclabel", SharedSecLabelRelation_Rowtype_Id, true, ...);
formrdesc("pg_subscription", SubscriptionRelation_Rowtype_Id, true, ...);
#define NUM_CRITICAL_SHARED_RELS 5
}

Phase 3 — RelationCacheInitializePhase3: The critical phase. Tries to load $PGDATA/base/<dboid>/pg_internal.init. On failure, calls formrdesc four times for the local nailed catalogs (pg_class, pg_attribute, pg_proc, pg_type), then loads the seven critical indexes one by one:

// RelationCacheInitializePhase3 — utils/cache/relcache.c (condensed)
if (!load_relcache_init_file(false))
{
formrdesc("pg_class", RelationRelation_Rowtype_Id, false, ...);
formrdesc("pg_attribute", AttributeRelation_Rowtype_Id, false, ...);
formrdesc("pg_proc", ProcedureRelation_Rowtype_Id, false, ...);
formrdesc("pg_type", TypeRelation_Rowtype_Id, false, ...);
#define NUM_CRITICAL_LOCAL_RELS 4
}
// ...
if (!criticalRelcachesBuilt)
{
load_critical_index(ClassOidIndexId, RelationRelationId);
load_critical_index(AttributeRelidNumIndexId, AttributeRelationId);
load_critical_index(IndexRelidIndexId, IndexRelationId);
load_critical_index(OpclassOidIndexId, OperatorClassRelationId);
load_critical_index(AccessMethodProcedureIndexId, AccessMethodProcedureRelationId);
load_critical_index(RewriteRelRulenameIndexId, RewriteRelationId);
load_critical_index(TriggerRelidNameIndexId, TriggerRelationId);
#define NUM_CRITICAL_LOCAL_INDEXES 7
criticalRelcachesBuilt = true;
}

Setting criticalRelcachesBuilt = true is the key transition: after this point, ScanPgRelation and RelationBuildTupleDesc may use index scans rather than sequential heap scans. Before it, they fall back to heap scans unconditionally. Phase 3 ends by scanning the entire cache and replacing any formrdesc stubs (identifiable by rd_rel->relowner == InvalidOid) with real pg_class rows read via SearchSysCache1.

formrdesc itself builds a minimal RelationData from compile-time attribute arrays generated by genbki.pl (Desc_pg_class, Desc_pg_attribute, etc.) and marks the entry nailed (rd_isnailed = true, rd_refcnt = 1). The entry is rd_isvalid = false intentionally — Phase 3 will fix it — so that any access before Phase 3 completes falls through to a rebuild.

flowchart LR
  P1["Phase 1\nRelationCacheInitialize\n(empty hash table)"]
  P2["Phase 2\nRelationCacheInitializePhase2\nshared stubs or init file"]
  P3["Phase 3\nRelationCacheInitializePhase3\nlocal stubs + critical indexes\ncriticalRelcachesBuilt = true\nfake → real pg_class rows"]
  NORMAL["Normal operation\nRelationIdGetRelation on demand"]
  P1 --> P2 --> P3 --> NORMAL

Figure 2 — Three-phase bootstrap. Phase 1 creates the empty hash. Phase 2 installs stubs for shared catalogs. Phase 3 installs local stubs, loads the 7 critical indexes, sets criticalRelcachesBuilt, then replaces stubs with real pg_class data. All three phases complete before the first user query runs.

write_relcache_init_file(bool shared) serialises all nailed (critical) relcache entries to a binary file at end-of-startup. The file is written to a temporary name and renamed into place atomically, so a concurrent reader never sees a partial file. The file starts with a magic number (RELCACHE_INIT_FILEMAGIC = 0x573266) that acts as a version identifier; a mismatch causes the file to be ignored and rebuilding from scratch.

load_relcache_init_file reads the file back, reconstructs each RelationData in CacheMemoryContext, recomputes lock info and physical addresses (in case the file was copied from another database by CREATE DATABASE), and counts nailed entries against NUM_CRITICAL_LOCAL_RELS / NUM_CRITICAL_LOCAL_INDEXES as a sanity check.

The write side is a length-prefixed binary dump. After the magic number, it walks RelationIdCache and, for each entry in the correct group (relisshared == shared) that belongs in the file, writes the RelationData struct, the pg_class form, each attribute’s fixed part, and the reloptions blob — then, for RELKIND_INDEX, the pg_index tuple and the opfamily/opcintype/support vectors. The whole thing is gated on having received zero sinval events since startup, and the final step is the atomic temp-file rename under RelCacheInitLock:

// write_relcache_init_file — utils/cache/relcache.c (condensed, real)
if (relcacheInvalsReceived != 0L)
return; /* already stale; don't bother */
/* ... snprintf temp + final names, AllocateFile(tempfilename) ... */
magic = RELCACHE_INIT_FILEMAGIC;
fwrite(&magic, 1, sizeof(magic), fp);
hash_seq_init(&status, RelationIdCache);
while ((idhentry = hash_seq_search(&status)) != NULL)
{
Relation rel = idhentry->reldesc;
Form_pg_class relform = rel->rd_rel;
if (relform->relisshared != shared) /* wrong group */
continue;
if (!shared && !RelationIdIsInInitFile(RelationGetRelid(rel)))
{
Assert(!rel->rd_isnailed); /* nailed must be stored */
continue;
}
write_item(rel, sizeof(RelationData), fp);
write_item(relform, CLASS_TUPLE_SIZE, fp);
for (i = 0; i < relform->relnatts; i++)
write_item(TupleDescAttr(rel->rd_att, i), ATTRIBUTE_FIXED_PART_SIZE, fp);
write_item(rel->rd_options, rel->rd_options ? VARSIZE(rel->rd_options) : 0, fp);
if (rel->rd_rel->relkind == RELKIND_INDEX) { /* pg_index + opfamily/support */ }
}
/* serialize against the unlink-and-SI path */
LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);
AcceptInvalidationMessages();
if (relcacheInvalsReceived == 0L)
{
if (rename(tempfilename, finalfilename) < 0) /* atomic publish */
unlink(tempfilename);
}
else
unlink(tempfilename); /* obsolete; drop it */
LWLockRelease(RelCacheInitLock);

The read side reverses the dump but cannot trust it blindly. Two structural checks (magic != RELCACHE_INIT_FILEMAGIC, and each descriptor’s len != sizeof(RelationData)) jump to read_failed so a layout change forces a clean rebuild. After reconstructing each entry it recomputes lock and physical addressing — the on-disk rd_locator / rd_lockInfo may belong to a different database if the file was copied by CREATE DATABASE. Finally it cross-checks the nailed-entry counts against the compile-time constants before inserting anything:

// load_relcache_init_file — utils/cache/relcache.c (condensed, real)
if (fread(&magic, 1, sizeof(magic), fp) != sizeof(magic)) goto read_failed;
if (magic != RELCACHE_INIT_FILEMAGIC) goto read_failed;
for (relno = 0;; relno++)
{
if ((nread = fread(&len, 1, sizeof(len), fp)) != sizeof(len))
{
if (nread == 0) break; /* clean EOF */
goto read_failed;
}
if (len != sizeof(RelationData)) goto read_failed; /* layout drift */
rel = rels[num_rels++] = (Relation) palloc(len);
fread(rel, 1, len, fp);
/* ... read pg_class form, rebuild rd_att, AM info, nailed counts ... */
rel->rd_smgr = NULL;
rel->rd_refcnt = rel->rd_isnailed ? 1 : 0;
rel->rd_indexvalid = false; rel->rd_indexlist = NIL;
/* recompute — file may have been copied by CREATE DATABASE */
RelationInitLockInfo(rel);
RelationInitPhysicalAddr(rel);
}
if (!shared && (nailed_rels != NUM_CRITICAL_LOCAL_RELS ||
nailed_indexes != NUM_CRITICAL_LOCAL_INDEXES))
goto read_failed; /* critical set changed → rebuild */
for (relno = 0; relno < num_rels; relno++)
RelationCacheInsert(rels[relno], false);
criticalRelcachesBuilt = true; /* (criticalSharedRelcachesBuilt if shared) */
return true;
read_failed:
/* leak the half-built rels (not in cache) and fall back to formrdesc */
return false;

Note that a successful load_relcache_init_file is what sets criticalRelcachesBuilt (or criticalSharedRelcachesBuilt) — the same flag that Phase 3 sets after load_critical_index. Either path reaches the post-bootstrap state; the init file is just the fast lane.

flowchart TD
  START["backend startup\nPhase 2 / Phase 3"] --> TRY["load_relcache_init_file(shared)"]
  TRY --> OPEN{"AllocateFile ok?"}
  OPEN -- "no" --> SLOW["formrdesc stubs +\nload_critical_index\n(slow bootstrap)"]
  OPEN -- "yes" --> MAGIC{"magic + len checks?"}
  MAGIC -- "fail" --> RF["read_failed\nreturn false"]
  RF --> SLOW
  MAGIC -- "ok" --> RECON["reconstruct entries\nRelationInitLockInfo\nRelationInitPhysicalAddr"]
  RECON --> COUNT{"nailed counts ==\nNUM_CRITICAL_*?"}
  COUNT -- "no" --> RF
  COUNT -- "yes" --> INS["RelationCacheInsert all\ncriticalRelcachesBuilt = true"]
  SLOW --> RUN["normal operation"]
  INS --> RUN
  RUN --> WRITE["write_relcache_init_file\nat end of startup\nif relcacheInvalsReceived == 0"]
  WRITE --> RENAME["rename temp → pg_internal.init\nunder RelCacheInitLock"]

Figure 4 — Init-file fast lane vs. slow bootstrap. load_relcache_init_file short-circuits the formrdesc path when the file opens, passes the magic/length structural checks, and matches the expected nailed-entry counts; any failure routes through read_failed back to the slow bootstrap. The first backend to finish startup without seeing a sinval event writes the file back and renames it into place atomically.

The file is invalidated by RelationCacheInitFilePreInvalidate / RelationCacheInitFilePostInvalidate (called by the sinval system in inval.c). Pre-invalidation renames the live file to a temp name; post-invalidation deletes the temp file. Any backend that loaded the file between pre- and post- gets an inconsistent view, but sinval processing will correct it before the stale data is used.

Every caller of RelationIdGetRelation receives an entry with rd_refcnt incremented. The caller is responsible for calling RelationClose (which calls RelationDecrementReferenceCount) when done. In practice, access goes through table_open / table_close (or index_open / index_close), which call RelationIdGetRelation and RelationClose respectively.

To prevent leaks on error paths, each open reference is registered with CurrentResourceOwner via ResourceOwnerRememberRelationRef:

// RelationIncrementReferenceCount — utils/cache/relcache.c (condensed)
void
RelationIncrementReferenceCount(Relation rel)
{
ResourceOwnerEnlarge(CurrentResourceOwner);
rel->rd_refcnt += 1;
if (!IsBootstrapProcessingMode())
ResourceOwnerRememberRelationRef(CurrentResourceOwner, rel);
}

The relref_resowner_desc descriptor registers ResOwnerReleaseRelation as the cleanup callback, which calls RelationDecrementReferenceCount at transaction end if the caller forgot. This ensures that long-lived processes (like the autovacuum worker) cannot accumulate leaked relcache pins across transactions.

Invalidation: RelationFlushRelation and the swap-in-place rebuild

Section titled “Invalidation: RelationFlushRelation and the swap-in-place rebuild”

When a sinval message arrives for a relation OID, RelationCacheInvalidateEntry increments relcacheInvalsReceived and calls RelationFlushRelation. The flush decision tree is:

flowchart TD
  F["RelationFlushRelation(rel)"]
  F --> NEW{"rd_createSubid != 0\nor rd_firstRelfilelocatorSubid != 0?"}
  NEW -- "yes (new-in-txn rel)" --> TXNSTATE{"IsTransactionState\nand not dropped?"}
  TXNSTATE -- "yes" --> REBUILDNEW["bump refcnt\nRelationRebuildRelation\ndecrement refcnt"]
  TXNSTATE -- "no" --> INVAL["RelationInvalidateRelation\nrd_isvalid = false"]
  NEW -- "no (pre-existing rel)" --> REFZERO{"refcnt == 0?"}
  REFZERO -- "yes" --> CLEAR["RelationClearRelation\n(destroy entry)"]
  REFZERO -- "no + !IsTransactionState" --> INVAL
  REFZERO -- "no + nailed + refcnt==1" --> INVAL
  REFZERO -- "no + open" --> REBUILD["RelationRebuildRelation\n(swap-in-place)"]

Figure 3 — RelationFlushRelation decision tree. A zero-refcount pre-existing entry is simply cleared. An open entry is rebuilt in-place by building a new RelationData, swapping contents field-by-field (preserving rd_refcnt, rd_smgr, transaction sub-IDs, and optionally rd_att / rd_rules / rd_rsdesc if they are structurally equal), then destroying the temporary new entry.

The swap-in-place logic in RelationRebuildRelation is the most complex part of the relcache. The full struct is swapped with memcpy, then a dozen fields are swapped back to preserve invariants:

// RelationRebuildRelation — utils/cache/relcache.c (condensed)
newrel = RelationBuildDesc(save_relid, false); /* build into temp entry */
keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att);
keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules);
keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc);
keep_partkey = (relation->rd_partkey != NULL); /* immutable once set */
/* swap all fields at once */
{ RelationData tmp; memcpy(&tmp, newrel, ...); memcpy(newrel, relation, ...);
memcpy(relation, &tmp, ...); }
/* then swap back fields that must be preserved */
SWAPFIELD(SMgrRelation, rd_smgr); /* back-links from smgr level */
SWAPFIELD(int, rd_refcnt); /* callers hold this count */
SWAPFIELD(SubTransactionId, rd_createSubid);
/* ... other SubTransactionId fields ... */
SWAPFIELD(Form_pg_class, rd_rel);
memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE); /* update content */
if (keep_tupdesc) SWAPFIELD(TupleDesc, rd_att); /* preserve pointer */
if (keep_rules) { SWAPFIELD(RuleLock*, rd_rules); SWAPFIELD(MemoryContext, rd_rulescxt); }
if (keep_policies) SWAPFIELD(RowSecurityDesc*, rd_rsdesc);
if (keep_partkey) SWAPFIELD(PartitionKey, rd_partkey);
/* ... partition desc context handling ... */
RelationDestroyRelation(newrel, !keep_tupdesc);

Preserving rd_att when the tuple descriptor is structurally unchanged is important: catcache entries may embed pointers into TupleDesc columns, so moving the descriptor would invalidate those pointers. equalTupleDescs performs a structural comparison to decide whether the preservation is safe.

If RelationBuildDesc returns NULL during a rebuild — which can happen for a relation visible in a historic decoding snapshot but not in the current one — and a historic snapshot is active, the function returns without rebuilding (the entry remains invalid); otherwise it ereports an error because a relation should not be dropped while still open.

Lazy computed fields: on-demand sub-structure loading

Section titled “Lazy computed fields: on-demand sub-structure loading”

Several RelationData fields are populated only on first request. RelationGetIndexList scans pg_index and caches the result in rd_indexlist; RelationGetFKeyList scans pg_constraint; both set their rd_*valid guards before storing into CacheMemoryContext.

A subtle concurrency issue arises in RelationGetIndexAttrBitmap: during the loop that opens each index to collect attribute bitmaps, a relcache flush may arrive and reset rd_indexlist. The function handles this with a restart loop — after collecting all bitmaps it calls RelationGetIndexList a second time and compares the result to its earlier snapshot; if they differ, it frees the partially-built bitmaps and restarts from scratch.

rd_createSubid, rd_newRelfilelocatorSubid, rd_firstRelfilelocatorSubid, and rd_droppedSubid track which subtransaction last mutated the relation’s file identity. These are critical for RelationNeedsWAL(), which returns false for a relation whose storage was created in the current top transaction (so WAL records are unnecessary for that storage). At end-of-transaction AtEOXact_RelationCache resets all sub-IDs to zero and, if the relation was created and then dropped in the same transaction, clears the entry.

Anchor on symbol names, not line numbers. Use git grep -n '<symbol>' src/backend/utils/cache/relcache.c to relocate; line numbers in the position-hint table are scoped to commit 273fe94.

  • struct RelationData (rel.h) — the aggregate descriptor typedef’d as Relation. Sub-fields: rd_locator, rd_rel, rd_att, rd_rules, trigdesc, rd_rsdesc, rd_indexlist, rd_indam.
  • struct relidcacheent / RelationIdCache (relcache.c) — the per-backend HTAB * keyed on OID.
  • struct InProgressEnt / in_progress_list (relcache.c) — the recursion guard stack for RelationBuildDesc.
  • eoxact_list[] / eoxact_list_overflowed (relcache.c) — fast path for end-of-transaction cleanup.
  • criticalRelcachesBuilt / criticalSharedRelcachesBuilt (relcache.c) — flags that switch ScanPgRelation from heap scan to index scan.
  • RelationCacheInitialize — Phase 1: empty hash + in_progress_list.
  • RelationCacheInitializePhase2 — Phase 2: shared stubs or init file.
  • RelationCacheInitializePhase3 — Phase 3: local stubs, critical indexes, criticalRelcachesBuilt, replace stubs with real rows.
  • formrdesc — build a hard-wired nailed entry from compile-time attrs.
  • load_critical_index — call RelationBuildDesc for one critical index and nail it.
  • ScanPgRelation — heap or index scan of pg_class for one OID.
  • AllocateRelationDesc — palloc a RelationData in CacheMemoryContext.
  • RelationBuildTupleDesc — scan pg_attribute, pg_attrdef, pg_constraint to fill rd_att.
  • RelationBuildDesc — top-level builder: orchestrates the catalog scans, handles in_progress_list restart, calls RelationCacheInsert.
  • RelationInitPhysicalAddr — fill rd_locator from pg_class.relfilenode or from relmapper.c for mapped rels.
  • RelationBuildRuleLock — scan pg_rewrite, build rd_rules.
  • RelationInitIndexAccessInfo — for index rels: fill rd_indam, rd_opfamily, rd_support, etc.
  • RelationBuildLocalRelation — build an entry for a newly-created relation before it has a pg_class row.
  • RelationIdGetRelation — public entry: hash lookup → rebuild if stale → RelationBuildDesc on miss. Increments refcount.
  • RelationIncrementReferenceCount / RelationDecrementReferenceCount — refcount ± 1 with resource-owner registration.
  • RelationClose — decrement refcount; calls RelationCloseCleanup.
  • RelationCacheInvalidateEntry — sinval dispatch: look up OID, call RelationFlushRelation; also sets in_progress_list[].invalidated.
  • RelationCacheInvalidate — bulk flush on SI buffer overflow.
  • RelationFlushRelation — dispatch: clear (refcnt=0) or rebuild (open).
  • RelationRebuildRelation — swap-in-place rebuild for open entries.
  • RelationInvalidateRelation — mark rd_isvalid = false without rebuild.
  • RelationForgetRelation — caller-reported drop; mark dropped or clear.
  • load_relcache_init_file — deserialise pg_internal.init into cache.
  • write_relcache_init_file — serialise nailed entries to temp file, then rename.
  • RelationCacheInitFilePreInvalidate / RelationCacheInitFilePostInvalidate — atomic invalidation of the init file around sinval processing.
  • RelationCacheInitFileRemove — remove init files (used by initdb, pg_upgrade, etc.).
  • RelationIdIsInInitFile — predicate: should this OID appear in the local init file?
  • RelationGetFKeyList — FK constraint list.
  • RelationGetIndexList — index OID list + rd_pkindex / rd_replidindex.
  • RelationGetIndexExpressions / RelationGetIndexPredicate — parsed index expression / predicate trees.
  • RelationGetIndexAttrBitmap — column bitmaps for HOT/FK/PK/replica identity; includes the restart loop for concurrent index-list changes.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
struct RelationDatautils/rel.h55
RelationIdCacheutils/cache/relcache.c134
criticalRelcachesBuiltutils/cache/relcache.c140
in_progress_listutils/cache/relcache.c170
eoxact_listutils/cache/relcache.c185
ScanPgRelationutils/cache/relcache.c340
AllocateRelationDescutils/cache/relcache.c413
RelationParseRelOptionsutils/cache/relcache.c468
RelationBuildTupleDescutils/cache/relcache.c525
RelationBuildRuleLockutils/cache/relcache.c752
RelationBuildDescutils/cache/relcache.c1059
RelationInitPhysicalAddrutils/cache/relcache.c1339
RelationInitIndexAccessInfoutils/cache/relcache.c1445
RelationInitTableAccessMethodutils/cache/relcache.c1829
RelationDestroyRelationutils/cache/relcache.c2439
formrdescutils/cache/relcache.c1894
RelationIdGetRelationutils/cache/relcache.c2099
RelationIncrementReferenceCountutils/cache/relcache.c2187
RelationDecrementReferenceCountutils/cache/relcache.c2200
RelationCloseutils/cache/relcache.c2220
RelationRebuildRelationutils/cache/relcache.c2585
RelationFlushRelationutils/cache/relcache.c2827
RelationCacheInvalidateEntryutils/cache/relcache.c2938
RelationCacheInvalidateutils/cache/relcache.c2994
AtEOXact_RelationCacheutils/cache/relcache.c3226
RelationBuildLocalRelationutils/cache/relcache.c3515
RelationCacheInitializeutils/cache/relcache.c4002
RelationCacheInitializePhase2utils/cache/relcache.c4048
RelationCacheInitializePhase3utils/cache/relcache.c4107
RelationGetFKeyListutils/cache/relcache.c4731
RelationGetIndexListutils/cache/relcache.c4836
load_relcache_init_fileutils/cache/relcache.c6167
write_relcache_init_fileutils/cache/relcache.c6585
write_itemutils/cache/relcache.c6797
RelationIdIsInInitFileutils/cache/relcache.c6820
RelationCacheInitFilePreInvalidateutils/cache/relcache.c6860
RelationCacheInitFilePostInvalidateutils/cache/relcache.c6885
RelationCacheInitFileRemoveutils/cache/relcache.c6900
struct relidcacheentutils/cache/relcache.c128

Facts about the current source at commit 273fe94. Open questions follow as the curator’s recorded gaps.

  • RelationIdCache is a backend-private HTAB * keyed on Oid with entrysize = sizeof(RelIdCacheEnt). Verified in RelationCacheInitialize (relcache.c): hash_create("Relcache by OID", INITRELCACHESIZE, &ctl, HASH_ELEM | HASH_BLOBS).

  • Seven nailed local catalogs/indexes are listed by NUM_CRITICAL_LOCAL_RELS = 4 and NUM_CRITICAL_LOCAL_INDEXES = 7, hard-coded as compile-time constants. Verified in RelationCacheInitializePhase3. Changing the nailed set requires updating these constants and the corresponding formrdesc / load_critical_index call lists.

  • criticalRelcachesBuilt switches ScanPgRelation from heap scan to index scan. Verified: ScanPgRelation passes indexOK && criticalRelcachesBuilt as the indexOK argument to systable_beginscan. Before Phase 3 completes, all catalog reads are unconditional heap scans.

  • The in_progress_list restart (goto retry) fires when an invalidation message arrives for the OID being built. Verified in RelationBuildDesc: after the full build sequence, if (in_progress_list[offset].invalidated) { RelationDestroyRelation(...); goto retry; }. RelationCacheInvalidateEntry sets in_progress_list[i].invalidated = true for matching OIDs when the relation is not yet in the hash table.

  • The swap-in-place rebuild in RelationRebuildRelation preserves rd_att when equalTupleDescs returns true, to avoid invalidating catcache entry pointers. Verified: keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att) and the conditional SWAPFIELD(TupleDesc, rd_att) path.

  • write_relcache_init_file aborts early if any sinval has been received since startup, via relcacheInvalsReceived != 0. Verified at the top of write_relcache_init_file. This prevents writing a file that is already stale.

  • RELCACHE_INIT_FILEMAGIC = 0x573266 is a compile-time constant used as a version check on pg_internal.init. Verified: defined at line 93 of relcache.c; a magic mismatch causes load_relcache_init_file to jump to read_failed and return false.

  • The rd_createSubid family of fields is reset to InvalidSubTransactionId at end-of-transaction by AtEOXact_RelationCache. Verified in AtEOXact_cleanup: all four sub-ID fields are zeroed before the potential RelationClearRelation call.

  • RelationGetIndexAttrBitmap has an explicit restart-on-flush loop. Verified: after the main foreach loop, a second call to RelationGetIndexList is compared to the earlier snapshot; if they differ, the bitmaps are freed and the code jumps to the restart: label.

  1. debug_discard_caches interaction with RelationBuildDesc. When debug_discard_caches > 0 is set, RelationBuildDesc uses a temporary memory context to recover transient data; and RelationCacheInvalidate is called to discard all entries after each query. The exact sequence in which in_progress_list is managed under concurrent debug_discard_caches stress is not fully traced here. Investigation path: read the comment block in RelationBuildDesc around MAYBE_RECOVER_RELATION_BUILD_MEMORY and instrument with debug_discard_caches = 1.

  2. relmapper.c update atomicity. Mapped relations’ file numbers live in pg_filenode.map, managed by relmapper.c. When a mapped relation’s storage is rewritten (e.g., by VACUUM FULL on a catalog), the mapper file is updated with its own two-phase rename. How RelationInitPhysicalAddr in RelationRebuildRelation interacts with a concurrent mapper update is not traced here. Investigation path: read relmapper.c and the RelFileLocatorSkippingWAL path in RelationInitPhysicalAddr.

  3. Init-file correctness under CREATE DATABASE. The code comment in load_relcache_init_file notes that lock info and physical addresses must be recomputed “in case the pg_internal.init file was copied from some other database by CREATE DATABASE.” The exact fields that could diverge between a copied init file and the new database’s on-disk state are not exhaustively listed here. Investigation path: read the CREATE DATABASE tablespace-rewriting path in commands/dbcommands.c.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Shallow seed list for follow-up documents; not analysis.

  • Catcache as peer layer. PostgreSQL’s catcache (utils/cache/catcache.c) caches individual tuples from system catalogs (one entry per catalog row), while the relcache caches aggregated descriptors. They share the sinval invalidation channel but serve different consumers. A side-by-side of their eviction policies (catcache has a clock-based LRU per syscache; relcache entries live until sinval) would clarify the architectural split. See postgres-catcache-syscache.md.

  • Sinval invalidation coupling. RelationCacheInvalidateEntry is one of several handlers registered with inval.c; the init-file invalidation via RelationCacheInitFilePreInvalidate is another. The full sinval pipeline (shared-memory ring buffer, catchup on reconnect, catcache reset) is the mechanism tying the relcache to DDL. See postgres-cache-invalidation.md.

  • CUBRID’s relcache analog. CUBRID maintains a per-thread schema_manager that caches OR_CLASSREP (the physical schema) and SM_CLASS (the logical object-relational schema) separately. The split mirrors PostgreSQL’s rd_att (physical) vs. rd_rel (catalog row) split, but CUBRID’s catalog is object-relational, making the mapping more complex. A comparison of bootstrap strategies (both systems have a set of “critical” tables that need hard-wired descriptors) would isolate what is universal from what is schema-model-specific.

  • Descriptor stability and pointer aliasing. PostgreSQL’s swap-in-place rebuild (SWAPFIELD) is an ad-hoc approach to pointer stability. Newer database research (e.g., Andy Pavlo’s group on live schema changes in OLAP systems) explores epoch-based approaches where old descriptors survive until all readers have left the epoch. Whether such an approach would simplify the keep_tupdesc / catcache pointer coupling is an open design question for a future postgres-relcache-evolution.md.

  • The formrdesc / init-file pattern as a general bootstrap technique. The problem of a component that needs itself to initialize is a special case of the circular-dependency problem in system initialization. Architecture of a DB System (Hellerstein et al., 2007 — dbms-papers/fntdb07-architecture.md) §“Catalog Manager” briefly notes that hard-wired stubs are the universal solution; a survey of how MySQL/InnoDB, Oracle, and SQL Server each break the same recursion would be a useful comparative note.

PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”
  • src/backend/utils/cache/relcache.c — all bootstrap, build, lookup, invalidation, and init-file logic.
  • src/include/utils/rel.hRelationData struct definition and all access macros.
  • src/include/utils/relcache.h — public API declarations, RELCACHE_INIT_FILENAME.
  • src/include/utils/relmapper.h — mapped-relation file-number interface.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz), Ch. 11 §“System Catalog” — the catalog structure that RelationBuildDesc reads.
  • Database Internals (Petrov), Ch. 2 §“Memory-Mapped Files and Direct I/O” — caching and descriptor-lifetime context.

Papers (under knowledge/research/dbms-papers/)

Section titled “Papers (under knowledge/research/dbms-papers/)”
  • Architecture of a DB System (Hellerstein et al., 2007) — fntdb07-architecture.md. §“Catalog Manager” frames the catalog-as-relation bootstrapping problem that formrdesc solves.
  • postgres-catcache-syscache.md — catcache (individual catalog-row cache) that feeds into relcache builds.
  • postgres-cache-invalidation.md — sinval pipeline that drives RelationCacheInvalidateEntry.
  • postgres-system-catalogs.md — the pg_class, pg_attribute etc. tables that RelationBuildDesc reads.
  • postgres-memory-contexts.mdCacheMemoryContext that hosts all relcache entries.
  • postgres-architecture-overview.md — Axis 6 (Catalog + cache layer) and the sinval loop.