PostgreSQL Index Creation — CREATE INDEX, the Build, and CONCURRENTLY
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”An index is a redundant, ordered access path layered over a base relation: it duplicates one or more attributes (plus a row locator) into a structure that answers point and range lookups in sub-linear time. Database System Concepts (Silberschatz, 7e, ch. 14 “Indexing”) frames the whole subject around two families — ordered indices (§14.2), whose realization is the B+-tree (§14.3–14.4), and hash indices — and notes that “an index structure is associated with a particular search key” so that “a file may have several indices, on different search keys.” The SQL surface for creating one is tiny; §4.6 gives the entire grammar:
“We create an index with the
create indexcommand, which takes the formcreate index <index-name> on <relation-name> (<attribute-list>);” (DSC §4.6)
Behind that one statement sits the question this document is about: how does the engine turn an empty index structure into one that already contains an entry for every existing row? There are two textbook answers, and they map directly onto PostgreSQL’s two code paths.
1. Tuple-at-a-time insertion. Treat the build as N ordinary index
inserts: scan the heap, and for each live row call the structure’s
insert(key, tid). Correct and simple, but pessimal — every insert walks
the tree from the root, splits pages as it goes, and dirties interior nodes
out of order. The B+-tree ends up sparsely packed (each leaf roughly
half-full after random splits) and the build does O(N log N) random page
touches.
2. Bulk loading. DSC §14.4.1 and the “bulk loader” discussion (§14.x) describe the production technique: sort all (key, tid) pairs once, then append them into leaf pages in sorted order, packing each leaf to a high fill factor and building the interior levels bottom-up. The textbook’s verdict on why this matters:
“Most relational database products have special ‘bulk loader’ utilities to insert a large number of records efficiently … sorting the entries and inserting them in sorted order … results in a much more efficient index construction.” (DSC, bulk-loading discussion)
Bulk loading turns O(N log N) random I/O into one sort plus a sequential
write, and it produces a dense tree — better cache behaviour and fewer
pages for the lifetime of the index. PostgreSQL’s B-tree ambuild
(btbuild → _bt_load) is exactly a bulk loader: scan, tuplesort, then
write packed leaves left-to-right.
The second theoretical axis is concurrency of the build itself. A
naïve build must freeze the table — no writes — for the entire scan, sort,
and write, which on a large table can mean minutes or hours of blocked DML.
The literature on online / non-blocking index construction (the same
lineage that produced concurrent B-tree algorithms — Lehman & Yao 1981,
“Efficient Locking for Concurrent Operations on B-Trees”; and online
schema-change work) asks: can we build an index while the table keeps
accepting inserts, updates, and deletes? The difficulty is a moving
target. While we scan, other transactions add rows the scan never saw and
delete rows we already indexed. An online build must (a) capture a
consistent baseline, (b) ensure concurrent writers start maintaining the
new index before we finish, and (c) reconcile the gap — the rows that
changed between baseline and the moment writers joined in. PostgreSQL’s
CREATE INDEX CONCURRENTLY is a concrete protocol for exactly this, and
the bulk of this document traces it.
A third, smaller idea: an index is derived state. Nothing in it is
authoritative — every entry can be recomputed from the heap. That is what
makes REINDEX safe (throw the structure away and rebuild) and what
makes a half-built index from a crashed CREATE INDEX CONCURRENTLY merely
invalid rather than corrupting: the planner ignores it, and REINDEX or
DROP cleans it up.
A fourth idea worth pinning down before the source: the difference between
an index being populated and being trusted. A classical single-pass
build conflates these — the moment the build finishes, the index both
contains every row and may be read. An online build must split them. There
is a window where the index physically exists and is even being maintained
by concurrent writers, yet the query planner must still refuse to use it
because it is not complete with respect to any consistent view of the
table. The textbook’s discussion of indices assumes the populated == trusted
identity; the online-DDL literature is precisely the study of how to break
that identity safely and then restore it. PostgreSQL encodes the broken
identity as two separate flags (indisready for “being maintained” and
indisvalid for “safe to read”) with a deliberate gap between them, and
the entire CONCURRENTLY protocol is the careful sequencing of that gap.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model — ordered structures, bulk loading, online
construction. This section names the engineering conventions that
production engines converge on, the patterns the textbook leaves implicit.
PostgreSQL’s specific choices in ## PostgreSQL's Approach are one set of
dials within this shared space.
-
Separate the SQL command from the catalog mechanism. The parser produces a statement node describing what index the user asked for; one layer validates and resolves it (does this access method exist? is this operator class compatible with this column type? what is the index’s name?), and a lower layer materializes catalog rows and storage. PostgreSQL splits this as
DefineIndex(command) overindex_create(catalog) — the same split you see in most engines between a “DDL executor” and a “catalog manager.” -
The build is access-method-polymorphic. B-tree, hash, GiST, GIN, BRIN, SP-GiST all build differently, but the orchestration — pick a userid, set up progress reporting, call the AM’s builder, write statistics back — is identical. Engines factor this into a single build driver that dispatches to a per-AM callback. PostgreSQL’s
index_buildcallsindexRelation->rd_indam->ambuild(...). -
Three liveness states, not one boolean. A “done” boolean cannot express the intermediate states an online build passes through. The conventional decomposition is live (the catalog row exists and should be maintained on drop/cleanup decisions), ready (new DML must insert into it), and valid (queries may read it). PostgreSQL stores exactly these as
pg_index.indislive,indisready,indisvalid. -
Lock strength is the policy knob for blocking vs. online. A blocking build takes a lock that excludes writers but admits readers; an online build takes a weaker lock that admits writers and pays for it with a multi-pass reconciliation. PostgreSQL uses
ShareLock(blocks writers) for plainCREATE INDEXandShareUpdateExclusiveLock(admits writers) forCONCURRENTLY. -
Online builds reconcile against a snapshot. Because the table moves under the build, the engine fixes a reference snapshot, builds/checks against it, and then waits out every transaction that could see a state older than that snapshot before trusting the index. This “wait for older snapshots” step is the hidden cost of
CONCURRENTLYand the reason it can hang behind a long-running transaction. -
Rebuild is build with a twist. Reindexing is the same build machine pointed at an existing index, with constraint re-checking optionally skipped (you trust the old definition) and HOT-chain bookkeeping adjusted (any broken chains predate the index, so its usability horizon need not move). PostgreSQL passes an
isreindexflag into the sharedindex_build. -
The build runs as the table owner, in a sandbox. Index columns can be arbitrary expressions over user-defined functions, and a partial index’s predicate can call user code too. Running that code as the invoking superuser (think
pg_dumprestoring an index) would be a privilege-escalation hole. The convention is to switch to the table owner’s userid under a security-restricted operation and to lock downsearch_pathfor the duration of the build. PostgreSQL does this in bothindex_buildandvalidate_indexviaSetUserIdAndSecContext(..., SECURITY_RESTRICTED_OPERATION)plusRestrictSearchPath(), restoring the original context afterward. -
Progress is observable. A multi-minute build that gives no feedback is operationally hostile. Engines expose build progress — phase, blocks scanned, tuples done. PostgreSQL threads
pgstat_progress_update_*calls through every phase (PROGRESS_CREATEIDX_PHASE_BUILD,_VALIDATE_IDXSCAN,_WAIT_1/2/3), surfaced inpg_stat_progress_create_index.
flowchart TD
subgraph cmd["Command layer — indexcmds.c"]
DI["DefineIndex<br/>lock table, resolve AM + opclass,<br/>ComputeIndexAttrs, choose name"]
end
subgraph cat["Catalog layer — index.c"]
IC["index_create<br/>pg_class / pg_attribute / pg_index rows,<br/>dependencies, constraint"]
IB["index_build<br/>userid + progress, then dispatch"]
AM["rd_indam->ambuild<br/>(btbuild, hashbuild, ...)<br/>scan heap + bulk load"]
IUS["index_update_stats<br/>write reltuples / relpages"]
end
DI --> IC
IC -->|"unless SKIP_BUILD"| IB
IB --> AM
AM --> IUS
DI -.->|"CONCURRENTLY: multi-txn protocol"| CC["index_concurrently_build<br/>+ validate_index + waits"]
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements the two-layer split precisely. DefineIndex (in
commands/indexcmds.c) is the command driver; index_create (in
catalog/index.c) is the catalog mechanism; index_build is the
AM-polymorphic build driver. Everything else — validate_index, the
index_concurrently_* family, reindex_index — orbits these three.
The command layer: DefineIndex
Section titled “The command layer: DefineIndex”DefineIndex first decides the lock and the build mode. Temporary
relations are forced non-concurrent (no other backend can see them, so a
strong lock is harmless and a non-concurrent drop is cheaper). The lock
strength follows directly:
// DefineIndex — src/backend/commands/indexcmds.clockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;rel = table_open(tableId, lockmode);ShareLock conflicts with RowExclusiveLock (held by INSERT/UPDATE/DELETE)
so a plain build blocks writers but not readers. ShareUpdateExclusiveLock
conflicts only with itself and stronger locks, so CONCURRENTLY lets DML
proceed throughout. Next it resolves the access method and gets its
routine vtable — the same IndexAmRoutine that postgres-index-am.md
covers — then checks the AM actually supports what the statement asked for:
// DefineIndex — src/backend/commands/indexcmds.caccessMethodForm = (Form_pg_am) GETSTRUCT(tuple);accessMethodId = accessMethodForm->oid;amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);// ...if (stmt->unique && !stmt->iswithoutoverlaps && !amRoutine->amcanunique) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("access method \"%s\" does not support unique indexes", accessMethodName)));It then builds an empty IndexInfo, resolves operator classes and
collations into parallel arrays via ComputeIndexAttrs, and (for a PRIMARY
KEY) runs extra checks. The IndexInfo is the in-memory description of the
index the build will consume — its key columns, expressions, predicate,
uniqueness, and the ii_Concurrent / ii_ReadyForInserts flags that
distinguish a concurrent build. With all of that resolved, it assembles the
flags word and calls index_create:
// DefineIndex — src/backend/commands/indexcmds.cflags = constr_flags = 0;if (stmt->isconstraint) flags |= INDEX_CREATE_ADD_CONSTRAINT;if (skip_build || concurrent || partitioned) flags |= INDEX_CREATE_SKIP_BUILD; // build deferred or neverif (concurrent) flags |= INDEX_CREATE_CONCURRENT;// ...indexRelationId = index_create(rel, indexRelationName, indexRelationId, parentIndexId, parentConstraintId, stmt->oldNumber, indexInfo, indexColNames, accessMethodId, tablespaceId, collationIds, opclassIds, opclassOptions, coloptions, NULL, reloptions, flags, constr_flags, allowSystemTableMods, !check_rights, &createdConstraintId);Note the crucial flag logic: INDEX_CREATE_SKIP_BUILD is set for
skip_build || concurrent || partitioned. A partitioned index has no
storage of its own; a concurrent build does the build later under its own
transaction discipline; so only a plain CREATE INDEX lets index_create
build inline. (Partition recursion — attaching or building matching child
indexes — is handled in DefineIndex and deferred here to
postgres-ddl-execution.md.)
The catalog layer: index_create
Section titled “The catalog layer: index_create”index_create is bookkeeping: it opens pg_class, derives the index’s
namespace/persistence/sharedness from the parent heap, validates parameters
(at least one key column; no user indexes on system catalogs unless
allow_system_table_mods), constructs the index’s TupleDescriptor,
inserts the pg_class row, fills pg_attribute via
InitializeAttributeOids + AppendAttributeTuples, writes the central
pg_index row via UpdateIndexRelation, records dependencies, optionally
creates the backing constraint, and finally — unless SKIP_BUILD —
calls index_build:
// index_create — src/backend/catalog/index.c (condensed, build tail)if ((flags & INDEX_CREATE_SKIP_BUILD) != 0){ // caller (or concurrent path) will build later; just init metapage // ... index_register / ambuildempty handling ...}else{ index_build(heapRelation, indexRelation, indexInfo, false, true);}The two trailing bools are isreindex (false here — fresh index) and
parallel (true — allow parallel workers if the AM and planner agree).
The single most important catalog write inside index_create is the
pg_index row, produced by UpdateIndexRelation. This is the row that
defines the index to the rest of the system: which heap columns it
covers (indkey), under which operator classes (indclass) and
collations (indcollation), with which per-column options (indoption,
e.g. DESC / NULLS FIRST), plus serialized pg_node_tree blobs for any
expression columns and the partial-index predicate. It also stamps the
three liveness bits at birth:
// UpdateIndexRelation — src/backend/catalog/index.c (column assembly, condensed)indkey = buildint2vector(NULL, indexInfo->ii_NumIndexAttrs);for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++) indkey->values[i] = indexInfo->ii_IndexAttrNumbers[i];indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexKeyAttrs);indclass = buildoidvector(opclassOids, indexInfo->ii_NumIndexKeyAttrs);indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexKeyAttrs);// ... indexprs / indpred serialized via nodeToString(make_ands_explicit(...)) ...// values[Anum_pg_index_indisvalid] = BoolGetDatum(isvalid); // f for CONCURRENTLY// values[Anum_pg_index_indisready] = BoolGetDatum(isready); // f for CONCURRENTLY// values[Anum_pg_index_indislive] = BoolGetDatum(true);For a plain CREATE INDEX, index_create passes isvalid = isready = true; for a concurrent build it passes both false, which is exactly the
not-ready/not-valid starting state the protocol below relies on. The index
expressions and predicate are stored as serialized node trees — partial
indexes (WHERE) and expression indexes (ON (lower(name))) live entirely
in these two pg_index columns, which is why the planner can reconstruct a
partial index’s applicability without re-parsing SQL (see
postgres-node-trees.md).
The build driver: index_build and ambuild
Section titled “The build driver: index_build and ambuild”index_build is where the AM polymorphism lives. It sets up the progress
report, switches to the table owner’s userid under a
SECURITY_RESTRICTED_OPERATION (index expressions may call user functions),
and then makes the one call that does the real work:
// index_build — src/backend/catalog/index.cAssert(PointerIsValid(indexRelation->rd_indam->ambuild));// ...stats = indexRelation->rd_indam->ambuild(heapRelation, indexRelation, indexInfo);Assert(PointerIsValid(stats));Before that call, index_build enters the sandbox described in the
conventions above — switch to the table owner, lock down
security-restricted operations, and confine search_path so a malicious
index expression cannot resolve a function it shouldn’t:
// index_build — src/backend/catalog/index.cGetUserIdAndSecContext(&save_userid, &save_sec_context);SetUserIdAndSecContext(heapRelation->rd_rel->relowner, save_sec_context | SECURITY_RESTRICTED_OPERATION);save_nestlevel = NewGUCNestLevel();RestrictSearchPath();ambuild is the per-AM callback (btbuild, hashbuild, gistbuild, …).
For B-tree it scans the heap via table_index_build_scan, feeds every live
(key, tid) into a tuplesort, then bulk-writes packed leaves — the
textbook bulk loader. It returns an IndexBuildResult with the heap and
index tuple counts, which index_build writes back into both pg_class
rows:
// index_build — src/backend/catalog/index.cindex_update_stats(heapRelation, true, stats->heap_tuples);index_update_stats(indexRelation, false, stats->index_tuples);CommandCounterIncrement();if (indexInfo->ii_ExclusionOps != NULL) IndexCheckExclusion(heapRelation, indexRelation, indexInfo);index_build also handles the broken HOT chain subtlety: if the scan
saw a HOT chain whose members have different index keys, the index cannot
be trusted until every transaction that could see the older versions is
gone, so it sets pg_index.indcheckxmin. Critically, this is only done
for non-concurrent, non-reindex builds — the comment in the source spells
out why the concurrent path doesn’t need it (it won’t set indisvalid
until those transactions are gone anyway) and why reindex must not (the
chains predate the index, and touching the tuple while reindexing pg_index
itself would be fatal):
// index_build — src/backend/catalog/index.cif (indexInfo->ii_BrokenHotChain && !isreindex && !indexInfo->ii_Concurrent){ // mark indcheckxmin = true on the pg_index row}Source Walkthrough
Section titled “Source Walkthrough”This section follows the three liveness bits — indislive, indisready,
indisvalid — through the four operations: the plain build, the concurrent
build, validation, and reindex. The bits are flipped only by
index_set_state_flags, whose state machine is the spine of everything
that follows.
index_set_state_flags — the liveness state machine
Section titled “index_set_state_flags — the liveness state machine”Every transition between the build phases is a single-row update of
pg_index performed by index_set_state_flags. The actions and their
asserted preconditions encode the entire lifecycle:
// index_set_state_flags — src/backend/catalog/index.cswitch (action){ case INDEX_CREATE_SET_READY: /* CREATE INDEX CONCURRENTLY phase 2->3 */ Assert(indexForm->indislive); Assert(!indexForm->indisready); Assert(!indexForm->indisvalid); indexForm->indisready = true; break; case INDEX_CREATE_SET_VALID: /* CREATE INDEX CONCURRENTLY final */ Assert(indexForm->indislive); Assert(indexForm->indisready); Assert(!indexForm->indisvalid); indexForm->indisvalid = true; break; case INDEX_DROP_CLEAR_VALID: /* DROP INDEX CONCURRENTLY */ indexForm->indisvalid = false; indexForm->indisclustered = false; indexForm->indisreplident = false; break; case INDEX_DROP_SET_DEAD: /* DROP INDEX CONCURRENTLY, final */ Assert(!indexForm->indisvalid); indexForm->indisready = false; indexForm->indislive = false; break;}A fresh CREATE INDEX CONCURRENTLY walks (live=t, ready=f, valid=f) →
SET_READY → (live, ready, valid=f) → SET_VALID → (live, ready, valid).
A plain CREATE INDEX inserts the row already at
(live, ready, valid)=t and never calls this function. The asymmetry — set
indisready before indisvalid, with a wait between each — is the heart
of the online protocol.
flowchart LR A["row inserted<br/>not-ready, not-valid"] -->|"WaitForLockers + build"| B["index_concurrently_build<br/>SET_READY"] B -->|"commit; WaitForLockers"| C["ref snapshot +<br/>validate_index"] C -->|"drop snap; WaitForOlderSnapshots"| D["SET_VALID<br/>queries may use it"] A2["plain CREATE INDEX<br/>row inserted ready+valid"] --> Z["build inline<br/>(ShareLock held)"]
Plain CREATE INDEX — the blocking path
Section titled “Plain CREATE INDEX — the blocking path”The non-concurrent path is short. index_create builds inline, and back in
DefineIndex there is nothing left to do but close the relation and end the
progress report:
// DefineIndex — src/backend/commands/indexcmds.c (non-concurrent tail)if (!concurrent){ /* Close the heap and we're done, in the non-concurrent case */ table_close(rel, NoLock); if (!OidIsValid(parentIndexId)) pgstat_progress_end_command(); else pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1); return address;}The ShareLock taken at table_open is held until the surrounding
transaction commits, so writers are blocked for the entire scan + sort +
write. That is the price of simplicity, and it is what CONCURRENTLY
exists to avoid.
CREATE INDEX CONCURRENTLY — the multi-transaction protocol
Section titled “CREATE INDEX CONCURRENTLY — the multi-transaction protocol”Everything past the if (!concurrent) early-return is the concurrent
protocol. It spans multiple transactions within one command, which is
why DefineIndex keeps almost nothing in memory across the commits — just
the relation OIDs and the lock tag. The first move is to escalate to a
session-level lock (survives the per-phase commits) and then commit the
transaction that created the catalog row, making the not-ready/not-valid
index visible:
// DefineIndex — src/backend/commands/indexcmds.c (enter concurrent path)heaprelid = rel->rd_lockInfo.lockRelId;SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);table_close(rel, NoLock);
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
PopActiveSnapshot();CommitTransactionCommand();StartTransactionCommand();
/* Tell concurrent index builds to ignore us, if index qualifies */if (safe_index) set_indexsafe_procflags();The comment above this block states the invariant precisely: making the catalog entries visible before the build “will prevent them from making incompatible HOT updates. The new index will be marked not indisready and not indisvalid, so that no one else tries to either insert into it or use it for queries.” Now Phase 2: wait for every transaction that could have the table open without the new index in its relcache list, then build against a fresh snapshot:
// DefineIndex — src/backend/commands/indexcmds.c (phase 2)WaitForLockers(heaplocktag, ShareLock, true);
/* Set ActiveSnapshot since functions in the indexes may need it */PushActiveSnapshot(GetTransactionSnapshot());
/* Perform concurrent build of index */index_concurrently_build(tableId, indexRelationId);
PopActiveSnapshot();CommitTransactionCommand();StartTransactionCommand();WaitForLockers uses actual lock acquisition (request ShareLock on
the heap’s lock tag and block until granted) rather than polling the
ProcArray — the source explains this is deliberate: it lets the deadlock
detector fire if one of the transactions we wait on is itself blocked
trying to lock our table. index_concurrently_build rebuilds the
IndexInfo (it was lost in the commit), runs index_build, and crucially
ends by promoting the index to ready:
// index_concurrently_build — src/backend/catalog/index.cindexInfo = BuildIndexInfo(indexRelation);Assert(!indexInfo->ii_ReadyForInserts);indexInfo->ii_Concurrent = true;indexInfo->ii_BrokenHotChain = false;
index_build(heapRel, indexRelation, indexInfo, false, true);// ... close rels, keep locks ...index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);After the commit that follows, indisready is globally visible: new
transactions now insert into the index. But the index still does not
contain rows that were deleted just before, or inserted just after, the
build’s snapshot. Phase 3 closes that gap. Wait again, take the
reference snapshot, and validate:
// DefineIndex — src/backend/commands/indexcmds.c (phase 3)WaitForLockers(heaplocktag, ShareLock, true);
snapshot = RegisterSnapshot(GetTransactionSnapshot());PushActiveSnapshot(snapshot);
/* Scan the index and the heap, insert any missing index entries. */validate_index(tableId, indexRelationId, snapshot);
limitXmin = snapshot->xmin;PopActiveSnapshot();UnregisterSnapshot(snapshot);The source flags the central hazard here: “There might still be snapshots
in use that treat some transaction as in-progress that our reference
snapshot treats as committed.” If such a transaction deleted rows, the
build will not index them, yet older snapshots still expect them — so we
must drop our reference snapshot (to avoid deadlocking against other
concurrent builds that would wait on it), commit, start a final
transaction, and wait out every transaction with a snapshot older than
limitXmin:
// DefineIndex — src/backend/commands/indexcmds.c (final wait + set valid)CommitTransactionCommand();StartTransactionCommand();if (safe_index) set_indexsafe_procflags();
Assert(MyProc->xmin == InvalidTransactionId); /* advertising no xmin */
WaitForOlderSnapshots(limitXmin, true);
PushActiveSnapshot(GetTransactionSnapshot());index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);PopActiveSnapshot();
CacheInvalidateRelcacheByRelid(heaprelid.relId);UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);WaitForOlderSnapshots is the step that can hang CONCURRENTLY for hours:
it calls GetCurrentVirtualXIDs(limitXmin, ...) to enumerate every backend
whose xmin precedes the reference snapshot and then VirtualXactLocks each
one to wait it out. Note the filter — it excludes autovacuum,
already-vacuuming, and other safe concurrent builds (the
PROC_IN_SAFE_IC flag set by set_indexsafe_procflags), so two
CONCURRENTLY commands do not deadlock on each other:
// WaitForOlderSnapshots — src/backend/commands/indexcmds.cold_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false, PROC_IS_AUTOVACUUM | PROC_IN_VACUUM | PROC_IN_SAFE_IC, &n_old_snapshots);// ... for each still-present old snapshot: VirtualXactLock(...) ...The final CacheInvalidateRelcacheByRelid on the parent table forces
replanning of cached plans so existing sessions notice the new index.
validate_index — the TID merge-join
Section titled “validate_index — the TID merge-join”validate_index is the reconciliation engine. It must find every heap
tuple visible in the reference snapshot that is not yet in the index, and
insert it — without re-inserting the (vast) majority that the build already
placed. The clever part, explained in the long comment above the function,
is that it does not index-scan: it uses an ambulkdelete callback that
collects TIDs (and never deletes), sorts them, then merge-joins the sorted
TID list against a heap scan:
// validate_index — src/backend/catalog/index.cstate.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator, InvalidOid, false, maintenance_work_mem, NULL, TUPLESORT_NONE);state.htups = state.itups = state.tups_inserted = 0;
/* gather every TID currently in the index via a no-op bulkdelete */(void) index_bulk_delete(&ivinfo, NULL, validate_index_callback, &state);
tuplesort_performsort(state.tuplesort);
/* "merge join" the sorted index TIDs against a heap scan */table_index_validate_scan(heapRelation, indexRelation, indexInfo, snapshot, &state);The callback simply encodes each item pointer as an int8 (a pass-by-value
sort is far faster than sorting raw ItemPointers) and counts it:
// validate_index_callback — src/backend/catalog/index.cstatic boolvalidate_index_callback(ItemPointer itemptr, void *opaque){ ValidateIndexState *state = (ValidateIndexState *) opaque; int64 encoded = itemptr_encode(itemptr);
tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false); state->itups += 1; return false; /* never delete */}table_index_validate_scan (the table-AM hook) walks the heap in physical
order and, for each visible tuple whose TID is not found in the sorted
index TID list, calls index_insert. Building a unique index this way is
subtle — the comment notes the AM must “recheck liveness of the
to-be-inserted tuple before it declares a uniqueness error,” because we may
race an in-progress update of the same logical row.
REINDEX — rebuild via the same machine
Section titled “REINDEX — rebuild via the same machine”reindex_index rebuilds one index in place. It takes AccessExclusiveLock
on the index (and ShareLock on the heap), transfers any predicate locks
up to the heap, builds a fresh IndexInfo, assigns a new
relfilenumber (so the old physical files are discarded on commit), and
calls the shared index_build with isreindex = true:
// reindex_index — src/backend/catalog/index.cTransferPredicateLocksToHeapRelation(iRel);indexInfo = BuildIndexInfo(iRel);// ... optionally clear ii_Unique / ii_ExclusionOps if skip_constraint_checks ...
SetReindexProcessing(heapId, indexId); /* hide target from planner */RelationSetNewRelfilenumber(iRel, persistence); /* new storage */index_build(heapRelation, iRel, indexInfo, true, true);ResetReindexProcessing();SetReindexProcessing registers the index OID in backend-local state so
that the build itself does not try to use the very index it is rebuilding
(important when reindexing a system catalog’s own index). After the build,
reindex_index opportunistically repairs a previously
invalid/not-ready/dead index — a leftover from a failed CREATE INDEX CONCURRENTLY or DROP INDEX CONCURRENTLY — by clearing those bits, which
is why “REINDEX” is the documented recovery for a stuck CONCURRENTLY.
REINDEX CONCURRENTLY (the ReindexRelationConcurrently path) is the
online analogue: rather than rebuild in place, it index_concurrently_create_copy
makes a shadow index, builds and validates it exactly like CREATE INDEX CONCURRENTLY, then index_concurrently_swap atomically swaps names,
dependencies, and constraints over to the new index while marking the old
one invalid — all under ShareUpdateExclusiveLock so DML never blocks. The
deep mechanics of the swap are a postgres-ddl-execution.md concern; the
point here is that it reuses the identical build/validate/wait primitives.
Worked trace — three commands, one machine
Section titled “Worked trace — three commands, one machine”It is worth stepping through the three commands end-to-end to see how they reuse the same primitives.
CREATE INDEX idx ON t (a); — one transaction. DefineIndex takes
ShareLock on t; resolves btree and its int4_ops opclass; computes
IndexInfo; calls index_create with flags = 0 (no SKIP_BUILD).
index_create writes a pg_index row already valid+ready, then calls
index_build(t, idx, info, isreindex=false, parallel=true), which dispatches
to btbuild — heap scan, tuplesort, packed-leaf write — and
index_update_stats records the tuple counts. The transaction commits;
the index is usable. Writers were blocked the whole time.
CREATE INDEX CONCURRENTLY idx ON t (a); — four transactions. T1: take
ShareUpdateExclusiveLock, index_create with SKIP_BUILD | CONCURRENT
writes a not-ready/not-valid row, escalate to a session lock, commit. T2:
WaitForLockers, index_concurrently_build (heap scan + btbuild against
a snapshot, then SET_READY), commit. T3: WaitForLockers, register the
reference snapshot, validate_index (collect index TIDs, sort, merge-join
the heap, insert the gap), commit. T4: WaitForOlderSnapshots(limitXmin),
SET_VALID, CacheInvalidateRelcacheByRelid, release session lock. (All
three intra-command boundaries are the explicit CommitTransactionCommand
/ StartTransactionCommand pairs inside DefineIndex; T4 itself is
committed by the surrounding utility framework.) DML proceeded throughout;
the cost was two heap scans and up to three waits.
REINDEX INDEX idx; — reindex_index takes AccessExclusiveLock on
idx, BuildIndexInfo from the catalog, RelationSetNewRelfilenumber
(new storage, old discarded on commit), then the same
index_build(heap, idx, info, isreindex=true, parallel=true) →
btbuild. Constraint re-checks are skipped if the caller trusts the old
definition; the broken-HOT-chain guard is bypassed because isreindex is
true. One transaction, index locked exclusively, but the rebuild is in
place rather than a shadow swap.
The throughline: btbuild (the ambuild callback) runs in all three; only
the locking and the surrounding transaction structure differ. That is the
payoff of the build being AM-polymorphic and the protocol living entirely
above it.
A useful way to read the three flows side by side is by what holds across
a commit boundary. The plain build never commits mid-command, so its
entire state — IndexInfo, open relations, the tuplesort — lives in one
memory context and dies at end-of-transaction. The concurrent build commits
three times inside DefineIndex (a fourth commit ends T4 in the utility
framework), so almost nothing survives across a boundary — it may keep only
OIDs and a session lock alive — which is why index_concurrently_build and
validate_index both re-BuildIndexInfo from the catalog rather than
receive it as an argument: the comment in validate_index notes its
caller’s copy “is long gone due to having been built in a previous
transaction.” REINDEX is single-transaction like the
plain build but, by swapping in a new relfilenumber first, gets crash
safety for free: if it aborts, the old storage is still referenced and the
new files are orphaned for cleanup. Three locking regimes, three
transaction shapes, one builder.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
DefineIndex | src/backend/commands/indexcmds.c | 542 |
lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock | src/backend/commands/indexcmds.c | 681 |
GetIndexAmRoutine (AM vtable lookup) | src/backend/commands/indexcmds.c | 864 |
ComputeIndexAttrs (call site) | src/backend/commands/indexcmds.c | 938 |
index_create (call site in DefineIndex) | src/backend/commands/indexcmds.c | 1247 |
LockRelationIdForSession (enter concurrent) | src/backend/commands/indexcmds.c | 1644 |
WaitForLockers (phase 2) | src/backend/commands/indexcmds.c | 1687 |
index_concurrently_build (call site) | src/backend/commands/indexcmds.c | 1711 |
WaitForLockers (phase 3) | src/backend/commands/indexcmds.c | 1734 |
validate_index (call site) | src/backend/commands/indexcmds.c | 1757 |
WaitForOlderSnapshots (call site) | src/backend/commands/indexcmds.c | 1797 |
index_set_state_flags(SET_VALID) (call site) | src/backend/commands/indexcmds.c | 1808 |
WaitForOlderSnapshots (definition) | src/backend/commands/indexcmds.c | 435 |
set_indexsafe_procflags | src/backend/commands/indexcmds.c | 4615 |
index_create | src/backend/catalog/index.c | 726 |
index_concurrently_build | src/backend/catalog/index.c | 1485 |
index_concurrently_swap | src/backend/catalog/index.c | 1552 |
index_build | src/backend/catalog/index.c | 3002 |
ambuild dispatch (rd_indam->ambuild) | src/backend/catalog/index.c | 3078 |
index_update_stats (call sites) | src/backend/catalog/index.c | 3155 |
validate_index | src/backend/catalog/index.c | 3350 |
validate_index_callback | src/backend/catalog/index.c | 3483 |
index_set_state_flags | src/backend/catalog/index.c | 3503 |
reindex_index | src/backend/catalog/index.c | 3608 |
ReindexRelationConcurrently | src/backend/commands/indexcmds.c | 3570 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the source-of-truth tree at
/data/hgryoo/references/postgres, REL_18_STABLE, commit
273fe94852b3a7e34fd171e8abdf1481beb302fa.
- Symbols exist on REL_18.
DefineIndex,WaitForOlderSnapshots,set_indexsafe_procflagsconfirmed incommands/indexcmds.c;index_create,index_build,index_concurrently_build,index_concurrently_swap,validate_index,validate_index_callback,index_set_state_flags,reindex_indexconfirmed incatalog/index.c(function-definition grep, line numbers tabled above). - Lock strengths verified literally.
lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLockis the exact text at the cited line;reindex_indexopens the index withAccessExclusiveLockand the heap withShareLock. - The three liveness bits
indislive/indisready/indisvalidand the fourIndexStateFlagsActioncases (INDEX_CREATE_SET_READY,INDEX_CREATE_SET_VALID,INDEX_DROP_CLEAR_VALID,INDEX_DROP_SET_DEAD) are present verbatim inindex_set_state_flags. - Phase ordering verified by reading
DefineIndexstraight through: catalog commit →WaitForLockers→index_concurrently_build(SET_READY) → commit →WaitForLockers→ reference snapshot →validate_index→ drop snapshot → commit →WaitForOlderSnapshots→index_set_state_flags(SET_VALID). ThePROGRESS_CREATEIDX_PHASE_WAIT_1 / _2 / _3progress markers bracket exactly these waits. PROC_IN_SAFE_ICfilter inGetCurrentVirtualXIDs(insideWaitForOlderSnapshots) confirmed; it is set byset_indexsafe_procflags, so two safe concurrent builds ignore each other. This is a REL_18-valid mechanism (no relation to the removedB_DATACHECKSUMSWORKER_*worker states).index_buildsignatureindex_build(Relation, Relation, IndexInfo *, bool isreindex, bool parallel)confirmed; the broken-HOT-chain guardii_BrokenHotChain && !isreindex && !ii_Concurrentis present as quoted.- Caveat — partition recursion and constraint creation inside
index_create/DefineIndexare summarized, not exhaustively quoted; seepostgres-ddl-execution.md. The B-tree-specific bulk loader (btbuild→_bt_load) and the table-AM scan (table_index_build_scan) are named but their internals belong topostgres-nbtree.mdandpostgres-table-am.mdrespectively.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s index creation sits at one corner of a large design space. Naming the alternatives sharpens what its choices cost and buy.
The blocking/online trade in other engines. Most production engines
offer an online build, but with different lock and reconciliation
strategies. MySQL/InnoDB’s online DDL (since 5.6) builds the secondary
index while logging concurrent DML into an in-memory row log and replays
that log at the end under a brief exclusive lock — a log-and-replay
design, in contrast to PostgreSQL’s snapshot-and-revalidate. Oracle’s
CREATE INDEX ... ONLINE uses a journal table populated by triggers,
similarly replayed at the end. SQL Server’s WITH (ONLINE = ON) maintains
the new index from a versioned row store during the build. The common shape
is: capture concurrent changes, build the bulk, then reconcile.
PostgreSQL is unusual in reconciling not via a change log but via a second
full pass (validate_index) plus the snapshot-wait. The source itself
calls this “a brute-force strategy” and muses about storing new tuples in a
special area instead — but rejects it as “yet more locking issues.” The
practical consequence: PostgreSQL’s CONCURRENTLY does two table scans
and can be stalled indefinitely by a single long-running transaction
(because WaitForOlderSnapshots cannot complete), whereas a log-replay
design’s tail latency is bounded by the backlog, not by the oldest
snapshot.
Why two passes instead of a change log? The answer is MVCC. Because
PostgreSQL keeps old row versions in the heap (no separate undo/rollback
segment), a snapshot is a consistent, cheap-to-take description of “which
rows existed then.” The build indexes everything visible to snapshot S1;
new writers (now seeing indisready) maintain the index themselves;
validate_index then fills the gap for rows that were visible to a later
reference snapshot S2 but not yet present. The engine never has to
serialize a change log because the heap’s version chains already are the
log. This is the same MVCC property exploited by postgres-vacuum.md and
postgres-mvcc-snapshots.md.
Bulk loading and write-optimized structures. PostgreSQL’s ambuild
bulk loaders produce a dense, read-optimized B-tree. The research frontier
for write-heavy workloads is the LSM-tree (O’Neil et al. 1996;
Database System Concepts §14.8 covers LSM trees and buffer trees as
write-optimized indices). An LSM engine never does a one-shot CREATE INDEX
bulk load in the same sense — it accretes sorted runs and compacts them,
so “building an index” amortizes into the ongoing compaction machinery.
PostgreSQL has no LSM core; its closest write-optimization is BRIN (a
tiny, lossy summary index that builds almost for free) and GIN’s pending
list. The comparison matters because it explains why CREATE INDEX is a
discrete, expensive event in PostgreSQL but a non-event in an LSM store.
Concurrent B-tree foundations. That the table can keep mutating during
a CONCURRENTLY build at all depends on the underlying structure
supporting concurrent inserts safely — the Lehman & Yao (1981) right-link
B-link-tree algorithm that PostgreSQL’s nbtree implements (see
postgres-nbtree.md). Without lock-coupling-free concurrent descent, the
“writers keep inserting while we build/validate” premise would collapse.
The index-creation protocol is thus a client of the concurrency
guarantees the AM provides, not a provider of them.
Failure semantics as a feature. A crashed CREATE INDEX CONCURRENTLY
leaves an indisvalid = false index — ignored by the planner, still
present in the catalog. This is a deliberate consequence of the
“index is derived state” principle: the half-built artifact is harmless
because nothing trusts it until the final flag flip, and REINDEX or
DROP INDEX cleans it up. Engines that build under a single transaction
(plain CREATE INDEX) get all-or-nothing rollback for free; the online
protocol trades that atomicity for non-blocking behaviour and recovers the
safety with the validity flag instead.
Open research directions. Recent work explores learned indexes
(Kraska et al. 2018) that replace B-tree nodes with learned models, and
incremental / adaptive indexing (database cracking — Idreos et al. 2007)
that builds the index lazily as a side effect of queries rather than as an
explicit DDL event. Both invert PostgreSQL’s model: instead of a discrete,
snapshot-bounded build command, the index materializes gradually from the
query stream. Neither is in core PostgreSQL, but both illuminate the
assumption baked into DefineIndex — that index construction is a
one-time, operator-issued, point-in-time event.
Sources
Section titled “Sources”- Code (source of truth):
/data/hgryoo/references/postgres, REL_18_STABLE @273fe94852b3a7e34fd171e8abdf1481beb302fa.src/backend/commands/indexcmds.c—DefineIndex,WaitForOlderSnapshots,ReindexRelationConcurrently,set_indexsafe_procflags, AM/opclass resolution.src/backend/catalog/index.c—index_create,index_build,index_concurrently_build,index_concurrently_swap,validate_index,validate_index_callback,index_set_state_flags,reindex_index.src/include/catalog/index.h—IndexStateFlagsAction,INDEX_CREATE_*flags.src/backend/access/heap/README.HOT— broken-HOT-chain /indcheckxminrationale.
- Theory anchors: Silberschatz, Korth & Sudarshan, Database System
Concepts, 7e — ch. 14 “Indexing” (§14.2 Ordered Indices, §14.3–14.4
B+-Tree, bulk-loading discussion, §14.8 write-optimized / LSM indices),
§4.6 “Index Definition in SQL”. Captured at
knowledge/research/dbms-general/database-system-concepts.md. - Concurrency lineage: Lehman & Yao 1981, “Efficient Locking for
Concurrent Operations on B-Trees” (the nbtree algorithm enabling
concurrent writes during
CONCURRENTLY); bibliography at.omc/plans/postgres-paper-bibliography.md. - Cross-references (sibling docs, do not duplicate):
postgres-ddl-execution.md(utility dispatch, partition recursion, REINDEX CONCURRENTLY swap mechanics),postgres-nbtree.md(btbuild,_bt_loadbulk loader, B-link-tree concurrency),postgres-index-am.md(IndexAmRoutinevtable,ambuild/aminsertcallbacks),postgres-table-am.md(table_index_build_scan,table_index_validate_scan),postgres-mvcc-snapshots.mdandpostgres-procarray.md(snapshots,GetCurrentVirtualXIDs,VirtualXactLock),postgres-vacuum.md(index_bulk_delete),postgres-lock-manager.md(WaitForLockers, session locks),postgres-buffer-manager.mdandpostgres-smgr-md.md(the physical pages the bulk loader writes),postgres-system-catalogs.md(pg_index/pg_class/pg_attributerow shapes).