Skip to content

PostgreSQL Index Creation — CREATE INDEX, the Build, and CONCURRENTLY

Contents:

An index is a redundant, ordered access path layered over a base relation: it duplicates one or more attributes (plus a row locator) into a structure that answers point and range lookups in sub-linear time. Database System Concepts (Silberschatz, 7e, ch. 14 “Indexing”) frames the whole subject around two families — ordered indices (§14.2), whose realization is the B+-tree (§14.3–14.4), and hash indices — and notes that “an index structure is associated with a particular search key” so that “a file may have several indices, on different search keys.” The SQL surface for creating one is tiny; §4.6 gives the entire grammar:

“We create an index with the create index command, which takes the form create index <index-name> on <relation-name> (<attribute-list>);” (DSC §4.6)

Behind that one statement sits the question this document is about: how does the engine turn an empty index structure into one that already contains an entry for every existing row? There are two textbook answers, and they map directly onto PostgreSQL’s two code paths.

1. Tuple-at-a-time insertion. Treat the build as N ordinary index inserts: scan the heap, and for each live row call the structure’s insert(key, tid). Correct and simple, but pessimal — every insert walks the tree from the root, splits pages as it goes, and dirties interior nodes out of order. The B+-tree ends up sparsely packed (each leaf roughly half-full after random splits) and the build does O(N log N) random page touches.

2. Bulk loading. DSC §14.4.1 and the “bulk loader” discussion (§14.x) describe the production technique: sort all (key, tid) pairs once, then append them into leaf pages in sorted order, packing each leaf to a high fill factor and building the interior levels bottom-up. The textbook’s verdict on why this matters:

“Most relational database products have special ‘bulk loader’ utilities to insert a large number of records efficiently … sorting the entries and inserting them in sorted order … results in a much more efficient index construction.” (DSC, bulk-loading discussion)

Bulk loading turns O(N log N) random I/O into one sort plus a sequential write, and it produces a dense tree — better cache behaviour and fewer pages for the lifetime of the index. PostgreSQL’s B-tree ambuild (btbuild_bt_load) is exactly a bulk loader: scan, tuplesort, then write packed leaves left-to-right.

The second theoretical axis is concurrency of the build itself. A naïve build must freeze the table — no writes — for the entire scan, sort, and write, which on a large table can mean minutes or hours of blocked DML. The literature on online / non-blocking index construction (the same lineage that produced concurrent B-tree algorithms — Lehman & Yao 1981, “Efficient Locking for Concurrent Operations on B-Trees”; and online schema-change work) asks: can we build an index while the table keeps accepting inserts, updates, and deletes? The difficulty is a moving target. While we scan, other transactions add rows the scan never saw and delete rows we already indexed. An online build must (a) capture a consistent baseline, (b) ensure concurrent writers start maintaining the new index before we finish, and (c) reconcile the gap — the rows that changed between baseline and the moment writers joined in. PostgreSQL’s CREATE INDEX CONCURRENTLY is a concrete protocol for exactly this, and the bulk of this document traces it.

A third, smaller idea: an index is derived state. Nothing in it is authoritative — every entry can be recomputed from the heap. That is what makes REINDEX safe (throw the structure away and rebuild) and what makes a half-built index from a crashed CREATE INDEX CONCURRENTLY merely invalid rather than corrupting: the planner ignores it, and REINDEX or DROP cleans it up.

A fourth idea worth pinning down before the source: the difference between an index being populated and being trusted. A classical single-pass build conflates these — the moment the build finishes, the index both contains every row and may be read. An online build must split them. There is a window where the index physically exists and is even being maintained by concurrent writers, yet the query planner must still refuse to use it because it is not complete with respect to any consistent view of the table. The textbook’s discussion of indices assumes the populated == trusted identity; the online-DDL literature is precisely the study of how to break that identity safely and then restore it. PostgreSQL encodes the broken identity as two separate flags (indisready for “being maintained” and indisvalid for “safe to read”) with a deliberate gap between them, and the entire CONCURRENTLY protocol is the careful sequencing of that gap.

The textbook gives the model — ordered structures, bulk loading, online construction. This section names the engineering conventions that production engines converge on, the patterns the textbook leaves implicit. PostgreSQL’s specific choices in ## PostgreSQL's Approach are one set of dials within this shared space.

  1. Separate the SQL command from the catalog mechanism. The parser produces a statement node describing what index the user asked for; one layer validates and resolves it (does this access method exist? is this operator class compatible with this column type? what is the index’s name?), and a lower layer materializes catalog rows and storage. PostgreSQL splits this as DefineIndex (command) over index_create (catalog) — the same split you see in most engines between a “DDL executor” and a “catalog manager.”

  2. The build is access-method-polymorphic. B-tree, hash, GiST, GIN, BRIN, SP-GiST all build differently, but the orchestration — pick a userid, set up progress reporting, call the AM’s builder, write statistics back — is identical. Engines factor this into a single build driver that dispatches to a per-AM callback. PostgreSQL’s index_build calls indexRelation->rd_indam->ambuild(...).

  3. Three liveness states, not one boolean. A “done” boolean cannot express the intermediate states an online build passes through. The conventional decomposition is live (the catalog row exists and should be maintained on drop/cleanup decisions), ready (new DML must insert into it), and valid (queries may read it). PostgreSQL stores exactly these as pg_index.indislive, indisready, indisvalid.

  4. Lock strength is the policy knob for blocking vs. online. A blocking build takes a lock that excludes writers but admits readers; an online build takes a weaker lock that admits writers and pays for it with a multi-pass reconciliation. PostgreSQL uses ShareLock (blocks writers) for plain CREATE INDEX and ShareUpdateExclusiveLock (admits writers) for CONCURRENTLY.

  5. Online builds reconcile against a snapshot. Because the table moves under the build, the engine fixes a reference snapshot, builds/checks against it, and then waits out every transaction that could see a state older than that snapshot before trusting the index. This “wait for older snapshots” step is the hidden cost of CONCURRENTLY and the reason it can hang behind a long-running transaction.

  6. Rebuild is build with a twist. Reindexing is the same build machine pointed at an existing index, with constraint re-checking optionally skipped (you trust the old definition) and HOT-chain bookkeeping adjusted (any broken chains predate the index, so its usability horizon need not move). PostgreSQL passes an isreindex flag into the shared index_build.

  7. The build runs as the table owner, in a sandbox. Index columns can be arbitrary expressions over user-defined functions, and a partial index’s predicate can call user code too. Running that code as the invoking superuser (think pg_dump restoring an index) would be a privilege-escalation hole. The convention is to switch to the table owner’s userid under a security-restricted operation and to lock down search_path for the duration of the build. PostgreSQL does this in both index_build and validate_index via SetUserIdAndSecContext(..., SECURITY_RESTRICTED_OPERATION) plus RestrictSearchPath(), restoring the original context afterward.

  8. Progress is observable. A multi-minute build that gives no feedback is operationally hostile. Engines expose build progress — phase, blocks scanned, tuples done. PostgreSQL threads pgstat_progress_update_* calls through every phase (PROGRESS_CREATEIDX_PHASE_BUILD, _VALIDATE_IDXSCAN, _WAIT_1/2/3), surfaced in pg_stat_progress_create_index.

flowchart TD
  subgraph cmd["Command layer — indexcmds.c"]
    DI["DefineIndex<br/>lock table, resolve AM + opclass,<br/>ComputeIndexAttrs, choose name"]
  end
  subgraph cat["Catalog layer — index.c"]
    IC["index_create<br/>pg_class / pg_attribute / pg_index rows,<br/>dependencies, constraint"]
    IB["index_build<br/>userid + progress, then dispatch"]
    AM["rd_indam->ambuild<br/>(btbuild, hashbuild, ...)<br/>scan heap + bulk load"]
    IUS["index_update_stats<br/>write reltuples / relpages"]
  end
  DI --> IC
  IC -->|"unless SKIP_BUILD"| IB
  IB --> AM
  AM --> IUS
  DI -.->|"CONCURRENTLY: multi-txn protocol"| CC["index_concurrently_build<br/>+ validate_index + waits"]

PostgreSQL implements the two-layer split precisely. DefineIndex (in commands/indexcmds.c) is the command driver; index_create (in catalog/index.c) is the catalog mechanism; index_build is the AM-polymorphic build driver. Everything else — validate_index, the index_concurrently_* family, reindex_index — orbits these three.

DefineIndex first decides the lock and the build mode. Temporary relations are forced non-concurrent (no other backend can see them, so a strong lock is harmless and a non-concurrent drop is cheaper). The lock strength follows directly:

// DefineIndex — src/backend/commands/indexcmds.c
lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
rel = table_open(tableId, lockmode);

ShareLock conflicts with RowExclusiveLock (held by INSERT/UPDATE/DELETE) so a plain build blocks writers but not readers. ShareUpdateExclusiveLock conflicts only with itself and stronger locks, so CONCURRENTLY lets DML proceed throughout. Next it resolves the access method and gets its routine vtable — the same IndexAmRoutine that postgres-index-am.md covers — then checks the AM actually supports what the statement asked for:

// DefineIndex — src/backend/commands/indexcmds.c
accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
accessMethodId = accessMethodForm->oid;
amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
// ...
if (stmt->unique && !stmt->iswithoutoverlaps && !amRoutine->amcanunique)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("access method \"%s\" does not support unique indexes",
accessMethodName)));

It then builds an empty IndexInfo, resolves operator classes and collations into parallel arrays via ComputeIndexAttrs, and (for a PRIMARY KEY) runs extra checks. The IndexInfo is the in-memory description of the index the build will consume — its key columns, expressions, predicate, uniqueness, and the ii_Concurrent / ii_ReadyForInserts flags that distinguish a concurrent build. With all of that resolved, it assembles the flags word and calls index_create:

// DefineIndex — src/backend/commands/indexcmds.c
flags = constr_flags = 0;
if (stmt->isconstraint)
flags |= INDEX_CREATE_ADD_CONSTRAINT;
if (skip_build || concurrent || partitioned)
flags |= INDEX_CREATE_SKIP_BUILD; // build deferred or never
if (concurrent)
flags |= INDEX_CREATE_CONCURRENT;
// ...
indexRelationId =
index_create(rel, indexRelationName, indexRelationId, parentIndexId,
parentConstraintId, stmt->oldNumber, indexInfo,
indexColNames, accessMethodId, tablespaceId,
collationIds, opclassIds, opclassOptions, coloptions,
NULL, reloptions, flags, constr_flags,
allowSystemTableMods, !check_rights, &createdConstraintId);

Note the crucial flag logic: INDEX_CREATE_SKIP_BUILD is set for skip_build || concurrent || partitioned. A partitioned index has no storage of its own; a concurrent build does the build later under its own transaction discipline; so only a plain CREATE INDEX lets index_create build inline. (Partition recursion — attaching or building matching child indexes — is handled in DefineIndex and deferred here to postgres-ddl-execution.md.)

index_create is bookkeeping: it opens pg_class, derives the index’s namespace/persistence/sharedness from the parent heap, validates parameters (at least one key column; no user indexes on system catalogs unless allow_system_table_mods), constructs the index’s TupleDescriptor, inserts the pg_class row, fills pg_attribute via InitializeAttributeOids + AppendAttributeTuples, writes the central pg_index row via UpdateIndexRelation, records dependencies, optionally creates the backing constraint, and finally — unless SKIP_BUILD — calls index_build:

// index_create — src/backend/catalog/index.c (condensed, build tail)
if ((flags & INDEX_CREATE_SKIP_BUILD) != 0)
{
// caller (or concurrent path) will build later; just init metapage
// ... index_register / ambuildempty handling ...
}
else
{
index_build(heapRelation, indexRelation, indexInfo, false, true);
}

The two trailing bools are isreindex (false here — fresh index) and parallel (true — allow parallel workers if the AM and planner agree).

The single most important catalog write inside index_create is the pg_index row, produced by UpdateIndexRelation. This is the row that defines the index to the rest of the system: which heap columns it covers (indkey), under which operator classes (indclass) and collations (indcollation), with which per-column options (indoption, e.g. DESC / NULLS FIRST), plus serialized pg_node_tree blobs for any expression columns and the partial-index predicate. It also stamps the three liveness bits at birth:

// UpdateIndexRelation — src/backend/catalog/index.c (column assembly, condensed)
indkey = buildint2vector(NULL, indexInfo->ii_NumIndexAttrs);
for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
indkey->values[i] = indexInfo->ii_IndexAttrNumbers[i];
indcollation = buildoidvector(collationOids, indexInfo->ii_NumIndexKeyAttrs);
indclass = buildoidvector(opclassOids, indexInfo->ii_NumIndexKeyAttrs);
indoption = buildint2vector(coloptions, indexInfo->ii_NumIndexKeyAttrs);
// ... indexprs / indpred serialized via nodeToString(make_ands_explicit(...)) ...
// values[Anum_pg_index_indisvalid] = BoolGetDatum(isvalid); // f for CONCURRENTLY
// values[Anum_pg_index_indisready] = BoolGetDatum(isready); // f for CONCURRENTLY
// values[Anum_pg_index_indislive] = BoolGetDatum(true);

For a plain CREATE INDEX, index_create passes isvalid = isready = true; for a concurrent build it passes both false, which is exactly the not-ready/not-valid starting state the protocol below relies on. The index expressions and predicate are stored as serialized node trees — partial indexes (WHERE) and expression indexes (ON (lower(name))) live entirely in these two pg_index columns, which is why the planner can reconstruct a partial index’s applicability without re-parsing SQL (see postgres-node-trees.md).

index_build is where the AM polymorphism lives. It sets up the progress report, switches to the table owner’s userid under a SECURITY_RESTRICTED_OPERATION (index expressions may call user functions), and then makes the one call that does the real work:

// index_build — src/backend/catalog/index.c
Assert(PointerIsValid(indexRelation->rd_indam->ambuild));
// ...
stats = indexRelation->rd_indam->ambuild(heapRelation, indexRelation,
indexInfo);
Assert(PointerIsValid(stats));

Before that call, index_build enters the sandbox described in the conventions above — switch to the table owner, lock down security-restricted operations, and confine search_path so a malicious index expression cannot resolve a function it shouldn’t:

// index_build — src/backend/catalog/index.c
GetUserIdAndSecContext(&save_userid, &save_sec_context);
SetUserIdAndSecContext(heapRelation->rd_rel->relowner,
save_sec_context | SECURITY_RESTRICTED_OPERATION);
save_nestlevel = NewGUCNestLevel();
RestrictSearchPath();

ambuild is the per-AM callback (btbuild, hashbuild, gistbuild, …). For B-tree it scans the heap via table_index_build_scan, feeds every live (key, tid) into a tuplesort, then bulk-writes packed leaves — the textbook bulk loader. It returns an IndexBuildResult with the heap and index tuple counts, which index_build writes back into both pg_class rows:

// index_build — src/backend/catalog/index.c
index_update_stats(heapRelation, true, stats->heap_tuples);
index_update_stats(indexRelation, false, stats->index_tuples);
CommandCounterIncrement();
if (indexInfo->ii_ExclusionOps != NULL)
IndexCheckExclusion(heapRelation, indexRelation, indexInfo);

index_build also handles the broken HOT chain subtlety: if the scan saw a HOT chain whose members have different index keys, the index cannot be trusted until every transaction that could see the older versions is gone, so it sets pg_index.indcheckxmin. Critically, this is only done for non-concurrent, non-reindex builds — the comment in the source spells out why the concurrent path doesn’t need it (it won’t set indisvalid until those transactions are gone anyway) and why reindex must not (the chains predate the index, and touching the tuple while reindexing pg_index itself would be fatal):

// index_build — src/backend/catalog/index.c
if (indexInfo->ii_BrokenHotChain &&
!isreindex &&
!indexInfo->ii_Concurrent)
{
// mark indcheckxmin = true on the pg_index row
}

This section follows the three liveness bits — indislive, indisready, indisvalid — through the four operations: the plain build, the concurrent build, validation, and reindex. The bits are flipped only by index_set_state_flags, whose state machine is the spine of everything that follows.

index_set_state_flags — the liveness state machine

Section titled “index_set_state_flags — the liveness state machine”

Every transition between the build phases is a single-row update of pg_index performed by index_set_state_flags. The actions and their asserted preconditions encode the entire lifecycle:

// index_set_state_flags — src/backend/catalog/index.c
switch (action)
{
case INDEX_CREATE_SET_READY: /* CREATE INDEX CONCURRENTLY phase 2->3 */
Assert(indexForm->indislive);
Assert(!indexForm->indisready);
Assert(!indexForm->indisvalid);
indexForm->indisready = true;
break;
case INDEX_CREATE_SET_VALID: /* CREATE INDEX CONCURRENTLY final */
Assert(indexForm->indislive);
Assert(indexForm->indisready);
Assert(!indexForm->indisvalid);
indexForm->indisvalid = true;
break;
case INDEX_DROP_CLEAR_VALID: /* DROP INDEX CONCURRENTLY */
indexForm->indisvalid = false;
indexForm->indisclustered = false;
indexForm->indisreplident = false;
break;
case INDEX_DROP_SET_DEAD: /* DROP INDEX CONCURRENTLY, final */
Assert(!indexForm->indisvalid);
indexForm->indisready = false;
indexForm->indislive = false;
break;
}

A fresh CREATE INDEX CONCURRENTLY walks (live=t, ready=f, valid=f) → SET_READY → (live, ready, valid=f) → SET_VALID → (live, ready, valid). A plain CREATE INDEX inserts the row already at (live, ready, valid)=t and never calls this function. The asymmetry — set indisready before indisvalid, with a wait between each — is the heart of the online protocol.

flowchart LR
  A["row inserted<br/>not-ready, not-valid"] -->|"WaitForLockers + build"| B["index_concurrently_build<br/>SET_READY"]
  B -->|"commit; WaitForLockers"| C["ref snapshot +<br/>validate_index"]
  C -->|"drop snap; WaitForOlderSnapshots"| D["SET_VALID<br/>queries may use it"]
  A2["plain CREATE INDEX<br/>row inserted ready+valid"] --> Z["build inline<br/>(ShareLock held)"]

The non-concurrent path is short. index_create builds inline, and back in DefineIndex there is nothing left to do but close the relation and end the progress report:

// DefineIndex — src/backend/commands/indexcmds.c (non-concurrent tail)
if (!concurrent)
{
/* Close the heap and we're done, in the non-concurrent case */
table_close(rel, NoLock);
if (!OidIsValid(parentIndexId))
pgstat_progress_end_command();
else
pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
return address;
}

The ShareLock taken at table_open is held until the surrounding transaction commits, so writers are blocked for the entire scan + sort + write. That is the price of simplicity, and it is what CONCURRENTLY exists to avoid.

CREATE INDEX CONCURRENTLY — the multi-transaction protocol

Section titled “CREATE INDEX CONCURRENTLY — the multi-transaction protocol”

Everything past the if (!concurrent) early-return is the concurrent protocol. It spans multiple transactions within one command, which is why DefineIndex keeps almost nothing in memory across the commits — just the relation OIDs and the lock tag. The first move is to escalate to a session-level lock (survives the per-phase commits) and then commit the transaction that created the catalog row, making the not-ready/not-valid index visible:

// DefineIndex — src/backend/commands/indexcmds.c (enter concurrent path)
heaprelid = rel->rd_lockInfo.lockRelId;
SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
table_close(rel, NoLock);
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
PopActiveSnapshot();
CommitTransactionCommand();
StartTransactionCommand();
/* Tell concurrent index builds to ignore us, if index qualifies */
if (safe_index)
set_indexsafe_procflags();

The comment above this block states the invariant precisely: making the catalog entries visible before the build “will prevent them from making incompatible HOT updates. The new index will be marked not indisready and not indisvalid, so that no one else tries to either insert into it or use it for queries.” Now Phase 2: wait for every transaction that could have the table open without the new index in its relcache list, then build against a fresh snapshot:

// DefineIndex — src/backend/commands/indexcmds.c (phase 2)
WaitForLockers(heaplocktag, ShareLock, true);
/* Set ActiveSnapshot since functions in the indexes may need it */
PushActiveSnapshot(GetTransactionSnapshot());
/* Perform concurrent build of index */
index_concurrently_build(tableId, indexRelationId);
PopActiveSnapshot();
CommitTransactionCommand();
StartTransactionCommand();

WaitForLockers uses actual lock acquisition (request ShareLock on the heap’s lock tag and block until granted) rather than polling the ProcArray — the source explains this is deliberate: it lets the deadlock detector fire if one of the transactions we wait on is itself blocked trying to lock our table. index_concurrently_build rebuilds the IndexInfo (it was lost in the commit), runs index_build, and crucially ends by promoting the index to ready:

// index_concurrently_build — src/backend/catalog/index.c
indexInfo = BuildIndexInfo(indexRelation);
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false;
index_build(heapRel, indexRelation, indexInfo, false, true);
// ... close rels, keep locks ...
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);

After the commit that follows, indisready is globally visible: new transactions now insert into the index. But the index still does not contain rows that were deleted just before, or inserted just after, the build’s snapshot. Phase 3 closes that gap. Wait again, take the reference snapshot, and validate:

// DefineIndex — src/backend/commands/indexcmds.c (phase 3)
WaitForLockers(heaplocktag, ShareLock, true);
snapshot = RegisterSnapshot(GetTransactionSnapshot());
PushActiveSnapshot(snapshot);
/* Scan the index and the heap, insert any missing index entries. */
validate_index(tableId, indexRelationId, snapshot);
limitXmin = snapshot->xmin;
PopActiveSnapshot();
UnregisterSnapshot(snapshot);

The source flags the central hazard here: “There might still be snapshots in use that treat some transaction as in-progress that our reference snapshot treats as committed.” If such a transaction deleted rows, the build will not index them, yet older snapshots still expect them — so we must drop our reference snapshot (to avoid deadlocking against other concurrent builds that would wait on it), commit, start a final transaction, and wait out every transaction with a snapshot older than limitXmin:

// DefineIndex — src/backend/commands/indexcmds.c (final wait + set valid)
CommitTransactionCommand();
StartTransactionCommand();
if (safe_index)
set_indexsafe_procflags();
Assert(MyProc->xmin == InvalidTransactionId); /* advertising no xmin */
WaitForOlderSnapshots(limitXmin, true);
PushActiveSnapshot(GetTransactionSnapshot());
index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
PopActiveSnapshot();
CacheInvalidateRelcacheByRelid(heaprelid.relId);
UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);

WaitForOlderSnapshots is the step that can hang CONCURRENTLY for hours: it calls GetCurrentVirtualXIDs(limitXmin, ...) to enumerate every backend whose xmin precedes the reference snapshot and then VirtualXactLocks each one to wait it out. Note the filter — it excludes autovacuum, already-vacuuming, and other safe concurrent builds (the PROC_IN_SAFE_IC flag set by set_indexsafe_procflags), so two CONCURRENTLY commands do not deadlock on each other:

// WaitForOlderSnapshots — src/backend/commands/indexcmds.c
old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
| PROC_IN_SAFE_IC,
&n_old_snapshots);
// ... for each still-present old snapshot: VirtualXactLock(...) ...

The final CacheInvalidateRelcacheByRelid on the parent table forces replanning of cached plans so existing sessions notice the new index.

validate_index is the reconciliation engine. It must find every heap tuple visible in the reference snapshot that is not yet in the index, and insert it — without re-inserting the (vast) majority that the build already placed. The clever part, explained in the long comment above the function, is that it does not index-scan: it uses an ambulkdelete callback that collects TIDs (and never deletes), sorts them, then merge-joins the sorted TID list against a heap scan:

// validate_index — src/backend/catalog/index.c
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
InvalidOid, false,
maintenance_work_mem,
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
/* gather every TID currently in the index via a no-op bulkdelete */
(void) index_bulk_delete(&ivinfo, NULL, validate_index_callback, &state);
tuplesort_performsort(state.tuplesort);
/* "merge join" the sorted index TIDs against a heap scan */
table_index_validate_scan(heapRelation, indexRelation, indexInfo,
snapshot, &state);

The callback simply encodes each item pointer as an int8 (a pass-by-value sort is far faster than sorting raw ItemPointers) and counts it:

// validate_index_callback — src/backend/catalog/index.c
static bool
validate_index_callback(ItemPointer itemptr, void *opaque)
{
ValidateIndexState *state = (ValidateIndexState *) opaque;
int64 encoded = itemptr_encode(itemptr);
tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false);
state->itups += 1;
return false; /* never delete */
}

table_index_validate_scan (the table-AM hook) walks the heap in physical order and, for each visible tuple whose TID is not found in the sorted index TID list, calls index_insert. Building a unique index this way is subtle — the comment notes the AM must “recheck liveness of the to-be-inserted tuple before it declares a uniqueness error,” because we may race an in-progress update of the same logical row.

reindex_index rebuilds one index in place. It takes AccessExclusiveLock on the index (and ShareLock on the heap), transfers any predicate locks up to the heap, builds a fresh IndexInfo, assigns a new relfilenumber (so the old physical files are discarded on commit), and calls the shared index_build with isreindex = true:

// reindex_index — src/backend/catalog/index.c
TransferPredicateLocksToHeapRelation(iRel);
indexInfo = BuildIndexInfo(iRel);
// ... optionally clear ii_Unique / ii_ExclusionOps if skip_constraint_checks ...
SetReindexProcessing(heapId, indexId); /* hide target from planner */
RelationSetNewRelfilenumber(iRel, persistence); /* new storage */
index_build(heapRelation, iRel, indexInfo, true, true);
ResetReindexProcessing();

SetReindexProcessing registers the index OID in backend-local state so that the build itself does not try to use the very index it is rebuilding (important when reindexing a system catalog’s own index). After the build, reindex_index opportunistically repairs a previously invalid/not-ready/dead index — a leftover from a failed CREATE INDEX CONCURRENTLY or DROP INDEX CONCURRENTLY — by clearing those bits, which is why “REINDEX” is the documented recovery for a stuck CONCURRENTLY.

REINDEX CONCURRENTLY (the ReindexRelationConcurrently path) is the online analogue: rather than rebuild in place, it index_concurrently_create_copy makes a shadow index, builds and validates it exactly like CREATE INDEX CONCURRENTLY, then index_concurrently_swap atomically swaps names, dependencies, and constraints over to the new index while marking the old one invalid — all under ShareUpdateExclusiveLock so DML never blocks. The deep mechanics of the swap are a postgres-ddl-execution.md concern; the point here is that it reuses the identical build/validate/wait primitives.

Worked trace — three commands, one machine

Section titled “Worked trace — three commands, one machine”

It is worth stepping through the three commands end-to-end to see how they reuse the same primitives.

CREATE INDEX idx ON t (a); — one transaction. DefineIndex takes ShareLock on t; resolves btree and its int4_ops opclass; computes IndexInfo; calls index_create with flags = 0 (no SKIP_BUILD). index_create writes a pg_index row already valid+ready, then calls index_build(t, idx, info, isreindex=false, parallel=true), which dispatches to btbuild — heap scan, tuplesort, packed-leaf write — and index_update_stats records the tuple counts. The transaction commits; the index is usable. Writers were blocked the whole time.

CREATE INDEX CONCURRENTLY idx ON t (a);four transactions. T1: take ShareUpdateExclusiveLock, index_create with SKIP_BUILD | CONCURRENT writes a not-ready/not-valid row, escalate to a session lock, commit. T2: WaitForLockers, index_concurrently_build (heap scan + btbuild against a snapshot, then SET_READY), commit. T3: WaitForLockers, register the reference snapshot, validate_index (collect index TIDs, sort, merge-join the heap, insert the gap), commit. T4: WaitForOlderSnapshots(limitXmin), SET_VALID, CacheInvalidateRelcacheByRelid, release session lock. (All three intra-command boundaries are the explicit CommitTransactionCommand / StartTransactionCommand pairs inside DefineIndex; T4 itself is committed by the surrounding utility framework.) DML proceeded throughout; the cost was two heap scans and up to three waits.

REINDEX INDEX idx;reindex_index takes AccessExclusiveLock on idx, BuildIndexInfo from the catalog, RelationSetNewRelfilenumber (new storage, old discarded on commit), then the same index_build(heap, idx, info, isreindex=true, parallel=true)btbuild. Constraint re-checks are skipped if the caller trusts the old definition; the broken-HOT-chain guard is bypassed because isreindex is true. One transaction, index locked exclusively, but the rebuild is in place rather than a shadow swap.

The throughline: btbuild (the ambuild callback) runs in all three; only the locking and the surrounding transaction structure differ. That is the payoff of the build being AM-polymorphic and the protocol living entirely above it.

A useful way to read the three flows side by side is by what holds across a commit boundary. The plain build never commits mid-command, so its entire state — IndexInfo, open relations, the tuplesort — lives in one memory context and dies at end-of-transaction. The concurrent build commits three times inside DefineIndex (a fourth commit ends T4 in the utility framework), so almost nothing survives across a boundary — it may keep only OIDs and a session lock alive — which is why index_concurrently_build and validate_index both re-BuildIndexInfo from the catalog rather than receive it as an argument: the comment in validate_index notes its caller’s copy “is long gone due to having been built in a previous transaction.” REINDEX is single-transaction like the plain build but, by swapping in a new relfilenumber first, gets crash safety for free: if it aborts, the old storage is still referenced and the new files are orphaned for cleanup. Three locking regimes, three transaction shapes, one builder.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
DefineIndexsrc/backend/commands/indexcmds.c542
lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLocksrc/backend/commands/indexcmds.c681
GetIndexAmRoutine (AM vtable lookup)src/backend/commands/indexcmds.c864
ComputeIndexAttrs (call site)src/backend/commands/indexcmds.c938
index_create (call site in DefineIndex)src/backend/commands/indexcmds.c1247
LockRelationIdForSession (enter concurrent)src/backend/commands/indexcmds.c1644
WaitForLockers (phase 2)src/backend/commands/indexcmds.c1687
index_concurrently_build (call site)src/backend/commands/indexcmds.c1711
WaitForLockers (phase 3)src/backend/commands/indexcmds.c1734
validate_index (call site)src/backend/commands/indexcmds.c1757
WaitForOlderSnapshots (call site)src/backend/commands/indexcmds.c1797
index_set_state_flags(SET_VALID) (call site)src/backend/commands/indexcmds.c1808
WaitForOlderSnapshots (definition)src/backend/commands/indexcmds.c435
set_indexsafe_procflagssrc/backend/commands/indexcmds.c4615
index_createsrc/backend/catalog/index.c726
index_concurrently_buildsrc/backend/catalog/index.c1485
index_concurrently_swapsrc/backend/catalog/index.c1552
index_buildsrc/backend/catalog/index.c3002
ambuild dispatch (rd_indam->ambuild)src/backend/catalog/index.c3078
index_update_stats (call sites)src/backend/catalog/index.c3155
validate_indexsrc/backend/catalog/index.c3350
validate_index_callbacksrc/backend/catalog/index.c3483
index_set_state_flagssrc/backend/catalog/index.c3503
reindex_indexsrc/backend/catalog/index.c3608
ReindexRelationConcurrentlysrc/backend/commands/indexcmds.c3570

Verified against the source-of-truth tree at /data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94852b3a7e34fd171e8abdf1481beb302fa.

  • Symbols exist on REL_18. DefineIndex, WaitForOlderSnapshots, set_indexsafe_procflags confirmed in commands/indexcmds.c; index_create, index_build, index_concurrently_build, index_concurrently_swap, validate_index, validate_index_callback, index_set_state_flags, reindex_index confirmed in catalog/index.c (function-definition grep, line numbers tabled above).
  • Lock strengths verified literally. lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock is the exact text at the cited line; reindex_index opens the index with AccessExclusiveLock and the heap with ShareLock.
  • The three liveness bits indislive / indisready / indisvalid and the four IndexStateFlagsAction cases (INDEX_CREATE_SET_READY, INDEX_CREATE_SET_VALID, INDEX_DROP_CLEAR_VALID, INDEX_DROP_SET_DEAD) are present verbatim in index_set_state_flags.
  • Phase ordering verified by reading DefineIndex straight through: catalog commit → WaitForLockersindex_concurrently_build (SET_READY) → commit → WaitForLockers → reference snapshot → validate_index → drop snapshot → commit → WaitForOlderSnapshotsindex_set_state_flags(SET_VALID). The PROGRESS_CREATEIDX_PHASE_WAIT_1 / _2 / _3 progress markers bracket exactly these waits.
  • PROC_IN_SAFE_IC filter in GetCurrentVirtualXIDs (inside WaitForOlderSnapshots) confirmed; it is set by set_indexsafe_procflags, so two safe concurrent builds ignore each other. This is a REL_18-valid mechanism (no relation to the removed B_DATACHECKSUMSWORKER_* worker states).
  • index_build signature index_build(Relation, Relation, IndexInfo *, bool isreindex, bool parallel) confirmed; the broken-HOT-chain guard ii_BrokenHotChain && !isreindex && !ii_Concurrent is present as quoted.
  • Caveat — partition recursion and constraint creation inside index_create / DefineIndex are summarized, not exhaustively quoted; see postgres-ddl-execution.md. The B-tree-specific bulk loader (btbuild_bt_load) and the table-AM scan (table_index_build_scan) are named but their internals belong to postgres-nbtree.md and postgres-table-am.md respectively.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s index creation sits at one corner of a large design space. Naming the alternatives sharpens what its choices cost and buy.

The blocking/online trade in other engines. Most production engines offer an online build, but with different lock and reconciliation strategies. MySQL/InnoDB’s online DDL (since 5.6) builds the secondary index while logging concurrent DML into an in-memory row log and replays that log at the end under a brief exclusive lock — a log-and-replay design, in contrast to PostgreSQL’s snapshot-and-revalidate. Oracle’s CREATE INDEX ... ONLINE uses a journal table populated by triggers, similarly replayed at the end. SQL Server’s WITH (ONLINE = ON) maintains the new index from a versioned row store during the build. The common shape is: capture concurrent changes, build the bulk, then reconcile. PostgreSQL is unusual in reconciling not via a change log but via a second full pass (validate_index) plus the snapshot-wait. The source itself calls this “a brute-force strategy” and muses about storing new tuples in a special area instead — but rejects it as “yet more locking issues.” The practical consequence: PostgreSQL’s CONCURRENTLY does two table scans and can be stalled indefinitely by a single long-running transaction (because WaitForOlderSnapshots cannot complete), whereas a log-replay design’s tail latency is bounded by the backlog, not by the oldest snapshot.

Why two passes instead of a change log? The answer is MVCC. Because PostgreSQL keeps old row versions in the heap (no separate undo/rollback segment), a snapshot is a consistent, cheap-to-take description of “which rows existed then.” The build indexes everything visible to snapshot S1; new writers (now seeing indisready) maintain the index themselves; validate_index then fills the gap for rows that were visible to a later reference snapshot S2 but not yet present. The engine never has to serialize a change log because the heap’s version chains already are the log. This is the same MVCC property exploited by postgres-vacuum.md and postgres-mvcc-snapshots.md.

Bulk loading and write-optimized structures. PostgreSQL’s ambuild bulk loaders produce a dense, read-optimized B-tree. The research frontier for write-heavy workloads is the LSM-tree (O’Neil et al. 1996; Database System Concepts §14.8 covers LSM trees and buffer trees as write-optimized indices). An LSM engine never does a one-shot CREATE INDEX bulk load in the same sense — it accretes sorted runs and compacts them, so “building an index” amortizes into the ongoing compaction machinery. PostgreSQL has no LSM core; its closest write-optimization is BRIN (a tiny, lossy summary index that builds almost for free) and GIN’s pending list. The comparison matters because it explains why CREATE INDEX is a discrete, expensive event in PostgreSQL but a non-event in an LSM store.

Concurrent B-tree foundations. That the table can keep mutating during a CONCURRENTLY build at all depends on the underlying structure supporting concurrent inserts safely — the Lehman & Yao (1981) right-link B-link-tree algorithm that PostgreSQL’s nbtree implements (see postgres-nbtree.md). Without lock-coupling-free concurrent descent, the “writers keep inserting while we build/validate” premise would collapse. The index-creation protocol is thus a client of the concurrency guarantees the AM provides, not a provider of them.

Failure semantics as a feature. A crashed CREATE INDEX CONCURRENTLY leaves an indisvalid = false index — ignored by the planner, still present in the catalog. This is a deliberate consequence of the “index is derived state” principle: the half-built artifact is harmless because nothing trusts it until the final flag flip, and REINDEX or DROP INDEX cleans it up. Engines that build under a single transaction (plain CREATE INDEX) get all-or-nothing rollback for free; the online protocol trades that atomicity for non-blocking behaviour and recovers the safety with the validity flag instead.

Open research directions. Recent work explores learned indexes (Kraska et al. 2018) that replace B-tree nodes with learned models, and incremental / adaptive indexing (database cracking — Idreos et al. 2007) that builds the index lazily as a side effect of queries rather than as an explicit DDL event. Both invert PostgreSQL’s model: instead of a discrete, snapshot-bounded build command, the index materializes gradually from the query stream. Neither is in core PostgreSQL, but both illuminate the assumption baked into DefineIndex — that index construction is a one-time, operator-issued, point-in-time event.

  • Code (source of truth): /data/hgryoo/references/postgres, REL_18_STABLE @ 273fe94852b3a7e34fd171e8abdf1481beb302fa.
    • src/backend/commands/indexcmds.cDefineIndex, WaitForOlderSnapshots, ReindexRelationConcurrently, set_indexsafe_procflags, AM/opclass resolution.
    • src/backend/catalog/index.cindex_create, index_build, index_concurrently_build, index_concurrently_swap, validate_index, validate_index_callback, index_set_state_flags, reindex_index.
    • src/include/catalog/index.hIndexStateFlagsAction, INDEX_CREATE_* flags.
    • src/backend/access/heap/README.HOT — broken-HOT-chain / indcheckxmin rationale.
  • Theory anchors: Silberschatz, Korth & Sudarshan, Database System Concepts, 7e — ch. 14 “Indexing” (§14.2 Ordered Indices, §14.3–14.4 B+-Tree, bulk-loading discussion, §14.8 write-optimized / LSM indices), §4.6 “Index Definition in SQL”. Captured at knowledge/research/dbms-general/database-system-concepts.md.
  • Concurrency lineage: Lehman & Yao 1981, “Efficient Locking for Concurrent Operations on B-Trees” (the nbtree algorithm enabling concurrent writes during CONCURRENTLY); bibliography at .omc/plans/postgres-paper-bibliography.md.
  • Cross-references (sibling docs, do not duplicate): postgres-ddl-execution.md (utility dispatch, partition recursion, REINDEX CONCURRENTLY swap mechanics), postgres-nbtree.md (btbuild, _bt_load bulk loader, B-link-tree concurrency), postgres-index-am.md (IndexAmRoutine vtable, ambuild/aminsert callbacks), postgres-table-am.md (table_index_build_scan, table_index_validate_scan), postgres-mvcc-snapshots.md and postgres-procarray.md (snapshots, GetCurrentVirtualXIDs, VirtualXactLock), postgres-vacuum.md (index_bulk_delete), postgres-lock-manager.md (WaitForLockers, session locks), postgres-buffer-manager.md and postgres-smgr-md.md (the physical pages the bulk loader writes), postgres-system-catalogs.md (pg_index / pg_class / pg_attribute row shapes).