Skip to content

PostgreSQL Full-Text Search — tsvector, tsquery, Dictionaries, and GIN

Contents:

Full-text search is the problem of finding documents that contain given words, as opposed to documents whose column value equals a string. The two problems look superficially similar but pull a database engine in opposite directions. Equality is exact, order-preserving, and B-tree friendly: WHERE title = 'foo' is one descent of a balanced tree. “Contains the word running, in any inflection, ignoring case and stop words, and rank the matches by how prominently the term appears” is a different beast. It demands three machinery layers that ordinary scalar types do not have:

  1. Tokenization and normalization. Raw text is an unstructured byte stream. To search it you must first cut it into tokens (words, numbers, URLs, email addresses, host names) and then fold each token to a canonical lexeme — lowercasing, stripping inflectional suffixes (running → run), dropping high-frequency stop words (the, a, and) that carry no discriminating power, and optionally expanding synonyms or thesaurus phrases. This is the linguistics-aware step; it is inherently language-specific, which is why the i18n subsystem owns it.

  2. An invertible document representation. Once normalized, a document becomes a bag (or sequence) of lexemes. The canonical representation is the inverted index of classical information retrieval (Salton & McGill, Introduction to Modern Information Retrieval, 1983; Zobel & Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys, 2006): a mapping from each distinct lexeme to the list of documents (and positions within them) where it occurs. The per-document half of this — the sorted set of a single document’s lexemes with their positions — is a first-class value in PostgreSQL: the tsvector.

  3. A query language and a ranking function. A search request is not a single value but a boolean expression over lexemes: run & (fast | quick) & !slow, possibly with phrase constraints (quick <-> brown means “quick immediately followed by brown”). Matching is a tree evaluation, not a comparison. And because many documents match, the engine must rank them — assign a relevance score so the best results surface first. Classical IR ranks by term frequency × inverse document frequency (tf-idf) and its descendants (Okapi BM25, Robertson & Spärck Jones). PostgreSQL ships a simpler, position-aware scheme based on term weights and cover density (the proximity-clustering idea of Clarke, Cormack & Tudhope, “Relevance Ranking for One- to Three-Term Queries”, Information Processing & Management, 2000).

Database System Concepts (Silberschatz et al.) treats information retrieval as a distinct chapter precisely because the relevance-ranked, keyword-over-unstructured-text model sits outside the relational exact-match core. Database Internals (Petrov) frames the inverted index as one member of the secondary-index family whose posting lists demand different storage and merge strategies than B-tree leaf chains. PostgreSQL’s answer threads all three layers into the type system: the linguistic layer is the configuration (regconfig), the document representation is the tsvector type, the query is the tsquery type, matching is the @@ operator, and the inverted index is a GIN index over tsvector.

The architectural insight that makes this fit a relational engine cleanly: PostgreSQL does not build a monolithic search engine bolted onto the side. It exposes tsvector and tsquery as ordinary user-visible datatypes with ordinary operators, makes to_tsvector(text) an ordinary (immutable, per-config) function you can put in an expression index, and reuses the generalized inverted index (GIN) access method — the same one that indexes array containment and JSONB — to accelerate @@. Full-text search is thus a composition of existing extensibility points (custom type, custom operator, custom GIN opclass, custom parser/dictionary plugins) rather than a special case in the core executor.

Almost every engine that offers text search converges on a similar pipeline; the variation is in where each stage lives and how pluggable it is. Naming the shared shape first makes PostgreSQL’s specific choices read as selections in a known design space.

The analyzer pipeline: tokenize → filter → normalize

Section titled “The analyzer pipeline: tokenize → filter → normalize”

Lucene/Elasticsearch call it an analyzer; Oracle Text calls it a lexer + stoplist + stemmer; PostgreSQL calls it a parser + dictionaries. The shape is identical: a tokenizer cuts the byte stream into typed tokens, then a filter chain transforms each token — lowercasing, stop-word removal, stemming, synonym expansion — emitting zero or more normalized terms. “Zero” matters: a stop word produces no term but must still consume a position so that phrase queries downstream measure distance correctly. Engines differ in whether the filter chain is per-token-type (PostgreSQL maps each token type to its own dictionary list) or global (Lucene applies one analyzer to the whole field).

The document vector and the inverted index are two views of one thing

Section titled “The document vector and the inverted index are two views of one thing”

Every engine stores, per document, a compact sorted structure of (lexeme → positions). PostgreSQL materializes this as the tsvector value stored in a table column; Lucene stores it only inside the index segment. Making the per-document vector a first-class column has consequences: you can SELECT it, debug it, store a precomputed to_tsvector(body) in a generated column, and the GIN index simply inverts that column. The inverted index maps each distinct lexeme to a posting list of row pointers (TIDs in PostgreSQL). A query term lookup is a single index probe returning a posting list; a boolean query intersects/unions posting lists.

Boolean query tree + relevance ranking are separate phases

Section titled “Boolean query tree + relevance ranking are separate phases”

Matching answers which documents qualify; ranking answers how good each is. These are deliberately decoupled: the index supplies a candidate set cheaply (often lossily — it may include false positives that a recheck removes), and ranking runs only on survivors, reading the full document vector to compute proximity- and frequency-aware scores. Decoupling lets the index be small (lexeme presence, not full position math) while ranking stays precise (it re-reads positions from the stored tsvector).

A crucial pattern: the index is allowed to return maybe-matches. Weight restrictions (run:A = “run, but only where it carries weight A”) and phrase distance (<->) cannot always be decided from posting-list membership alone, because a GIN posting list records which rows contain a lexeme, not at which weighted position. So the index returns TRUE or MAYBE, and a recheck flag tells the executor to re-evaluate @@ on the actual heap tuple. This ternary logic (yes / no / maybe) is the linchpin that lets a presence-only index serve position-and-weight-sensitive queries.

flowchart LR
  doc["raw document text"] --> P["text parser<br/>(tokenize)"]
  P --> T1["token: asciiword 'Running'"]
  P --> T2["token: blank ' '"]
  P --> T3["token: asciiword 'fast'"]
  T1 --> D["dictionary chain<br/>(lexize: lowercase,<br/>stopword, stem)"]
  T3 --> D
  D --> V["tsvector<br/>'fast':2 'run':1"]
  V --> G["GIN inverted index<br/>(lexeme to row TIDs)"]
  q["query text"] --> QP["to_tsquery<br/>(parse + lexize)"]
  QP --> Q["tsquery<br/>'run' & 'fast'"]
  Q --> G
  G --> C["candidate rows<br/>(TRUE / MAYBE)"]
  C --> R["@@ recheck +<br/>ts_rank scoring"]
  R --> out["ranked results"]

PostgreSQL’s full-text search is assembled from five cooperating pieces, each a separate compilation unit under src/backend/tsearch/ (the linguistic engine) and src/backend/utils/adt/ (the type operators):

  1. The text parser (wparser_def.c) — a hand-written finite-state machine that classifies the input into 23 token types (asciiword, word, numword, email, url, host, version, tag, …). The default parser is the one registered as pg_catalog.default.

  2. Dictionaries (dict.c, dict_simple.c, dict_synonym.c, dict_thesaurus.c, dict_ispell.c, plus spell.c) — each a plugin exposing an init and a lexize function. A dictionary maps a token to zero, one, or many output lexemes, or declines (returns NULL) to let the next dictionary in the chain try.

  3. The lexize driver (ts_parse.c) — parsetext() runs the parser, and for each token walks the configuration’s per-token-type dictionary list, calling each dictionary’s lexize until one accepts. The stateful LexizeData machine also supports multi-word dictionaries (thesaurus) that need to peek at following tokens.

  4. The datatypes and operators (to_tsany.c, tsvector_op.c, tsrank.c) — to_tsvector / to_tsquery build the tsvector / tsquery values; ts_match_vq implements @@; ts_rank / ts_rank_cd score matches.

  5. The GIN bridge (tsginidx.c) — gin_extract_tsvector, gin_extract_tsquery, and gin_tsquery_consistent connect @@ to the inverted index.

The tsvector: a sorted lexeme + position store

Section titled “The tsvector: a sorted lexeme + position store”

A tsvector is a varlena blob: a count, an array of WordEntry headers (offset + length + a haspos bit), then a packed string region holding the lexeme bytes and, optionally, position arrays. The header layout is fixed at 4 bytes per WordEntry:

// WordEntry — src/include/utils/tsvector.h
typedef struct
{
uint32 haspos:1,
len:11, /* MAX 2Kb */
pos:20; /* MAX 1Mb */
} WordEntry;

Positions are 16-bit, with the high 2 bits reserved for a weight (A/B/C/D) and the low 14 bits for the position ordinal — hence MAXENTRYPOS = (1<<14) and MAXNUMPOS = 256 positions per lexeme:

// position limits — src/include/utils/tsvector.h
#define MAXENTRYPOS (1<<14)
#define MAXNUMPOS (256)
#define LIMITPOS(x) ( ( (x) >= MAXENTRYPOS ) ? (MAXENTRYPOS-1) : (x) )

The lexeme array is kept sorted by (length, bytes) so that @@ matching and ranking can binary-search it. That sort and the dedup happen in make_tsvectoruniqueWORD, which collapses repeated lexemes and merges their position lists:

// make_tsvector — src/backend/tsearch/to_tsany.c
TSVector
make_tsvector(ParsedText *prs)
{
/* Merge duplicate words */
if (prs->curwords > 0)
prs->curwords = uniqueWORD(prs->words, prs->curwords);
/* ... compute lenstr, palloc0(totallen), SET_VARSIZE ... */
in->size = prs->curwords;
/* ... copy each lexeme's bytes + position array into the blob ... */
return in;
}

uniqueWORD first qsorts by the compareWORD comparator (which calls tsCompareString), then walks the run merging equal words and appending distinct positions, capped at MAXNUMPOS:

// uniqueWORD — src/backend/tsearch/to_tsany.c
qsort(a, l, sizeof(ParsedWord), compareWORD);
/* ... */
while (ptr - a < l)
{
if (!(ptr->len == res->len &&
strncmp(ptr->word, res->word, res->len) == 0))
{
/* Got a new word, so put it in result */
res++;
/* ... start a fresh position array ... */
}
else
{
/* same word: append position if within MAXNUMPOS and distinct */
if (res->pos.apos[0] < MAXNUMPOS - 1 && ...)
res->pos.apos[res->pos.apos[0] + 1] = LIMITPOS(ptr->pos.pos);
}
ptr++;
}

to_tsvector: drive the parser, lexize, materialize

Section titled “to_tsvector: drive the parser, lexize, materialize”

to_tsvector_byid is the workhorse. It estimates the word count, allocates a ParsedText, calls parsetext() to fill it (tokenize + lexize), then make_tsvector to pack the sorted blob:

// to_tsvector_byid — src/backend/tsearch/to_tsany.c
prs.lenwords = VARSIZE_ANY_EXHDR(in) / 6; /* estimate word count */
if (prs.lenwords < 2)
prs.lenwords = 2;
prs.curwords = 0;
prs.pos = 0;
prs.words = (ParsedWord *) palloc(sizeof(ParsedWord) * prs.lenwords);
parsetext(cfgId, &prs, VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));
out = make_tsvector(&prs);
PG_RETURN_TSVECTOR(out);

The zero-argument to_tsvector(text) simply resolves the current configuration (the default_text_search_config GUC) and forwards:

// to_tsvector — src/backend/tsearch/to_tsany.c
cfgId = getTSCurrentConfig(true);
PG_RETURN_DATUM(DirectFunctionCall2(to_tsvector_byid,
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));

A tsquery is a flattened postfix tree of QueryItems — either a QI_VAL operand (a lexeme, with optional prefix/weight flags) or a QI_OPR operator. The operators are OP_NOT, OP_AND, OP_OR, and OP_PHRASE (the <-> / <N> distance operator), the last being the highest-priority code:

// operator codes — src/include/tsearch/ts_type.h
#define OP_NOT 1
#define OP_AND 2
#define OP_OR 3
#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */

to_tsquery is more subtle than to_tsvector: each query word must also be run through the dictionary chain (so to_tsquery('Running') matches the run stored in the vector), but a single word may lexize into several variant lexemes. The callback pushval_morph handles this: words from the same morphological variant are AND-ed, different variants are OR-ed, and stop words removed mid-phrase leave placeholder positions so phrase distances stay correct:

// pushval_morph — src/backend/tsearch/to_tsany.c
parsetext(data->cfg_id, &prs, strval, lenval);
/* ... for each output position ... */
/* Push all words belonging to the same variant */
pushValue(state, prs.words[count].word, prs.words[count].len,
weight, ((prs.words[count].flags & TSL_PREFIX) || prefix));
if (cnt)
pushOperator(state, OP_AND, 0); /* same variant: AND */
/* ... */
if (cntvar)
pushOperator(state, OP_OR, 0); /* different variant: OR */

The several public entry points differ only in the qoperator they hand to pushval_morph and the parse flags they pass to parse_tsquery:

  • to_tsquery_byid — full operator syntax, OP_PHRASE for adjacent morphs.
  • plainto_tsquery_byidP_TSQ_PLAIN, OP_AND: treats the whole input as plain words AND-ed together.
  • phraseto_tsquery_byidP_TSQ_PLAIN, OP_PHRASE: words become a phrase.
  • websearch_to_tsquery_byidP_TSQ_WEB: Google-style syntax (quoted phrases, or, - negation).
// plainto_tsquery_byid — src/backend/tsearch/to_tsany.c
data.qoperator = OP_AND;
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
P_TSQ_PLAIN,
NULL);

Every conversion is parameterized by a text search configuration (regconfig), which bundles a parser with a per-token-type dictionary map. getTSCurrentConfig reads the session default; lookup_ts_config_cache caches the resolved TSConfigCacheEntry (its map field is the token-type → dictionary-list array consumed by LexizeExec). The thin get_current_ts_config SQL function just exposes the former:

// get_current_ts_config — src/backend/tsearch/to_tsany.c
Datum
get_current_ts_config(PG_FUNCTION_ARGS)
{
PG_RETURN_OID(getTSCurrentConfig(true));
}
flowchart TD
  txt["to_tsvector(cfg, text)"] --> pt["parsetext()"]
  pt --> cfg["lookup_ts_config_cache(cfgId)"]
  pt --> prs["lookup_ts_parser_cache(prsId)"]
  prs --> tok["prsstart / prstoken / prsend<br/>FSM emits typed tokens"]
  tok --> lz["LexizeAddLemm + LexizeExec"]
  lz --> map["cfg->map[token.type]<br/>dictionary list"]
  map --> dic["each dict->lexize until accept"]
  dic --> pw["ParsedWord[] (word, pos, flags)"]
  pw --> mv["make_tsvector: uniqueWORD + pack"]
  mv --> tsv["tsvector value"]

This section walks the call flow in execution order — parser → dictionaries → lexize driver → datatype build → match → rank → index — anchoring on stable symbol names. The position-hint table at the end pairs each symbol with its (file, line) as observed at the updated: revision.

The default parser is a table-driven finite-state machine. Token types are small integers whose human names live in tok_alias[]; type 0 (""/blank) is special and is skipped by the lexize driver. The 23 recognized types are:

// tok_alias — src/backend/tsearch/wparser_def.c
static const char *const tok_alias[] = {
"", "asciiword", "word", "numword",
"email", "url", "host", "sfloat",
"version", "hword_numpart", "hword_part", "hword_asciipart",
"blank", "tag", "protocol", "numhword",
"asciihword","hword", "url_path", "file",
"float", "int", "uint", "entity"
};

The FSM is expressed as arrays of TParserStateActionItem — for each state, a list of (character-class predicate, flags, action, next-state, emit-type) rows. For example, once inside an ASCII word the machine emits an ASCIIWORD token (A_BINGO) when it hits a non-word character or EOF:

// actionTPS_InAsciiWord — src/backend/tsearch/wparser_def.c
static const TParserStateActionItem actionTPS_InAsciiWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
/* ... transitions to hword/url/host/email states ... */
{NULL, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}
};

The parser exposes three function-manager callbacks — prsstart, prstoken, prsend — invoked by the lexize driver below; it does not know about dictionaries or configurations.

2. Dictionaries: the lexize plugin contract (dict_simple.c, dict_synonym.c)

Section titled “2. Dictionaries: the lexize plugin contract (dict_simple.c, dict_synonym.c)”

A dictionary is a pair of FunctionCall-able procedures. init parses the CREATE TEXT SEARCH DICTIONARY options into an opaque state blob; lexize receives (dictData, lemmaText, len, DictSubState*) and returns a NULL-terminated TSLexeme array. The output array encodes three distinct outcomes:

  • return NULL — “I don’t recognize this token” → driver tries the next dictionary in the list.
  • return a zero-length (empty first lexeme) array — “this is a stop word” → the token is consumed, produces no lexeme, but advances position.
  • return one-or-more lexemes — accepted and normalized.

dsimple_lexize is the canonical illustration: lowercase, then either reject as stop word (empty array), accept (one lexeme), or decline (NULL) depending on the accept option:

// dsimple_lexize — src/backend/tsearch/dict_simple.c
txt = str_tolower(in, len, DEFAULT_COLLATION_OID);
if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
{
pfree(txt);
res = palloc0(sizeof(TSLexeme) * 2); /* empty array = stop word */
PG_RETURN_POINTER(res);
}
else if (d->accept)
{
res = palloc0(sizeof(TSLexeme) * 2);
res[0].lexeme = txt; /* one lexeme = accept */
PG_RETURN_POINTER(res);
}
else
PG_RETURN_POINTER(NULL); /* NULL = decline */

Each TSLexeme carries flag bits that steer the driver:

// TSLexeme flags — src/include/tsearch/ts_public.h
#define TSL_ADDPOS 0x01 /* this lexeme advances the position counter */
#define TSL_PREFIX 0x02 /* lexeme is a prefix pattern (foo:*) */
#define TSL_FILTER 0x04 /* re-feed lexeme to the rest of the chain */

TSL_FILTER is what makes filtering dictionaries (e.g. unaccent) composable: a filter dictionary rewrites the token and hands the result back to the same position in the dictionary list, so a later stemmer still runs. The synonym dictionary (dsynonym_lexize) instead does a direct table lookup and may set TSL_ADDPOS for multi-word replacements.

3. The lexize driver: parsetext and LexizeExec (ts_parse.c)

Section titled “3. The lexize driver: parsetext and LexizeExec (ts_parse.c)”

parsetext is the loop that ties parser to dictionaries. It looks up the config and parser caches, starts the parser, then repeatedly pulls a token (prstoken), feeds it to the lexize state machine, and drains every resulting lexeme into prs->words[], bumping prs->pos once per token and again per TSL_ADDPOS:

// parsetext — src/backend/tsearch/ts_parse.c
do {
type = DatumGetInt32(FunctionCall3(&(prsobj->prstoken),
PointerGetDatum(prsdata),
PointerGetDatum(&lemm),
PointerGetDatum(&lenlemm)));
/* ... IGNORE_LONGLEXEME guard for words >= MAXSTRLEN ... */
LexizeAddLemm(&ldata, type, lemm, lenlemm);
while ((norms = LexizeExec(&ldata, NULL)) != NULL)
{
prs->pos++; /* one position per token */
while (ptr->lexeme)
{
if (ptr->flags & TSL_ADDPOS)
prs->pos++;
prs->words[prs->curwords].word = ptr->lexeme;
prs->words[prs->curwords].pos.pos = LIMITPOS(prs->pos);
ptr++; prs->curwords++;
}
}
} while (type > 0);

LexizeExec is the heart of the chain logic. In its normal mode it walks the dictionary list cfg->map[type], calling each lexize. A token type with an empty map (or type 0) is skipped entirely:

// LexizeExec (normal mode) — src/backend/tsearch/ts_parse.c
map = ld->cfg->map + curVal->type;
if (curVal->type == 0 || curVal->type >= ld->cfg->lenmap || map->len == 0)
{
RemoveHead(ld); /* skip this token type */
continue;
}
for (i = ld->posDict; i < map->len; i++)
{
dict = lookup_ts_dictionary_cache(map->dictIds[i]);
res = (TSLexeme *) DatumGetPointer(FunctionCall4(&(dict->lexize), ...));
if (ld->dictState.getnext) { /* go multiword mode */ ... }
if (!res) continue; /* dict declined: next dict */
if (res->flags & TSL_FILTER) { /* rewrite & re-feed same chain */ ... }
RemoveHead(ld);
return res; /* accepted */
}

When a dictionary sets dictState.getnext (thesaurus-style multi-word matching), LexizeExec flips into multiword mode: it remembers the dictionary id, stashes a tentative result via setNewTmpRes, and recurses to feed following tokens to the same dictionary until it commits or backs out. The DictSubState struct is the channel for this handshake:

// DictSubState — src/include/tsearch/ts_public.h
typedef struct
{
bool isend; /* in: text end reached */
bool getnext; /* out: dict wants next lexeme */
void *private_state; /* dict's cross-call scratch */
} DictSubState;

ts_lexize(dictid, word) is the SQL-exposed single-token debugger that calls one dictionary’s lexize directly (handling the getnext two-call protocol) and returns the lexeme array as a text[]:

// ts_lexize — src/backend/tsearch/dict.c
res = (TSLexeme *) DatumGetPointer(FunctionCall4(&dict->lexize,
PointerGetDatum(dict->dictData),
PointerGetDatum(VARDATA_ANY(in)),
Int32GetDatum(VARSIZE_ANY_EXHDR(in)),
PointerGetDatum(&dstate)));
if (dstate.getnext) { dstate.isend = true; /* second call */ }

4. Matching: @@, TS_execute, and phrase distance (tsvector_op.c)

Section titled “4. Matching: @@, TS_execute, and phrase distance (tsvector_op.c)”

The @@ operator is ts_match_vq. It sets up a CHKVAL pointing into the tsvector’s sorted lexeme array and runs the ternary tree evaluator TS_execute with the checkcondition_str callback:

// ts_match_vq — src/backend/utils/adt/tsvector_op.c
chkval.arrb = ARRPTR(val);
chkval.arre = chkval.arrb + val->size;
chkval.values = STRPTR(val);
chkval.operand = GETOPERAND(query);
result = TS_execute(GETQUERY(query), &chkval,
TS_EXEC_EMPTY, checkcondition_str);
PG_RETURN_BOOL(result);

checkcondition_str answers “does this query operand occur in the vector?” by binary search over the sorted WordEntry array (the payoff for keeping the tsvector sorted), then checkclass_str checks any weight restriction and fills position data for phrase logic. Prefix operands (foo:*) scan the contiguous run of lexemes that share the prefix:

// checkcondition_str — src/backend/utils/adt/tsvector_op.c
while (StopLow < StopHigh)
{
StopMiddle = StopLow + (StopHigh - StopLow) / 2;
difference = tsCompareString(chkval->operand + val->distance, val->length,
chkval->values + StopMiddle->pos,
StopMiddle->len, false);
if (difference == 0) { res = checkclass_str(chkval, StopMiddle, val, data); break; }
else if (difference > 0) StopLow = StopMiddle + 1;
else StopHigh = StopMiddle;
}

TS_execute is a thin wrapper over TS_execute_recurse, which returns a ternary value (TS_YES / TS_NO / TS_MAYBE). MAYBE is what lets a position-blind caller (the GIN index) defer weight/phrase decisions to a recheck:

// TS_execute — src/backend/utils/adt/tsvector_op.c
bool
TS_execute(QueryItem *curitem, void *arg, uint32 flags,
TSExecuteCallback chkcond)
{
return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;
}

The OP_PHRASE case is special-cased to TS_phrase_execute, which walks the two operands’ position lists and keeps only matches whose distance equals the operator’s distance field — this is how quick <-> brown (distance 1) or a <3> b is enforced. TS_phrase_execute returns position sets so that nested phrase operators compose.

5. Ranking: ts_rank and cover density ts_rank_cd (tsrank.c)

Section titled “5. Ranking: ts_rank and cover density ts_rank_cd (tsrank.c)”

calc_rank is the standard ts_rank core. It dispatches on the top operator: AND/PHRASE queries use calc_rank_and (proximity-aware), everything else uses calc_rank_or (frequency-aware), then applies the selected normalization bits:

// calc_rank — src/backend/utils/adt/tsrank.c
res = (item->type == QI_OPR && (item->qoperator.oper == OP_AND ||
item->qoperator.oper == OP_PHRASE)) ?
calc_rank_and(w, t, q) :
calc_rank_or(w, t, q);
if (res < 0) res = 1e-20f;
if ((method & RANK_NORM_LOGLENGTH) && t->size > 0)
res /= log((double) (cnt_length(t) + 1)) / log(2.0);
if (method & RANK_NORM_LENGTH) { len = cnt_length(t); if (len > 0) res /= (float) len; }
if ((method & RANK_NORM_UNIQ) && t->size > 0) res /= (float) (t->size);
if (method & RANK_NORM_RDIVRPLUS1) res /= (res + 1);

calc_rank_and multiplies per-occurrence weights by a distance decay: the closer two query terms occur, the higher the contribution. The decay curve is word_distance, which falls off exponentially with separation and floors at 1e-30 beyond 100 lexemes apart:

// word_distance — src/backend/utils/adt/tsrank.c
static float4
word_distance(int32 w)
{
if (w > 100)
return 1e-30f;
return 1.0 / (1.005 + 0.05 * exp(((float4) w) / 1.5 - 2));
}
// calc_rank_and (core product) — src/backend/utils/adt/tsrank.c
dist = abs((int) WEP_GETPOS(post[l]) - (int) WEP_GETPOS(ct[p]));
if (dist || (dist == 0 && (pos[i] == POSNULL || pos[k] == POSNULL)))
{
if (!dist) dist = MAXENTRYPOS;
curw = sqrt(wpos(post[l]) * wpos(ct[p]) * word_distance(dist));
res = (res < 0) ? curw : 1.0 - (1.0 - res) * (1.0 - curw);
}

calc_rank_or is the disjunctive/frequency branch: it sums weighted occurrences with a 1/i^2 discount on the i-th occurrence (so additional hits add diminishing value, normalized against the pi^2/6 series limit):

// calc_rank_or — src/backend/utils/adt/tsrank.c
for (j = 0; j < dimt; j++)
{
resj = resj + wpos(post[j]) / ((j + 1) * (j + 1));
if (wpos(post[j]) > wjm) { wjm = wpos(post[j]); jm = j; }
}
res = res + (wjm + resj - wjm / ((jm + 1) * (jm + 1))) / 1.64493406685;

ts_rank_cd (cover density) is the proximity ranker. It builds a DocRepresentation (the query terms’ positions in the document) via get_docrep, then repeatedly calls Cover to find minimal covers — shortest spans of the document containing all query terms — and scores inversely to cover length, summing across all covers. Cover itself reuses the same TS_execute matcher against a QueryRepresentation:

// Cover — src/backend/utils/adt/tsrank.c
while (ptr - doc < len)
{
fillQueryRepresentationData(qr, ptr);
if (TS_execute(GETQUERY(qr->query), qr,
TS_EXEC_EMPTY, checkcondition_QueryOperand))
{
if (WEP_GETPOS(ptr->pos) > ext->q)
{ ext->q = WEP_GETPOS(ptr->pos); ext->end = ptr; found = true; }
break;
}
ptr++;
}

A GIN index on a tsvector column inverts it. gin_extract_tsvector returns one index key per lexeme — exactly the posting-list keys GIN stores against the row’s TID:

// gin_extract_tsvector — src/backend/utils/adt/tsginidx.c
*nentries = vector->size;
for (i = 0; i < vector->size; i++)
{
txt = cstring_to_text_with_len(STRPTR(vector) + we->pos, we->len);
entries[i] = PointerGetDatum(txt);
we++;
}

gin_extract_tsquery turns a query into the set of lexeme keys to look up, sets partialmatch for prefix operands, and chooses a search mode: a query with no required positive term (e.g. !foo) forces a full scan (GIN_SEARCH_MODE_ALL) because the index cannot satisfy a pure negation:

// gin_extract_tsquery — src/backend/utils/adt/tsginidx.c
if (tsquery_requires_match(item))
*searchMode = GIN_SEARCH_MODE_DEFAULT;
else
*searchMode = GIN_SEARCH_MODE_ALL;
/* ... emit one entry per QI_VAL, partialmatch[j] = val->prefix ... */

gin_tsquery_consistent is where lossy matching lives. GIN hands it a bool check[] array — one flag per query key, “is this lexeme present in the candidate row?” — and it runs TS_execute_ternary with the checkcondition_gin callback. Because the index knows presence but not weight or position, checkcondition_gin downgrades a present-and-weight- restricted match to GIN_MAYBE, and a TS_MAYBE result sets *recheck:

// gin_tsquery_consistent — src/backend/utils/adt/tsginidx.c
gcv.first_item = GETQUERY(query);
gcv.check = (GinTernaryValue *) check;
gcv.map_item_operand = (int *) (extra_data[0]);
switch (TS_execute_ternary(GETQUERY(query), &gcv,
TS_EXEC_PHRASE_NO_POS, checkcondition_gin))
{
case TS_NO: res = false; break;
case TS_YES: res = true; break;
case TS_MAYBE: res = true; *recheck = true; break;
}
// checkcondition_gin — src/backend/utils/adt/tsginidx.c
result = gcv->check[j];
if (result == GIN_TRUE)
{
if (val->weight != 0 || data != NULL) /* weight/position needed */
result = GIN_MAYBE; /* force a recheck */
}
return (TSTernaryValue) result;

When *recheck is set, the executor re-evaluates @@ on the heap tuple’s real tsvector (via ts_match_vq again), so phrase distance and weight filters are decided exactly even though the index probe was lossy.

flowchart TD
  q["WHERE body_tsv @@ to_tsquery('run & fast:A')"] --> ext["gin_extract_tsquery<br/>keys: run, fast"]
  ext --> probe["GIN posting-list probe<br/>per key to TID sets"]
  probe --> check["bool check[] per row<br/>(present? per key)"]
  check --> cons["gin_tsquery_consistent<br/>TS_execute_ternary"]
  cons --> yes["TS_YES to emit row"]
  cons --> maybe["TS_MAYBE (weight A unknown)<br/>set recheck"]
  maybe --> rc["heap fetch + ts_match_vq<br/>exact recheck"]
  yes --> rank["ts_rank / ts_rank_cd"]
  rc --> rank
  rank --> out["ordered results"]

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
to_tsvector_byidsrc/backend/tsearch/to_tsany.c243
to_tsvectorsrc/backend/tsearch/to_tsany.c270
make_tsvectorsrc/backend/tsearch/to_tsany.c165
uniqueWORDsrc/backend/tsearch/to_tsany.c77
pushval_morphsrc/backend/tsearch/to_tsany.c492
to_tsquery_byidsrc/backend/tsearch/to_tsany.c579
plainto_tsquery_byidsrc/backend/tsearch/to_tsany.c617
websearch_to_tsquery_byidsrc/backend/tsearch/to_tsany.c692
get_current_ts_configsrc/backend/tsearch/to_tsany.c47
parsetextsrc/backend/tsearch/ts_parse.c355
LexizeExecsrc/backend/tsearch/ts_parse.c173
hlparsetextsrc/backend/tsearch/ts_parse.c540
generateHeadlinesrc/backend/tsearch/ts_parse.c607
ts_lexizesrc/backend/tsearch/dict.c27
dsimple_lexizesrc/backend/tsearch/dict_simple.c76
dsynonym_lexizesrc/backend/tsearch/dict_synonym.c212
tok_aliassrc/backend/tsearch/wparser_def.c62
actionTPS_InAsciiWordsrc/backend/tsearch/wparser_def.c999
ts_match_vqsrc/backend/utils/adt/tsvector_op.c2187
checkcondition_strsrc/backend/utils/adt/tsvector_op.c1268
TS_executesrc/backend/utils/adt/tsvector_op.c1827
TS_phrase_executesrc/backend/utils/adt/tsvector_op.c1582
tsvector_bsearchsrc/backend/utils/adt/tsvector_op.c395
calc_ranksrc/backend/utils/adt/tsrank.c358
calc_rank_andsrc/backend/utils/adt/tsrank.c201
calc_rank_orsrc/backend/utils/adt/tsrank.c284
word_distancesrc/backend/utils/adt/tsrank.c45
calc_rank_cdsrc/backend/utils/adt/tsrank.c855
Coversrc/backend/utils/adt/tsrank.c651
get_docrepsrc/backend/utils/adt/tsrank.c732
gin_extract_tsvectorsrc/backend/utils/adt/tsginidx.c64
gin_extract_tsquerysrc/backend/utils/adt/tsginidx.c94
checkcondition_ginsrc/backend/utils/adt/tsginidx.c183
gin_tsquery_consistentsrc/backend/utils/adt/tsginidx.c214
getTSCurrentConfigsrc/backend/utils/cache/ts_cache.c556
lookup_ts_config_cachesrc/backend/utils/cache/ts_cache.c385

All symbols, code excerpts, and constants above were read directly from the REL_18 working tree at commit 273fe94852b (2026-06-05) under /data/hgryoo/references/postgres. Verification notes and caveats:

  • Commit / branch. git log -1 reports 273fe94852b 2026-06-05. All excerpts are condensed (marked with /* ... */) but otherwise verbatim; every leading // symbol — path comment names a function that exists at the cited line in the position-hint table.

  • Copyright banners read 2025. Every file header still says “Portions Copyright (c) 1996-2025”; this is the in-tree banner on the REL_18 branch and is not a contradiction of the 2026-dated commit — the banner year lags the actual commit date. Noted so a future reader does not mistake it for a stale checkout.

  • tsvector / tsquery on-disk layout. WordEntry is asserted to be 4 bytes (sizeof(WordEntry) == 4) by tsvectorsend/recv; the bitfield widths (len:11, pos:20, haspos:1) and the MAXENTRYPOS (1<<14) / MAXNUMPOS (256) constants are from src/include/utils/tsvector.h as read. The operator codes OP_NOT/AND/OR/PHRASE = 1/2/3/4 are from src/include/tsearch/ts_type.h.

  • Ternary matching. TS_execute returns bool (collapsing TS_MAYBE→true), while TS_execute_ternary (used by the GIN consistent function) preserves TS_MAYBE. This split is the mechanism behind the index recheck path and was confirmed in both tsvector_op.c and tsginidx.c.

  • Scope boundary. The headline subsystem (hlparsetext, generateHeadline, ts_headline) and the statistics path (ts_typanalyze.c, ts_selfuncs.c) are in-tree and named here for completeness, but their internals are out of scope for this doc; the full spell.c ispell engine and dict_thesaurus.c are likewise only sketched. The GIN access method itself (posting-tree layout, fast insert, page splits) is deferred to postgres-gin.md. The varlena / TOAST mechanics of how a large tsvector is stored are deferred to postgres-datatypes-adt.md and postgres-toast.md.

  • contrib/ is out of scope. pg_trgm, unaccent, and the dict_int / dict_xsyn example dictionaries live under contrib/ and are named only as examples; nothing in the walkthrough asserts facts about them. The Snowball stemmer dictionaries (snowball/) ship in the core tree but their per-language stemming tables are not examined here.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s full-text search is a good-enough, in-engine IR subsystem deliberately scoped below a dedicated search engine. Placing it in the landscape of comparative designs clarifies both what it buys and what it forgoes.

Lucene / Elasticsearch — the segment-based inverted index

Section titled “Lucene / Elasticsearch — the segment-based inverted index”

Lucene is the reference point for “serious” text search. Its differences from PostgreSQL’s model are instructive rather than merely quantitative:

  • Where the vector lives. Lucene never materializes a per-document vector as a user-visible value; the term→postings map exists only inside immutable on-disk segments. PostgreSQL makes the tsvector a first-class column you can SELECT, store generated, and debug. The cost is duplication (the column and the GIN index both hold the lexemes); the benefit is transparency and reuse of the ordinary MVCC/heap machinery.
  • Scoring. Lucene defaults to BM25 (Robertson & Spärck Jones’ Okapi family), a saturating tf-idf with document-length normalization whose k1/b parameters are tunable per field. PostgreSQL’s ts_rank and ts_rank_cd (the cover-density scheme of Clarke, Cormack & Tudhope) are position/proximity rankers that do not use corpus-wide inverse document frequency — the @@ index path has no global term-frequency statistics to draw on, so a term’s rarity across the table does not raise its score. This is the single biggest semantic gap: PostgreSQL ranks within a document, not relative to the corpus.
  • Segment merges vs. GIN pending list. Lucene amortizes index updates by writing new segments and background-merging them. GIN amortizes via the fast-update pending list (a heap of un-merged entries flushed in bulk), covered in postgres-gin.md. Both trade read-time merge cost for write throughput; neither does in-place posting-list mutation.

Oracle Text, SQL Server, MySQL — in-engine peers

Section titled “Oracle Text, SQL Server, MySQL — in-engine peers”

Oracle Text (CONTEXT index, CONTAINS() operator) is architecturally the closest peer: a lexer + stoplist + stemmer pipeline feeding an inverted index, with SCORE() for relevance. SQL Server’s FULLTEXT indexes and CONTAINS/FREETEXT, and MySQL/InnoDB’s FULLTEXT with MATCH … AGAINST (which does expose a BM25-like natural-language mode), occupy the same niche: text search welded into the relational engine so it stays transactional and joins against ordinary columns. PostgreSQL’s distinctive move is the degree of composition — the whole subsystem is expressed in public extensibility points (a custom type pair, a custom operator, a GIN opclass, pluggable parser/dictionary FMGR procedures) rather than a private engine, so a user can register a new dictionary (CREATE TEXT SEARCH DICTIONARY) without touching core C.

Two structural limits define the research/extension frontier for the in-core design:

  1. No corpus statistics on the index path. Because gin_tsquery_consistent only sees a per-row bool check[] (presence), and ts_rank only sees one document’s tsvector, true tf-idf/BM25 ranking would require either a maintained per-lexeme document-frequency catalog or a different access method. pg_search/ParadeDB and the rum extension (a GIN cousin that stores positions and lexeme weights inside the posting lists) attack exactly this: rum lets distance- and rank-ordered scans avoid the heap recheck and avoid a separate sort, which the stock GIN path cannot.
  2. Lossy recheck is mandatory for weight/phrase queries. As Section 6 showed, checkcondition_gin downgrades any weight- or position-sensitive operand to GIN_MAYBE, forcing a heap fetch and a second ts_match_vq. For phrase-heavy or weight-filtered workloads this can dominate cost. The rum access method’s in-posting-list positions are the community answer; getting equivalent capability into core GIN remains open.

Vector / semantic search — an orthogonal axis

Section titled “Vector / semantic search — an orthogonal axis”

The post-2023 shift toward dense-vector semantic search (embeddings + approximate nearest neighbor, e.g. pgvector’s HNSW/IVFFlat indexes) is a different retrieval model, not a replacement: it ranks by embedding cosine similarity rather than lexeme overlap, and answers “semantically similar” rather than “contains these words.” Production systems increasingly run hybrid retrieval — lexical (tsvector/@@) and vector ANN combined via reciprocal-rank fusion. PostgreSQL can host both side by side (a tsvector GIN column and a vector HNSW column on the same table), which is itself an argument for the in-engine, compose-from-types philosophy: the two index worlds share transactions, joins, and WHERE filtering for free. pgvector internals are out of scope here and noted only as the complementary axis.

The design is a textbook instance of “embed an 80%-solution that composes with the relational core” rather than “bolt on a best-of-breed external engine.” For transactional, moderate-scale, language-aware containment + proximity ranking against rows you are already storing and joining, the tsvector/tsquery/GIN triad is hard to beat on operational simplicity. For corpus-relative relevance (BM25), faceting, distributed sharding, or billion-document scale, the comparative literature points toward Lucene-class engines or the rum/ParadeDB extensions — and, increasingly, toward hybrid lexical+vector pipelines.

PostgreSQL source tree (REL_18_STABLE, commit 273fe94852b, read 2026-06-05 under /data/hgryoo/references/postgres):

  • src/backend/tsearch/to_tsany.cto_tsvector_byid, make_tsvector, uniqueWORD, pushval_morph, the to_tsquery / plainto_tsquery / phraseto_tsquery / websearch_to_tsquery family, get_current_ts_config.
  • src/backend/tsearch/ts_parse.cparsetext, LexizeExec, LexizeAddLemm, the multiword DictSubState handshake, headline path (hlparsetext, generateHeadline, named only).
  • src/backend/tsearch/dict.cts_lexize SQL debugger and the dictionary FMGR plumbing.
  • src/backend/tsearch/dict_simple.c, dict_synonym.cdsimple_lexize, dsynonym_lexize (the lexize plugin contract: NULL / empty / lexemes, TSL_FILTER / TSL_ADDPOS).
  • src/backend/tsearch/wparser_def.c — default parser FSM: tok_alias, actionTPS_InAsciiWord, the 23 token types.
  • src/backend/utils/adt/tsvector_op.cts_match_vq (@@), checkcondition_str, TS_execute / TS_execute_recurse, TS_phrase_execute, tsvector_bsearch.
  • src/backend/utils/adt/tsrank.ccalc_rank, calc_rank_and, calc_rank_or, word_distance, calc_rank_cd, Cover, get_docrep.
  • src/backend/utils/adt/tsginidx.cgin_extract_tsvector, gin_extract_tsquery, checkcondition_gin, gin_tsquery_consistent.
  • src/backend/utils/cache/ts_cache.cgetTSCurrentConfig, lookup_ts_config_cache.
  • src/include/utils/tsvector.hWordEntry bitfield, MAXENTRYPOS, MAXNUMPOS, LIMITPOS. src/include/tsearch/ts_type.h — operator codes. src/include/tsearch/ts_public.hTSLexeme flags, DictSubState.

Sibling KB analyses (cross-referenced, not duplicated):

  • knowledge/code-analysis/postgres/postgres-gin.md — GIN access method internals: posting trees, the fast-update pending list, page splits.
  • knowledge/code-analysis/postgres/postgres-datatypes-adt.md and postgres-toast.md — varlena layout and TOAST storage of large tsvector values.
  • knowledge/code-analysis/postgres/postgres-collation-providers.md, postgres-encoding.md — the i18n-text neighbors that own case-folding/collation context used by dsimple_lexize.

Theory / IR literature:

  • Salton & McGill, Introduction to Modern Information Retrieval (1983) — inverted index, vector space model.
  • Zobel & Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys 38(2), 2006 — posting-list storage and query evaluation.
  • Robertson & Spärck Jones — the Okapi BM25 relevance model (the corpus- relative scheme PostgreSQL’s ts_rank deliberately does not implement).
  • Clarke, Cormack & Tudhope, “Relevance Ranking for One- to Three-Term Queries”, Information Processing & Management 36(2), 2000 — the cover-density model behind ts_rank_cd.
  • Database System Concepts (Silberschatz, Korth, Sudarshan) — the information-retrieval chapter. Database Internals (Petrov) — inverted indexes as a secondary-index family. Both captured under knowledge/research/dbms-general/.