PostgreSQL Full-Text Search — tsvector, tsquery, Dictionaries, and GIN
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Full-text search is the problem of finding documents that contain
given words, as opposed to documents whose column value equals a string.
The two problems look superficially similar but pull a database engine in
opposite directions. Equality is exact, order-preserving, and B-tree
friendly: WHERE title = 'foo' is one descent of a balanced tree. “Contains
the word running, in any inflection, ignoring case and stop words, and
rank the matches by how prominently the term appears” is a different beast.
It demands three machinery layers that ordinary scalar types do not have:
-
Tokenization and normalization. Raw text is an unstructured byte stream. To search it you must first cut it into tokens (words, numbers, URLs, email addresses, host names) and then fold each token to a canonical lexeme — lowercasing, stripping inflectional suffixes (running → run), dropping high-frequency stop words (the, a, and) that carry no discriminating power, and optionally expanding synonyms or thesaurus phrases. This is the linguistics-aware step; it is inherently language-specific, which is why the i18n subsystem owns it.
-
An invertible document representation. Once normalized, a document becomes a bag (or sequence) of lexemes. The canonical representation is the inverted index of classical information retrieval (Salton & McGill, Introduction to Modern Information Retrieval, 1983; Zobel & Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys, 2006): a mapping from each distinct lexeme to the list of documents (and positions within them) where it occurs. The per-document half of this — the sorted set of a single document’s lexemes with their positions — is a first-class value in PostgreSQL: the
tsvector. -
A query language and a ranking function. A search request is not a single value but a boolean expression over lexemes:
run & (fast | quick) & !slow, possibly with phrase constraints (quick <-> brownmeans “quick immediately followed by brown”). Matching is a tree evaluation, not a comparison. And because many documents match, the engine must rank them — assign a relevance score so the best results surface first. Classical IR ranks by term frequency × inverse document frequency (tf-idf) and its descendants (Okapi BM25, Robertson & Spärck Jones). PostgreSQL ships a simpler, position-aware scheme based on term weights and cover density (the proximity-clustering idea of Clarke, Cormack & Tudhope, “Relevance Ranking for One- to Three-Term Queries”, Information Processing & Management, 2000).
Database System Concepts (Silberschatz et al.) treats information
retrieval as a distinct chapter precisely because the relevance-ranked,
keyword-over-unstructured-text model sits outside the relational
exact-match core. Database Internals (Petrov) frames the inverted index
as one member of the secondary-index family whose posting lists demand
different storage and merge strategies than B-tree leaf chains. PostgreSQL’s
answer threads all three layers into the type system: the linguistic layer
is the configuration (regconfig), the document representation is the
tsvector type, the query is the tsquery type, matching is the @@
operator, and the inverted index is a GIN index over tsvector.
The architectural insight that makes this fit a relational engine cleanly:
PostgreSQL does not build a monolithic search engine bolted onto the
side. It exposes tsvector and tsquery as ordinary user-visible
datatypes with ordinary operators, makes to_tsvector(text) an ordinary
(immutable, per-config) function you can put in an expression index, and
reuses the generalized inverted index (GIN) access method — the same one
that indexes array containment and JSONB — to accelerate @@. Full-text
search is thus a composition of existing extensibility points (custom
type, custom operator, custom GIN opclass, custom parser/dictionary
plugins) rather than a special case in the core executor.
Common DBMS Design
Section titled “Common DBMS Design”Almost every engine that offers text search converges on a similar pipeline; the variation is in where each stage lives and how pluggable it is. Naming the shared shape first makes PostgreSQL’s specific choices read as selections in a known design space.
The analyzer pipeline: tokenize → filter → normalize
Section titled “The analyzer pipeline: tokenize → filter → normalize”Lucene/Elasticsearch call it an analyzer; Oracle Text calls it a lexer + stoplist + stemmer; PostgreSQL calls it a parser + dictionaries. The shape is identical: a tokenizer cuts the byte stream into typed tokens, then a filter chain transforms each token — lowercasing, stop-word removal, stemming, synonym expansion — emitting zero or more normalized terms. “Zero” matters: a stop word produces no term but must still consume a position so that phrase queries downstream measure distance correctly. Engines differ in whether the filter chain is per-token-type (PostgreSQL maps each token type to its own dictionary list) or global (Lucene applies one analyzer to the whole field).
The document vector and the inverted index are two views of one thing
Section titled “The document vector and the inverted index are two views of one thing”Every engine stores, per document, a compact sorted structure of
(lexeme → positions). PostgreSQL materializes this as the tsvector value
stored in a table column; Lucene stores it only inside the index segment.
Making the per-document vector a first-class column has consequences:
you can SELECT it, debug it, store a precomputed to_tsvector(body) in a
generated column, and the GIN index simply inverts that column. The
inverted index maps each distinct lexeme to a posting list of row pointers
(TIDs in PostgreSQL). A query term lookup is a single index probe returning
a posting list; a boolean query intersects/unions posting lists.
Boolean query tree + relevance ranking are separate phases
Section titled “Boolean query tree + relevance ranking are separate phases”Matching answers which documents qualify; ranking answers how good each
is. These are deliberately decoupled: the index supplies a candidate set
cheaply (often lossily — it may include false positives that a recheck
removes), and ranking runs only on survivors, reading the full document
vector to compute proximity- and frequency-aware scores. Decoupling lets
the index be small (lexeme presence, not full position math) while ranking
stays precise (it re-reads positions from the stored tsvector).
Lossy index + recheck
Section titled “Lossy index + recheck”A crucial pattern: the index is allowed to return maybe-matches. Weight
restrictions (run:A = “run, but only where it carries weight A”) and
phrase distance (<->) cannot always be decided from posting-list
membership alone, because a GIN posting list records which rows contain a
lexeme, not at which weighted position. So the index returns TRUE or
MAYBE, and a recheck flag tells the executor to re-evaluate @@ on the
actual heap tuple. This ternary logic (yes / no / maybe) is the linchpin
that lets a presence-only index serve position-and-weight-sensitive
queries.
flowchart LR doc["raw document text"] --> P["text parser<br/>(tokenize)"] P --> T1["token: asciiword 'Running'"] P --> T2["token: blank ' '"] P --> T3["token: asciiword 'fast'"] T1 --> D["dictionary chain<br/>(lexize: lowercase,<br/>stopword, stem)"] T3 --> D D --> V["tsvector<br/>'fast':2 'run':1"] V --> G["GIN inverted index<br/>(lexeme to row TIDs)"] q["query text"] --> QP["to_tsquery<br/>(parse + lexize)"] QP --> Q["tsquery<br/>'run' & 'fast'"] Q --> G G --> C["candidate rows<br/>(TRUE / MAYBE)"] C --> R["@@ recheck +<br/>ts_rank scoring"] R --> out["ranked results"]
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s full-text search is assembled from five cooperating pieces,
each a separate compilation unit under src/backend/tsearch/ (the
linguistic engine) and src/backend/utils/adt/ (the type operators):
-
The text parser (
wparser_def.c) — a hand-written finite-state machine that classifies the input into 23 token types (asciiword,word,numword,email,url,host,version,tag, …). The default parser is the one registered aspg_catalog.default. -
Dictionaries (
dict.c,dict_simple.c,dict_synonym.c,dict_thesaurus.c,dict_ispell.c, plusspell.c) — each a plugin exposing aninitand alexizefunction. A dictionary maps a token to zero, one, or many output lexemes, or declines (returns NULL) to let the next dictionary in the chain try. -
The lexize driver (
ts_parse.c) —parsetext()runs the parser, and for each token walks the configuration’s per-token-type dictionary list, calling each dictionary’slexizeuntil one accepts. The statefulLexizeDatamachine also supports multi-word dictionaries (thesaurus) that need to peek at following tokens. -
The datatypes and operators (
to_tsany.c,tsvector_op.c,tsrank.c) —to_tsvector/to_tsquerybuild thetsvector/tsqueryvalues;ts_match_vqimplements@@;ts_rank/ts_rank_cdscore matches. -
The GIN bridge (
tsginidx.c) —gin_extract_tsvector,gin_extract_tsquery, andgin_tsquery_consistentconnect@@to the inverted index.
The tsvector: a sorted lexeme + position store
Section titled “The tsvector: a sorted lexeme + position store”A tsvector is a varlena blob: a count, an array of WordEntry headers
(offset + length + a haspos bit), then a packed string region holding the
lexeme bytes and, optionally, position arrays. The header layout is fixed
at 4 bytes per WordEntry:
// WordEntry — src/include/utils/tsvector.htypedef struct{ uint32 haspos:1, len:11, /* MAX 2Kb */ pos:20; /* MAX 1Mb */} WordEntry;Positions are 16-bit, with the high 2 bits reserved for a weight
(A/B/C/D) and the low 14 bits for the position ordinal — hence
MAXENTRYPOS = (1<<14) and MAXNUMPOS = 256 positions per lexeme:
// position limits — src/include/utils/tsvector.h#define MAXENTRYPOS (1<<14)#define MAXNUMPOS (256)#define LIMITPOS(x) ( ( (x) >= MAXENTRYPOS ) ? (MAXENTRYPOS-1) : (x) )The lexeme array is kept sorted by (length, bytes) so that @@
matching and ranking can binary-search it. That sort and the dedup happen
in make_tsvector → uniqueWORD, which collapses repeated lexemes and
merges their position lists:
// make_tsvector — src/backend/tsearch/to_tsany.cTSVectormake_tsvector(ParsedText *prs){ /* Merge duplicate words */ if (prs->curwords > 0) prs->curwords = uniqueWORD(prs->words, prs->curwords); /* ... compute lenstr, palloc0(totallen), SET_VARSIZE ... */ in->size = prs->curwords; /* ... copy each lexeme's bytes + position array into the blob ... */ return in;}uniqueWORD first qsorts by the compareWORD comparator (which calls
tsCompareString), then walks the run merging equal words and appending
distinct positions, capped at MAXNUMPOS:
// uniqueWORD — src/backend/tsearch/to_tsany.cqsort(a, l, sizeof(ParsedWord), compareWORD);/* ... */while (ptr - a < l){ if (!(ptr->len == res->len && strncmp(ptr->word, res->word, res->len) == 0)) { /* Got a new word, so put it in result */ res++; /* ... start a fresh position array ... */ } else { /* same word: append position if within MAXNUMPOS and distinct */ if (res->pos.apos[0] < MAXNUMPOS - 1 && ...) res->pos.apos[res->pos.apos[0] + 1] = LIMITPOS(ptr->pos.pos); } ptr++;}to_tsvector: drive the parser, lexize, materialize
Section titled “to_tsvector: drive the parser, lexize, materialize”to_tsvector_byid is the workhorse. It estimates the word count, allocates
a ParsedText, calls parsetext() to fill it (tokenize + lexize), then
make_tsvector to pack the sorted blob:
// to_tsvector_byid — src/backend/tsearch/to_tsany.cprs.lenwords = VARSIZE_ANY_EXHDR(in) / 6; /* estimate word count */if (prs.lenwords < 2) prs.lenwords = 2;prs.curwords = 0;prs.pos = 0;prs.words = (ParsedWord *) palloc(sizeof(ParsedWord) * prs.lenwords);
parsetext(cfgId, &prs, VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));out = make_tsvector(&prs);PG_RETURN_TSVECTOR(out);The zero-argument to_tsvector(text) simply resolves the current
configuration (the default_text_search_config GUC) and forwards:
// to_tsvector — src/backend/tsearch/to_tsany.ccfgId = getTSCurrentConfig(true);PG_RETURN_DATUM(DirectFunctionCall2(to_tsvector_byid, ObjectIdGetDatum(cfgId), PointerGetDatum(in)));The tsquery: an operator tree
Section titled “The tsquery: an operator tree”A tsquery is a flattened postfix tree of QueryItems — either a
QI_VAL operand (a lexeme, with optional prefix/weight flags) or a
QI_OPR operator. The operators are OP_NOT, OP_AND, OP_OR, and
OP_PHRASE (the <-> / <N> distance operator), the last being the
highest-priority code:
// operator codes — src/include/tsearch/ts_type.h#define OP_NOT 1#define OP_AND 2#define OP_OR 3#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */to_tsquery is more subtle than to_tsvector: each query word must also
be run through the dictionary chain (so to_tsquery('Running') matches the
run stored in the vector), but a single word may lexize into several
variant lexemes. The callback pushval_morph handles this: words from the
same morphological variant are AND-ed, different variants are OR-ed, and
stop words removed mid-phrase leave placeholder positions so phrase
distances stay correct:
// pushval_morph — src/backend/tsearch/to_tsany.cparsetext(data->cfg_id, &prs, strval, lenval);/* ... for each output position ... *//* Push all words belonging to the same variant */pushValue(state, prs.words[count].word, prs.words[count].len, weight, ((prs.words[count].flags & TSL_PREFIX) || prefix));if (cnt) pushOperator(state, OP_AND, 0); /* same variant: AND *//* ... */if (cntvar) pushOperator(state, OP_OR, 0); /* different variant: OR */The several public entry points differ only in the qoperator they hand
to pushval_morph and the parse flags they pass to parse_tsquery:
to_tsquery_byid— full operator syntax,OP_PHRASEfor adjacent morphs.plainto_tsquery_byid—P_TSQ_PLAIN,OP_AND: treats the whole input as plain words AND-ed together.phraseto_tsquery_byid—P_TSQ_PLAIN,OP_PHRASE: words become a phrase.websearch_to_tsquery_byid—P_TSQ_WEB: Google-style syntax (quoted phrases,or,-negation).
// plainto_tsquery_byid — src/backend/tsearch/to_tsany.cdata.qoperator = OP_AND;query = parse_tsquery(text_to_cstring(in), pushval_morph, PointerGetDatum(&data), P_TSQ_PLAIN, NULL);Configuration resolution
Section titled “Configuration resolution”Every conversion is parameterized by a text search configuration
(regconfig), which bundles a parser with a per-token-type dictionary map.
getTSCurrentConfig reads the session default; lookup_ts_config_cache
caches the resolved TSConfigCacheEntry (its map field is the
token-type → dictionary-list array consumed by LexizeExec). The thin
get_current_ts_config SQL function just exposes the former:
// get_current_ts_config — src/backend/tsearch/to_tsany.cDatumget_current_ts_config(PG_FUNCTION_ARGS){ PG_RETURN_OID(getTSCurrentConfig(true));}flowchart TD txt["to_tsvector(cfg, text)"] --> pt["parsetext()"] pt --> cfg["lookup_ts_config_cache(cfgId)"] pt --> prs["lookup_ts_parser_cache(prsId)"] prs --> tok["prsstart / prstoken / prsend<br/>FSM emits typed tokens"] tok --> lz["LexizeAddLemm + LexizeExec"] lz --> map["cfg->map[token.type]<br/>dictionary list"] map --> dic["each dict->lexize until accept"] dic --> pw["ParsedWord[] (word, pos, flags)"] pw --> mv["make_tsvector: uniqueWORD + pack"] mv --> tsv["tsvector value"]
Source Walkthrough
Section titled “Source Walkthrough”This section walks the call flow in execution order — parser → dictionaries
→ lexize driver → datatype build → match → rank → index — anchoring on
stable symbol names. The position-hint table at the end pairs each symbol
with its (file, line) as observed at the updated: revision.
1. The text parser FSM (wparser_def.c)
Section titled “1. The text parser FSM (wparser_def.c)”The default parser is a table-driven finite-state machine. Token types are
small integers whose human names live in tok_alias[]; type 0
(""/blank) is special and is skipped by the lexize driver. The 23
recognized types are:
// tok_alias — src/backend/tsearch/wparser_def.cstatic const char *const tok_alias[] = { "", "asciiword", "word", "numword", "email", "url", "host", "sfloat", "version", "hword_numpart", "hword_part", "hword_asciipart", "blank", "tag", "protocol", "numhword", "asciihword","hword", "url_path", "file", "float", "int", "uint", "entity"};The FSM is expressed as arrays of TParserStateActionItem — for each state,
a list of (character-class predicate, flags, action, next-state, emit-type)
rows. For example, once inside an ASCII word the machine emits an
ASCIIWORD token (A_BINGO) when it hits a non-word character or EOF:
// actionTPS_InAsciiWord — src/backend/tsearch/wparser_def.cstatic const TParserStateActionItem actionTPS_InAsciiWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}, /* ... transitions to hword/url/host/email states ... */ {NULL, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}};The parser exposes three function-manager callbacks — prsstart,
prstoken, prsend — invoked by the lexize driver below; it does not know
about dictionaries or configurations.
2. Dictionaries: the lexize plugin contract (dict_simple.c, dict_synonym.c)
Section titled “2. Dictionaries: the lexize plugin contract (dict_simple.c, dict_synonym.c)”A dictionary is a pair of FunctionCall-able procedures. init parses the
CREATE TEXT SEARCH DICTIONARY options into an opaque state blob; lexize
receives (dictData, lemmaText, len, DictSubState*) and returns a
NULL-terminated TSLexeme array. The output array encodes three distinct
outcomes:
- return NULL — “I don’t recognize this token” → driver tries the next dictionary in the list.
- return a zero-length (empty first lexeme) array — “this is a stop word” → the token is consumed, produces no lexeme, but advances position.
- return one-or-more lexemes — accepted and normalized.
dsimple_lexize is the canonical illustration: lowercase, then either
reject as stop word (empty array), accept (one lexeme), or decline (NULL)
depending on the accept option:
// dsimple_lexize — src/backend/tsearch/dict_simple.ctxt = str_tolower(in, len, DEFAULT_COLLATION_OID);if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)){ pfree(txt); res = palloc0(sizeof(TSLexeme) * 2); /* empty array = stop word */ PG_RETURN_POINTER(res);}else if (d->accept){ res = palloc0(sizeof(TSLexeme) * 2); res[0].lexeme = txt; /* one lexeme = accept */ PG_RETURN_POINTER(res);}else PG_RETURN_POINTER(NULL); /* NULL = decline */Each TSLexeme carries flag bits that steer the driver:
// TSLexeme flags — src/include/tsearch/ts_public.h#define TSL_ADDPOS 0x01 /* this lexeme advances the position counter */#define TSL_PREFIX 0x02 /* lexeme is a prefix pattern (foo:*) */#define TSL_FILTER 0x04 /* re-feed lexeme to the rest of the chain */TSL_FILTER is what makes filtering dictionaries (e.g. unaccent)
composable: a filter dictionary rewrites the token and hands the result
back to the same position in the dictionary list, so a later stemmer
still runs. The synonym dictionary (dsynonym_lexize) instead does a
direct table lookup and may set TSL_ADDPOS for multi-word replacements.
3. The lexize driver: parsetext and LexizeExec (ts_parse.c)
Section titled “3. The lexize driver: parsetext and LexizeExec (ts_parse.c)”parsetext is the loop that ties parser to dictionaries. It looks up the
config and parser caches, starts the parser, then repeatedly pulls a token
(prstoken), feeds it to the lexize state machine, and drains every
resulting lexeme into prs->words[], bumping prs->pos once per token and
again per TSL_ADDPOS:
// parsetext — src/backend/tsearch/ts_parse.cdo { type = DatumGetInt32(FunctionCall3(&(prsobj->prstoken), PointerGetDatum(prsdata), PointerGetDatum(&lemm), PointerGetDatum(&lenlemm))); /* ... IGNORE_LONGLEXEME guard for words >= MAXSTRLEN ... */ LexizeAddLemm(&ldata, type, lemm, lenlemm);
while ((norms = LexizeExec(&ldata, NULL)) != NULL) { prs->pos++; /* one position per token */ while (ptr->lexeme) { if (ptr->flags & TSL_ADDPOS) prs->pos++; prs->words[prs->curwords].word = ptr->lexeme; prs->words[prs->curwords].pos.pos = LIMITPOS(prs->pos); ptr++; prs->curwords++; } }} while (type > 0);LexizeExec is the heart of the chain logic. In its normal mode it walks
the dictionary list cfg->map[type], calling each lexize. A token type
with an empty map (or type 0) is skipped entirely:
// LexizeExec (normal mode) — src/backend/tsearch/ts_parse.cmap = ld->cfg->map + curVal->type;if (curVal->type == 0 || curVal->type >= ld->cfg->lenmap || map->len == 0){ RemoveHead(ld); /* skip this token type */ continue;}for (i = ld->posDict; i < map->len; i++){ dict = lookup_ts_dictionary_cache(map->dictIds[i]); res = (TSLexeme *) DatumGetPointer(FunctionCall4(&(dict->lexize), ...)); if (ld->dictState.getnext) { /* go multiword mode */ ... } if (!res) continue; /* dict declined: next dict */ if (res->flags & TSL_FILTER) { /* rewrite & re-feed same chain */ ... } RemoveHead(ld); return res; /* accepted */}When a dictionary sets dictState.getnext (thesaurus-style multi-word
matching), LexizeExec flips into multiword mode: it remembers the
dictionary id, stashes a tentative result via setNewTmpRes, and recurses
to feed following tokens to the same dictionary until it commits or backs
out. The DictSubState struct is the channel for this handshake:
// DictSubState — src/include/tsearch/ts_public.htypedef struct{ bool isend; /* in: text end reached */ bool getnext; /* out: dict wants next lexeme */ void *private_state; /* dict's cross-call scratch */} DictSubState;ts_lexize(dictid, word) is the SQL-exposed single-token debugger that
calls one dictionary’s lexize directly (handling the getnext
two-call protocol) and returns the lexeme array as a text[]:
// ts_lexize — src/backend/tsearch/dict.cres = (TSLexeme *) DatumGetPointer(FunctionCall4(&dict->lexize, PointerGetDatum(dict->dictData), PointerGetDatum(VARDATA_ANY(in)), Int32GetDatum(VARSIZE_ANY_EXHDR(in)), PointerGetDatum(&dstate)));if (dstate.getnext) { dstate.isend = true; /* second call */ }4. Matching: @@, TS_execute, and phrase distance (tsvector_op.c)
Section titled “4. Matching: @@, TS_execute, and phrase distance (tsvector_op.c)”The @@ operator is ts_match_vq. It sets up a CHKVAL pointing into the
tsvector’s sorted lexeme array and runs the ternary tree evaluator
TS_execute with the checkcondition_str callback:
// ts_match_vq — src/backend/utils/adt/tsvector_op.cchkval.arrb = ARRPTR(val);chkval.arre = chkval.arrb + val->size;chkval.values = STRPTR(val);chkval.operand = GETOPERAND(query);result = TS_execute(GETQUERY(query), &chkval, TS_EXEC_EMPTY, checkcondition_str);PG_RETURN_BOOL(result);checkcondition_str answers “does this query operand occur in the vector?”
by binary search over the sorted WordEntry array (the payoff for
keeping the tsvector sorted), then checkclass_str checks any weight
restriction and fills position data for phrase logic. Prefix operands
(foo:*) scan the contiguous run of lexemes that share the prefix:
// checkcondition_str — src/backend/utils/adt/tsvector_op.cwhile (StopLow < StopHigh){ StopMiddle = StopLow + (StopHigh - StopLow) / 2; difference = tsCompareString(chkval->operand + val->distance, val->length, chkval->values + StopMiddle->pos, StopMiddle->len, false); if (difference == 0) { res = checkclass_str(chkval, StopMiddle, val, data); break; } else if (difference > 0) StopLow = StopMiddle + 1; else StopHigh = StopMiddle;}TS_execute is a thin wrapper over TS_execute_recurse, which returns a
ternary value (TS_YES / TS_NO / TS_MAYBE). MAYBE is what lets a
position-blind caller (the GIN index) defer weight/phrase decisions to a
recheck:
// TS_execute — src/backend/utils/adt/tsvector_op.cboolTS_execute(QueryItem *curitem, void *arg, uint32 flags, TSExecuteCallback chkcond){ return TS_execute_recurse(curitem, arg, flags, chkcond) != TS_NO;}The OP_PHRASE case is special-cased to TS_phrase_execute, which walks
the two operands’ position lists and keeps only matches whose distance
equals the operator’s distance field — this is how quick <-> brown
(distance 1) or a <3> b is enforced. TS_phrase_execute returns position
sets so that nested phrase operators compose.
5. Ranking: ts_rank and cover density ts_rank_cd (tsrank.c)
Section titled “5. Ranking: ts_rank and cover density ts_rank_cd (tsrank.c)”calc_rank is the standard ts_rank core. It dispatches on the top
operator: AND/PHRASE queries use calc_rank_and (proximity-aware),
everything else uses calc_rank_or (frequency-aware), then applies the
selected normalization bits:
// calc_rank — src/backend/utils/adt/tsrank.cres = (item->type == QI_OPR && (item->qoperator.oper == OP_AND || item->qoperator.oper == OP_PHRASE)) ? calc_rank_and(w, t, q) : calc_rank_or(w, t, q);if (res < 0) res = 1e-20f;if ((method & RANK_NORM_LOGLENGTH) && t->size > 0) res /= log((double) (cnt_length(t) + 1)) / log(2.0);if (method & RANK_NORM_LENGTH) { len = cnt_length(t); if (len > 0) res /= (float) len; }if ((method & RANK_NORM_UNIQ) && t->size > 0) res /= (float) (t->size);if (method & RANK_NORM_RDIVRPLUS1) res /= (res + 1);calc_rank_and multiplies per-occurrence weights by a distance decay:
the closer two query terms occur, the higher the contribution. The decay
curve is word_distance, which falls off exponentially with separation and
floors at 1e-30 beyond 100 lexemes apart:
// word_distance — src/backend/utils/adt/tsrank.cstatic float4word_distance(int32 w){ if (w > 100) return 1e-30f; return 1.0 / (1.005 + 0.05 * exp(((float4) w) / 1.5 - 2));}// calc_rank_and (core product) — src/backend/utils/adt/tsrank.cdist = abs((int) WEP_GETPOS(post[l]) - (int) WEP_GETPOS(ct[p]));if (dist || (dist == 0 && (pos[i] == POSNULL || pos[k] == POSNULL))){ if (!dist) dist = MAXENTRYPOS; curw = sqrt(wpos(post[l]) * wpos(ct[p]) * word_distance(dist)); res = (res < 0) ? curw : 1.0 - (1.0 - res) * (1.0 - curw);}calc_rank_or is the disjunctive/frequency branch: it sums weighted
occurrences with a 1/i^2 discount on the i-th occurrence (so additional
hits add diminishing value, normalized against the pi^2/6 series limit):
// calc_rank_or — src/backend/utils/adt/tsrank.cfor (j = 0; j < dimt; j++){ resj = resj + wpos(post[j]) / ((j + 1) * (j + 1)); if (wpos(post[j]) > wjm) { wjm = wpos(post[j]); jm = j; }}res = res + (wjm + resj - wjm / ((jm + 1) * (jm + 1))) / 1.64493406685;ts_rank_cd (cover density) is the proximity ranker. It builds a
DocRepresentation (the query terms’ positions in the document) via
get_docrep, then repeatedly calls Cover to find minimal covers —
shortest spans of the document containing all query terms — and scores
inversely to cover length, summing across all covers. Cover itself reuses
the same TS_execute matcher against a QueryRepresentation:
// Cover — src/backend/utils/adt/tsrank.cwhile (ptr - doc < len){ fillQueryRepresentationData(qr, ptr); if (TS_execute(GETQUERY(qr->query), qr, TS_EXEC_EMPTY, checkcondition_QueryOperand)) { if (WEP_GETPOS(ptr->pos) > ext->q) { ext->q = WEP_GETPOS(ptr->pos); ext->end = ptr; found = true; } break; } ptr++;}6. The GIN bridge (tsginidx.c)
Section titled “6. The GIN bridge (tsginidx.c)”A GIN index on a tsvector column inverts it. gin_extract_tsvector
returns one index key per lexeme — exactly the posting-list keys GIN
stores against the row’s TID:
// gin_extract_tsvector — src/backend/utils/adt/tsginidx.c*nentries = vector->size;for (i = 0; i < vector->size; i++){ txt = cstring_to_text_with_len(STRPTR(vector) + we->pos, we->len); entries[i] = PointerGetDatum(txt); we++;}gin_extract_tsquery turns a query into the set of lexeme keys to look up,
sets partialmatch for prefix operands, and chooses a search mode: a query
with no required positive term (e.g. !foo) forces a full scan
(GIN_SEARCH_MODE_ALL) because the index cannot satisfy a pure negation:
// gin_extract_tsquery — src/backend/utils/adt/tsginidx.cif (tsquery_requires_match(item)) *searchMode = GIN_SEARCH_MODE_DEFAULT;else *searchMode = GIN_SEARCH_MODE_ALL;/* ... emit one entry per QI_VAL, partialmatch[j] = val->prefix ... */gin_tsquery_consistent is where lossy matching lives. GIN hands it a
bool check[] array — one flag per query key, “is this lexeme present in
the candidate row?” — and it runs TS_execute_ternary with the
checkcondition_gin callback. Because the index knows presence but not
weight or position, checkcondition_gin downgrades a present-and-weight-
restricted match to GIN_MAYBE, and a TS_MAYBE result sets *recheck:
// gin_tsquery_consistent — src/backend/utils/adt/tsginidx.cgcv.first_item = GETQUERY(query);gcv.check = (GinTernaryValue *) check;gcv.map_item_operand = (int *) (extra_data[0]);switch (TS_execute_ternary(GETQUERY(query), &gcv, TS_EXEC_PHRASE_NO_POS, checkcondition_gin)){ case TS_NO: res = false; break; case TS_YES: res = true; break; case TS_MAYBE: res = true; *recheck = true; break;}// checkcondition_gin — src/backend/utils/adt/tsginidx.cresult = gcv->check[j];if (result == GIN_TRUE){ if (val->weight != 0 || data != NULL) /* weight/position needed */ result = GIN_MAYBE; /* force a recheck */}return (TSTernaryValue) result;When *recheck is set, the executor re-evaluates @@ on the heap tuple’s
real tsvector (via ts_match_vq again), so phrase distance and weight
filters are decided exactly even though the index probe was lossy.
flowchart TD
q["WHERE body_tsv @@ to_tsquery('run & fast:A')"] --> ext["gin_extract_tsquery<br/>keys: run, fast"]
ext --> probe["GIN posting-list probe<br/>per key to TID sets"]
probe --> check["bool check[] per row<br/>(present? per key)"]
check --> cons["gin_tsquery_consistent<br/>TS_execute_ternary"]
cons --> yes["TS_YES to emit row"]
cons --> maybe["TS_MAYBE (weight A unknown)<br/>set recheck"]
maybe --> rc["heap fetch + ts_match_vq<br/>exact recheck"]
yes --> rank["ts_rank / ts_rank_cd"]
rc --> rank
rank --> out["ordered results"]
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
to_tsvector_byid | src/backend/tsearch/to_tsany.c | 243 |
to_tsvector | src/backend/tsearch/to_tsany.c | 270 |
make_tsvector | src/backend/tsearch/to_tsany.c | 165 |
uniqueWORD | src/backend/tsearch/to_tsany.c | 77 |
pushval_morph | src/backend/tsearch/to_tsany.c | 492 |
to_tsquery_byid | src/backend/tsearch/to_tsany.c | 579 |
plainto_tsquery_byid | src/backend/tsearch/to_tsany.c | 617 |
websearch_to_tsquery_byid | src/backend/tsearch/to_tsany.c | 692 |
get_current_ts_config | src/backend/tsearch/to_tsany.c | 47 |
parsetext | src/backend/tsearch/ts_parse.c | 355 |
LexizeExec | src/backend/tsearch/ts_parse.c | 173 |
hlparsetext | src/backend/tsearch/ts_parse.c | 540 |
generateHeadline | src/backend/tsearch/ts_parse.c | 607 |
ts_lexize | src/backend/tsearch/dict.c | 27 |
dsimple_lexize | src/backend/tsearch/dict_simple.c | 76 |
dsynonym_lexize | src/backend/tsearch/dict_synonym.c | 212 |
tok_alias | src/backend/tsearch/wparser_def.c | 62 |
actionTPS_InAsciiWord | src/backend/tsearch/wparser_def.c | 999 |
ts_match_vq | src/backend/utils/adt/tsvector_op.c | 2187 |
checkcondition_str | src/backend/utils/adt/tsvector_op.c | 1268 |
TS_execute | src/backend/utils/adt/tsvector_op.c | 1827 |
TS_phrase_execute | src/backend/utils/adt/tsvector_op.c | 1582 |
tsvector_bsearch | src/backend/utils/adt/tsvector_op.c | 395 |
calc_rank | src/backend/utils/adt/tsrank.c | 358 |
calc_rank_and | src/backend/utils/adt/tsrank.c | 201 |
calc_rank_or | src/backend/utils/adt/tsrank.c | 284 |
word_distance | src/backend/utils/adt/tsrank.c | 45 |
calc_rank_cd | src/backend/utils/adt/tsrank.c | 855 |
Cover | src/backend/utils/adt/tsrank.c | 651 |
get_docrep | src/backend/utils/adt/tsrank.c | 732 |
gin_extract_tsvector | src/backend/utils/adt/tsginidx.c | 64 |
gin_extract_tsquery | src/backend/utils/adt/tsginidx.c | 94 |
checkcondition_gin | src/backend/utils/adt/tsginidx.c | 183 |
gin_tsquery_consistent | src/backend/utils/adt/tsginidx.c | 214 |
getTSCurrentConfig | src/backend/utils/cache/ts_cache.c | 556 |
lookup_ts_config_cache | src/backend/utils/cache/ts_cache.c | 385 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All symbols, code excerpts, and constants above were read directly from the
REL_18 working tree at commit 273fe94852b (2026-06-05) under
/data/hgryoo/references/postgres. Verification notes and caveats:
-
Commit / branch.
git log -1reports273fe94852b 2026-06-05. All excerpts are condensed (marked with/* ... */) but otherwise verbatim; every leading// symbol — pathcomment names a function that exists at the cited line in the position-hint table. -
Copyright banners read
2025. Every file header still says “Portions Copyright (c) 1996-2025”; this is the in-tree banner on the REL_18 branch and is not a contradiction of the 2026-dated commit — the banner year lags the actual commit date. Noted so a future reader does not mistake it for a stale checkout. -
tsvector/tsqueryon-disk layout.WordEntryis asserted to be 4 bytes (sizeof(WordEntry) == 4) bytsvectorsend/recv; the bitfield widths (len:11,pos:20,haspos:1) and theMAXENTRYPOS (1<<14)/MAXNUMPOS (256)constants are fromsrc/include/utils/tsvector.has read. The operator codesOP_NOT/AND/OR/PHRASE = 1/2/3/4are fromsrc/include/tsearch/ts_type.h. -
Ternary matching.
TS_executereturnsbool(collapsingTS_MAYBE→true), whileTS_execute_ternary(used by the GIN consistent function) preservesTS_MAYBE. This split is the mechanism behind the index recheck path and was confirmed in bothtsvector_op.candtsginidx.c. -
Scope boundary. The headline subsystem (
hlparsetext,generateHeadline,ts_headline) and the statistics path (ts_typanalyze.c,ts_selfuncs.c) are in-tree and named here for completeness, but their internals are out of scope for this doc; the fullspell.cispell engine anddict_thesaurus.care likewise only sketched. The GIN access method itself (posting-tree layout, fast insert, page splits) is deferred topostgres-gin.md. The varlena / TOAST mechanics of how a largetsvectoris stored are deferred topostgres-datatypes-adt.mdandpostgres-toast.md. -
contrib/is out of scope.pg_trgm,unaccent, and thedict_int/dict_xsynexample dictionaries live undercontrib/and are named only as examples; nothing in the walkthrough asserts facts about them. The Snowball stemmer dictionaries (snowball/) ship in the core tree but their per-language stemming tables are not examined here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s full-text search is a good-enough, in-engine IR subsystem deliberately scoped below a dedicated search engine. Placing it in the landscape of comparative designs clarifies both what it buys and what it forgoes.
Lucene / Elasticsearch — the segment-based inverted index
Section titled “Lucene / Elasticsearch — the segment-based inverted index”Lucene is the reference point for “serious” text search. Its differences from PostgreSQL’s model are instructive rather than merely quantitative:
- Where the vector lives. Lucene never materializes a per-document
vector as a user-visible value; the term→postings map exists only inside
immutable on-disk segments. PostgreSQL makes the
tsvectora first-class column you canSELECT, store generated, and debug. The cost is duplication (the column and the GIN index both hold the lexemes); the benefit is transparency and reuse of the ordinary MVCC/heap machinery. - Scoring. Lucene defaults to BM25 (Robertson & Spärck Jones’
Okapi family), a saturating tf-idf with document-length normalization
whose
k1/bparameters are tunable per field. PostgreSQL’sts_rankandts_rank_cd(the cover-density scheme of Clarke, Cormack & Tudhope) are position/proximity rankers that do not use corpus-wide inverse document frequency — the@@index path has no global term-frequency statistics to draw on, so a term’s rarity across the table does not raise its score. This is the single biggest semantic gap: PostgreSQL ranks within a document, not relative to the corpus. - Segment merges vs. GIN pending list. Lucene amortizes index updates
by writing new segments and background-merging them. GIN amortizes via
the fast-update pending list (a heap of un-merged entries flushed in
bulk), covered in
postgres-gin.md. Both trade read-time merge cost for write throughput; neither does in-place posting-list mutation.
Oracle Text, SQL Server, MySQL — in-engine peers
Section titled “Oracle Text, SQL Server, MySQL — in-engine peers”Oracle Text (CONTEXT index, CONTAINS() operator) is architecturally the
closest peer: a lexer + stoplist + stemmer pipeline feeding an inverted
index, with SCORE() for relevance. SQL Server’s FULLTEXT indexes and
CONTAINS/FREETEXT, and MySQL/InnoDB’s FULLTEXT with MATCH … AGAINST
(which does expose a BM25-like natural-language mode), occupy the same
niche: text search welded into the relational engine so it stays
transactional and joins against ordinary columns. PostgreSQL’s distinctive
move is the degree of composition — the whole subsystem is expressed in
public extensibility points (a custom type pair, a custom operator, a GIN
opclass, pluggable parser/dictionary FMGR procedures) rather than a private
engine, so a user can register a new dictionary (CREATE TEXT SEARCH DICTIONARY) without touching core C.
The ranking and recheck frontier
Section titled “The ranking and recheck frontier”Two structural limits define the research/extension frontier for the in-core design:
- No corpus statistics on the index path. Because
gin_tsquery_consistentonly sees a per-rowbool check[](presence), andts_rankonly sees one document’stsvector, true tf-idf/BM25 ranking would require either a maintained per-lexeme document-frequency catalog or a different access method.pg_search/ParadeDB and therumextension (a GIN cousin that stores positions and lexeme weights inside the posting lists) attack exactly this:rumlets distance- and rank-ordered scans avoid the heap recheck and avoid a separate sort, which the stock GIN path cannot. - Lossy recheck is mandatory for weight/phrase queries. As Section 6
showed,
checkcondition_gindowngrades any weight- or position-sensitive operand toGIN_MAYBE, forcing a heap fetch and a secondts_match_vq. For phrase-heavy or weight-filtered workloads this can dominate cost. Therumaccess method’s in-posting-list positions are the community answer; getting equivalent capability into core GIN remains open.
Vector / semantic search — an orthogonal axis
Section titled “Vector / semantic search — an orthogonal axis”The post-2023 shift toward dense-vector semantic search (embeddings +
approximate nearest neighbor, e.g. pgvector’s HNSW/IVFFlat indexes) is a
different retrieval model, not a replacement: it ranks by embedding cosine
similarity rather than lexeme overlap, and answers “semantically similar”
rather than “contains these words.” Production systems increasingly run
hybrid retrieval — lexical (tsvector/@@) and vector ANN combined via
reciprocal-rank fusion. PostgreSQL can host both side by side (a tsvector
GIN column and a vector HNSW column on the same table), which is itself an
argument for the in-engine, compose-from-types philosophy: the two index
worlds share transactions, joins, and WHERE filtering for free. pgvector
internals are out of scope here and noted only as the complementary axis.
Where PostgreSQL sits
Section titled “Where PostgreSQL sits”The design is a textbook instance of “embed an 80%-solution that composes
with the relational core” rather than “bolt on a best-of-breed external
engine.” For transactional, moderate-scale, language-aware containment +
proximity ranking against rows you are already storing and joining, the
tsvector/tsquery/GIN triad is hard to beat on operational simplicity.
For corpus-relative relevance (BM25), faceting, distributed sharding, or
billion-document scale, the comparative literature points toward Lucene-class
engines or the rum/ParadeDB extensions — and, increasingly, toward hybrid
lexical+vector pipelines.
Sources
Section titled “Sources”PostgreSQL source tree (REL_18_STABLE, commit 273fe94852b, read
2026-06-05 under /data/hgryoo/references/postgres):
src/backend/tsearch/to_tsany.c—to_tsvector_byid,make_tsvector,uniqueWORD,pushval_morph, theto_tsquery/plainto_tsquery/phraseto_tsquery/websearch_to_tsqueryfamily,get_current_ts_config.src/backend/tsearch/ts_parse.c—parsetext,LexizeExec,LexizeAddLemm, the multiwordDictSubStatehandshake, headline path (hlparsetext,generateHeadline, named only).src/backend/tsearch/dict.c—ts_lexizeSQL debugger and the dictionary FMGR plumbing.src/backend/tsearch/dict_simple.c,dict_synonym.c—dsimple_lexize,dsynonym_lexize(the lexize plugin contract: NULL / empty / lexemes,TSL_FILTER/TSL_ADDPOS).src/backend/tsearch/wparser_def.c— default parser FSM:tok_alias,actionTPS_InAsciiWord, the 23 token types.src/backend/utils/adt/tsvector_op.c—ts_match_vq(@@),checkcondition_str,TS_execute/TS_execute_recurse,TS_phrase_execute,tsvector_bsearch.src/backend/utils/adt/tsrank.c—calc_rank,calc_rank_and,calc_rank_or,word_distance,calc_rank_cd,Cover,get_docrep.src/backend/utils/adt/tsginidx.c—gin_extract_tsvector,gin_extract_tsquery,checkcondition_gin,gin_tsquery_consistent.src/backend/utils/cache/ts_cache.c—getTSCurrentConfig,lookup_ts_config_cache.src/include/utils/tsvector.h—WordEntrybitfield,MAXENTRYPOS,MAXNUMPOS,LIMITPOS.src/include/tsearch/ts_type.h— operator codes.src/include/tsearch/ts_public.h—TSLexemeflags,DictSubState.
Sibling KB analyses (cross-referenced, not duplicated):
knowledge/code-analysis/postgres/postgres-gin.md— GIN access method internals: posting trees, the fast-update pending list, page splits.knowledge/code-analysis/postgres/postgres-datatypes-adt.mdandpostgres-toast.md— varlena layout and TOAST storage of largetsvectorvalues.knowledge/code-analysis/postgres/postgres-collation-providers.md,postgres-encoding.md— the i18n-text neighbors that own case-folding/collation context used bydsimple_lexize.
Theory / IR literature:
- Salton & McGill, Introduction to Modern Information Retrieval (1983) — inverted index, vector space model.
- Zobel & Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys 38(2), 2006 — posting-list storage and query evaluation.
- Robertson & Spärck Jones — the Okapi BM25 relevance model (the corpus-
relative scheme PostgreSQL’s
ts_rankdeliberately does not implement). - Clarke, Cormack & Tudhope, “Relevance Ranking for One- to Three-Term
Queries”, Information Processing & Management 36(2), 2000 — the
cover-density model behind
ts_rank_cd. - Database System Concepts (Silberschatz, Korth, Sudarshan) — the
information-retrieval chapter. Database Internals (Petrov) — inverted
indexes as a secondary-index family. Both captured under
knowledge/research/dbms-general/.