PostgreSQL Internationalization & Text Search — Section Overview
Contents:
What this section covers
Section titled “What this section covers”This subcategory is about what PostgreSQL does to a string beyond storing its bytes — how it orders and case-folds text under a locale, what encoding a string is in and how it converts between encodings, and how a document is turned into something searchable. Three module docs own these three layers:
- Collation providers — the abstraction behind every locale-sensitive
operation. A
COLLATEclause, alower(), aLIKE/regex, an index on atextcolumn: all of them resolve a collation OID to a singlepg_locale_thandle and call through it. The handle hides which provider —libc,ICU, or the PG17-newbuiltinprovider — actually does the comparison. - Encoding (
utils/mb) — the per-database server encoding, the matrix of conversion functions between encodings (client↔server), and the multibyte-aware string primitives the rest of the backend calls. - Full-text search (
tsearch) — thetsvector/tsquerytypes, the parser-plus-dictionary text-search configuration that normalizes a document into lexemes, and the operators that match a query against a vector.
Sharp boundaries — what this section is NOT.
- It is not the datatype machinery.
text/varcharstorage,TOASTcompression, thevarlenaheader, and thefmgrcalling convention that every string function uses belong to base-infra. This section assumes a string already exists as aDatumand asks what locale/encoding/search semantics apply to it. - It is not the GIN access method itself. Full-text search is GIN’s most
important client —
tsvectorships GINextractValue/extractQuerysupport functions and a default operator class — but the GIN index structure, posting lists, and the pending-list/fast-update path are owned by storage-engine (postgres-gin.md). This doc draws the seam; it does not cross it. - It is not pattern-matching execution or the planner’s index selection.
LIKE/SIMILAR TO/regex evaluation and selectivity live in query-processing; this section only owns the locale handle they consult for case and character classification. - It is not parsing or normalization of identifiers and SQL text — that is the scanner/parser in query-processing. Encoding here is about the data path (client_encoding ↔ server_encoding), not the lexer.
Where a question crosses one of these lines, the relevant module doc names the neighbor and hands off.
The layering
Section titled “The layering”The three modules are independent layers over the same raw bytes, not a single pipeline — each is consulted by a different caller in the backend. The diagram shows who calls into each, and the two outward seams (base-infra below, storage-engine’s GIN to the side).
flowchart TB
subgraph CALLERS["backend callers (other subcategories)"]
CMP["comparisons / sort / index build<br/>(varstr_cmp, str_tolower, LIKE, regex)"]
WIRE["client I/O<br/>(client_encoding ↔ server_encoding)"]
FTS["to_tsvector / to_tsquery / @@<br/>(text-search operators)"]
end
subgraph SEC["i18n-text subcategory"]
COLL["postgres-collation-providers.md<br/>pg_locale_t handle + collate vtable<br/>(libc / ICU / builtin)"]
ENC["postgres-encoding.md<br/>utils/mb: server encoding,<br/>conversion matrix, mb string ops"]
TS["postgres-full-text-search.md<br/>tsvector/tsquery, parser +<br/>dictionaries (text-search config)"]
end
CMP --> COLL
WIRE --> ENC
FTS --> TS
COLL -. "ctype ops need encoding<br/>(char ↔ wchar)" .-> ENC
TS -. "lexize / casefold<br/>uses locale + encoding" .-> COLL
TS -. "lexize uses encoding" .-> ENC
subgraph BASE["base-infra (below)"]
DT["text / varlena datatypes<br/>fmgr calling convention"]
end
COLL --> DT
ENC --> DT
TS --> DT
subgraph STORE["storage-engine (side seam)"]
GIN["postgres-gin.md<br/>GIN access method"]
end
TS -. "extractValue / extractQuery<br/>(tsvector opclass)" .-> GIN
Two things to read off the diagram. First, collation and encoding are
orthogonal but coupled: encoding tells you where one character ends and the
next begins (and supplies the char↔wchar conversion the libc ctype path
needs), while collation tells you how to order and case-fold those
characters — the builtin and ICU providers carry their own Unicode tables, so
they lean on encoding only for the byte/codepoint boundary, not for the
semantics. Second, full-text search sits on top of both: turning a
document into lexemes (dictionary lexize) consults the locale for case-folding
and the encoding for character boundaries, then the result is what GIN indexes.
Reading order
Section titled “Reading order”Cross-referenced-first: read the abstraction the other two lean on, then the byte layer, then the consumer that sits on both.
postgres-collation-providers.md— start here. Thepg_locale_thandle and thecollate_methodsvtable are referenced by encoding’s ctype path and by full-text’s case-folding, so the provider abstraction is the shared vocabulary. It also explains the three-provider split (libc / ICU / builtin) that a reader coming from PG≤16 will not expect.postgres-encoding.md— the byte layer. Short and self-contained: server vs client encoding, the conversion-function matrix, and why the name/wchar tables live insrc/common(shared with frontend tools).postgres-full-text-search.md— read last. It is the consumer that ties the previous two together and reaches sideways into GIN (postgres-gin.mdin storage-engine). Read that GIN doc alongside if you want the index half of the story.
Detail-doc summaries
Section titled “Detail-doc summaries”Forward references — these module docs may not exist yet; the summaries are predictive.
| Module doc | One-line scope |
|---|---|
postgres-collation-providers.md | The pg_locale_t handle and collate_methods vtable in pg_locale.c: how a collation OID resolves to a provider (COLLPROVIDER_LIBC 'c' / COLLPROVIDER_ICU 'i' / COLLPROVIDER_BUILTIN 'b'), the pg_strcoll/pg_strxfrm/case-mapping entry points dispatched to pg_locale_{libc,icu,builtin}.c, deterministic vs nondeterministic collations, the collate_is_c/ctype_is_c fast paths, collation-version tracking, and CREATE COLLATION (collationcmds.c). |
postgres-encoding.md | utils/mb: the per-database server encoding, client_encoding and the conversion-function matrix wired through pg_do_encoding_conversion (mbutils.c, conv.c, the conversion_procs table), the multibyte string primitives, and the split with src/common (encnames.c for encoding names, wchar.c for the per-encoding mblen/verify/char↔wchar tables shared with frontend tools). |
postgres-full-text-search.md | tsearch: the tsvector/tsquery types and operators, the text-search configuration (a parser — wparser_def.c — plus a dictionary chain — dict_*, spell.c), how to_tsvector/to_tsquery lexize and normalize, ranking (tsrank.c), and the GIN integration (tsginidx.c extractValue/extractQuery, tsgistidx.c for GiST). |
Adjacent sections
Section titled “Adjacent sections”postgres-overview-base-infra.md— the layer directly below. It owns thetext/varlenadatatypes, thefmgrcalling convention, and the string functions inutils/adtthat invoke the collation handle and the mb conversions this section describes. Read base-infra for “what is atextDatum”; read here for “what locale/encoding rules apply to it.”postgres-overview-storage-engine.md— the side seam.postgres-gin.mdthere owns the GIN access method that full-text search is the flagship client of; thetsvectoroperator class andextract*support functions documented here plug into GIN’s generic extract/consistent framework.postgres-overview-query-processing.md— borders on two edges:LIKE/regex evaluation and selectivity estimation consult the collation handle for case/character classification, and the scanner/parser handles the SQL-text encoding that this section’s data-path encoding sits beside.postgres-overview-system-catalog.md—pg_collation,pg_conversion, andpg_ts_config/pg_ts_dict/pg_ts_parserare the catalog rows that the three modules here read; that section owns the catalog layout and cache, this one owns the runtime behavior keyed by those rows.