Skip to content

PostgreSQL Internationalization & Text Search — Section Overview

Contents:

This subcategory is about what PostgreSQL does to a string beyond storing its bytes — how it orders and case-folds text under a locale, what encoding a string is in and how it converts between encodings, and how a document is turned into something searchable. Three module docs own these three layers:

  • Collation providers — the abstraction behind every locale-sensitive operation. A COLLATE clause, a lower(), a LIKE/regex, an index on a text column: all of them resolve a collation OID to a single pg_locale_t handle and call through it. The handle hides which provider — libc, ICU, or the PG17-new builtin provider — actually does the comparison.
  • Encoding (utils/mb) — the per-database server encoding, the matrix of conversion functions between encodings (client↔server), and the multibyte-aware string primitives the rest of the backend calls.
  • Full-text search (tsearch) — the tsvector/tsquery types, the parser-plus-dictionary text-search configuration that normalizes a document into lexemes, and the operators that match a query against a vector.

Sharp boundaries — what this section is NOT.

  • It is not the datatype machinery. text/varchar storage, TOAST compression, the varlena header, and the fmgr calling convention that every string function uses belong to base-infra. This section assumes a string already exists as a Datum and asks what locale/encoding/search semantics apply to it.
  • It is not the GIN access method itself. Full-text search is GIN’s most important clienttsvector ships GIN extractValue/extractQuery support functions and a default operator class — but the GIN index structure, posting lists, and the pending-list/fast-update path are owned by storage-engine (postgres-gin.md). This doc draws the seam; it does not cross it.
  • It is not pattern-matching execution or the planner’s index selection. LIKE/SIMILAR TO/regex evaluation and selectivity live in query-processing; this section only owns the locale handle they consult for case and character classification.
  • It is not parsing or normalization of identifiers and SQL text — that is the scanner/parser in query-processing. Encoding here is about the data path (client_encoding ↔ server_encoding), not the lexer.

Where a question crosses one of these lines, the relevant module doc names the neighbor and hands off.

The three modules are independent layers over the same raw bytes, not a single pipeline — each is consulted by a different caller in the backend. The diagram shows who calls into each, and the two outward seams (base-infra below, storage-engine’s GIN to the side).

flowchart TB
  subgraph CALLERS["backend callers (other subcategories)"]
    CMP["comparisons / sort / index build<br/>(varstr_cmp, str_tolower, LIKE, regex)"]
    WIRE["client I/O<br/>(client_encoding ↔ server_encoding)"]
    FTS["to_tsvector / to_tsquery / @@<br/>(text-search operators)"]
  end

  subgraph SEC["i18n-text subcategory"]
    COLL["postgres-collation-providers.md<br/>pg_locale_t handle + collate vtable<br/>(libc / ICU / builtin)"]
    ENC["postgres-encoding.md<br/>utils/mb: server encoding,<br/>conversion matrix, mb string ops"]
    TS["postgres-full-text-search.md<br/>tsvector/tsquery, parser +<br/>dictionaries (text-search config)"]
  end

  CMP --> COLL
  WIRE --> ENC
  FTS --> TS

  COLL -. "ctype ops need encoding<br/>(char ↔ wchar)" .-> ENC
  TS -. "lexize / casefold<br/>uses locale + encoding" .-> COLL
  TS -. "lexize uses encoding" .-> ENC

  subgraph BASE["base-infra (below)"]
    DT["text / varlena datatypes<br/>fmgr calling convention"]
  end
  COLL --> DT
  ENC --> DT
  TS --> DT

  subgraph STORE["storage-engine (side seam)"]
    GIN["postgres-gin.md<br/>GIN access method"]
  end
  TS -. "extractValue / extractQuery<br/>(tsvector opclass)" .-> GIN

Two things to read off the diagram. First, collation and encoding are orthogonal but coupled: encoding tells you where one character ends and the next begins (and supplies the charwchar conversion the libc ctype path needs), while collation tells you how to order and case-fold those characters — the builtin and ICU providers carry their own Unicode tables, so they lean on encoding only for the byte/codepoint boundary, not for the semantics. Second, full-text search sits on top of both: turning a document into lexemes (dictionary lexize) consults the locale for case-folding and the encoding for character boundaries, then the result is what GIN indexes.

Cross-referenced-first: read the abstraction the other two lean on, then the byte layer, then the consumer that sits on both.

  1. postgres-collation-providers.md — start here. The pg_locale_t handle and the collate_methods vtable are referenced by encoding’s ctype path and by full-text’s case-folding, so the provider abstraction is the shared vocabulary. It also explains the three-provider split (libc / ICU / builtin) that a reader coming from PG≤16 will not expect.
  2. postgres-encoding.md — the byte layer. Short and self-contained: server vs client encoding, the conversion-function matrix, and why the name/wchar tables live in src/common (shared with frontend tools).
  3. postgres-full-text-search.md — read last. It is the consumer that ties the previous two together and reaches sideways into GIN (postgres-gin.md in storage-engine). Read that GIN doc alongside if you want the index half of the story.

Forward references — these module docs may not exist yet; the summaries are predictive.

Module docOne-line scope
postgres-collation-providers.mdThe pg_locale_t handle and collate_methods vtable in pg_locale.c: how a collation OID resolves to a provider (COLLPROVIDER_LIBC 'c' / COLLPROVIDER_ICU 'i' / COLLPROVIDER_BUILTIN 'b'), the pg_strcoll/pg_strxfrm/case-mapping entry points dispatched to pg_locale_{libc,icu,builtin}.c, deterministic vs nondeterministic collations, the collate_is_c/ctype_is_c fast paths, collation-version tracking, and CREATE COLLATION (collationcmds.c).
postgres-encoding.mdutils/mb: the per-database server encoding, client_encoding and the conversion-function matrix wired through pg_do_encoding_conversion (mbutils.c, conv.c, the conversion_procs table), the multibyte string primitives, and the split with src/common (encnames.c for encoding names, wchar.c for the per-encoding mblen/verify/charwchar tables shared with frontend tools).
postgres-full-text-search.mdtsearch: the tsvector/tsquery types and operators, the text-search configuration (a parser — wparser_def.c — plus a dictionary chain — dict_*, spell.c), how to_tsvector/to_tsquery lexize and normalize, ranking (tsrank.c), and the GIN integration (tsginidx.c extractValue/extractQuery, tsgistidx.c for GiST).
  • postgres-overview-base-infra.md — the layer directly below. It owns the text/varlena datatypes, the fmgr calling convention, and the string functions in utils/adt that invoke the collation handle and the mb conversions this section describes. Read base-infra for “what is a text Datum”; read here for “what locale/encoding rules apply to it.”
  • postgres-overview-storage-engine.md — the side seam. postgres-gin.md there owns the GIN access method that full-text search is the flagship client of; the tsvector operator class and extract* support functions documented here plug into GIN’s generic extract/consistent framework.
  • postgres-overview-query-processing.md — borders on two edges: LIKE/regex evaluation and selectivity estimation consult the collation handle for case/character classification, and the scanner/parser handles the SQL-text encoding that this section’s data-path encoding sits beside.
  • postgres-overview-system-catalog.mdpg_collation, pg_conversion, and pg_ts_config/pg_ts_dict/pg_ts_parser are the catalog rows that the three modules here read; that section owns the catalog layout and cache, this one owns the runtime behavior keyed by those rows.