PostgreSQL Internationalization & Text Search — Section Overview

Contents:

What this section covers
The layering
Reading order
Detail-doc summaries
Adjacent sections

What this section covers

This subcategory is about what PostgreSQL does to a string beyond storing its bytes — how it orders and case-folds text under a locale, what encoding a string is in and how it converts between encodings, and how a document is turned into something searchable. Three module docs own these three layers:

Collation providers — the abstraction behind every locale-sensitive operation. A COLLATE clause, a lower(), a LIKE/regex, an index on a text column: all of them resolve a collation OID to a single pg_locale_t handle and call through it. The handle hides which provider — libc, ICU, or the PG17-new builtin provider — actually does the comparison.
Encoding (utils/mb) — the per-database server encoding, the matrix of conversion functions between encodings (client↔server), and the multibyte-aware string primitives the rest of the backend calls.
Full-text search (tsearch) — the tsvector/tsquery types, the parser-plus-dictionary text-search configuration that normalizes a document into lexemes, and the operators that match a query against a vector.

Sharp boundaries — what this section is NOT.

It is not the datatype machinery. text/varchar storage, TOAST compression, the varlena header, and the fmgr calling convention that every string function uses belong to base-infra. This section assumes a string already exists as a Datum and asks what locale/encoding/search semantics apply to it.
It is not the GIN access method itself. Full-text search is GIN’s most important client — tsvector ships GIN extractValue/extractQuery support functions and a default operator class — but the GIN index structure, posting lists, and the pending-list/fast-update path are owned by storage-engine (postgres-gin.md). This doc draws the seam; it does not cross it.
It is not pattern-matching execution or the planner’s index selection. LIKE/SIMILAR TO/regex evaluation and selectivity live in query-processing; this section only owns the locale handle they consult for case and character classification.
It is not parsing or normalization of identifiers and SQL text — that is the scanner/parser in query-processing. Encoding here is about the data path (client_encoding ↔ server_encoding), not the lexer.

Where a question crosses one of these lines, the relevant module doc names the neighbor and hands off.

The layering

The three modules are independent layers over the same raw bytes, not a single pipeline — each is consulted by a different caller in the backend. The diagram shows who calls into each, and the two outward seams (base-infra below, storage-engine’s GIN to the side).

flowchart TB
  subgraph CALLERS["backend callers (other subcategories)"]
    CMP["comparisons / sort / index build<br/>(varstr_cmp, str_tolower, LIKE, regex)"]
    WIRE["client I/O<br/>(client_encoding ↔ server_encoding)"]
    FTS["to_tsvector / to_tsquery / @@<br/>(text-search operators)"]
  end

  subgraph SEC["i18n-text subcategory"]
    COLL["postgres-collation-providers.md<br/>pg_locale_t handle + collate vtable<br/>(libc / ICU / builtin)"]
    ENC["postgres-encoding.md<br/>utils/mb: server encoding,<br/>conversion matrix, mb string ops"]
    TS["postgres-full-text-search.md<br/>tsvector/tsquery, parser +<br/>dictionaries (text-search config)"]
  end

  CMP --> COLL
  WIRE --> ENC
  FTS --> TS

  COLL -. "ctype ops need encoding<br/>(char ↔ wchar)" .-> ENC
  TS -. "lexize / casefold<br/>uses locale + encoding" .-> COLL
  TS -. "lexize uses encoding" .-> ENC

  subgraph BASE["base-infra (below)"]
    DT["text / varlena datatypes<br/>fmgr calling convention"]
  end
  COLL --> DT
  ENC --> DT
  TS --> DT

  subgraph STORE["storage-engine (side seam)"]
    GIN["postgres-gin.md<br/>GIN access method"]
  end
  TS -. "extractValue / extractQuery<br/>(tsvector opclass)" .-> GIN

Two things to read off the diagram. First, collation and encoding are orthogonal but coupled: encoding tells you where one character ends and the next begins (and supplies the char↔wchar conversion the libc ctype path needs), while collation tells you how to order and case-fold those characters — the builtin and ICU providers carry their own Unicode tables, so they lean on encoding only for the byte/codepoint boundary, not for the semantics. Second, full-text search sits on top of both: turning a document into lexemes (dictionary lexize) consults the locale for case-folding and the encoding for character boundaries, then the result is what GIN indexes.

Reading order

Cross-referenced-first: read the abstraction the other two lean on, then the byte layer, then the consumer that sits on both.

postgres-collation-providers.md — start here. The pg_locale_t handle and the collate_methods vtable are referenced by encoding’s ctype path and by full-text’s case-folding, so the provider abstraction is the shared vocabulary. It also explains the three-provider split (libc / ICU / builtin) that a reader coming from PG≤16 will not expect.
postgres-encoding.md — the byte layer. Short and self-contained: server vs client encoding, the conversion-function matrix, and why the name/wchar tables live in src/common (shared with frontend tools).
postgres-full-text-search.md — read last. It is the consumer that ties the previous two together and reaches sideways into GIN (postgres-gin.md in storage-engine). Read that GIN doc alongside if you want the index half of the story.

Detail-doc summaries

Forward references — these module docs may not exist yet; the summaries are predictive.

Module doc	One-line scope
`postgres-collation-providers.md`	The `pg_locale_t` handle and `collate_methods` vtable in `pg_locale.c`: how a collation OID resolves to a provider (`COLLPROVIDER_LIBC` `'c'` / `COLLPROVIDER_ICU` `'i'` / `COLLPROVIDER_BUILTIN` `'b'`), the `pg_strcoll`/`pg_strxfrm`/case-mapping entry points dispatched to `pg_locale_{libc,icu,builtin}.c`, deterministic vs nondeterministic collations, the `collate_is_c`/`ctype_is_c` fast paths, collation-version tracking, and `CREATE COLLATION` (`collationcmds.c`).
`postgres-encoding.md`	`utils/mb`: the per-database server encoding, `client_encoding` and the conversion-function matrix wired through `pg_do_encoding_conversion` (`mbutils.c`, `conv.c`, the `conversion_procs` table), the multibyte string primitives, and the split with `src/common` (`encnames.c` for encoding names, `wchar.c` for the per-encoding `mblen`/verify/`char`↔`wchar` tables shared with frontend tools).
`postgres-full-text-search.md`	`tsearch`: the `tsvector`/`tsquery` types and operators, the text-search configuration (a parser — `wparser_def.c` — plus a dictionary chain — `dict_*`, `spell.c`), how `to_tsvector`/`to_tsquery` lexize and normalize, ranking (`tsrank.c`), and the GIN integration (`tsginidx.c` `extractValue`/`extractQuery`, `tsgistidx.c` for GiST).

Adjacent sections

postgres-overview-base-infra.md — the layer directly below. It owns the text/varlena datatypes, the fmgr calling convention, and the string functions in utils/adt that invoke the collation handle and the mb conversions this section describes. Read base-infra for “what is a text Datum”; read here for “what locale/encoding rules apply to it.”
postgres-overview-storage-engine.md — the side seam. postgres-gin.md there owns the GIN access method that full-text search is the flagship client of; the tsvector operator class and extract* support functions documented here plug into GIN’s generic extract/consistent framework.
postgres-overview-query-processing.md — borders on two edges: LIKE/regex evaluation and selectivity estimation consult the collation handle for case/character classification, and the scanner/parser handles the SQL-text encoding that this section’s data-path encoding sits beside.
postgres-overview-system-catalog.md — pg_collation, pg_conversion, and pg_ts_config/pg_ts_dict/pg_ts_parser are the catalog rows that the three modules here read; that section owns the catalog layout and cache, this one owns the runtime behavior keyed by those rows.