Skip to content

PostgreSQL Collation Providers — libc, ICU, and the Builtin Provider

Contents:

A relational database must impose a total order on the values of every sortable type, because ORDER BY, B-tree indexes, MIN/MAX, merge joins, and GROUP BY all rest on a comparator that, given two values, returns less / equal / greater. For numbers and dates the order is intrinsic. For character strings it is not: “is Z < a?” has no universal answer. ASCII says yes (uppercase sorts before lowercase); a German phone-book order says treat ä like ae; a Swedish order puts ä after z. The function that decides is a collation, and it is a cultural artifact, not a mathematical one.

Database System Concepts (Silberschatz et al.) introduces this under the SQL COLLATE clause: a collation is “a set of rules that determines how strings are compared and sorted,” and the standard lets a collation be attached to a column, an expression, or a query-level override. The textbook stresses two things the beginner overlooks. First, collation is separate from character encoding — encoding (UTF-8, LATIN1) decides which byte sequences are legal characters; collation decides how those characters sort. Second, collation interacts with equality, not only ordering: a case-insensitive collation must report 'A' = 'a' as true, which ripples into unique indexes, DISTINCT, and hash joins.

Three design knobs define the space a collation implementation chooses within:

  1. Where do the rules come from? A database can lean on the host operating system’s C library (strcoll(3)), bind a dedicated Unicode library (ICU/CLDR), or compile its own ordering tables into the binary. Each choice trades coverage (how many locales) against stability (does the order change when the host upgrades).

  2. Is the comparison deterministic? The Unicode Collation Algorithm (UCA, Unicode Technical Standard #10) defines a multi-level comparison: primary level ignores case and accents, secondary adds accents, tertiary adds case. A collation can declare that two strings comparing equal at the chosen level are equal (nondeterministic — 'café' = 'café' even though the byte strings differ), or that ties are broken by a final byte comparison so only byte-identical strings are equal (deterministic). The choice changes the meaning of =, unique constraints, and whether abbreviated keys / deduplication are even sound.

  3. How is order versioned? Collation tables are data, and they change. CLDR ships a new release; glibc 2.28 famously reordered many locales. When the rules underneath a B-tree change, the index is silently corrupt — entries are no longer in the order the tree assumes. A mature engine records the collation version that built each index and warns when the live library disagrees.

PostgreSQL’s answer to knob 1 is a provider abstraction: every collation carries a one-character collprovider code, and a thin dispatch layer routes to one of three back ends. Knob 2 is a per-collation boolean, collisdeterministic, supported only by ICU. Knob 3 is the collversion string stored in pg_collation / pg_database, checked on first use.

Every engine that supports culturally-correct string ordering converges on the same handful of engineering patterns, regardless of which library it delegates to. Naming them here lets PostgreSQL’s specific symbols read as choices within a shared space.

A provider/strategy indirection over the comparator

Section titled “A provider/strategy indirection over the comparator”

No serious engine hard-codes one collation library. The comparator is reached through an indirection — a function-pointer table, a virtual method, or a tagged dispatch on a provider code — so the same call site (ORDER BY, index build) works whether the rules come from libc, ICU, or a built-in table. The indirection is resolved once per (collation, session) and cached, because resolving a locale handle (newlocale(3), ucol_open()) is expensive and the same collation is hit millions of times per query.

Locale handle distinct from the resolved comparator

Section titled “Locale handle distinct from the resolved comparator”

The locale name ("en_US.utf8", "und-x-icu") is catalog data; the locale handle (locale_t, UCollator *) is a live OS/library object that cannot be stored in a catalog and must be reconstructed per backend. Engines keep the two strictly separate: the catalog row names the collation, and a per-process cache materializes the handle on demand.

Transform keys for sort-once, compare-many

Section titled “Transform keys for sort-once, compare-many”

strcoll-style comparison is costly: it re-parses both strings on every call. The standard optimization is a transform (strxfrm(3), ucol_getSortKey()) that converts a string once into a byte blob whose plain memcmp reproduces the collation order. Sorts and abbreviated-key optimizations use the transform; the catch is that not every library’s transform is trustworthy, so the engine guards it behind a capability flag.

Deterministic tie-break appended to the cultural order

Section titled “Deterministic tie-break appended to the cultural order”

Cultural orders are not total — many distinct byte strings tie at every UCA level. For a B-tree key or a unique index, ties are unacceptable. The near-universal fix: after the library returns “equal,” append a raw memcmp (then a length comparison) to force a total order. A collation that skips this tie-break is “nondeterministic” and must be excluded from optimizations that assume byte-equal-iff-equal (deduplication, abbreviated keys, hash equality by image).

Because collation tables drift, engines stamp each collation-dependent on-disk structure with the library version that built it and surface a mismatch as a warning plus an explicit “rebuild + refresh version” workflow. Without this, an OS upgrade turns every text index subtly wrong with no diagnostic.

Theory / patternPostgreSQL name
Provider/strategy indirectioncollprovider code + create_pg_locale() dispatch
Provider codesCOLLPROVIDER_LIBC 'c', _ICU 'i', _BUILTIN 'b', _DEFAULT 'd'
Resolved comparator handlepg_locale_t (struct pg_locale_struct)
Comparator method tablestruct collate_methods (strncoll, strnxfrm, …)
Locale name (catalog data)collcollate/collctype (libc), colllocale (ICU/builtin)
Per-session handle cacheCollationCache + last_collation_cache_*
Transform keypg_strnxfrm / strxfrm_is_safe flag
Deterministic tie-breakmylocale->deterministic branch in varstr_cmp
Version stampingcollversion column + get_collation_actual_version()
Fixed-at-create database localedatcollate/datctype/datlocale in pg_database

PostgreSQL routes all locale-sensitive string work through a single opaque handle, pg_locale_t, obtained from a collation OID. The handle’s first field is the provider code; every public entry point (pg_strcoll, pg_strnxfrm, pg_strlower, …) is a thin dispatcher that branches on that code and forwards to a provider-specific implementation living in pg_locale_libc.c, pg_locale_icu.c, or pg_locale_builtin.c.

// struct pg_locale_struct — src/include/utils/pg_locale.h
struct pg_locale_struct
{
char provider;
bool deterministic;
bool collate_is_c;
bool ctype_is_c;
bool is_default;
const struct collate_methods *collate; /* NULL if collate_is_c */
union
{
struct
{
const char *locale;
bool casemap_full;
} builtin;
locale_t lt; /* libc */
#ifdef USE_ICU
struct
{
const char *locale;
UCollator *ucol;
} icu;
#endif
} info;
};

The struct is a discriminated union. provider is the discriminant; the info union holds exactly one of: the builtin locale string, a libc locale_t, or an ICU UCollator *. Two booleans, collate_is_c and ctype_is_c, are hoisted out of the union because the engine optimizes the C/POSIX path aggressively (plain memcmp, ASCII case folding) and wants to test for it without a provider-specific call. deterministic is likewise hoisted because it gates correctness logic in dozens of call sites.

The comparison methods are themselves a pointer table, so even within one provider PostgreSQL can swap implementations (e.g., a Windows UTF-8 variant):

// struct collate_methods — src/include/utils/pg_locale.h
struct collate_methods
{
int (*strncoll) (const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2,
pg_locale_t locale); /* required */
size_t (*strnxfrm) (char *dest, size_t destsize,
const char *src, ssize_t srclen,
pg_locale_t locale); /* required */
size_t (*strnxfrm_prefix) (char *dest, size_t destsize,
const char *src, ssize_t srclen,
pg_locale_t locale); /* optional */
bool strxfrm_is_safe;
};

A collation OID becomes a handle in create_pg_locale(), which reads the pg_collation row, branches on collprovider, and calls the matching constructor:

// create_pg_locale — src/backend/utils/adt/pg_locale.c
collform = (Form_pg_collation) GETSTRUCT(tp);
if (collform->collprovider == COLLPROVIDER_BUILTIN)
result = create_pg_locale_builtin(collid, context);
else if (collform->collprovider == COLLPROVIDER_ICU)
result = create_pg_locale_icu(collid, context);
else if (collform->collprovider == COLLPROVIDER_LIBC)
result = create_pg_locale_libc(collid, context);
else
PGLOCALE_SUPPORT_ERROR(collform->collprovider); /* shouldn't happen */

The provider codes are single characters, stored as a char column so the catalog stays compact:

// pg_collation.h — provider codes
#define COLLPROVIDER_DEFAULT 'd'
#define COLLPROVIDER_BUILTIN 'b'
#define COLLPROVIDER_ICU 'i'
#define COLLPROVIDER_LIBC 'c'

COLLPROVIDER_DEFAULT ('d') never appears on a real collation row that gets a handle — it is the sentinel for “use the database default,” and the dispatcher reaches it through DEFAULT_COLLATION_OID instead (see the cache section). The three concrete providers are libc, ICU, and builtin.

Every public string operation follows the same dispatch shape. pg_strlower is representative:

// pg_strlower — src/backend/utils/adt/pg_locale.c
size_t
pg_strlower(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
if (locale->provider == COLLPROVIDER_BUILTIN)
return strlower_builtin(dst, dstsize, src, srclen, locale);
#ifdef USE_ICU
else if (locale->provider == COLLPROVIDER_ICU)
return strlower_icu(dst, dstsize, src, srclen, locale);
#endif
else if (locale->provider == COLLPROVIDER_LIBC)
return strlower_libc(dst, dstsize, src, srclen, locale);
else
PGLOCALE_SUPPORT_ERROR(locale->provider);
return 0; /* keep compiler quiet */
}

Note #ifdef USE_ICU: ICU is a build-time option. A server compiled without ICU still understands the 'i' code in the catalog (so pg_dump/restore of ICU collations does not crash) but errors cleanly if asked to actually use one. The comparison and transform entry points are even thinner — they bounce straight through the method table without a provider switch:

// pg_strncoll / pg_strnxfrm — src/backend/utils/adt/pg_locale.c
int
pg_strncoll(const char *arg1, ssize_t len1, const char *arg2, ssize_t len2,
pg_locale_t locale)
{
return locale->collate->strncoll(arg1, len1, arg2, len2, locale);
}
size_t
pg_strnxfrm(char *dest, size_t destsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
return locale->collate->strnxfrm(dest, destsize, src, srclen, locale);
}

That is the whole abstraction: provider selects the constructor; the constructor installs a collate_methods table; comparison and transform go through the table; case mapping goes through a per-function provider switch.

flowchart TD
  A["caller: ORDER BY / index build / texteq<br/>has collation OID"] --> B["pg_newlocale_from_collation(collid)"]
  B --> C{"collid?"}
  C -->|"DEFAULT_COLLATION_OID"| D["return default_locale<br/>(set at init_database_collation)"]
  C -->|"C_COLLATION_OID"| E["return &c_locale<br/>(static, provider=LIBC, collate_is_c)"]
  C -->|"other"| F{"in CollationCache?"}
  F -->|"yes"| G["return cached pg_locale_t"]
  F -->|"no"| H["create_pg_locale(collid)"]
  H --> I{"collprovider"}
  I -->|"'b' BUILTIN"| J["create_pg_locale_builtin"]
  I -->|"'i' ICU"| K["create_pg_locale_icu"]
  I -->|"'c' LIBC"| L["create_pg_locale_libc"]
  J --> M["install collate_methods + info.builtin"]
  K --> M2["install collate_methods + info.icu.ucol"]
  L --> M3["install collate_methods + info.lt"]
  M --> N["cache + version check"]
  M2 --> N
  M3 --> N
  N --> G

Figure 1 — Resolving a collation OID to a pg_locale_t. The two fast paths (DEFAULT_COLLATION_OID, C_COLLATION_OID) short-circuit before any catalog access; everything else goes through the per-backend cache and, on a miss, the provider-dispatched constructor. The version check (Figure 4) runs inside create_pg_locale after the constructor returns.

The libc provider: lc_collate / lc_ctype fixed at CREATE DATABASE

Section titled “The libc provider: lc_collate / lc_ctype fixed at CREATE DATABASE”

The libc provider is the historical default and the one tied to the LC_COLLATE / LC_CTYPE settings. The file-header comment in pg_locale.c states the central constraint:

// pg_locale.c — file header comment
/*----------
* Here is how the locale stuff is handled: LC_COLLATE and LC_CTYPE
* are fixed at CREATE DATABASE time, stored in pg_database, and cannot
* be changed. Thus, the effects of strcoll(), strxfrm(), isupper(),
* toupper(), etc. are always in the same fixed locale.
* ...
*----------
*/

LC_COLLATE and LC_CTYPE are not GUCs you can SET per session — they are baked into pg_database.datcollate / datctype when the database is created, because a mutable collation would make every existing B-tree index ambiguous. (By contrast LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME are runtime GUCs, handled separately in this same file by the check_locale_* / assign_locale_* hooks and the PGLC_localeconv / cache_locale_time caches — but those only affect formatting, never sort order.)

The libc constructor pulls the two locale strings from the catalog and builds an OS locale_t:

// create_pg_locale_libc — src/backend/utils/adt/pg_locale_libc.c
loc = make_libc_collator(collate, ctype);
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
result->provider = COLLPROVIDER_LIBC;
result->deterministic = true; /* libc is always deterministic */
result->collate_is_c = (strcmp(collate, "C") == 0) ||
(strcmp(collate, "POSIX") == 0);
result->ctype_is_c = (strcmp(ctype, "C") == 0) ||
(strcmp(ctype, "POSIX") == 0);
result->info.lt = loc;
if (!result->collate_is_c)
{
#ifdef WIN32
if (GetDatabaseEncoding() == PG_UTF8)
result->collate = &collate_methods_libc_win32_utf8;
else
#endif
result->collate = &collate_methods_libc;
}

Two things stand out. First, deterministic is hard-wired true: libc has no nondeterministic mode. Second, for the C/POSIX locale make_libc_collator returns NULL and result->collate stays NULL — the engine never calls strcoll_l for C, it uses raw memcmp, which is both faster and immune to OS collation drift. Under the hood the libc methods call strncoll_l / strnxfrm_l, the thread-safe locale-parameterized forms of strcoll / strxfrm.

The ICU provider: BCP-47 language tags and nondeterministic collations

Section titled “The ICU provider: BCP-47 language tags and nondeterministic collations”

ICU is the modern, OS-independent provider built on Unicode CLDR data. Its locale names are BCP-47 language tags (en-US, und for the root), canonicalized before they reach ucol_open():

// icu_language_tag — src/backend/utils/adt/pg_locale.c (condensed)
char *
icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
/* ... grow buffer in a loop ... */
uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
/* ... on U_FAILURE, ereport(elevel) or return NULL ... */
return langtag;
#else
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("ICU is not supported in this build")));
return NULL;
#endif
}

Canonicalization (uloc_toLanguageTag) does “level-2 canonicalization”: it accepts POSIX-ish or .NET-ish locale spellings and produces one consistent language tag, so CREATE COLLATION ... (locale = 'en_US') and 'en-US-x-icu' resolve to the same ICU collator. icu_validate_locale() then best-effort checks the language exists, and finally ucol_open() builds the live UCollator.

ICU is the only provider that supports nondeterministic collations, because it is the only one that can compare at a UCA level where distinct byte strings legitimately tie. DefineCollation enforces this:

// DefineCollation — src/backend/commands/collationcmds.c
/*
* Nondeterministic collations are currently only supported with ICU
* because that's the only case where it can actually make a
* difference. ...
*/
if (!collisdeterministic && collprovider != COLLPROVIDER_ICU)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("nondeterministic collations not supported with this provider")));

The builtin provider: C, C.UTF-8, and PG_UNICODE_FAST

Section titled “The builtin provider: C, C.UTF-8, and PG_UNICODE_FAST”

The builtin provider (added so PostgreSQL has a fast, version-stable Unicode option that does not depend on the host libc or a linked ICU) accepts exactly three locale names, validated and canonicalized in pg_locale.c:

// builtin_validate_locale — src/backend/utils/adt/pg_locale.c
if (strcmp(locale, "C") == 0)
canonical_name = "C";
else if (strcmp(locale, "C.UTF-8") == 0 || strcmp(locale, "C.UTF8") == 0)
canonical_name = "C.UTF-8";
else if (strcmp(locale, "PG_UNICODE_FAST") == 0)
canonical_name = "PG_UNICODE_FAST";
if (!canonical_name)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("invalid locale name \"%s\" for builtin provider",
locale)));

Each builtin locale carries an encoding constraint, returned by builtin_locale_encoding: "C" is encoding-agnostic (-1), while "C.UTF-8" and "PG_UNICODE_FAST" require UTF-8. The ordering is pure code point order (memcmp on UTF-8 bytes gives code-point order), so it is fast and never changes across PostgreSQL releases — there is no external library to drift. PG_UNICODE_FAST differs from C.UTF-8 in its ctype behavior: it applies full Unicode case mapping (upper/lower/title across the whole repertoire) while keeping code-point sort order, where C.UTF-8 does only ASCII case mapping.

Deterministic vs. nondeterministic equality

Section titled “Deterministic vs. nondeterministic equality”

Determinism is where collation stops being “just sorting” and starts changing the meaning of =. The comparison core, varstr_cmp, calls the provider and then breaks ties only when the collation is deterministic:

// varstr_cmp tie-break — src/backend/utils/adt/varlena.c (condensed)
result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
/* Break tie if necessary. */
if (result == 0 && mylocale->deterministic)
{
result = memcmp(arg1, arg2, Min(len1, len2));
if ((result == 0) && (len1 != len2))
result = (len1 < len2) ? -1 : 1;
}

For a deterministic collation, two strings are equal iff they are byte-identical: the cultural comparison decides ordering, but any residual tie is resolved by raw bytes, so = reduces to memcmp. texteq exploits this — for a deterministic locale it skips strcoll entirely and does a bitwise compare:

// texteq — src/backend/utils/adt/varlena.c (condensed)
mylocale = pg_newlocale_from_collation(collid);
if (mylocale->deterministic)
{
/* ... equality is pure length check + memcmp, no strcoll ... */
if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
/* equal */ ;
}
else
{
/* must run the collation: 'café' may equal 'café' */
}

For a nondeterministic collation the tie-break is skipped: two strings that compare equal at the chosen UCA level are equal even if their bytes differ. That makes case-insensitive or accent-insensitive unique indexes possible, but it disables optimizations that assume byte-equal-iff-equal — PostgreSQL refuses deduplication, abbreviated keys are unsafe, and some substring operations error out ("nondeterministic collations are not supported for substring searches").

flowchart TD
  A["text equality / comparison<br/>varstr_cmp"] --> B{"locale->deterministic?"}
  B -->|"true (libc, builtin,<br/>most ICU)"| C["run provider compare"]
  C --> D{"compares equal?"}
  D -->|"no"| E["return cultural order"]
  D -->|"yes (tie)"| F["memcmp tie-break<br/>then length"]
  F --> G["total order:<br/>equal iff byte-identical"]
  B -->|"false (ICU only)"| H["run provider compare<br/>at UCA level"]
  H --> I["equal means equal —<br/>no byte tie-break"]
  I --> J["disables: dedup,<br/>abbreviated keys,<br/>substring search"]

Figure 2 — The determinism branch. Deterministic collations (the default, all that libc and builtin support) append a memcmp tie-break so equality collapses to byte-identity, which keeps B-tree deduplication and abbreviated keys sound. Nondeterministic collations (ICU only) take the cultural verdict as final, enabling accent/case-insensitive uniqueness at the cost of those optimizations.

Sorting calls the comparator O(n log n) times, each re-parsing both strings. The classic optimization is to transform each string once into a byte blob whose memcmp reproduces the collation order, then sort the blobs. PostgreSQL exposes this as pg_strnxfrm, but not every provider’s transform is trustworthy on every platform, so the engine gates it behind a capability flag rather than assuming it works:

// pg_strxfrm_enabled — src/backend/utils/adt/pg_locale.c
bool
pg_strxfrm_enabled(pg_locale_t locale)
{
/*
* locale->collate->strnxfrm is still a required method, even if it may
* have the wrong behavior, because the planner uses it for estimates in
* some cases.
*/
return locale->collate->strxfrm_is_safe;
}

The subtlety in the comment is load-bearing: strnxfrm is always present (the planner calls it for selectivity estimates, where a wrong answer only hurts plan quality, not correctness), but it is only used for actual sorting when strxfrm_is_safe is true. Historically some glibc versions produced strxfrm output that did not round-trip through strcoll, so the flag lets PostgreSQL fall back to direct comparison on those platforms while still offering the transform where it is sound. The abbreviated-keys optimization in tuplesort (see postgres-tuplesort.md) is built on this transform.

A collation’s rules are external data that drifts: a new ICU/CLDR release, or the notorious glibc 2.28 reordering, can silently change the order of an existing B-tree index. PostgreSQL defends against this by recording the provider’s reported version string in pg_collation.collversion (and pg_database.datcollversion for the database default) at create time, then re-checking it every time the collation is first resolved in a backend.

get_collation_actual_version asks the live provider what version it reports now:

// get_collation_actual_version — src/backend/utils/adt/pg_locale.c
char *
get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
if (collprovider == COLLPROVIDER_BUILTIN)
collversion = get_collation_actual_version_builtin(collcollate);
#ifdef USE_ICU
else if (collprovider == COLLPROVIDER_ICU)
collversion = get_collation_actual_version_icu(collcollate);
#endif
else if (collprovider == COLLPROVIDER_LIBC)
collversion = get_collation_actual_version_libc(collcollate);
return collversion;
}

create_pg_locale then compares the recorded version (read from the catalog row) against the actual version and warns on mismatch, with an errhint naming the exact remediation:

// create_pg_locale — src/backend/utils/adt/pg_locale.c (condensed)
collversionstr = TextDatumGetCString(datum); /* recorded */
actual_versionstr = get_collation_actual_version(collform->collprovider, ...);
if (strcmp(actual_versionstr, collversionstr) != 0)
ereport(WARNING,
(errmsg("collation \"%s\" has version mismatch", ...),
errdetail("The collation in the database was created using version %s, "
"but the operating system provides version %s.",
collversionstr, actual_versionstr),
errhint("Rebuild all objects affected by this collation and run "
"ALTER COLLATION %s REFRESH VERSION, ...")));

The mismatch is a warning, not an error — PostgreSQL cannot know whether the specific strings in your indexes are affected by the reordering, so it flags the risk and lets the DBA decide. The fix is two steps: REINDEX every affected index (rebuilding it under the new order), then ALTER COLLATION ... REFRESH VERSION to stamp the new version so the warning stops. AlterCollation implements the refresh by re-reading the actual version and writing it back into collversion:

// AlterCollation — src/backend/commands/collationcmds.c (condensed)
newversion = get_collation_actual_version(collForm->collprovider,
TextDatumGetCString(datum));
if ((!oldversion && newversion) || (oldversion && !newversion))
elog(ERROR, "invalid collation version change");
else if (oldversion && newversion && strcmp(newversion, oldversion) != 0)
{
ereport(NOTICE, (errmsg("changing version from %s to %s",
oldversion, newversion)));
/* ... write newversion into Anum_pg_collation_collversion ... */
}
else
ereport(NOTICE, (errmsg("version has not changed")));

The default collation gets its version from pg_database, not pg_collation, so ALTER COLLATION on it is refused with a hint to use ALTER DATABASE ... REFRESH COLLATION VERSION instead. The SQL function pg_collation_actual_version(oid) exposes the live version to users for inspection, reading from pg_database for DEFAULT_COLLATION_OID and from pg_collation otherwise.

flowchart TD
  A["CREATE COLLATION / CREATE DATABASE"] --> B["get_collation_actual_version()"]
  B --> C["store version string in<br/>pg_collation.collversion /<br/>pg_database.datcollversion"]
  C --> D["index built under these rules"]
  D --> E["OS upgrade: glibc 2.28 /<br/>new ICU CLDR release"]
  E --> F["first use: create_pg_locale()<br/>recorded vs actual"]
  F --> G{"strcmp differs?"}
  G -->|"no"| H["silent — index trusted"]
  G -->|"yes"| I["WARNING: version mismatch<br/>errhint: REINDEX + REFRESH"]
  I --> J["DBA: REINDEX affected indexes"]
  J --> K["ALTER COLLATION ... REFRESH VERSION<br/>writes actual into collversion"]
  K --> H

Figure 3 — The collation version lifecycle. The version stamped at create time is the contract; a library upgrade can break it. The mismatch surfaces as a warning (PostgreSQL cannot prove which rows are affected), and the two-step REINDEX + REFRESH workflow restores the contract.

CREATE COLLATION: option parsing and provider selection

Section titled “CREATE COLLATION: option parsing and provider selection”

DefineCollation is where SQL syntax becomes catalog data. It parses the provider, locale / lc_collate / lc_ctype, deterministic, rules, and version options, then normalizes them per provider. The locale shorthand expands differently depending on provider — for libc it fills both lc_collate and lc_ctype; for ICU/builtin it fills colllocale:

// DefineCollation — src/backend/commands/collationcmds.c (condensed)
if (localeEl)
{
if (collprovider == COLLPROVIDER_LIBC)
{
collcollate = defGetString(localeEl);
collctype = defGetString(localeEl);
}
else
colllocale = defGetString(localeEl);
}

The provider default is libc when none is given. ICU-only options (rules, nondeterminism) are rejected for other providers, ICU locales are canonicalized to a language tag (unless in binary upgrade, which preserves the original string), and finally CollationCreate writes the row. After creation the code calls pg_newlocale_from_collation(newoid) once as a smoke test — “check that the locales can be loaded” — so a typo’d locale fails at CREATE COLLATION time rather than at first query.

A fresh database does not enumerate every OS locale by hand; pg_import_system_collations() (invoked by initdb and exposed as an SQL function) does it. On non-Windows it runs locale -a, normalizes each name (stripping .utf8-style encoding tags and creating a short alias), and calls CollationCreate with COLLPROVIDER_LIBC for each valid one:

// create_collation_from_locale — src/backend/commands/collationcmds.c (condensed)
collid = CollationCreate(locale, nspid, GetUserId(),
COLLPROVIDER_LIBC, true, enc,
locale, locale, NULL, NULL,
get_collation_actual_version(COLLPROVIDER_LIBC, locale),
true, true);

When built with ICU, the same function loops over uloc_getAvailable() and creates <langtag>-x-icu collations (the root locale sneaks in at index -1). Each is created with its actual version already stamped, so freshly imported collations start life version-consistent.

The collation-provider machinery spans the catalog (pg_collation, pg_database), the runtime resolver (pg_locale.c plus three provider files), and the DDL layer (collationcmds.c). The symbols below are grouped by call-flow.

  • struct pg_locale_struct / pg_locale_t (in pg_locale.h) — the resolved handle: provider discriminant, the deterministic / collate_is_c / ctype_is_c flags, the collate method pointer, and the info union over builtin / libc locale_t / ICU UCollator.
  • struct collate_methods (in pg_locale.h) — the per-collation method table: strncoll (required), strnxfrm (required), strnxfrm_prefix (optional), and the strxfrm_is_safe flag.
  • COLLPROVIDER_DEFAULT / _BUILTIN / _ICU / _LIBC (in pg_collation.h) — the single-char provider codes 'd' / 'b' / 'i' / 'c'.
  • Form_pg_collation fields collprovider, collisdeterministic, collencoding, collcollate, collctype, colllocale, collicurules, collversion — the catalog columns the resolver reads.
  • Form_pg_database fields datlocprovider, datcollate, datctype, datlocale, datcollversion — the database-default locale, fixed at CREATE DATABASE.
  • pg_newlocale_from_collation (in pg_locale.c) — the public entry point; short-circuits DEFAULT_COLLATION_OIDdefault_locale and C_COLLATION_OID&c_locale, then consults the cache.
  • c_locale (in pg_locale.c) — the static pg_locale_struct for C/POSIX, usable without catalog access.
  • collation_cache_entry / CollationCache / last_collation_cache_oid / last_collation_cache_locale (in pg_locale.c) — the per-backend simplehash cache plus a one-entry fast path for the repeated-collation case.
  • create_pg_locale (in pg_locale.c) — reads the catalog row, dispatches on collprovider to the provider constructor, then runs the version check.
  • init_database_collation (in pg_locale.c) — builds default_locale from pg_database at backend startup, dispatching on datlocprovider.
  • create_pg_locale_libc / make_libc_collator (in pg_locale_libc.c) — builds a locale_t via newlocale; sets deterministic = true and the collate_is_c / ctype_is_c flags; leaves collate NULL for C/POSIX.
  • create_pg_locale_icu (in pg_locale_icu.c) — opens a UCollator; the only constructor that can set deterministic = false.
  • create_pg_locale_builtin (in pg_locale_builtin.c) — installs the compiled-in Unicode tables for C / C.UTF-8 / PG_UNICODE_FAST.
  • pg_strcoll / pg_strncoll (in pg_locale.c) — comparison via locale->collate->strncoll.
  • pg_strnxfrm / pg_strxfrm / pg_strxfrm_enabled / pg_strnxfrm_prefix / pg_strxfrm_prefix_enabled (in pg_locale.c) — transform keys and their safety/availability gates.
  • pg_strlower / pg_strtitle / pg_strupper / pg_strfold (in pg_locale.c) — case mapping, each a provider switch (note pg_strfold falls back to strlower_libc for the libc provider).
  • varstr_cmp / text_cmp / texteq (in varlena.c) — the determinism tie-break and the deterministic fast path for equality.
  • builtin_validate_locale / builtin_locale_encoding (in pg_locale.c) — canonicalize and encoding-check the three builtin locale names.
  • icu_language_tag / icu_validate_locale (in pg_locale.c) — BCP-47 canonicalization and best-effort validation for ICU locales.
  • get_collation_actual_version (in pg_locale.c) — asks the live provider for its current version string.
  • DefineCollation (in collationcmds.c) — CREATE COLLATION: parse options, per-provider normalization, write the row, smoke-test the locale.
  • AlterCollation (in collationcmds.c) — ALTER COLLATION … REFRESH VERSION; refuses the default collation.
  • pg_collation_actual_version (in collationcmds.c) — SQL-exposed live version, reading pg_database or pg_collation.
  • pg_import_system_collations / create_collation_from_locale / normalize_libc_locale_name (in collationcmds.c) — bulk import of OS locales (libc via locale -a, ICU via uloc_getAvailable).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
struct collate_methodssrc/include/utils/pg_locale.h73
struct pg_locale_structsrc/include/utils/pg_locale.h115
COLLPROVIDER_DEFAULTsrc/include/catalog/pg_collation.h70
COLLPROVIDER_BUILTINsrc/include/catalog/pg_collation.h71
COLLPROVIDER_ICUsrc/include/catalog/pg_collation.h72
COLLPROVIDER_LIBCsrc/include/catalog/pg_collation.h73
c_localesrc/backend/utils/adt/pg_locale.c137
collation_cache_entrysrc/backend/utils/adt/pg_locale.c146
last_collation_cache_oidsrc/backend/utils/adt/pg_locale.c176
check_localesrc/backend/utils/adt/pg_locale.c301
create_pg_localesrc/backend/utils/adt/pg_locale.c1075
init_database_collationsrc/backend/utils/adt/pg_locale.c1154
pg_newlocale_from_collationsrc/backend/utils/adt/pg_locale.c1196
get_collation_actual_versionsrc/backend/utils/adt/pg_locale.c1254
pg_strlowersrc/backend/utils/adt/pg_locale.c1270
pg_strcollsrc/backend/utils/adt/pg_locale.c1353
pg_strncollsrc/backend/utils/adt/pg_locale.c1373
pg_strxfrm_enabledsrc/backend/utils/adt/pg_locale.c1387
pg_strnxfrmsrc/backend/utils/adt/pg_locale.c1428
builtin_locale_encodingsrc/backend/utils/adt/pg_locale.c1486
builtin_validate_localesrc/backend/utils/adt/pg_locale.c1510
icu_language_tagsrc/backend/utils/adt/pg_locale.c1550
icu_validate_localesrc/backend/utils/adt/pg_locale.c1608
create_pg_locale_libcsrc/backend/utils/adt/pg_locale_libc.c421
make_libc_collatorsrc/backend/utils/adt/pg_locale_libc.c488
varstr_cmp (tie-break)src/backend/utils/adt/varlena.c1698
texteqsrc/backend/utils/adt/varlena.c1738
DefineCollationsrc/backend/commands/collationcmds.c53
AlterCollationsrc/backend/commands/collationcmds.c424
pg_collation_actual_versionsrc/backend/commands/collationcmds.c507
normalize_libc_locale_namesrc/backend/commands/collationcmds.c596
create_collation_from_localesrc/backend/commands/collationcmds.c696
pg_import_system_collationssrc/backend/commands/collationcmds.c836
  • There are exactly four provider codes and three concrete providers. Verified in pg_collation.h: COLLPROVIDER_DEFAULT 'd', COLLPROVIDER_BUILTIN 'b', COLLPROVIDER_ICU 'i', COLLPROVIDER_LIBC 'c'. 'd' is a sentinel (“use the database default”) and never reaches a provider constructor — create_pg_locale dispatches only on the other three and PGLOCALE_SUPPORT_ERRORs otherwise.

  • Nondeterministic collations are ICU-only, enforced at CREATE time. Verified in DefineCollation (collationcmds.c): if (!collisdeterministic && collprovider != COLLPROVIDER_ICU) ereport(ERROR, ...). The libc constructor hard-wires result->deterministic = true; the builtin constructor does likewise. Only create_pg_locale_icu can produce a non-deterministic handle.

  • The C/POSIX path never calls strcoll and never carries a method table. Verified in create_pg_locale_libc and make_libc_collator (pg_locale_libc.c): for "C"/"POSIX", make_libc_collator returns NULL, collate_is_c is set true, and result->collate is left NULL. Comparison falls back to memcmp, which is why the C locale is immune to OS collation drift.

  • The collation version mismatch is a WARNING, not an ERROR. Verified in create_pg_locale (pg_locale.c): the strcmp(actual, recorded) != 0 branch ereport(WARNING, ...) with an errhint naming REINDEX and ALTER COLLATION ... REFRESH VERSION. PostgreSQL cannot prove which rows a reordering affects, so it flags risk rather than blocking access.

  • LC_COLLATE and LC_CTYPE are fixed at CREATE DATABASE, not GUCs. Verified by the pg_locale.c file-header comment and by the libc constructor reading datcollate/datctype from pg_database. The runtime locale GUCs in this same file (locale_messages/_monetary/_numeric/_time) affect only formatting (PGLC_localeconv, cache_locale_time), never sort order.

  • strnxfrm is always present but only used when strxfrm_is_safe. Verified in pg_strxfrm_enabled (pg_locale.c) and its comment: the method pointer is required because the planner calls it for estimates, but actual sorting consults the strxfrm_is_safe flag first. This is the hook some glibc versions’ broken strxfrm is disabled through.

  • The builtin provider accepts exactly three locale names. Verified in builtin_validate_locale (pg_locale.c): C, C.UTF-8 (alias C.UTF8), and PG_UNICODE_FAST; anything else errors. C is encoding-agnostic (-1); the other two require UTF-8 per builtin_locale_encoding.

  • The per-backend collation cache has a one-entry fast path. Verified in pg_newlocale_from_collation (pg_locale.c): last_collation_cache_oid / last_collation_cache_locale short-circuit the simplehash lookup when the same collation is requested consecutively — the common case inside one query.

  1. What exactly distinguishes PG_UNICODE_FAST from C.UTF-8 at runtime? Both are UTF-8 builtin locales with code-point sort order. The difference is in ctype/case mapping (casemap_full in the builtin info union — full Unicode case mapping vs. ASCII-only), but the precise table sources and the list of case-mapping functions that branch on casemap_full were not traced here. Investigation path: read pg_locale_builtin.c and the strlower_builtin / strupper_builtin / strtitle_builtin / strfold_builtin implementations, plus the generated Unicode tables under src/common/unicode/.

  2. How does strxfrm_is_safe get set to false in practice on REL_18? The flag exists and is honored, but which providers/platforms actually clear it (and whether any do by default on a modern glibc) was not confirmed from the constructors read here. Investigation path: grep strxfrm_is_safe across pg_locale_libc.c / pg_locale_icu.c / pg_locale_builtin.c and trace the collate_methods table initializers.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Unicode Collation Algorithm (UTS #10) and CLDR. The multi-level primary/secondary/tertiary comparison PostgreSQL’s ICU provider relies on is specified in Unicode Technical Standard #10; the locale-specific tailorings live in CLDR. A focused note mapping UCA levels onto PostgreSQL’s deterministic/nondeterministic distinction and the rules (tailoring) option would make the ICU provider’s behavior concrete.

  • The glibc 2.28 collation break. The 2018 glibc release reordered most locales, silently corrupting B-tree indexes on upgraded hosts and motivating much of PostgreSQL’s version-tracking machinery and the push toward ICU and the builtin provider. A short case study of that incident would ground the collversion mechanism in the failure it exists to catch.

  • ICU vs. libc stability tradeoff. ICU pins collation behavior to a linked library version independent of the host OS, but introduces its own versioning surface. SQL Server (Windows collations + UCA), Oracle (NLS_SORT / linguistic collations), and MySQL (per-character-set collations, utf8mb4_0900_ai_ci) each pick a different point on the coverage/stability curve. A side-by-side comparison would clarify what PostgreSQL’s three-provider design buys.

  • Encoding vs. collation separation. This doc deliberately defers the encoding side; see postgres-encoding.md for pg_wchar, server/client encoding conversion, and how collencoding constrains which collations are usable in a given database. The two subsystems meet at pg_get_encoding_from_locale and check_encoding_locale_matches.

  • Where collations are consumed. The comparator and transform keys surface in sorting (postgres-tuplesort.md, abbreviated keys), B-tree ordering (postgres-nbtree.md, deduplication disabled for nondeterministic collations), and the text type functions (postgres-datatypes-adt.md). A follow-up could trace one ORDER BY end-to-end from parse-time collation assignment to the strncoll call.

  • None. This document was synthesized directly from the REL_18 source tree (sources: [] in frontmatter).
  • src/backend/utils/adt/pg_locale.c — provider dispatch, handle cache, version check, builtin/ICU validation, runtime locale GUC hooks.
  • src/backend/commands/collationcmds.c — CREATE / ALTER COLLATION, version refresh, system-collation import.
  • src/include/utils/pg_locale.hpg_locale_t, collate_methods, pg_locale_struct, the public string-operation prototypes.
  • src/include/catalog/pg_collation.hCOLLPROVIDER_* codes, the pg_collation catalog form.
  • src/include/catalog/pg_database.hdatlocprovider / datcollate / datctype / datlocale / datcollversion.
  • src/backend/utils/adt/pg_locale_libc.c — the libc provider constructor and make_libc_collator.
  • src/backend/utils/adt/varlena.cvarstr_cmp determinism tie-break, texteq deterministic fast path.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7th ed.) — the SQL COLLATE clause, collation as comparison-and-sort rules, and the encoding-vs-collation distinction. Captured at knowledge/research/dbms-general/database-system-concepts.md.
  • Unicode Technical Standard #10 (Unicode Collation Algorithm) and the CLDR data set — the basis for the ICU provider’s multi-level comparison and tailoring rules.
  • postgres-encoding.md — character encoding and the encoding/collation boundary.
  • postgres-datatypes-adt.md — the text type functions that consume these comparators.
  • postgres-tuplesort.md — abbreviated keys built on the transform machinery.
  • postgres-nbtree.md — B-tree ordering and deduplication’s interaction with determinism.