CUBRID Internationalization — Section Overview

What this section covers

This section is internationalisation primitives — horizontal capabilities that every string operator, every comparison, every date-arithmetic call eventually passes through. There are exactly two: codeset-plus-collation, and timezone. Both are infrastructure: they disappear into the rest of the engine, surfacing only as opaque per-record encoded IDs (a LANG_COLLATION index, a 32-bit TZ_ID) that the storage and query layers carry around without inspection.

This section used to be a wider catch-all (i18n-specialty) holding self-contained features that did not fit storage / query / server-architecture. Those have been redistributed to their natural homes:

JSON_TABLE — moved to Query Processing. It is a SCAN_TYPE in the executor’s scan_manager registry, sitting next to heap, list, and B+Tree scans. See cubrid-json-table.md.
SHOW commands — moved to a new System Catalog section alongside cubrid-catalog-manager. SHOW exposes server-internal runtime state through the same uniform SQL surface that the static catalog uses for schema. See cubrid-overview-system-catalog.md.
compactdb — moved to Utilities, alongside csql, cub-admin, loaddb, unloaddb. It is an offline SA-mode tool. See cubrid-utilities-misc.md for the utilities cluster.

What remains here is the textbook concept of i18n — locale-aware text and time, the two layers of “the database speaks more than one human convention”.

The two primitives

The two docs are independent of one another (charset code does not call into timezone code and vice versa) but they share a single architectural pattern, which is the reason they sit together:

Compile external standards data into a CUBRID-built shared library. dlopen it at server boot. Pack the per-record state into a small fixed-width ID. Resolve the ID through the loaded library on the read path.

flowchart LR
    subgraph build["Build / install time"]
        LDML["LDML locale rules<br/>(per-locale XML)"]
        IANA["IANA tzdata<br/>(zone1970.tab,<br/>africa, asia, ...)"]
        GENL["genlocale binary"]
        MKTZ["make_tz binary"]
        CL["libcubrid_collations.so<br/>(UCA weight tables,<br/>per-codeset comparators)"]
        TZ["libcubrid_timezones.so<br/>(zone, offset-rule,<br/>DS-rule arrays)"]
        LDML --> GENL --> CL
        IANA --> MKTZ --> TZ
    end
    subgraph runtime["cub_server runtime"]
        BOOT["boot_sr → lang_init / tz_load_library"]
        LANG["LANG_COLLATION vtable<br/>fastcmp / strmatch / next_alpha_char"]
        TZID["TZ_ID 32 bits<br/>(zone, offset-rule, DS-rule)"]
        BOOT -->|dlopen| CL
        BOOT -->|dlopen| TZ
        CL --> LANG
        TZ --> TZID
    end
    LANG -.feeds.-> btree["B+Tree key compare<br/>· sort · hash · LIKE / =<br/>· every string scalar"]
    TZID -.feeds.-> dt["DATETIMETZ / TIMESTAMPTZ<br/>· tz_create_datetimetz<br/>· tz_explain_tz_id<br/>· every CAST · every date scalar"]

Figure 1 — i18n shared pattern. Both primitives compile external standards (LDML via genlocale, IANA tzdata via make_tz) into per-platform shared libraries that boot_sr dlopens at startup; charset-collation surfaces as a LANG_COLLATION vtable feeding B+Tree and string operators; timezone surfaces as a 32-bit TZ_ID feeding all date-arithmetic and CAST paths.

cubrid-charset-collation.md — text. Four codesets (binary, ISO-8859-1, EUC-KR, UTF-8); LDML locale rules compiled by genlocale into UCA weight tables shipped as a per-platform shared library; comparison dispatched through a function-pointer LANG_COLLATION vtable consumed by B+Tree, sort, hash, and every string scalar.

cubrid-timezone.md — time. Raw IANA tzdata files compiled by make_tz into a generated timezones.c and a shared library libcubrid_timezones.so; a 32-bit TZ_ID packs (zone, gmt-offset-rule, ds-rule) or a raw signed offset; tz_datetime_utc_conv resolves wall-clock to UTC honouring LOCAL_STD / LOCAL_WALL / UTC “AT” qualifiers and spring-forward / fall-back overlaps.

Reading order

The two i18n primitives are independent. Pick by what you are working on:

If you care about strings, identifiers, comparison, indexing, sorting, joining, or hashing — read cubrid-charset-collation.md first. Almost every other doc in the repository eventually mentions INTL_CODESET, LANG_COLLATION, or one of the lang_*cmp* comparators. The charset-collation doc is the only place where the per-codeset comparator family, the LDML / UCA pipeline, and the LANG_GET_BINARY_COLLATION macro are explained from scratch.

If you care about dates, timestamps, sessions, or anything client-server about “local time” — read cubrid-timezone.md first. The packing of TZ_ID, the AT-time qualifier semantics, and the connection-time session region are explained nowhere else.

Read both back-to-back if you want to see the architectural pattern. “Compile external standards into a dlopen-ed .so, pack a small ID, surface through SQL” is identical in both subsystems. Reading them as a pair makes the pattern obvious — and primes you to recognise it elsewhere (e.g. PL bridge libraries).

Cross-cutting concerns

Both primitives are deeply load-bearing. They touch most other sections of the knowledge base, and a comprehensive read of any of those sections is incomplete without an awareness of what these two layers are doing on its behalf.

Charset-collation feeds B+Tree and every string operator. Every comparator on the hot path inside btree.c ultimately resolves through a LANG_COLLATION.fastcmp / LANG_COLLATION.strmatch function pointer. Same for LIKE, =, <, ORDER BY, GROUP BY, sort-merge join keys, and hash-join hashing. The collation surface is therefore inseparable from B+Tree (cubrid-btree.md), external sort (cubrid-external-sort.md), hash join (cubrid-hash-join.md), and string scalar functions (cubrid-scalar-functions.md).
Timezone feeds DATETIMETZ / TIMESTAMPTZ in scalar functions and at session boundaries. Anything operating on DB_DATETIMETZ or DB_TIMESTAMPTZ — tz_create_datetimetz, tz_conv_tz_datetime_w_region, tz_explain_tz_id, every CAST, every date-arithmetic operator — walks the TZ_DATA blob from libcubrid_timezones.so. The session-level tz_Region_session / session_tz_region runtime variables tie this into the connection lifecycle covered by cubrid-network-protocol.md, cubrid-server-session.md, and cubrid-boot.md.
Both surface through boot_sr at the same point in the topological boot order. lang_init and tz_load_library run in the same boot phase, after sysparams but before page buffer / log / lock — because every later subsystem may need to compare strings or interpret timestamps. See cubrid-boot.md for the full ordering.
Both surface through SHOW commands. SHOW LOCALES and SHOW TIMEZONES expose the in-memory state of the loaded .sos — locale name, charset, codeset, contraction count, DS rule count, etc. — through the catalog overview’s cubrid-overview-system-catalog.md virtual-scan path. This is the standard introspection surface for “what is the engine actually using?”.

Detail-doc summaries

Doc	One-line summary
`cubrid-charset-collation.md`	Four-codeset text model (binary, ISO-8859-1, EUC-KR, UTF-8) plus locale-aware comparison via the `LANG_COLLATION` vtable; LDML + UCA weights compiled by `genlocale` into a per-platform shared library that the server `dlopen`s at startup.
`cubrid-timezone.md`	IANA tzdata compiled into `libcubrid_timezones.so`; 32-bit `TZ_ID` packs `(zone, offset-rule, ds-rule)` or a raw signed offset; `tz_datetime_utc_conv` walks per-zone offset and DS rules honouring `s` / `w` / `u` AT-time qualifiers and overlap intervals.

Adjacent sections

System Catalog — both primitives surface through SHOW commands. SHOW COLLATION, SHOW LOCALES, SHOW TIMEZONES, SHOW FULL TIMEZONES all read the in-memory .so state through the virtual-scan registry. See cubrid-overview-system-catalog.md.
Storage Engine. Charset-collation drives every B+Tree comparator (cubrid-btree.md) and surfaces through external sort (cubrid-external-sort.md). Timezone is touched only indirectly here — DATETIMETZ values stored in heaps carry TZ_IDs but storage code never inspects them.
Query Processing. Both primitives surface in scalar functions (cubrid-scalar-functions.md), the optimiser’s collation-aware index choice (cubrid-query-optimizer.md), and the executor’s hash and sort comparators (cubrid-hash-join.md, cubrid-external-sort.md).
Server Architecture. Both dlopen at boot from cubrid-boot.md. Timezone is also a session-level concept managed at connect time through cubrid-network-protocol.md and cubrid-server-session.md. Both library paths and locale settings come from system parameters (cubrid-system-parameters.md).