CUBRID Semantic Check — Name Resolution, Type Checking, Constant Folding, and Statement-Specific Validation
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”The semantic-check phase of a SQL compiler sits between the parser and the optimizer. The parser proves the query is grammatically well-formed; the optimizer proves a chosen plan is cheap. Between them, semantic analysis proves the query is meaningful against the catalog: every name resolves to something, every operator’s arguments have types it accepts, and every transformation that removes degrees of freedom for the optimizer (constant folding, predicate normalization) happens here so the optimizer sees a single canonical form.
Three textbook subproblems live underneath this single pass.
Symbol resolution and scoping. The Dragon Book frames it (Aho et al.,
ch. 6 §“Symbol Tables”, ch. 7 §“Scope”) as a tree-walk that maintains a
stack of scopes; an inner scope shadows an outer one, and a free name is
bound to the closest scope where the name is declared. SQL’s scope rules
are largely flat — a SELECT introduces a scope spanning its FROM,
WHERE, GROUP BY, HAVING, and ORDER BY — but two recursive cases
push them into the same scope-stack discipline as a programming language:
subqueries (each SELECT opens a new scope; a free reference inside
walks outward to find a FROM-list provider) and derived tables / CTEs
(an inner SELECT produces an aliased relation that the outer scope must
see after the inner has been resolved).
The data structure is the same as Wirth-style nested-block compilers: a
linked list of scopes, head is innermost. CUBRID’s SCOPES struct
(name_resolution.c:78) is exactly this list — a next pointer to the
outer scope, a list of PT_SPEC nodes that the scope makes visible, and
a correlation_level counter that tells the resolver “I had to walk N
scopes outward to bind this name”, which the optimizer later uses to
decide whether a subquery is correlated.
Type systems with implicit conversion. Database System Concepts
(Silberschatz et al., ch. 4 §“Built-in Data Types”, ch. 5 §“Functions and
Procedures”) formalizes a SQL type system as a partial order of types
with explicit and implicit conversion rules. Every operator has a
signature — a tuple of input types and a return type — and the
type-check pass either picks the unique matching signature or chooses one
by promoting the actual arguments through implicit casts. The two
canonical promotions are integer ↔ floating-point (preserves value, may
lose precision) and numeric ↔ string (succeeds only if the string is
parseable as a number; otherwise a runtime error). Engines vary on
overload selection — exact match preferred, otherwise minimum-cost
promotion, otherwise reject — and on whether the rules are
operator-driven (each operator has hand-written rules) or
signature-table-driven (a single matcher reads a per-operator table).
CUBRID uses both: operator-driven rules for arithmetic and a few special
cases, plus a signature-table matcher (pt_apply_expressions_definition)
for the long tail.
Constant folding as compile-time evaluation. The Dragon Book chapter
9 §“Loop-Invariant Computations” introduces folding as a peephole
optimization: if every operand of an expression is a literal, evaluate
it now and replace the expression with its result. In SQL the prize is
twofold — the executor avoids per-tuple work, and the optimizer sees a
simpler predicate that may unlock further rewrites (e.g.,
WHERE 1=1 AND foo > 5 becoming WHERE foo > 5 after folding the first
conjunct). The hazard is that folding must respect side-effect-bearing
operators (RAND, NOW, BENCHMARK) and not-yet-bound host
variables. The compiler must classify operators by purity and not fold
expressions that contain anything impure or any unresolved host variable.
CNF for predicate normalization. Quine’s Methods of Logic and the
classical resolution-theorem-proving literature (Robinson 1965; Nilsson’s
Principles of Artificial Intelligence, the “resolution1” chapter cited
by the deck) define CNF as the conjunction-of-disjunctions form: every
formula (A ∨ B) ∧ (C ∨ ¬D ∨ E) ∧ …. Three rewrite steps reach it
mechanically — eliminate implication and equivalence, push negation
inward via De Morgan, distribute disjunction over conjunction. A query
optimizer wants CNF for two reasons: (a) conjunct-by-conjunct
selectivity estimation is sound only when conjuncts are independent —
each (...) clause becomes a single predicate that the cost model
multiplies; (b) predicate pushdown and index selection treat each
conjunct as a candidate for a separate index access path, and a CNF
predicate is already a flat list of such candidates. A non-CNF predicate
would force the optimizer to enumerate all factorizations.
These four ideas — scope stack, type promotion, constant folding, predicate normalization — appear over and over in the rest of this document. They are the load-bearing concepts that the CUBRID source expresses in C.
Common DBMS Design
Section titled “Common DBMS Design”Every relational engine — PostgreSQL, Oracle, SQL Server, MySQL, CUBRID — adopts roughly the same five-stage shape between parser and optimizer. The names differ; the work does not.
Stage 1: parse-tree walk that resolves spec / FROM-list relations
Section titled “Stage 1: parse-tree walk that resolves spec / FROM-list relations”The first walk binds the FROM-list itself. PostgreSQL’s transformFromClause
(in parser/parse_clause.c) takes each RangeVar (a parser node naming a
relation) and resolves it against the catalog, returning a RangeTblEntry
that the rest of the pass can index by ordinal. Oracle’s kkqs /
kkpamqry does the same role — turn each table reference into a slot in
the query block’s range table. CUBRID’s parallel is pt_flat_spec_pre:
it walks the parse tree, and for each PT_SPEC (CUBRID’s term for a
FROM-list element, including subselects, derived tables, and CTEs) it
expands the spec into a flat_entity_list of physical class references
and attaches the workspace’s DB_OBJECT for each.
The data structure shape is identical across engines: each FROM-list
entry gets a stable identifier (PostgreSQL: 1-based ordinal in the range
table; CUBRID: the address of the PT_SPEC cast to UINTPTR,
re-purposed as info.spec.id). Every later name reference will carry
that identifier as its “I belong to this relation” backpointer.
Stage 2: name binding under a scope stack
Section titled “Stage 2: name binding under a scope stack”The second walk binds every other name in the tree to one of the FROM-list entries from stage 1, walking a stack of scopes that grows as the walker descends into subqueries and shrinks as it pops out. Every engine implements this as a recursive descent over the parse tree carrying a mutable scope-list argument.
PostgreSQL’s transformExpr chain calls colNameToVar against the
current ParseState’s p_namespace. Oracle does the same against the
parsed query block’s range table. CUBRID’s pt_bind_names /
pt_bind_name_or_path_in_scope walks SCOPES *scopes (a linked list,
innermost first) and resolves names against the specs of each scope in
turn. The first match wins; an ambiguity (same column name in two specs
in the same scope) errors out as MSGCAT_SEMANTIC_AMBIGUOUS_REF_TO.
The classical correctness invariant — “no name resolves to a relation
that was introduced after the current scope was opened” — is enforced
by the order in which the walker pushes scopes. CUBRID pushes a new
SCOPES frame at every PT_SELECT, PT_UPDATE, PT_DELETE, etc., and
unwinds on the post-walk; this is the same discipline as Wirth-style
nested blocks.
Stage 3: type checking with overload resolution
Section titled “Stage 3: type checking with overload resolution”After every name has a type (because every PT_SPEC has been mapped to
a catalog object whose attributes are typed), the third walk computes a
type for every internal node — PT_EXPR, PT_FUNCTION, PT_VALUE,
PT_HOST_VAR. The classical algorithm is bottom-up: leaves carry their
declared type, an internal node picks an overload of its operator that
accepts the children’s types (inserting implicit casts where the
language allows), and the parent receives the overload’s return type.
PostgreSQL’s coerce_to_target_type and func_select_candidate and
Oracle’s kkmpc cost-rank candidate signatures by exact match versus
implicit conversion. CUBRID’s pt_apply_expressions_definition
(type_checking.c:5778) reads a per-operator EXPRESSION_DEFINITION
table of overloads, scores each by pt_are_equivalent_types plus
pt_are_unmatchable_types, and picks the highest-scoring signature
(falling back to the first if no signature matches all three argument
slots). Picked signatures cause pt_coerce_expression_argument to wrap
mismatched arguments in PT_CAST expressions.
Stage 4: constant folding
Section titled “Stage 4: constant folding”Once types are known, every expression whose arguments are all PT_VALUE
can be evaluated at compile time. PostgreSQL’s eval_const_expressions,
Oracle’s kkecf, MySQL’s Item::const_item propagate folding through
the tree.
CUBRID does this in pt_fold_const_expr (for PT_EXPR) and
pt_fold_const_function (for PT_FUNCTION). The classification of
“foldable” is operator-driven: PT_RAND, PT_BENCHMARK, anything
involving an unresolved PT_HOST_VAR is excluded by an early-return
check. Folded subtrees are replaced in place by the resulting
PT_VALUE.
Stage 5: predicate normalization (CNF)
Section titled “Stage 5: predicate normalization (CNF)”Every engine that does cost-based optimization wants the WHERE-clause
predicate in CNF, because the conjunct-by-conjunct selectivity model
multiplies independent conjuncts and the index-scan rewriter looks for
single-attribute conjuncts to push into index access methods.
PostgreSQL’s canonicalize_qual does exactly this. CUBRID has a
dedicated cnf.c whose entry point pt_do_cnf walks the tree and calls
pt_cnf on each WHERE and HAVING predicate; pt_cnf itself first
applies De Morgan’s law to push negation inward (via pt_and_or_form),
then distributes disjunction over conjunction (via pt_transform_cnf_pre
/ _post). The result is a linked list of PT_EXPR nodes joined by
implicit AND on next; each conjunct can carry one OR-list joined by
or_next, allowing CNF to be represented as a flat 2-D structure
without a tree.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Symbol-table per-scope frame | SCOPES { next, specs, correlation_level, location } (name_resolution.c:78) |
| Whole-walk symbol-table state | PT_BIND_NAMES_ARG { scopes, spec_frames, sc_info } (name_resolution.c:88) |
| FROM-list entry | PT_SPEC carrying flat_entity_list, entity_name, as_attr_list, id, location, natural, join_type, on_cond (parse_tree.h) |
| Bound-to-spec back-pointer on a name | PT_NAME::info.name.spec_id (parse_tree.h) |
Resolved relation (db_object) cache on a name | PT_NAME::info.name.db_object (parse_tree.h) |
| Per-statement type-check / fold context | SEMANTIC_CHK_INFO { top_node, attrdefs, system_class, donot_fold, ... } (semantic_check.h) |
| Operator-overload table (signature-driven) | EXPRESSION_DEFINITION { op, overloads_count, overloads[MAX_OVERLOADS] } (type_checking.c) |
| Implicit cast insertion | pt_coerce_expression_argument wrapping in PT_CAST with target TP_DOMAIN |
| ”Maybe” type for unresolved host vars | PT_TYPE_MAYBE + expected_domain field on PT_HOST_VAR |
| Per-operator late-binding decision | pt_is_op_hv_late_bind (type_checking.c:20260) |
| Constant-fold gate flag | SEMANTIC_CHK_INFO::donot_fold and BENCHMARK early return in pt_fold_constants_pre |
| CNF “this conjunct already normalized” flag | PT_EXPR_INFO_CNF_DONE set inside pt_cnf |
| Conjunct-list / disjunct-list link fields | PT_NODE::next for AND, PT_NODE::or_next for OR (used as a 2-D linked list) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The semantic-check pass enters at pt_compile (in compile.c),
which is called once per top-level statement returned by the parser.
pt_compile is a thin wrapper that calls pt_semantic_check, which
in turn delegates to pt_check_with_info. All real work lives there.
// pt_compile — src/parser/compile.cPT_NODE *pt_compile (PARSER_CONTEXT * parser, PT_NODE * volatile statement){ PT_NODE *next;
PT_SET_JMP_ENV (parser);
if (statement) { next = statement->next; statement->next = NULL;
statement = pt_semantic_check (parser, statement);
/* restore link */ if (statement) { statement->next = next; } }
PT_CLEAR_JMP_ENV (parser);
return statement;}Note statement->next = NULL — the wrapper detaches the current
statement from a multi-statement list before checking it, so each
statement is checked in isolation. The setjmp/longjmp envelope is
the parser’s panic-mode error recovery; a deep semantic error can
unwind out of nested calls without the boilerplate of propagating
return codes.
The pipeline
Section titled “The pipeline”flowchart LR PARSE["parser_main\n(bison/flex)"] --> PT["PT_NODE tree\n(unanalyzed)"] PT --> COMP["pt_compile"] COMP --> SC["pt_semantic_check"] SC --> CWI["pt_check_with_info\n(per-statement switch)"] CWI --> RN["1) pt_resolve_names\nname resolution"] RN --> CW["2) pt_check_where\naggregate-in-WHERE check"] CW --> RH["3) pt_check_and_replace_hostvar\nresolve host-var values"] RH --> SCL["4) parser_walk_tree\n(post=pt_semantic_check_local)\nstatement-specific\nrewrites + checks\n· pt_semantic_type"] SCL --> EI["pt_expand_isnull_preds"] EI --> CNF["pt_do_cnf / pt_cnf\n(predicate normalization)"] CNF --> XASL["xasl_generation"] SCL -. internal call .-> PST["pt_semantic_type"] PST --> EVTP["pt_eval_type_pre\n(top-down)"] PST --> EVTPOST["pt_eval_type\n(bottom-up)\n → pt_eval_expr_type\n → pt_eval_function_type\n → pt_where_type"] PST --> FOLD["pt_fold_constants_pre\n· pt_fold_constants_post\n → pt_fold_const_expr\n → pt_fold_const_function"]
The flowchart hides one subtle detail: the type-check and constant-fold
walk inside pt_semantic_type is invoked recursively by the
statement-specific rewrites, because some rewrites (e.g., wrapping a
LIMIT clause as an INST_NUM() predicate) introduce new expressions
whose types need to be re-derived. The top-level pt_check_with_info
relies on the post-order callback pt_semantic_check_local to invoke
pt_semantic_type once per PT_SELECT / PT_UNION / etc. The CNF
pass at the end is run after all rewriting and folding settle.
Stage 1 — Name resolution
Section titled “Stage 1 — Name resolution”Inside pt_check_with_info, for any DML or query node, the first call
is to pt_resolve_names.
// pt_resolve_names — src/parser/name_resolution.c (condensed)PT_NODE *pt_resolve_names (PARSER_CONTEXT * parser, PT_NODE * statement, SEMANTIC_CHK_INFO * sc_info){ PT_BIND_NAMES_ARG bind_arg; PT_FLAT_SPEC_INFO info;
bind_arg.scopes = NULL; bind_arg.spec_frames = NULL; bind_arg.sc_info = sc_info;
/* Replace each Entity Spec with an Equivalent flat list */ info.spec_parent = NULL; info.for_update = false; statement = parser_walk_tree (parser, statement, pt_flat_spec_pre, &info, pt_continue_walk, NULL);
/* Mark PT_NAME nodes that are inside GROUP BY / HAVING. */ statement = parser_walk_tree (parser, statement, pt_mark_group_having_pt_name, NULL, NULL, NULL);
/* The main name-binding walk. */ statement = parser_walk_tree (parser, statement, pt_bind_names, &bind_arg, pt_bind_names_post, &bind_arg);
/* Resolve alias references in GROUP BY / HAVING. */ statement = parser_walk_tree (parser, statement, pt_resolve_group_having_alias, NULL, NULL, NULL);
/* Convert NATURAL JOIN into INNER/OUTER JOIN by synthesizing the equi-predicates. */ statement = parser_walk_tree (parser, statement, NULL, NULL, pt_resolve_natural_join, NULL);
/* FOR UPDATE: flag the affected specs. */ if (statement && statement->node_type == PT_SELECT && PT_SELECT_INFO_IS_FLAGED (statement, PT_SELECT_INFO_FOR_UPDATE)) { // ... condensed: flag each spec with PT_SPEC_FLAG_FOR_UPDATE_CLAUSE ... }
return statement;}Five sub-passes execute in this order:
-
Spec expansion (
pt_flat_spec_pre): for eachPT_SPEC, expandentity_nameinto aflat_entity_listthat includes inherited classes (CUBRID is an ORDBMS, soSELECT FROM Parentmay need to touch every subclass) and the workspace’sDB_OBJECT. TheALL ... EXCEPTsyntax can subtract specific subclasses; the resulting list is exact. -
GROUP BY / HAVING marking (
pt_mark_group_having_pt_name): tag everyPT_NAMEinside aGROUP BYorHAVINGclause withCPTR_PT_NAME_IN_GROUP_HAVINGso that, in the next pass, the resolver knows whether to prefer SELECT-list aliases over FROM-list columns when both could match. -
Name binding (
pt_bind_names+pt_bind_names_post): the main recursive walk that pushes / pops scope frames and resolves everyPT_NAMEagainst the scope stack viapt_bind_name_or_path_in_scope. -
Alias resolution (
pt_resolve_group_having_alias): forGROUP BYandHAVINGexpressions, replace anyPT_NAMEwhose string matches a SELECT-list alias with a copy of the SELECT-list node. This is the “GROUP BY alias_name” feature — alias resolution is post-binding because it must let the binder fail to find the name in the FROM list first. -
NATURAL JOIN rewriting (
pt_resolve_natural_join): walk the tree, find anyPT_SPECwhoseinfo.spec.naturalis true, and for each common attribute of its left- and right-hand specs synthesize an equi-predicatelhs.col = rhs.coland graft it onto the right-hand spec’son_cond. After this rewrite the planner only sees a regular INNER JOIN with an explicit ON condition.
Spec expansion (pt_flat_spec_pre)
Section titled “Spec expansion (pt_flat_spec_pre)”The reason CUBRID expands every spec into a list of physical relations
rather than carrying the original entity_name is its ORDBMS heritage:
SELECT * FROM Parent must produce rows from every subclass unless
explicit ALL / EXCEPT clauses say otherwise. The flat_entity_list
materializes that decision at semantic-check time, so the binder and
the optimizer can both treat each list element as a regular relation
reference.
For each PT_SPEC, the expansion attaches three pieces of state:
id— the address of thePT_SPECnode itself, cast toUINTPTR. This is the back-reference key that every laterPT_NAME::info.name.spec_idwill carry.db_object— the workspace pointer fetched inpt_class_pre_fetch; every relation referenced by a query is locked and pre-fetched before semantic check runs, so the workspace is already populated.flat_entity_list— a fresh linked list ofPT_NAMEnodes, one per physical class the query reaches. Each carries its owndb_objectso attribute lookup needs only the list element.
Name binding and the scope stack
Section titled “Name binding and the scope stack”The scope stack is a singly-linked list in declaration-order-reversed — innermost first. Each frame holds the relations that frame introduces.
// SCOPES — src/parser/name_resolution.c:78typedef struct scopes SCOPES;struct scopes{ SCOPES *next; /* next outermost scope */ PT_NODE *specs; /* list of PT_SPEC nodes */ unsigned short correlation_level; /* how far up the stack was a name found? */ short location; /* for outer join */};
typedef struct pt_bind_names_arg PT_BIND_NAMES_ARG;struct pt_bind_names_arg{ SCOPES *scopes; PT_EXTRA_SPECS_FRAME *spec_frames; SEMANTIC_CHK_INFO *sc_info;};The correlation_level is the answer to “if the binder had to walk
outward N scopes to find this name, what was N?” It is the optimizer’s
hint that a subquery is correlated — non-zero level means the inner
SELECT’s evaluation depends on a row from an outer relation, which
forbids subquery-flattening transformations. The location is a
companion to outer-join rewriting: every PT_NAME under an outer-join’s
ON condition gets the same nonzero location so the optimizer knows
the predicate cannot be hoisted out of the join.
flowchart TD
subgraph OUTER["Outer SELECT scope"]
OS["specs: history h, olympic o"]
end
subgraph INNER["Inner subquery scope (correlated)"]
IS["specs: prize p"]
IS --> ISN["Resolving p.year:\n match in IS — level 0"]
IS --> ISO["Resolving o.host_year:\n no match in IS\n walk to OS — level 1\n → mark subquery correlated"]
end
INNER -- "next" --> OUTER
ISO -. "spec_id ← &outer_PT_SPEC\nresolved ← 'o'" .-> OS
The walker pushes frames whenever it descends into a query node and
pops on the way out. Pushing happens lazily inside pt_bind_scope,
which gathers the PT_SPEC nodes the new scope makes visible, builds a
SCOPES *new with new->next = bind_arg->scopes, and reassigns
bind_arg->scopes = new. On the post-walk callback the scope is
unlinked. The two-phase callback signature
(pt_bind_names, pt_bind_names_post) of parser_walk_tree is what
makes this push/pop reliable.
Resolving * in the SELECT list
Section titled “Resolving * in the SELECT list”SELECT * FROM code — the * is parsed as a PT_VALUE of a special
“star” kind, not as a PT_NAME. The binder’s job at the SELECT-list
level is to expand it. pt_resolve_star does the expansion: locate the
PT_SPEC (or all specs, if the * is bare) that the star refers to,
fetch each spec’s flat_entity_list’s DB_ATTRIBUTE set, and emit one
fresh PT_NAME per attribute, linked by next. Each fresh PT_NAME
then re-enters the regular binder and gets its spec_id and
data_type set as if it had been written by the user.
flowchart LR
STAR["PT_VALUE ('*')\nin SELECT list"] --> EXP["pt_resolve_star"]
EXP --> NL["fresh PT_NAME list:\n s_name → f_name → ..."]
NL --> BIND["pt_bind_name_or_path_in_scope\n(re-enters binder)"]
BIND --> DONE["each PT_NAME has\n spec_id, resolved,\n data_type filled"]
Derived tables, subqueries, CTEs
Section titled “Derived tables, subqueries, CTEs”Subqueries (a PT_SELECT inside another PT_SELECT) and derived
tables (a PT_SELECT directly under a FROM-list PT_SPEC) push a new
scope. The order in which scopes resolve matters: a derived table must
be resolved first, before its alias is offered to the outer scope, so
that the outer scope can look up as_attr_list columns by their bound
types. pt_bind_scope enforces this: when the binder hits a PT_SPEC
whose derived_table is non-NULL, it descends into the derived table
recursively, calls pt_semantic_type to type its select list, then
threads the inner select-list types back into the spec’s
as_attr_list. After this, the outer scope sees the derived table as
a regular relation.
CTEs follow the same shape via pt_bind_names_in_with_clause /
pt_bind_names_in_cte: the CTE’s body is bound as if it were a
derived table; the CTE’s name and column list become a PT_SPEC that
the WITH-clause’s query body sees in scope.
Oracle-style outer join
Section titled “Oracle-style outer join”Oracle outer-join syntax — WHERE a.col(+) = b.col — pushes the
outerness into the predicate rather than the join. The CUBRID parser
detects the (+) and flags the corresponding PT_EXPR with
PT_EXPR_INFO_RIGHT_OUTER or PT_EXPR_INFO_LEFT_OUTER. At
name-resolution time, pt_check_Oracle_outerjoin (called inside
pt_bind_names) finds these predicates, identifies the lhs and rhs
specs, infers the join order, and rewrites the WHERE-clause predicate
into an equivalent ANSI-style join: it sets the affected PT_SPEC’s
join_type and moves the predicate from where to on_cond of the
right-hand spec. The rest of the pipeline now sees a regular ANSI
LEFT/RIGHT OUTER JOIN.
If the lhs and rhs specs are reversed in the FROM list relative to the
predicate (FROM x, y WHERE y.i = x.i(+)), the rewriter swaps the
join type:
// pt_check_Oracle_outerjoin (excerpt) — src/parser/name_resolution.cif (lhs_spec && rhs_spec && lhs_spec->info.spec.id != rhs_spec->info.spec.id) { /* found edge: set join type and spec */ if (lhs_location < rhs_location) /* left-to-right in FROM order */ { p_spec = lhs_spec; spec = rhs_spec; } else /* SELECT ... FROM x, y WHERE y.i = x.i(+) */ { join_type = (join_type == PT_JOIN_LEFT_OUTER) ? PT_JOIN_RIGHT_OUTER : PT_JOIN_LEFT_OUTER; p_spec = rhs_spec; spec = lhs_spec; } }A trailing detail: a meaningless outer-join predicate (SELECT ... FROM x, y ON y.i(+) = 1)
collapses to PT_JOIN_NONE, and the predicate is moved back to the
WHERE clause — the rewriter explicitly recognizes that the (+) was a
syntactic tic in this position and not a real join request.
NATURAL JOIN
Section titled “NATURAL JOIN”NATURAL JOIN is sugar for “INNER JOIN with equi-predicates on every
matching column name”. The parser carries it as a PT_SPEC with
info.spec.natural == true, leaving the join-type bit
PT_JOIN_NATURAL unused (it exists in the enum for completeness but
the natural flag is what the rewriter reads). pt_resolve_natural_join
walks the tree, and for each natural-join PT_SPEC builds two attribute
lists — one from the lhs, one from the rhs — by reading each spec’s
flat_entity_list (or the derived-table’s select-list / CTE’s
attribute list, depending on the spec’s kind). Common attribute names
are converted into lhs.col = rhs.col PT_EXPR nodes which are
appended to the rhs spec’s on_cond:
-- beforeSELECT DISTINCT h.host_year, o.host_nationFROM history h NATURAL JOIN olympic oWHERE o.host_year > 1950;
-- after pt_resolve_natural_joinSELECT DISTINCT h.host_year, o.host_nationFROM history h INNER JOIN olympic o ON h.host_year = o.host_yearWHERE o.host_year > 1950;The rewriter additionally flips the spec’s info.spec.natural to
false so that subsequent passes treat it as a regular inner join.
Stage 2 — Aggregate-in-WHERE check (pt_check_where)
Section titled “Stage 2 — Aggregate-in-WHERE check (pt_check_where)”A small but mandatory check: aggregate (SUM, AVG, …) and analytic
(ROW_NUMBER, RANK, …) functions are forbidden in WHERE clauses;
they must move to HAVING.
// pt_check_where — src/parser/semantic_check.c:17503PT_NODE *pt_check_where (PARSER_CONTEXT * parser, PT_NODE * node){ PT_NODE *function = NULL;
/* check if exists aggregate/analytic functions in where clause */ function = pt_find_aggregate_analytic_in_where (parser, node); if (function != NULL) { if (pt_is_aggregate_function (parser, function)) { PT_ERRORm (parser, function, MSGCAT_SET_PARSER_SEMANTIC, MSGCAT_SEMANTIC_INVALID_AGGREGATE); } else { PT_ERRORm (parser, function, MSGCAT_SET_PARSER_SEMANTIC, MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS); } }
return node;}The message MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS is reused for
analytics-in-WHERE because the underlying classification function
treats both as “function nodes that semantically belong elsewhere”.
Stage 3 — Host-variable replacement (pt_check_and_replace_hostvar)
Section titled “Stage 3 — Host-variable replacement (pt_check_and_replace_hostvar)”A host variable is the ? (or :name) parameter that the client binds
at execute time. After parsing it is a PT_HOST_VAR of PT_TYPE_MAYBE.
The walk replaces those whose value the parser has already bound (e.g.,
literal placeholders in dynamic SQL the client supplied at compile
time) with PT_VALUE nodes carrying the bound value, leaving genuinely
free host vars as PT_HOST_VAR for late binding. The step also sets a
cannot_prepare flag on the statement when it discovers a host var
referencing an OBJECT (parameter) — those statements must be re-checked
at execute time and so cannot be prepared once.
Stage 4 — Statement-specific rewrites + type checking
Section titled “Stage 4 — Statement-specific rewrites + type checking”pt_semantic_check_local is the largest single function in
semantic_check.c. It is the post-order callback of a parser_walk_tree
in pt_check_with_info:
// pt_check_with_info (excerpt) — src/parser/semantic_check.cnode = parser_walk_tree (parser, node, NULL, NULL, pt_semantic_check_local, sc_info_ptr);Because it is post-order only, it sees subqueries before their containing query — important for type propagation upward.
The function dispatches on node->node_type. For PT_SELECT it does:
-
SELECT-list checks:
pt_check_into_clause(theINTO :varcount must match the select-list count;INTOis forbidden inside a subquery).MSGCAT_SEMANTIC_NOT_SINGLE_COLreproduces the parser’s multi-column-subquery check at the semantic level.WITH INCREMENT FOR colrewrites into a hiddenincr(col)selection. -
GROUP BY checks:
MSGCAT_SEMANTIC_SORT_SPEC_RANGE_ERRfor a position-style GROUP BY beyond the select-list length;MSGCAT_SEMANTIC_NO_GROUPBY_ALLOWEDfor a host-var GROUP BY (the position would change at execute time);MSGCAT_SEMANTIC_CANNOT_USE_GROUPBYNUM_WITH_ROLLUPfor the rareWITH ROLLUP HAVING GROUPBY_NUM()combination. -
Aggregate / analytic argument checks: every aggregate / analytic function except a small whitelist (
COUNT(*),ROW_NUMBER(),RANK(),DENSE_RANK(),CUME_DIST(),PERCENT_RANK()) must have anarg_list.MEDIANcannot be used withOVER (ORDER BY ...).PARTITION BY NULLis removed and replaced byORDER BY <const>to keep the analytic well-formed. -
ORDER BY checks: strip
ORDER BYif the query is a single-row aggregation that won’t benefit; range-check positional ORDER BY. -
Hierarchical query checks (CONNECT BY): forbid
CONNECT_BY_ISLEAFandCONNECT_BY_ISCYCLEoutside the prior expr context; analyzeLEVELto decide whether a cycle check is needed. -
SHOW statement rewrite: replace the placeholder select-list with the real meta-data list for the SHOW kind (e.g.,
SHOW TABLES→ SELECT against a system catalog). -
Derived-query checks: for each
PT_SPECcarrying a derived table, verify thatas_attr_listhas unique names (MSGCAT_SEMANTIC_AMBIGUOUS_REF_TO) and that hidden columns get syntheticha_<n>placeholder names so the count matches. -
CAST well-formedness:
pt_check_cast_opverifies the source type can be coerced to the target type; otherwise emitsMSGCAT_SEMANTIC_CANT_COERCE_TO(target is in the coercion table but marked NO) orMSGCAT_SEMANTIC_COERCE_UNSUPPORTED(target not in the table at all). -
LIMIT rewriting — see below.
-
pt_semantic_typeinvocation: finally, after all rewrites,pt_semantic_check_localcallspt_semantic_typeon the subtree to type-check and constant-fold. This is the “type checking + constant folding” pass.
LIMIT → numbering expression
Section titled “LIMIT → numbering expression”LIMIT is a syntactic shorthand for “stop after N rows”, but the optimizer has no notion of “stop” — every clause must turn into a predicate. The rewriter chooses the predicate based on the surrounding context:
| Context | Rewrite |
|---|---|
| LIMIT with WHERE only | append inst_num() <= ? to WHERE |
| LIMIT with ORDER BY | append orderby_num() <= ? (the ordering is preserved before the cut) |
| LIMIT with GROUP BY | append groupby_num() BETWEEN 1 AND ? to HAVING |
| LIMIT with DISTINCT only | treat as ORDER BY case (distinct is order-equivalent for the purpose of cutting) |
| LIMIT inside aggregate | wrap the aggregate select as a derived table, apply LIMIT outside |
-- example 4SELECT * FROM code LIMIT 3;=> SELECT code.s_name, code.f_name FROM code code WHERE (inst_num() <= ?:0);
-- example 1 — LIMIT pushes through to ORDERBY_NUM inside the analyticSELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY math) AS pdisc FROM scores LIMIT 5;=> ... WITHIN GROUP (ORDER BY math FOR ORDERBY_NUM() BETWEEN 1 AND 5) ...The rewrite is performed before pt_semantic_type is called, so the
type checker sees a regular WHERE/HAVING with the number-expression in
place. This is why pt_eval_type_pre’s LIMIT handling is also the
spot that replaces LIMIT with the appropriate xxx_NUM() — the rewrite
needs to know the surrounding clause kind, and eval_type_pre is
top-down so that knowledge is available when the LIMIT subtree is
visited.
pt_semantic_type — the type-check + fold engine
Section titled “pt_semantic_type — the type-check + fold engine”// pt_semantic_type — src/parser/type_checking.cPT_NODE *pt_semantic_type (PARSER_CONTEXT * parser, PT_NODE * tree, SEMANTIC_CHK_INFO * sc_info_ptr){ /* ... condensed: derive sc_info_ptr->has_dblink from tree ... */
/* do type checking */ tree = parser_walk_tree (parser, tree, pt_eval_type_pre, sc_info_ptr, pt_eval_type, sc_info_ptr); if (pt_has_error (parser)) { return NULL; }
/* Parsing static sql is only for semantic check. Any kind of execution should be avoided */ if (!parser->flag.is_parsing_static_sql) { PT_NODE *spec_list = NULL; /* do constant folding */ tree = parser_walk_tree (parser, tree, pt_fold_constants_pre, &spec_list, pt_fold_constants_post, sc_info_ptr); if (pt_has_error (parser)) { tree = NULL; } }
return tree;}Two parser_walk_tree calls in sequence:
-
Type checking — pre-callback
pt_eval_type_pre(top-down) sets things up that the children need (LIMIT rewrite, recursive-expr handling, derived-table outer-join flag), and post-callbackpt_eval_type(bottom-up) computes each node’s type.pt_eval_typedispatches by node kind:PT_EXPR→pt_eval_expr_type,PT_FUNCTION→pt_eval_function_type,PT_CREATE_INDEX / PT_DELETE / PT_UPDATE→pt_where_typeon their search conditions. -
Constant folding — pre-callback
pt_fold_constants_predecides whether to descend (it stops descent onBENCHMARKto keep the benchmark target unfoldable), post-callbackpt_fold_constants_postchecks eachPT_EXPR/PT_FUNCTIONto see if every argument is aPT_VALUEand, if so, evaluates and replaces.
The static-SQL guard is for prepared-statement-only parsing where the statement is being checked but never executed; folding could rely on volatile state that does not exist in that mode.
Type checking for PT_EXPR — pt_eval_expr_type
Section titled “Type checking for PT_EXPR — pt_eval_expr_type”The function receives a parent PT_EXPR whose children have already
been typed (because pt_eval_type is the post-order callback). It
extracts up to three argument slots (arg1, arg2, arg3) plus their
host-var siblings if any (arg1_hv, arg2_hv, arg3_hv):
// pt_eval_expr_type — src/parser/type_checking.c (head, condensed)static PT_NODE *pt_eval_expr_type (PARSER_CONTEXT * parser, PT_NODE * node){ PT_OP_TYPE op = node->info.expr.op; PT_NODE *arg1 = NULL, *arg2 = NULL, *arg3 = NULL; PT_NODE *arg1_hv = NULL, *arg2_hv = NULL, *arg3_hv = NULL; PT_TYPE_ENUM arg1_type = PT_TYPE_NONE, arg2_type = PT_TYPE_NONE; PT_TYPE_ENUM arg3_type = PT_TYPE_NONE, common_type = PT_TYPE_NONE; /* ... condensed: enumeration-comparison fast path ... */
arg1 = node->info.expr.arg1; if (arg1) { arg1_type = arg1->type_enum; if (arg1->node_type == PT_HOST_VAR && arg1->type_enum == PT_TYPE_MAYBE) arg1_hv = arg1; /* ... condensed: unwrap unary-minus / PRIOR / CONNECT_BY_ROOT around a host var ... */ } /* ... arg2, arg3 similarly ... */}The body that follows is one of the longest switches in the codebase. At a high level it performs six tasks:
-
Operator-driven rule for arithmetic and a handful of others. A small set of operators —
PT_PLUS,PT_MINUS, thePT_BETWEEN_*family,PT_LIKE,PT_TO_CHAR,PT_FROM_TZ,PT_NEW_TIME, thePT_IS_IN/PT_IS_NOT_INpair — have hand-written rules. The arithmetic rule, for instance, encodes the date/time × integer matrix from the deck (slide 32):TIMESTAMP - TIMESTAMPproducesBIGINT(seconds),INT + DATEproducesDATE, etc. The rule lives directly in the switch arm, which then setscommon_typeand bypasses the signature table. -
Signature-driven matching for the long tail (
pt_apply_expressions_definition). For everything else, a precomputedEXPRESSION_DEFINITION(returned bypt_get_expression_definition (op, &def)) hasoverloads_countoverloads, each an(arg1_type, arg2_type, arg3_type, return_type)tuple. The matcher scores each overload by counting per-argumentpt_are_equivalent_typesmatches; the highest-scoring overload wins. If no overload’s argument types are even unmatchable-free, the matcher falls back to overload 0 (which propagates an error downstream whenpt_coerce_expression_argumentcannot insert the needed cast).// pt_apply_expressions_definition (scoring loop, condensed) — src/parser/type_checking.c:5778best_match = 0;matches = -1;for (i = 0; i < def.overloads_count; i++){int match_cnt = 0;if (pt_are_unmatchable_types (def.overloads[i].arg1_type, arg1_type)) { match_cnt = -1; continue; }if (pt_are_equivalent_types (def.overloads[i].arg1_type, arg1_type)) match_cnt++;if (pt_are_unmatchable_types (def.overloads[i].arg2_type, arg2_type)) { match_cnt = -1; continue; }if (pt_are_equivalent_types (def.overloads[i].arg2_type, arg2_type)) match_cnt++;if (pt_are_unmatchable_types (def.overloads[i].arg3_type, arg3_type)) { match_cnt = -1; continue; }if (pt_are_equivalent_types (def.overloads[i].arg3_type, arg3_type)) match_cnt++;if (match_cnt == 3) { best_match = i; break; } /* perfect — short-circuit *//* ... condensed: track running best_match by match_cnt > matches ... */} -
Collation-compatibility check (
pt_check_expr_collation). For string-typed arguments, both arguments must agree on collation or one must be coercible. CUBRID’s collation system distinguishes “explicit” (user-writtenCOLLATE), “implicit” (column’s default), and “coercible” (literal); the rule prefers the most-explicit collation and errors if two arguments are explicitly tagged with incompatible collations. -
Late-binding host vars (
pt_is_op_hv_late_bind). Some operators — range comparisons, arithmetic on a?argument — can postpone the host var’s exact type until execution. The pass wraps the host var in aPT_CASTwith targetexpected_domainset to aDB_TYPE_VARIABLEdomain, then clears theexpected_domainso the XASL generator knows to treat the cast as late-binding rather than early. -
Type self-evidence. A handful of operators have a return type that is fixed regardless of argument type (
PT_RAND→ numeric,PT_YEAR→ integer,PT_EXTRACT→ integer,PT_COALESCE→ common-type). After the signature match the type is overridden from the operator’s contract. -
Argument-type back-propagation for typed host vars. When the matcher has settled on a return type and the host var is now constrained,
pt_eval_expr_typerecords the inferred domain intoexpected_domainso that subsequent passes (and ultimately the executor) coerce the actual host-var value at bind time.
The function returns the modified node. By the time it returns,
node->type_enum is set, node->data_type is allocated (if the type
needs precision/collation), and any required PT_CAST wrappers have
been inserted around children.
Recursive expressions
Section titled “Recursive expressions”A handful of operators — GREATEST, LEAST, COALESCE (left-recursive)
and CASE, DECODE (right-recursive) — are parsed as a chain of
identical PT_EXPR nodes. Each chain element has only two interesting
arguments (the immediate operand and the recursion-link to the next
chain element); the chain is parsed bottom-up but must be type-checked
as one expression so the final common type propagates to every link.
pt_eval_type_pre’s recursive-expression handling spots the chain on
the way down, walks to the bottom, and from there walks back up
calling eval_recursive_expr_type repeatedly to compute the
common type once. The result recursive_type is then carried in
expr.recursive_type so that the per-link pt_apply_expressions_definition
can use it instead of asking each link to type-check independently.
flowchart TD CASE1["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE2"] CASE2["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE3"] CASE3["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE4"] CASE4["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: const"] CASE1 --> CASE2 --> CASE3 --> CASE4 PRE["eval_type_pre walks down\nto CASE4, computes\nrecursive_type from\nleaf vals, propagates\nupward into all\nrecursive_type fields"] POST["eval_type then runs\nbottom-up; each link\nreads its own recursive_type\nrather than inferring"]
Type checking for PT_FUNCTION
Section titled “Type checking for PT_FUNCTION”pt_eval_function_type is the parallel of pt_eval_expr_type for
function nodes (which can have arbitrary arity). CUBRID has two
implementations side-by-side:
pt_eval_function_type_old— the long-standing C implementation for built-in aggregates, analytics, and most string / numeric functions.pt_eval_function_type_new— a C++17 replacement built around afunc_type::Nodeclass infunc_type.cpp, currently active for JSON functions, theBENCHMARKfamily, and the newREGEXP_*family. New functions are expected to land in the C++ path.
Both paths read a per-function signature list and pick the
best-matching overload by argument type, same scoring discipline as
pt_apply_expressions_definition.
pt_where_type — predicate cut-off
Section titled “pt_where_type — predicate cut-off”After child types are known, the WHERE clause itself gets a final
sweep to cut off provably-true / provably-false conjuncts. Because the
parser delivers the WHERE clause as a CNF-shaped list (next for AND,
or_next for OR), the cut-off logic walks the list once:
// pt_where_type (excerpt) — src/parser/type_checking.c:6495PT_NODE *pt_where_type (PARSER_CONTEXT * parser, PT_NODE * where){ PT_NODE *cnf_node, *dnf_node, *cnf_prev, *dnf_prev; bool cut_off; short location;
/* traverse CNF list and keep track the pointer to previous node */ cnf_prev = NULL; while ((cnf_node = ((cnf_prev) ? cnf_prev->next : where))) { // ... condensed: extract location, validate logical type ...
if (cnf_node->or_next == NULL) { if (cnf_node->node_type == PT_VALUE && cnf_node->type_enum == PT_TYPE_LOGICAL && cnf_node->info.value.data_value.i == 1) { cut_off = true; /* ... AND TRUE AND ... → drop the conjunct */ } else if (cnf_node->node_type == PT_VALUE && cnf_node->type_enum == PT_TYPE_LOGICAL && cnf_node->info.value.data_value.i == 0) { if (cnf_node == where && cnf_node->next == NULL) return where; goto always_false; /* ... AND FALSE AND ... → whole WHERE is false */ } } else { /* DNF inner pass: drop FALSEs from OR-list, short-circuit on first TRUE */ // ... condensed ... }
// ... condensed: advance cnf_prev or splice cnf_node out ... } return where;}The two reductions are:
A ∧ TRUE ∧ B→A ∧ B(drop the conjunct).A ∧ FALSE ∧ B→FALSE(replace the whole WHERE with a single falsePT_VALUE).
The DNF inner loop does the dual:
D ∨ TRUE ∨ E→TRUE(short-circuit the disjunct as a single true).D ∨ FALSE ∨ E→D ∨ E(drop the FALSE alternative).
The same predicate-cut-off logic runs against ON conditions, HAVING clauses, START WITH / CONNECT BY clauses, and ORDERBY-FOR clauses.
Constant folding
Section titled “Constant folding”pt_fold_constants_post is the per-node folder. For a PT_EXPR it
delegates to pt_fold_const_expr, which checks that every argument is
a PT_VALUE, then evaluates the operator at compile time using the
same machinery the executor would use (pt_evaluate_db_value_expr)
and replaces the entire PT_EXPR with the resulting PT_VALUE.
PT_FUNCTION follows the same pattern via pt_fold_const_function.
The pre-callback pt_fold_constants_pre only intervenes for
BENCHMARK: it returns PT_LIST_WALK to stop descent into the
benchmark’s argument, ensuring that the benchmark’s target expression
is evaluated by the executor (where the benchmark can time it) rather
than collapsed at compile time.
There is an additional gate at the SEMANTIC_CHK_INFO level:
donot_fold is a per-pass flag that the caller of pt_check_with_info
can set to suppress folding entirely. Statement re-checking after a
view-transformation rewrite uses this so that folding doesn’t erase
the structure the rewriter just produced.
Type infrastructure — PT_DATA_TYPE and TP_DOMAIN
Section titled “Type infrastructure — PT_DATA_TYPE and TP_DOMAIN”CUBRID’s parse tree has its own type representation (PT_TYPE_ENUM +
PT_DATA_TYPE), separate from the engine’s DB_TYPE + TP_DOMAIN.
The two coexist because the parse tree needs to carry parameterized
metadata (precision, scale, collation, codeset, enumeration values) in
a form aligned with parser nodes, while the engine’s domain cache
needs to be canonical to support fast comparisons:
| Layer | Header | Purpose |
|---|---|---|
| Parse-tree types | parser/parse_tree.h | PT_TYPE_ENUM, PT_DATA_TYPE_INFO — parser-side type metadata |
| Engine domains | object/object_domain.h | TP_DOMAIN — built-in + user-defined type descriptor, cached |
| Primitive types | object/object_primitive.h | PR_TYPE — per-primitive serialization / comparison / size |
| On-disk layout | object/object_representation.h | OR_* — physical byte layout for storage |
Every PT_DATA_TYPE node materialized in the parse tree is mapped to
a TP_DOMAIN via pt_data_type_to_db_domain. Repeated TP_DOMAIN
allocations are deduped through the tp_Domains[] cache (built-in
domains by DB_TYPE, with collection-domain and parameterized-domain
chains hanging off each entry). The key invariant: any two TP_DOMAIN *
pointers that represent the same parameterized type are identical
pointers after the cache lookup, so equality is == rather than a
deep compare.
Stage 5 — CNF normalization (pt_do_cnf / pt_cnf)
Section titled “Stage 5 — CNF normalization (pt_do_cnf / pt_cnf)”After all rewriting and folding settle, pt_do_cnf walks the tree and
calls pt_cnf on each WHERE and HAVING predicate.
// pt_do_cnf — src/parser/cnf.c:1190PT_NODE *pt_do_cnf (PARSER_CONTEXT * parser, PT_NODE * node, void *arg, int *continue_walk){ PT_NODE *list, *spec;
if (node->node_type != PT_SELECT) return node;
list = node->info.query.q.select.where; if (list) { for (; list; list = list->next) PT_EXPR_INFO_CLEAR_FLAG (list, PT_EXPR_INFO_CNF_DONE);
node->info.query.q.select.where = pt_cnf (parser, node->info.query.q.select.where);
/* annotate each conjunct with the spec(s) it touches — used by predicate pushdown */ for (spec = node->info.query.q.select.from; spec; spec = spec->next) pt_tag_terms_with_specs (parser, node->info.query.q.select.where, spec, spec->info.spec.id); }
list = node->info.query.q.select.having; if (list) { for (; list; list = list->next) PT_EXPR_INFO_CLEAR_FLAG (list, PT_EXPR_INFO_CNF_DONE); node->info.query.q.select.having = pt_cnf (parser, node->info.query.q.select.having); }
return node;}The PT_EXPR_INFO_CNF_DONE clear-then-set protocol exists because
pt_do_cnf may be invoked more than once on the same subtree across
the compile (view transform can re-feed predicates that were already
CNF; the second pass should treat them as fresh).
pt_cnf itself is the textbook three-step transform:
// pt_cnf — src/parser/cnf.c:941PT_NODE *pt_cnf (PARSER_CONTEXT * parser, PT_NODE * node){ PT_NODE *list = NULL, *cnf, *next, *last = NULL; CNF_MODE mode;
do { next = node->next; node->next = NULL;
if (node->node_type == PT_VALUE || PT_EXPR_INFO_IS_FLAGED (node, PT_EXPR_INFO_CNF_DONE)) { /* already in CNF (or a folded constant) — splice in unchanged */ cnf = node; // ... condensed: append to (list, last) ... } else { /* AND/OR form: push NOT inward via De Morgan */ node = pt_and_or_form (parser, node);
/* if too many disjuncts, do CNF in OR-tree-pruning mode */ mode = (count_and_or (parser, node) > 100) ? TRANSFORM_CNF_OR_COMPACT : TRANSFORM_CNF_AND_OR;
/* distribute disjunction over conjunction */ cnf = parser_walk_tree (parser, node, pt_transform_cnf_pre, &mode, pt_transform_cnf_post, &mode); // ... condensed: append cnf to (list, last) ...
/* tag every conjunct as done */ for (last = cnf; last->next; last = last->next) PT_EXPR_INFO_SET_FLAG (last, PT_EXPR_INFO_CNF_DONE); PT_EXPR_INFO_SET_FLAG (last, PT_EXPR_INFO_CNF_DONE); }
node = next; } while (next);
list = parser_walk_tree (parser, list, NULL, NULL, pt_tag_start_of_cnf_post, NULL); return list;}The three rewrite steps inside pt_cnf are:
- AND/OR form (
pt_and_or_form) — replace implication and equivalence with their AND/OR equivalents, push every NOT inward using De Morgan until NOT only appears in front of a leaf predicate. - Distribute (
pt_transform_cnf_*) — apply the distributive lawA ∨ (B ∧ C) ≡ (A ∨ B) ∧ (A ∨ C)repeatedly until the formula is a flat conjunction of disjunctions. Themodeswitch (TRANSFORM_CNF_OR_COMPACTover 100 OR-children) is the exponential-blow-up guard: distributing OR over AND can multiply the number of conjuncts by the cross-product of disjunct sizes, so for very wide formulas the compact mode preserves the high-level OR rather than blowing it out. - Tag — set
PT_EXPR_INFO_CNF_DONEon every conjunct so a subsequentpt_cnfinvocation does not re-distribute.
The post-pass pt_tag_start_of_cnf_post marks the first conjunct of
each CNF list with a different flag so that downstream code can detect
the head of a CNF list separately from its members.
The pt_tag_terms_with_specs annotation happens after CNF: each
conjunct is tagged with the bitset of PT_SPEC ids it touches. The
optimizer reads this to decide which conjuncts are pushdown candidates
to which join input.
flowchart LR RAW["WHERE (a OR b) AND NOT (c AND d)"] RAW --> AOF["pt_and_or_form\n→ (a OR b) AND (NOT c OR NOT d)"] AOF --> DIST["pt_transform_cnf_∗\n→ (a OR b) AND (NOT c OR NOT d)\n (already CNF — flat)"] DIST --> TAG["each conjunct gets\nPT_EXPR_INFO_CNF_DONE\nplus spec-id tag"] TAG --> OPT["xasl_generation\n· optimizer reads each\nconjunct for pushdown"]
Statement-class coverage
Section titled “Statement-class coverage”pt_check_with_info’s switch handles every statement kind the parser
can produce; the heavy lifting differs:
| Statement kind | What pt_check_with_info does |
|---|---|
PT_SELECT, PT_UNION, PT_INTERSECTION, PT_DIFFERENCE | Full pipeline: resolve names, check WHERE, replace host vars, semantic_check_local (which calls semantic_type), expand IS NULL, CNF |
PT_INSERT, PT_UPDATE, PT_DELETE, PT_MERGE | Full pipeline plus DML-specific assignments / target validation |
PT_CREATE_INDEX, PT_ALTER_INDEX, PT_DROP_INDEX | pt_resolve_names + pt_check_create_index / pt_check_function_index_expr / pt_check_filter_index_expr; no host-var phase |
PT_CREATE_ENTITY, PT_ALTER, PT_DROP, etc. | DDL-specific checkers (uniqueness of attr names, partition-range validation, FK validity, etc.) |
PT_EVALUATE, PT_DO, PT_SET_* | Full pipeline — these are SQL-level expressions / commands and need typing |
PT_HOST_VAR, PT_VALUE, PT_NAME, PT_EXPR, PT_FUNCTION | Bare expressions — full pipeline |
The DDL paths skip the host-var pass because MSGCAT_SEMANTIC_HOSTVAR_IN_DDL
forbids host vars in DDL statements; the parser sets
parser->host_var_count if any were encountered, and the DDL arm
errors out before semantic checking.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The line-number table at the end is a position hint scoped to this doc’s
updated:date.
Top-level driver (src/parser/)
Section titled “Top-level driver (src/parser/)”pt_compile(compile.c) — single-statement entry; sets the longjmp envelope, callspt_semantic_check.pt_class_pre_fetch(compile.c) — runs beforept_semantic_checkfrom the caller’s perspective; pre-locks every referenced relation. Not part of the semantic pass per se but the precondition forpt_flat_spec_preto find every relation in the workspace.pt_semantic_check(semantic_check.c) — wrapper forpt_check_with_info.pt_check_with_info(semantic_check.c) — the statement-kind switch; the scaffolding under which all four sub-passes run.
Name resolution (name_resolution.c)
Section titled “Name resolution (name_resolution.c)”SCOPES,PT_BIND_NAMES_ARG(name_resolution.c:78,87) — scope-stack frame and walk-state argument.pt_resolve_names— five-step driver.pt_flat_spec_pre— expand eachPT_SPECinto aflat_entity_list(handles inheritance andALL ... EXCEPT).pt_mark_group_having_pt_name— tag GROUP BY / HAVINGPT_NAMEs.pt_bind_names,pt_bind_names_post— main name-binding walk (push / pop scopes; handle PT_DOT_).pt_bind_scope— push a newSCOPESframe; if the spec is a derived table or CTE, recurse into it before exposing it.pt_bind_name_or_path_in_scope— resolve a singlePT_NAME(orPT_DOT_path) against the current scope stack.pt_get_resolution,pt_find_entity_in_scopes,pt_find_outer_entity_in_scopes— the walker primitives behind the binder.pt_resolve_star,pt_resolve_star_reserved_names—*expansion in the SELECT list.pt_resolve_correlation— bumpcorrelation_levelwhen a name walks outward; the optimizer reads this.pt_resolve_group_having_alias,_pt_name,_pt_expr,_pt_sort_spec,_internal— alias resolution in GROUP BY / HAVING.pt_resolve_natural_join,pt_resolve_natural_join_internal— NATURAL JOIN → INNER JOIN ON … rewrite.pt_check_Oracle_outerjoin,pt_clear_Oracle_outerjoin_spec_id— Oracle(+)syntax to ANSI rewrite.pt_bind_names_in_with_clause,pt_bind_names_in_cte— CTE binding.pt_bind_names_merge_insert,pt_bind_names_merge_update— MERGE-specific binding.pt_make_flat_name_list,pt_make_subclass_list— list builders for the flat entity list.
Aggregate-in-WHERE check (semantic_check.c)
Section titled “Aggregate-in-WHERE check (semantic_check.c)”pt_check_where— walks the predicate tree looking for an aggregate / analytic function and emitsMSGCAT_SEMANTIC_INVALID_AGGREGATE/MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS.pt_find_aggregate_analytic_in_where,pt_is_aggregate_function— predicate helpers.
Host-variable replacement (semantic_check.c)
Section titled “Host-variable replacement (semantic_check.c)”pt_check_and_replace_hostvar—parser_walk_treecallback.SET_HOST_VARIABLES_IF_INTERNAL_STATEMENT(macro) — inherits host vars from a parent parser context, used when the statement was generated internally (view transform / trigger).
Statement-specific checks (semantic_check.c)
Section titled “Statement-specific checks (semantic_check.c)”pt_semantic_check_local— the post-order callback that drives per-statement semantic checks and rewrites; also the call site ofpt_semantic_type.pt_check_into_clause,pt_check_into_clause_for_static_sql— INTO arity and INTO-in-subquery checks.pt_check_create_index,pt_check_function_index_expr,pt_check_filter_index_expr— DDL index-expression checks.pt_check_cast_op— CAST type / collation feasibility.pt_check_unique_attr— duplicate-column check.pt_check_range_partition_strict_increasing,pt_coerce_partition_value_with_data_type— partition-range validation.pt_expand_isnull_preds,pt_expand_isnull_preds_helper— expandcol IS NULLinto the disjunction of class-discriminating predicates needed for inheritance-aware nullability; runs afterpt_semantic_check_localand before CNF (so CNF sees the expansion).pt_mark_union_leaf_nodes— propagate UNION leaf flags upward.
Type checking and constant folding (type_checking.c)
Section titled “Type checking and constant folding (type_checking.c)”pt_semantic_type— driver; calls the type-check walk and the fold walk.pt_eval_type_pre— top-down callback (LIMIT rewrites, recursive- expr handling, derived-table outer-join flag, sort-spec subquery flag, search-condition cleanup).pt_eval_type— bottom-up callback (dispatches per node type topt_eval_expr_type,pt_eval_function_type,pt_where_type).pt_eval_expr_type— operator-driven + signature-driven type inference forPT_EXPR.pt_eval_function_type,pt_eval_function_type_old,pt_eval_function_type_new— same forPT_FUNCTION. The_newpath delegates tofunc_type::Nodeinfunc_type.cppfor new functions (JSON, REGEXP, BENCHMARK).pt_apply_expressions_definition— signature-table matcher.pt_get_expression_definition— returns the per-operatorEXPRESSION_DEFINITION.pt_coerce_expression_argument— wrap an argument inPT_CASTto a targetTP_DOMAIN.pt_check_expr_collation— collation-compatibility verification.pt_is_op_hv_late_bind— operator predicate; true if a host-var argument can be late-bound.pt_upd_domain_info— refreshdata_typeprecision/scale on thePT_EXPRfrom its operator and arguments.pt_where_type,pt_where_type_keep_true— predicate cut-off based on resolved Boolean values.pt_fold_constants_pre— descent gate (BENCHMARK).pt_fold_constants_post— per-node folder.pt_fold_const_expr— fold aPT_EXPRwhose every argument is aPT_VALUE.pt_fold_const_function— fold aPT_FUNCTIONsimilarly.pt_evaluate_db_value_expr— the actual compile-time evaluation primitive (lives inparse_evaluate.c); used by the folders.
Type infrastructure (parse_dbi.c, object_domain.c)
Section titled “Type infrastructure (parse_dbi.c, object_domain.c)”pt_data_type_to_db_domain(parse_dbi.c) — convertPT_DATA_TYPEtoTP_DOMAIN.tp_domain_new,tp_domain_construct(object_domain.c) — fresh domain allocation; the parser-side allocators wrap these.tp_Domains[]— the global per-DB_TYPEdomain cache.tp_is_domain_cached— cache lookup primitive for parameterized domains.
CNF normalization (cnf.c)
Section titled “CNF normalization (cnf.c)”pt_do_cnf— top-level walker that finds WHERE / HAVING inPT_SELECTand runs CNF.pt_cnf— three-step transform (AND/OR form → distribute → tag).pt_and_or_form— eliminate ⇒ / ⇔, push NOT inward.pt_aof_to_cnf— recursive AOF-to-CNF helper.pt_distributes_disjunction— distribute OR over AND.pt_transform_cnf_pre,pt_transform_cnf_post,pt_tag_start_of_cnf_post—parser_walk_treecallbacks.pt_tag_terms_with_specs— annotate each CNF conjunct with the spec-id bitset it touches.count_and_or— disjunct-count gate that selects betweenTRANSFORM_CNF_AND_ORandTRANSFORM_CNF_OR_COMPACTmodes.PT_EXPR_INFO_CNF_DONE— flag set on every conjunct after CNF.
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
pt_compile | parser/compile.c | 381 |
pt_class_pre_fetch | parser/compile.c | 432 |
pt_semantic_check | parser/semantic_check.c | 12523 |
pt_check_with_info | parser/semantic_check.c | 12060 |
pt_check_where | parser/semantic_check.c | 17503 |
pt_check_and_replace_hostvar | parser/semantic_check.c | 11934 |
pt_semantic_check_local | parser/semantic_check.c | 10865 |
pt_check_into_clause | parser/semantic_check.c | 10684 |
pt_check_into_clause_for_static_sql | parser/semantic_check.c | 10657 |
pt_check_create_index | parser/semantic_check.c | 8888 |
pt_expand_isnull_preds | parser/semantic_check.c | 222 |
pt_mark_union_leaf_nodes | parser/semantic_check.c | 269 |
SCOPES struct | parser/name_resolution.c | 78 |
PT_BIND_NAMES_ARG struct | parser/name_resolution.c | 87 |
pt_bind_name_or_path_in_scope | parser/name_resolution.c | 841 |
pt_bind_scope | parser/name_resolution.c | 1206 |
pt_bind_names_post | parser/name_resolution.c | 1467 |
pt_bind_names | parser/name_resolution.c | 1974 |
pt_flat_spec_pre | parser/name_resolution.c | 4735 |
pt_resolve_star_reserved_names | parser/name_resolution.c | 7391 |
pt_resolve_star | parser/name_resolution.c | 7460 |
pt_resolve_natural_join_internal | parser/name_resolution.c | 8914 |
pt_resolve_natural_join | parser/name_resolution.c | 9026 |
pt_resolve_group_having_alias_pt_sort_spec | parser/name_resolution.c | 9142 |
pt_resolve_group_having_alias_pt_name | parser/name_resolution.c | 9158 |
pt_resolve_group_having_alias_pt_expr | parser/name_resolution.c | 9211 |
pt_resolve_group_having_alias_internal | parser/name_resolution.c | 9269 |
pt_resolve_group_having_alias | parser/name_resolution.c | 9302 |
pt_resolve_names | parser/name_resolution.c | 9350 |
pt_bind_names_merge_insert | parser/name_resolution.c | 10837 |
pt_bind_names_merge_update | parser/name_resolution.c | 10933 |
pt_bind_names_in_with_clause | parser/name_resolution.c | 11147 |
pt_bind_names_in_cte | parser/name_resolution.c | 11202 |
pt_coerce_expression_argument | parser/type_checking.c | 4473 |
pt_apply_expressions_definition | parser/type_checking.c | 5778 |
pt_where_type | parser/type_checking.c | 6495 |
pt_where_type_keep_true | parser/type_checking.c | 6689 |
pt_eval_type_pre | parser/type_checking.c | 7082 |
pt_fold_constants_pre | parser/type_checking.c | 7487 |
pt_fold_constants_post | parser/type_checking.c | 7661 |
pt_eval_type | parser/type_checking.c | 7714 |
pt_eval_expr_type | parser/type_checking.c | 8706 |
pt_upd_domain_info | parser/type_checking.c | 11245 |
pt_eval_function_type | parser/type_checking.c | 12285 |
pt_fold_const_expr | parser/type_checking.c | 17579 |
pt_fold_const_function | parser/type_checking.c | 18799 |
pt_semantic_type | parser/type_checking.c | 19073 |
pt_is_op_hv_late_bind | parser/type_checking.c | 20260 |
pt_check_expr_collation | parser/type_checking.c | 22076 |
pt_aof_to_cnf | parser/cnf.c | 255 |
pt_cnf | parser/cnf.c | 941 |
pt_do_cnf | parser/cnf.c | 1191 |
Cross-check Notes
Section titled “Cross-check Notes”The raw analyses are dated (versions 0.5 – 1.0, internal RND-2팀 review) and lag the current source on a few specific points:
-
pt_semantic_check_localis no longer the place wherept_semantic_typeis unconditionally called. The deck describespt_semantic_check_localas “calls semantic_type and constant folds”. In the current source,pt_check_with_infoschedules the walks and the call topt_semantic_typehappens both insidept_semantic_check_local(per-statement) and directly frompt_check_with_infofor non-DML node kinds (e.g.,PT_CREATE_INDEXcallspt_semantic_typedirectly atsemantic_check.c:12257). -
The
pt_eval_function_typesplit. The raw type-checking deck refers only topt_eval_function_typeas a single function. The current source has bothpt_eval_function_type_oldandpt_eval_function_type_new; the new C++17 path (func_type::Nodeinfunc_type.cpp) handles JSON, BENCHMARK, and REGEXP families. Most other functions still go through the old path. -
PT_JOIN_NATURALis dead. The deck flags it as “not used”; the comment is still in the enum (parse_tree.h). NATURAL JOIN is carried as abool naturalflag onPT_SPEC, not as a join-type.PT_JOIN_FULL_OUTERis also dead — full outer join is unsupported in CUBRID. -
The Oracle outer-join “Meaningless” predicate handling. The deck notes
SELECT ... FROM x, y ON y.i(+) = 1collapses toPT_JOIN_NONEand the predicate moves back to WHERE. Verified inpt_check_Oracle_outerjoin’s post-walk; the current source preserves this behavior. -
CNF
modeswitch threshold.pt_cnfswitches toTRANSFORM_CNF_OR_COMPACTwhencount_and_or > 100. The CNF deck does not mention this exponential-blow-up guard. This is a recent hardening; prefer the source. -
pt_check_into_clausehas a static-SQL variant. The “INTO” section of the per-statement deck describes only the regular arity check. The current source haspt_check_into_clause_for_static_sqlwhich performs the same check at static-SQL parse time (theis_parsing_static_sqlflag is also what gates folding inpt_semantic_type). -
The “host-var → expected_domain” handling has matured. The deck describes a single late-bind step. The current source performs five distinct settle-then-revisit operations on host vars: type unwrap (handle
-?andprior ?), late-bind operator classification (pt_is_op_hv_late_bind), expected-domain computation, expected-domain clear-on-late-bind (so the XASL generator emitsDB_TYPE_VARIABLE), and a final back-propagation of the inferred type into the host var’s expected domain. The five-step shape is implicit in the deck but explicit in the code. -
NATURAL JOIN attribute lookup for derived-tables / CTEs. The deck describes two cases (entity-fetched DB_ATTRIBUTE, derived table or CTE select-list / WITH-list). Verified in
pt_resolve_natural_join_internal; current source matches.
Open Questions
Section titled “Open Questions”-
pt_eval_function_type_newmigration plan. Three function families live in the C++17 path; the rest still usept_eval_function_type_old. Is there a documented roadmap for moving more functions over (and what does the migration look like for an existing aggregate)? Investigation path:git logfunc_type.cppplus thept_eval_function_type_newcallsites for merge-commits introducing new function families. -
donot_foldsemantics.SEMANTIC_CHK_INFO::donot_foldis set by some callers ofpt_check_with_info, including the view transform. The deck does not enumerate every site. Is there a complete list of folding-suppression contexts, and does any of them affect plan-cache correctness? Investigation path: grep fordonot_fold = trueacross the parser. -
PT_EXPR_INFO_CNF_DONEand re-entrance.pt_do_cnfclears the flag before applyingpt_cnf. What if the predicate already contains a partially CNF-tagged sub-tree from view transform? The current code unconditionally clears the flag on every top-level conjunct; a sub-conjunct’s flag is not cleared. Could this causept_cnfto skip a sub-tree that should have been re-CNFed? Investigation path: write a regression with a view that contains a CNF-tagged predicate andUNIONit with a query that re-enters CNF. -
pt_tag_terms_with_specsand natural-join post-conditions. The natural-join rewrite synthesizes newPT_EXPRpredicates and appends them to the rhs spec’son_cond. After this rewrite, the spec-id tag bitset on each synthesized predicate must be the union of lhs + rhs. Is this correctly produced? Investigation path: instrumentpt_tag_terms_with_specsand run the natural-join regression suite. -
Static-SQL parsing and folding.
pt_semantic_type’sis_parsing_static_sqlguard skips folding entirely. This means that constant predicates in static SQL stay alive in the parse tree until execution. Is the executor robust against this — does it fold lazily, or does it always evaluate? Investigation path: tracept_evaluate_db_value_exprcallers inquery/query_executor.c. -
Recursive-expression depth limit.
eval_recursive_expr_typewalks the chain to its bottom. Long chains (CASE WHEN ... CASE WHEN ...repeated hundreds of times) could blow the stack. Is there a depth guard? Investigation path: search forMAX_RECURSIVE_DEPTHor equivalent intype_checking.c. -
The five-bit
OR_MVCC_FLAG_MASKanalogue here? The MVCC record header has reserved bits; the parse-tree node structure has its own flag bytes (PT_NODE::flag,PT_EXPR_INFO_*,PT_NAME_INFO_*). Are any of those flag-byte bits reserved for future semantic-check use, or is the flag space allocated tight? Investigation path: enumeratePT_*_INFO_andPT_NODE_FLAG_*masks and confirm none overlap.
Sources
Section titled “Sources”Raw analyses (raw/code-analysis/cubrid/query-processing/)
Section titled “Raw analyses (raw/code-analysis/cubrid/query-processing/)”code_analysis_Semantic_check_Overview_v_1_0.pdf— the four-stage pipeline overview.code_analysis_Semantic_check-Name_Resolution_v_0_9.pdf— the longest deck (26 KB markdown after conversion); covers spec expansion, the scope stack,*resolution, derived tables, subqueries, joins (including Oracle(+)style), GROUP BY / HAVING aliasing, NATURAL JOIN rewriting, FOR UPDATE flagging.code_analysis_Semantic_check-Type_Checking_and_Constant_Folding_v_1_0.pdf—pt_semantic_typeand the type-evaluation rules.code_analysis_Semantic_check-Checking_Semantic_of_Particular_Statement_v_0_8.pdf— per-statement rewrites and error messages.type_checking_v1.0.pptx— the deeper companion deck for the type-checking pass (signature table, type coercion table, parameterized types, thePT_DATA_TYPE↔TP_DOMAINrelationship).code_analysis_Common_module_CNF_(parsercnf).pdf— CNF as a utility module shared between semantic check and view transform.
Sibling docs
Section titled “Sibling docs”knowledge/code-analysis/cubrid/cubrid-mvcc.md— same query-processing pipeline; the WHERE clause CNF normalization feeds into the optimizer that the MVCC-aware scan pages out.knowledge/code-analysis/cubrid/cubrid-heartbeat.md— unrelated but a style reference for the format of this document.
Textbook chapters / references
Section titled “Textbook chapters / references”- Database Internals (Petrov), Ch. 5 §“Query Processing” — sets the stage for the parser-to-optimizer pipeline.
- Database System Concepts (Silberschatz, Korth, Sudarshan), Ch. 4 §“Built-in Data Types”, Ch. 5 §“Functions and Procedures” — the type-system framing used in §“Theoretical Background”.
- Compilers: Principles, Techniques, and Tools (Aho, Lam, Sethi, Ullman; “Dragon Book”), Ch. 6 §“Symbol Tables”, Ch. 9 §“Loop-Invariant Computations” — symbol table and constant-folding framing.
- Robinson, A Machine-Oriented Logic Based on the Resolution Principle, JACM 12, 1965 — the formal CNF transform.
- Nilsson, Principles of Artificial Intelligence, ch. “Resolution Refutations” — referenced directly by the CUBRID CNF deck.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/parser/compile.c—pt_compile,pt_class_pre_fetch.src/parser/semantic_check.c— the per-statement driver and every per-statement check.src/parser/semantic_check.h—SEMANTIC_CHK_INFOand the module’s public surface.src/parser/name_resolution.c— every name-resolution sub-pass.src/parser/type_checking.c—pt_semantic_typeand the type-evaluation rules.src/parser/cnf.c— CNF normalization.src/parser/parse_tree.h—PT_NODEdefinitions; the union ofinfo.<kind>payloads for every node type.src/parser/parse_dbi.c—pt_data_type_to_db_domainand thePT_DATA_TYPE↔TP_DOMAINbridge.src/parser/parse_evaluate.c—pt_evaluate_db_value_expr, the compile-time evaluator used by the constant folders.src/parser/func_type.cpp,func_type.hpp— the C++17 function type-evaluation machinery used bypt_eval_function_type_new.src/object/object_domain.{c,h}—TP_DOMAINand the domain cache.