CUBRID Parser — Flex/Bison Pipeline, PT_NODE Tree, and the Parser Memory Model
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A SQL parser sits between two textbook problems. The first is lexical analysis: chop a flat character stream into tokens that the grammar can reason about. The second is syntactic analysis: decide whether the token stream conforms to a context-free grammar and, if it does, build a tree that captures the derivation. The tooling for both halves was canonised by Aho, Sethi, Lam, and Ullman’s Compilers: Principles, Techniques, and Tools — the Dragon book — and its core abstractions are unchanged in modern DBMS parsers: regular languages → DFA for lexing, and a deterministic shift-reduce automaton (LR(1) / LALR(1) / GLR) for parsing.
Database Internals (Petrov), Ch. 1 §“Query Processor”, frames the same picture from the database side: SQL text enters the query processor, the parser produces an internal tree representation, the optimiser rewrites it, and the executor walks the rewritten tree. The parser owns one job — produce a tree the later passes can rely on — and one constraint — never silently accept input the grammar does not cover.
Lexing — regular languages and DFAs
Section titled “Lexing — regular languages and DFAs”A lexer is a regular-expression engine specialised for one
language. Every token class (keywords, identifiers, integer
literals, string literals, operators) is encoded as a regular
expression; the union of those regexes is compiled to a single
deterministic finite automaton; the DFA reads characters and
returns (token_class, lexeme) pairs. Compilers §3 derives the
NFA → DFA construction. Two practical points matter for any
real lexer: longest-match-wins disambiguation, and a stateful
start condition mechanism that lets the same surface syntax
mean different things in different contexts (string interior vs.
SQL keyword space, for example).
Flex is the GNU re-implementation of Lex. It generates a C
function yylex() that returns one token per call and exposes
yytext, yyleng, and a per-token yylval slot for the value
the parser will reduce on. CUBRID’s lexer is in
src/parser/csql_lexer.l; the generated csql_lexer.c is what
the build actually compiles.
Parsing — LALR(1) and GLR
Section titled “Parsing — LALR(1) and GLR”Bison generates a shift-reduce parser from a context-free
grammar with semantic actions. The default flavour is
LALR(1) — Look-Ahead LR with one token of lookahead, the
same automaton family covered in Compilers §4.7. LALR(1)
covers most programming-language grammars cleanly; it does not
cover SQL cleanly because SQL has historic ambiguities (the
classic example is the parse of SELECT a FROM b GROUP BY c, d
where c, d could be a list or a tuple, depending on dialect).
Bison’s escape hatch is GLR (Generalized LR): when the
LALR(1) automaton hits a conflict it forks, parses both branches
in parallel, and discards the branches that fail to reduce. The
performance cost is real but bounded — most real SQL queries
encounter conflicts at only a handful of points. CUBRID compiles
its grammar with %glr-parser (csql_grammar.y declaration
%glr-parser); this is what lets the grammar accept the
permissive superset of MySQL- and Oracle-style joins, hint
syntax, and method-call syntax that the surface SQL allows.
Bison’s semantic-action mechanism is what builds the tree.
Each grammar rule has an embedded C action block; when the
parser reduces by that rule, the action runs. The action sees
$1, $2, ... for the rule’s right-hand-side semantic values
and writes the rule’s own value into $$. The %union
declaration tells Bison the type of those values; in CUBRID
the union holds PT_NODE *node, int number, char *cptr,
and a few helper containers. The grammar is the only place
where parse-tree nodes are constructed for the user’s input —
every other module reads what the grammar built.
AST construction and error recovery
Section titled “AST construction and error recovery”The parse tree is the contract the parser hands to the rest of the system. Two design choices shape every concrete realisation:
-
Polymorphism mechanism. Every parse tree has nodes of many shapes (a
SELECTis not aWHEREclause, which is not a literal). Three approaches dominate:- Class hierarchy (modern C++ / Java): one base class, virtual methods. Conventional but requires RTTI and virtual dispatch.
- Tagged union (the C dialect): one struct with a
discriminating tag and a
unionof per-type substructures. Cheaper at runtime, less safe at compile time. - Sum-type ADT (Haskell / OCaml / Rust): the language’s own union type with exhaustive matching.
PostgreSQL, MySQL, and CUBRID all chose the tagged-union approach for the same C-era reasons.
-
Error recovery. Compilers §4.1.4 enumerates the four classical strategies: panic-mode (skip until a synchronising token), phrase-level (replace tokens to make the rule reduce), error productions (grammar rules that match ill-formed input), and global correction (find the minimum-edit valid program). Bison’s built-in
errortoken is panic-mode; production engines layer custom location tracking and message synthesis on top.
CUBRID’s combination is conventional: Flex with start
conditions, Bison %glr-parser, a tagged-union PT_NODE, and
panic-mode recovery enriched with YYLTYPE location info that
pinpoints the offending byte in the original buffer. The
distinguishing engineering choices are in the memory model
and in the per-type child layout, which we trace in
§“CUBRID’s Approach”.
Common DBMS Design
Section titled “Common DBMS Design”Every SQL engine that compiles SQL on the client side faces the same set of tasks; the engineering vocabulary that has settled across PostgreSQL, MySQL, and CUBRID is the lens this section adopts. The CUBRID-specific code in §“CUBRID’s Approach” is best read as one set of dials within this shared design space.
Flex + Bison as the default front-end stack
Section titled “Flex + Bison as the default front-end stack”Three of the most-cited C/C++ RDBMSes — PostgreSQL, MySQL,
and CUBRID — generate their lexer and parser from Flex/Bison
sources. Postgres’ files are src/backend/parser/scan.l and
src/backend/parser/gram.y; MySQL’s are
sql/sql_lex.cc/sql/sql_yacc.yy; CUBRID’s are
src/parser/csql_lexer.l and src/parser/csql_grammar.y. The
generation step is part of the build (CMake on CUBRID, Autoconf
on Postgres) and the generated .c files are checked in only
sometimes; CUBRID checks in only the inputs.
The shared shape is:
- One
.lfile with three sections — Definitions, Rules, User subroutines — and the rules section returning token codes that the grammar’s%tokendeclarations match. - One
.yfile with%union,%type,%token, the start symbol, the grammar rules each carrying a C action block, and an epilogue with helper functions used from inside actions. - Generated artifacts wired into the build through CMake’s
flex_target/bison_target.
ParseNode / PT_NODE / Item — the tagged-union AST
Section titled “ParseNode / PT_NODE / Item — the tagged-union AST”PostgreSQL’s Node (src/include/nodes/nodes.h) tags every
parse-tree, plan-tree, and executor-tree element with a
NodeTag. The tag selects which concrete struct the pointer
actually refers to; visitor-style walks switch on it.
MySQL splits the surface differently — Item for expressions,
SELECT_LEX and friends for query blocks — but each variant
carries a tag and shares a base type; the discriminator is the
class identity in MySQL’s case (it does use C++ inheritance).
CUBRID’s PT_NODE is the same idea as Postgres’s Node minus
inheritance: a single struct with one PT_NODE_TYPE node_type
field and a giant union pt_statement_info info field whose
layout depends on node_type. Every per-type substructure is
named PT_<TYPE>_INFO and accessed as node->info.<lowercase>.
Per-PARSER_CONTEXT block allocator
Section titled “Per-PARSER_CONTEXT block allocator”Parser-time allocations are short-lived and many. Every engine
uses some form of arena / region / zone allocator: a few
big blocks of memory acquired up-front, individual nodes carved
out by bumping a pointer, and the whole arena released in one
free at the end. PostgreSQL has MemoryContext (per-context
palloc/pfree); MySQL has MEM_ROOT; CUBRID has the
PARSER_CONTEXT plus a per-parser pool of PARSER_NODE_BLOCKs
and PARSER_STRING_BLOCKs in parse_tree.c.
The win is the same in all three: nodes built during parsing
that turn out to be unreachable (because a rule failed to reduce
or because semantic check rewrote them) do not need individual
free calls — they vanish with the parser context. The cost is
that node pointers cannot safely outlive the parser; if the
optimizer or executor needs a node after the parser closes, it
has to copy it into its own arena (this is the role
PostgreSQL’s copyObject plays, and CUBRID’s
parser_copy_tree).
Per-type child layout — the apply function table
Section titled “Per-type child layout — the apply function table”A parse tree walker has to know which children each node type has. Three approaches:
- Hard-coded per-type switches in every walker. Fast, brittle.
- Reflection (Java, C#) — read the field types at runtime. Slow, expensive at build time.
- Function-pointer table indexed by the discriminator — one function per type, manually written, that visits the correct set of children for that type.
Postgres uses the third approach with macros that expand to
visitor code; CUBRID uses it explicitly with the pt_apply_f
array of per-PT_NODE_TYPE functions, plus parallel
pt_init_f and pt_print_f arrays for initialisation and
pretty-printing. The trade-off is a small build-time tax (every
new node type must register three functions) for a fast and
debuggable walker.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical / common concept | CUBRID name |
|---|---|
| Lexer source | src/parser/csql_lexer.l (flex) |
| Generated lexer | csql_lexer.c (built by flex_target in CMake) |
| Grammar source | src/parser/csql_grammar.y (bison %glr-parser) |
| Generated parser | csql_grammar.c (built by bison_target) |
| Token type prefix | csql_yy* (set via --name-prefix=csql_yy / --prefix=csql_yy) |
| Lexer entry | csql_yylex (invoked by Bison-generated csql_yyparse) |
| Parser entry | csql_yyparse → wrapped by parser_main (csql_grammar.y) |
| Parse-tree node | PT_NODE (parse_tree.h) |
| Node discriminator | enum pt_node_type PT_NODE_TYPE (parse_tree.h) |
| Per-type substructure union | union pt_statement_info PT_STATEMENT_INFO (parse_tree.h) |
| Polymorphic child-visit table | pt_apply_f[PT_NODE_NUMBER] (parse_tree_cl.c) |
| Polymorphic init table | pt_init_f[PT_NODE_NUMBER] |
| Polymorphic print table | pt_print_f[PT_NODE_NUMBER] |
| Tree walker | parser_walk_tree, parser_walk_leaves (parse_tree_cl.c) |
| Per-parser arena | PARSER_CONTEXT + PARSER_NODE_BLOCK + PARSER_STRING_BLOCK (parse_tree.c) |
| Node allocator | parser_new_node → parser_create_node → block bump (parse_tree_cl.c, parse_tree.c) |
| String allocator | parser_alloc (parse_tree.c) |
| Lexer-token name interning | pt_makename (scanner_support.c) |
| Bison location type | YYLTYPE { first_line, first_column, last_line, last_column, buffer_pos } (csql_grammar.y) |
| Per-node line/column | PT_NODE::line_number, column_number, buffer_pos |
| Error token | Bison built-in error; messages via csql_yyerror and PT_ERRORmf |
| Parser context lifecycle | parser_create_parser / parser_free_parser (parse_tree.c) |
| Public string-parse entry | parser_parse_string, parser_parse_string_with_escapes (parse_tree_cl.c) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”We walk three levels in order: (1) the lex → parse → tree
pipeline at runtime, (2) the PT_NODE data model that the
pipeline produces, (3) the memory model that lets the
producer and the consumer share node pointers safely.
Pipeline at runtime
Section titled “Pipeline at runtime”flowchart LR SQL["SQL text"] PC["PARSER_CONTEXT"] PARSE["parser_parse_string"] MAIN["parser_main"] LEX["csql_yylex<br/>(generated from csql_lexer.l)"] PARSER["csql_yyparse<br/>(generated from csql_grammar.y)"] ACT["reduce-action<br/>parser_new_node / parser_make_∗"] POOL["PARSER_NODE_BLOCK pool<br/>(per-parser arena)"] STK["parser->node_stack<br/>(top-level stmt list)"] TREE["PT_NODE root tree(s)"] STMTS["parser->statements[]"] SQL --> PARSE PARSE --> PC PARSE --> MAIN MAIN --> LEX MAIN --> PARSER LEX -- "(token, csql_yylval)" --> PARSER PARSER -- "on each reduce" --> ACT ACT -- "carve from arena" --> POOL ACT -- "wire children" --> TREE PARSER -- "top-level reduce" --> STK MAIN -- "snapshot stack" --> STMTS STMTS --> TREE
The figure makes three boundaries explicit. (grammar /
allocator) the grammar rules are the only place that calls
parser_new_node; reduce actions are where nodes are born.
(parse / drive) parser_main is a thin wrapper around
yyparse that owns the per-call setup (line/column reset,
buffer-position save/restore, hint table init, top-level
statement-array snapshot). (per-call arena) every node and
string allocated during the call lives in
PARSER_CONTEXT-owned blocks and is released by
parser_free_parser; nothing is heap-individual.
The lexer — csql_lexer.l
Section titled “The lexer — csql_lexer.l”Flex source has the standard three-section layout:
// csql_lexer.l (skeleton)
%{#include "csql_grammar.h"#include "csql_grammar_scan.h"#include "parse_tree.h"#include "parser_message.h"#include "system_parameter.h"#include "message_catalog.h"
#define CSQL_MAXNAME 256static int parser_yyinput (char *buff, int max_size);
#undef YY_INPUT#define YY_INPUT(buffer, result, max_size) { \ result = parser_yyinput(buffer, max_size); \ result == 0 ? result = YY_NULL : result; \}
#define YY_USER_ACTION { \ yybuffer_pos += csql_yyget_leng (); \ csql_yylloc.buffer_pos = yybuffer_pos; \}%}
%option yylineno%x QUOTED_CHAR_STRING DOUBLY_QUOTED_CHAR_STRING DELIMITED_ID_NAME%x BRACKET_ID_NAME BACKTICK_ID_NAME PLCSQL_TEXT/* ... twelve more start conditions ... */
%%[sS][eE][lL][eE][cC][tT] { begin_token(yytext); return SELECT; }[sS][eE][nN][sS][iI][tT][iI][vV][eE] { begin_token(yytext); return SENSITIVE; }[sS][eE][pP][aA][rR][aA][tT][oO][rR] { begin_token(yytext); csql_yylval.cptr = pt_makename(yytext); return SEPARATOR; }/* ... ~2000 more rules ... */%%
/* user subroutines: parser_yyinput, csql_yyerror, yywrap, ... */Three deliberate choices live in the snippet:
YY_INPUTis overridden. The default FlexYY_INPUTreads fromstdin. CUBRID parses an in-memory buffer (parser->buffer) so it routes Flex’s input throughparser_yyinput, which copies bytes out of the parser context’s owned buffer and tracksyybuffer_pos.YY_USER_ACTIONaccumulates a byte-position counter.yybuffer_posis the cumulative offset into the original SQL string of the token Flex just matched; it is the anchor every error message and everyPT_NODE::buffer_poslater refers back to. Lines and columns alone are not enough — error messages quote the offending substring, which needs the byte offset.- Keyword rules are case-insensitive by enumeration of
character classes.
[sS][eE][lL][eE][cC][tT]is the manual way of getting case-insensitive matching without asking Flex to apply%option case-insensitive(which would also affect identifier rules and string literals). The generated DFA pays a few extra states per keyword; the alternative would have been to lowercase the input first, which interferes with delimited identifiers like"FooBar".
The user-subroutines block holds the input drivers
(parser_yyinput, parser_yyinput_single_line,
parser_yyinput_multi_line), the error reporter
(csql_yyerror / csql_yyerror_explicit), and the
end-of-input handler yywrap (which always returns 1 — the
parser sees one buffer per parser_parse_string call, never a
chained input).
The fourteen start conditions (excerpt:
QUOTED_CHAR_STRING, DOUBLY_QUOTED_CHAR_STRING,
DELIMITED_ID_NAME, BRACKET_ID_NAME, BACKTICK_ID_NAME,
PL_LANG_SPEC, PLCSQL_TEXT, …) are how CUBRID handles the
fact that the same surface character (e.g. ") means
different things in the middle of a string literal versus at
the start of a delimited identifier. Each BEGIN(state)
action in the grammar/lexer flips the lexer into a state where
only the rules relevant to that context are eligible to match.
PL/CSQL text matches a multi-line balanced BEGIN ... END
block; expecting_pl_lang_spec is a one-shot flag that the
grammar sets right before it expects to see a procedure body.
The grammar — csql_grammar.y
Section titled “The grammar — csql_grammar.y”The grammar uses Bison %glr-parser to handle SQL’s
ambiguities; the %union carries the value types reduce
actions can return; the rules build PT_NODEs.
// csql_grammar.y — declarations (excerpt)
%{/*%CODE_REQUIRES_START%*/#include "json_table_def.h"#include "parser.h"
typedef struct YYLTYPE{ int first_line; int first_column; int last_line; int last_column; int buffer_pos; /* cumulative byte offset; updated by YY_USER_ACTION */} YYLTYPE;#define YYLTYPE_IS_DECLARED 1/*%CODE_END%*/%}
%locations%glr-parser%define parse.error verbose
%union{ int number; bool boolean; PT_NODE *node; char *cptr; container_2 c2; container_3 c3; container_4 c4; container_10 c10; container_11 c11; struct json_table_column_behavior jtcb;}
%type <node> stmt%type <node> select_stmt%type <node> select_expression%type <node> select_expression_opt_with/* ... ~1100 more %type lines ... */The YYLTYPE extension is the parser’s way of preserving the
lexer’s buffer_pos through reductions. The default Bison
YYLTYPE carries only (first_line, first_column, last_line, last_column); CUBRID adds buffer_pos so error
messages can render the offending byte range against the
original SQL text.
A representative reduce action — the one that pairs an
opt_with_clause with a select_expression — shows the
typical shape:
// csql_grammar.y — select_expression_opt_with
select_expression_opt_with : opt_with_clause select_expression {{ PT_NODE *with_clause = $1; PT_NODE *stmt = $2; if (stmt && with_clause) { stmt->info.query.with = with_clause; } $$ = stmt; }} ;Three things to notice: (a) the action only wires up an
existing PT_NODE returned by the inner rule; the PT_SELECT
itself was allocated by an action deeper down, in
select_stmt, with parser_new_node (this_parser, PT_SELECT);
(b) the action does not return — it writes to $$, a local
variable in the generated parser; (c) this_parser is a
file-static PARSER_CONTEXT * that parser_main sets up
before calling yyparse and tears down afterwards.
The actual PT_SELECT allocation happens here:
// csql_grammar.y — select_stmt (excerpt)
select_stmt : SELECT {{ PT_NODE *node; parser_save_found_Oracle_outer (); if (parser_select_level >= 0) parser_select_level++; parser_hidden_incr_list = NULL;
node = parser_new_node (this_parser, PT_SELECT); if (node) { node->info.query.q.select.flavor = PT_USER_SELECT; node->info.query.q.select.hint = PT_HINT_NONE; } parser_push_select_stmt_node (node); parser_push_hint_node (node); }} opt_hint_list /* $3 */ all_distinct /* $4 */ select_list /* $5 */ opt_select_param_list /* $6 */ /* ... wire children, pop the stmt node, set $$, ... */ ;Two patterns recur here and across the grammar: a
stmt-stack push in the action that creates the node, and
a stmt-stack pop in a later action that finalises it —
this is how mid-rule actions like opt_hint_list know which
PT_SELECT they belong to without threading explicit
arguments through every rule.
Driver — parser_main
Section titled “Driver — parser_main”parser_main lives at the bottom of csql_grammar.y (it has
to, because yyparse is a static function in the generated
csql_grammar.c and calling it from outside the .y file is
awkward):
// parser_main — csql_grammar.y (condensed)
PT_NODE **parser_main (PARSER_CONTEXT * parser){ long desc_index = 0; long i, top; int rv, yybuffer_pos_save; PARSER_CONTEXT *this_parser_saved;
if (!parser) return 0;
parser_output_host_index = parser_input_host_index = desc_index = 0;
this_parser_saved = this_parser; this_parser = parser; /* publish to grammar-action TLS */
dbcs_start_input ();
yycolumn = yycolumn_end = 1; yybuffer_pos_save = yybuffer_pos; yybuffer_pos = 0; is_dblink_query_string = 0; expecting_pl_lang_spec = 0; csql_yylloc.buffer_pos = 0;
g_query_string = NULL; g_query_string_pos = 0; g_query_string_len = 0; g_original_buffer_len = 0;
pt_initialize_hint (parser, parser_hint_table);
rv = yyparse (); /* drives csql_yylex through tokens */
/* parser_main can be re-entered (loaddb -s); restore yybuffer_pos. */ yybuffer_pos = yybuffer_pos_save;
pt_cleanup_hint (parser, parser_hint_table);
if (pt_has_error (parser) || parser->stack_top <= 0 || !parser->node_stack) { parser->statements = NULL; } else { parser->statements = (PT_NODE **) parser_alloc (parser, (1 + parser->stack_top) * sizeof (PT_NODE *)); if (parser->statements) { for (i = 0, top = parser->stack_top; i < top; i++) parser->statements[i] = parser->node_stack[i]; parser->statements[top] = NULL; } parser->host_var_count = parser_input_host_index; /* ... allocate host_variables, host_var_expected_domains ... */ } /* ... restore this_parser ... */ return parser->statements;}The save/restore of this_parser and yybuffer_pos matters
because of the loaddb -s re-entry path: a parse-and-execute
loop inside loaddb can call parser_main recursively while
the outer parser_main is still parsing the input file’s
statements.
The parser->statements array is a NULL-terminated list of
top-level statements. CUBRID’s parser accepts a multi-statement
buffer (e.g. SELECT 1; SELECT 2;), pushes each one onto
node_stack as the top-level rule reduces, and parser_main
copies the stack into the result array at the end.
PT_NODE — the polymorphic-tagged-union node
Section titled “PT_NODE — the polymorphic-tagged-union node”The node type is one struct with one tag and one union:
// parser_node — src/parser/parse_tree.h
struct parser_node{ PT_NODE_TYPE node_type; /* the type of SQL statement this represents */ int parser_id; /* which parser did I come from */ int line_number; /* the user line number originating this */ int column_number; /* the user column number originating this */ int buffer_pos; /* position in the parse buffer of the string */ char *sql_user_text; /* user input sql string */ int sql_user_text_len;
PT_NODE *next; /* forward link for NULL-terminated list */ PT_NODE *or_next; /* forward link for DNF list */ PT_NODE *next_row; /* PT_VALUE/NAME/EXPR row in PT_NODE_LIST */ void *etc; /* application-specific info hook */ UINTPTR spec_ident; /* entity-spec equivalence class (for FROM resolve) */ TP_DOMAIN *expected_domain; /* expected domain for input marker */ PT_NODE *data_type; /* for non-primitive types, sets, objects */ XASL_ID *xasl_id; /* XASL_ID for this SQL statement */ const char *alias_print; PARSER_VARCHAR *expr_before_const_folding; PT_TYPE_ENUM type_enum; /* PT_TYPE_INTEGER / VARCHAR / ... */ CACHE_TIME cache_time; int sub_host_var_count; int *sub_host_var_index;
struct { /* 27 single-bit flags */ unsigned recompile:1; unsigned cannot_prepare:1; /* ... use_plan_cache, is_hidden_column, with_rollup, * do_not_fold, is_value_query, do_not_replace_orderby, * use_auto_commit, ... 27 in total */ } flag;
PT_STATEMENT_INFO info; /* depends on 'node_type' field */};Three observations:
next/or_next/next_roware three independent list spines.nextis the natural list (select-list elements,AND-conjoined predicates,FROM-list specs).or_nextis the DNF spine for predicate normalisation (a = 1 OR a = 2 OR a = 3becomes anextof one and anor_nextof three).next_rowis forVALUESqueries that reduce toPT_NODE_LIST. The walker (pt_walk_private) descends each spine independently — this is what lets transformation passes splice into one spine without disturbing the others.data_typeis a child, not an in-place type tag.type_enumis the cheap discriminator (PT_TYPE_INTEGER,PT_TYPE_VARCHAR);data_typeis a separatePT_NODEof typePT_DATA_TYPEthat carries domain-specific extras (precision,scale,collation). The walker’sPT_APPLY_WALK (parser, node->data_type, walk)is what makes type-checking transforms uniform across all parents.- Line / column / buffer_pos live on every node. They are
set by
pt_parser_line_colinsideparser_new_node. Error messages can therefore quote the source range of any node in any later pass without re-parsing — semantic check, the optimizer, even XASL generation can produce source-anchored diagnostics.
The PT_NODE_TYPE enum carries dual semantics: the values for
SQL statements (PT_SELECT, PT_INSERT, …) are
intentionally the same as CUBRID_STMT_* codes from the API,
so a PT_NODE discriminator can be returned to the API client
without translation:
// PT_NODE_TYPE — src/parser/parse_tree.h (excerpt)
enum pt_node_type{ PT_NODE_NONE = CUBRID_STMT_NONE, PT_ALTER = CUBRID_STMT_ALTER_CLASS, PT_CREATE_ENTITY = CUBRID_STMT_CREATE_CLASS, PT_INSERT = CUBRID_STMT_INSERT, PT_SELECT = CUBRID_STMT_SELECT, PT_UPDATE = CUBRID_STMT_UPDATE, PT_DELETE = CUBRID_STMT_DELETE, /* ... ~50 more statement-typed nodes ... */
PT_DIFFERENCE = CUBRID_MAX_STMT_TYPE, /* internal-only types from here down */ PT_INTERSECTION, PT_UNION,
/* expression / clause / fragment nodes */ PT_EXPR, PT_NAME, PT_VALUE, PT_FUNCTION, PT_DOT_, PT_SPEC, PT_DATA_TYPE, PT_SORT_SPEC, PT_HOST_VAR, /* ... ~40 more ... */
PT_NODE_NUMBER, /* the number of node types */ PT_LAST_NODE_NUMBER = PT_NODE_NUMBER};The split is meaningful: node_type < CUBRID_MAX_STMT_TYPE
identifies a top-level statement that the API and the executor
both need; node_type >= CUBRID_MAX_STMT_TYPE is internal
parse-tree machinery.
The per-type info union is in parse_tree.h:3470:
// pt_statement_info — src/parser/parse_tree.h (excerpt)
union pt_statement_info{ PT_ZZ_ERROR_MSG_INFO error_msg; PT_ALTER_INFO alter; PT_ALTER_TRIGGER_INFO alter_trigger; PT_CREATE_ENTITY_INFO create_entity; PT_DELETE_INFO delete_; PT_DOT_INFO dot; PT_EXPR_INFO expr; PT_FUNCTION_INFO function; PT_INSERT_INFO insert; PT_NAME_INFO name; PT_QUERY_INFO query; /* SELECT, UNION, etc. */ PT_SPEC_INFO spec; /* FROM-clause table spec */ PT_VALUE_INFO value; PT_DATA_TYPE_INFO data_type; /* ... 80+ more variants ... */};PT_QUERY_INFO::q.select is the one most users picture when
they think of a parse tree: it has list (the select list),
from, where, group_by, having, connect_by,
start_with, using_index, hint slots, order_by (lifted to
PT_QUERY_INFO::order_by because UNION shares it), limit,
for_update, and a dozen others. Each of these is a PT_NODE *
that points to whatever the grammar reduced into it.
Per-type child layout — pt_apply_f, pt_init_f, pt_print_f
Section titled “Per-type child layout — pt_apply_f, pt_init_f, pt_print_f”Three function-pointer tables, all sized PT_NODE_NUMBER, are
the polymorphism mechanism:
// parse_tree_cl.c — apply/init/print function tables (excerpt)
typedef PT_NODE *(*PARSER_INIT_NODE_FUNC) (PT_NODE *);typedef PARSER_VARCHAR *(*PARSER_PRINT_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p);typedef PT_NODE *(*PARSER_APPLY_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p, void *arg);
static PARSER_INIT_NODE_FUNC *pt_init_f = NULL;static PARSER_PRINT_NODE_FUNC *pt_print_f = NULL;static PARSER_APPLY_NODE_FUNC *pt_apply_f = NULL;
static PARSER_APPLY_NODE_FUNC pt_apply_func_array[PT_NODE_NUMBER];static PARSER_INIT_NODE_FUNC pt_init_func_array[PT_NODE_NUMBER];static PARSER_PRINT_NODE_FUNC pt_print_func_array[PT_NODE_NUMBER];parser_init_func_vectors (called once from
parser_create_parser when pt_apply_f == NULL) populates the
arrays. The per-PT_SELECT apply function is the canonical
example:
// pt_apply_select — parse_tree_cl.c
static PT_NODE *pt_apply_select (PARSER_CONTEXT * parser, PT_NODE * p, void *arg){ PT_APPLY_WALK (parser, p->info.query.with, arg); PT_APPLY_WALK (parser, p->info.query.q.select.list, arg); PT_APPLY_WALK (parser, p->info.query.q.select.from, arg); PT_APPLY_WALK (parser, p->info.query.q.select.where, arg); PT_APPLY_WALK (parser, p->info.query.q.select.connect_by, arg); PT_APPLY_WALK (parser, p->info.query.q.select.start_with, arg); PT_APPLY_WALK (parser, p->info.query.q.select.after_cb_filter, arg); PT_APPLY_WALK (parser, p->info.query.q.select.group_by, arg); PT_APPLY_WALK (parser, p->info.query.q.select.having, arg); PT_APPLY_WALK (parser, p->info.query.q.select.using_index, arg); PT_APPLY_WALK (parser, p->info.query.q.select.with_increment, arg); /* ... more hint slots ... */ PT_APPLY_WALK (parser, p->info.query.into_list, arg); PT_APPLY_WALK (parser, p->info.query.order_by, arg); PT_APPLY_WALK (parser, p->info.query.orderby_for, arg); PT_APPLY_WALK (parser, p->info.query.qcache_hint, arg); PT_APPLY_WALK (parser, p->info.query.q.select.check_where, arg); PT_APPLY_WALK (parser, p->info.query.limit, arg); PT_APPLY_WALK (parser, p->info.query.q.select.for_update, arg); return p;}The order matters: every walker (semantic check, type
inference, name resolution, plan generation, pretty-printing)
visits these children in this order. A new child added to
PT_QUERY_INFO::q.select that is not added here will be
silently invisible to every existing walker — which is one of
the few places in the parser where the build will not warn
about an oversight.
pt_init_select is the parallel for default initialisation:
// pt_init_select — parse_tree_cl.c
static PT_NODE *pt_init_select (PT_NODE * p){ p->info.query.q.select.hint = PT_HINT_NONE; p->info.query.q.select.check_cycles = CONNECT_BY_CYCLES_ERROR; p->info.query.all_distinct = PT_ALL; p->info.query.is_subquery = (PT_MISC_TYPE) 0; p->info.query.hint = PT_HINT_NONE; p->info.query.scan_op_type = S_SELECT; return p;}pt_init_f[node->node_type] is called from parser_init_node
on every newly-allocated node, so the default values are
in-place before any reduce action gets to the new node.
The walker — parser_walk_tree / pt_walk_private
Section titled “The walker — parser_walk_tree / pt_walk_private”parser_walk_tree is the universal pre/post-order traversal:
// parser_walk_tree — parse_tree_cl.c
PT_NODE *parser_walk_tree (PARSER_CONTEXT * parser, PT_NODE * node, PT_NODE_WALK_FUNCTION pre_function, void *pre_argument, PT_NODE_WALK_FUNCTION post_function, void *post_argument){ if (node == NULL) return NULL;
PT_WALK_ARG walk_argument; walk_argument.continue_walk = PT_CONTINUE_WALK; walk_argument.pre_function = pre_function; walk_argument.pre_argument = pre_argument; walk_argument.post_function = post_function; walk_argument.post_argument = post_argument;
PT_APPLY_WALK (parser, node, &walk_argument); return node;}The real engine is pt_walk_private. For each node it:
(a) calls pre_function, (b) dispatches through
pt_apply_f[node_type] to visit the typed children, then
visits data_type, then or_next, then next, in that order,
(c) calls post_function. The four continue_walk modes
(PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK,
PT_LIST_WALK) let the pre-function prune the search:
// pt_walk_private — parse_tree_cl.c (condensed)
static PT_NODE *pt_walk_private (PARSER_CONTEXT * parser, PT_NODE * node, void *void_arg){ PT_WALK_ARG *walk = (PT_WALK_ARG *) void_arg; PT_NODE_TYPE node_type; PARSER_APPLY_NODE_FUNC apply; int save_continue;
if (walk->pre_function) { node = (*walk->pre_function) (parser, node, walk->pre_argument, &(walk->continue_walk)); if (!node) return NULL; }
if (walk->continue_walk != PT_STOP_WALK) { save_continue = walk->continue_walk;
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK) { node_type = node->node_type; if (node_type >= PT_LAST_NODE_NUMBER || !(apply = pt_apply_f[node_type])) return NULL;
(*apply) (parser, node, walk); /* typed children */ PT_APPLY_WALK (parser, node->data_type, walk); /* data_type child */ }
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK || save_continue == PT_LIST_WALK) PT_APPLY_WALK (parser, node->or_next, walk);
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LIST_WALK) PT_APPLY_WALK (parser, node->next, walk);
if (walk->continue_walk != PT_STOP_WALK) walk->continue_walk = save_continue; }
if (walk->post_function) node = (*walk->post_function) (parser, node, walk->post_argument, &(walk->continue_walk)); return node;}The traversal shape — pre-visit, typed-child via apply,
data_type, or_next, next, post-visit — is fixed; only
the user-supplied pre_function and post_function vary.
This is how every later pass (pt_resolve_names,
pt_check_with_info, pt_semantic_type,
pt_to_buildlist_proc, pt_print_tree) plugs into the same
walker without each having to re-implement the per-type child
visit. The companion parser_walk_leaves is the variant that
starts at the leaves of the input node — it skips the
pre-visit on node itself and goes straight to the typed
children.
Memory model — per-parser arena
Section titled “Memory model — per-parser arena”The third key piece is the allocator. Every parse-time allocation goes through one of two paths into a per-parser pool:
// parse_tree.c — block-pool sketch (one block per pool node)
#define NODES_PER_BLOCK 256
typedef struct parser_node_block PARSER_NODE_BLOCK;struct parser_node_block{ PARSER_NODE_BLOCK *next; int parser_id; PT_NODE nodes[NODES_PER_BLOCK];};
static PARSER_NODE_BLOCK *parser_Node_blocks[HASH_NUMBER];static PARSER_NODE_FREE_LIST *parser_Node_free_lists[HASH_NUMBER];static PARSER_STRING_BLOCK *parser_String_blocks[HASH_NUMBER];
static int parser_id = 1;parser_create_parser is the lifecycle entry:
// parser_create_parser — parse_tree.c (condensed)
PARSER_CONTEXT *parser_create_parser (void){ PARSER_CONTEXT *parser; struct timeval t; struct drand48_data rand_buf;
parser = (PARSER_CONTEXT *) calloc (sizeof (PARSER_CONTEXT), 1); if (parser == NULL) { er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_OUT_OF_VIRTUAL_MEMORY, 1, sizeof (PARSER_CONTEXT)); return NULL; }
#if !defined (SERVER_MODE) parser_init_func_vectors (); /* lazy fill of pt_apply_f / init_f / print_f */#endif
parser->id = parser_id++; if (pt_register_parser (parser) == ER_FAILED) { free_and_init (parser); return NULL; }
parser->execution_values.row_count = -1;
/* ... seed lrand48 / drand48 / etc ... */ parser->query_id = NULL_QUERY_ID; /* ... 15 single-bit flags ... */ return parser;}Two non-obvious facts. First, parser_init_func_vectors is
lazy-fired here — the first parser created in a process pays
the cost of populating the three function-pointer tables, every
subsequent parser inherits them. Second, the parser does not
own its node blocks directly; it owns a parser_id and the
node blocks are looked up in a process-wide hash
(parser_Node_blocks[HASH_NUMBER]). This makes
parser_free_parser a hash-walk that reclaims every block
tagged with the dying parser’s id.
Node allocation:
// parser_new_node — parse_tree_cl.c
PT_NODE *parser_new_node (PARSER_CONTEXT * parser, PT_NODE_TYPE node_type){ PT_NODE *node;
node = parser_create_node (parser); /* carve from PARSER_NODE_BLOCK */ if (node) { parser_init_node (node, node_type); /* memset + pt_init_f[node_type] */ pt_parser_line_col (node); /* stamp from current YYLTYPE */ node->sql_user_text = g_query_string; node->sql_user_text_len = g_query_string_len; } return node;}Three things happen in a single call: (a) a PT_NODE-sized
slot is taken from the parser’s node-block pool; (b) the slot
is zeroed, the node_type field is set, and
pt_init_f[node_type] runs to fill in default values for the
type-specific union arm; (c) the (line, column, buffer_pos)
triple is stamped from the lexer’s current
csql_yylloc/g_query_string_pos so future error messages
can quote the source position of this exact node.
parser_init_node is the second half:
// parser_init_node — parse_tree_cl.c
PT_NODE *parser_init_node (PT_NODE * node, PT_NODE_TYPE node_type){ assert (node != NULL); assert (node_type < PT_LAST_NODE_NUMBER); int parser_id = node->parser_id;
memset (node, 0x00, sizeof (PT_NODE)); node->buffer_pos = -1; node->type_enum = PT_TYPE_NONE;
node->parser_id = parser_id; node->node_type = node_type; if (pt_init_f[node_type]) node = (pt_init_f[node_type]) (node);
return node;}The parser_id is preserved across the memset because the
slot’s identity (which parser does it belong to?) outlives the
node’s logical state; this is what lets parser_reinit_node
re-stamp an in-use slot during walk-rewrites without losing
the slot’s home parser.
The companion parser_alloc is for everything that is not
a PT_NODE — strings, arrays, host-variable arrays:
// parser_alloc — parse_tree.c (excerpt)
void *parser_alloc (const PARSER_CONTEXT * parser, const int length){ /* ... locate or create a PARSER_STRING_BLOCK with `length` bytes free, * bump the block's high-water mark, return the new region ... */}Strings inside parser_alloc’s pool live and die with the
parser context just like nodes. The grammar’s calls to
pt_makename(yytext) (which calls pt_append_string) for
every identifier therefore do not leak — the entire universe
of identifier strings allocated during a parse is freed in one
sweep by parser_free_parser → pt_free_string_blocks.
Helpers used by reduce actions
Section titled “Helpers used by reduce actions”A handful of functions live in csql_grammar.y’s epilogue
because every grammar action calls them:
// parser_make_expression — csql_grammar.y
PT_NODE *parser_make_expression (PARSER_CONTEXT * parser, PT_OP_TYPE OP, PT_NODE * arg1, PT_NODE * arg2, PT_NODE * arg3){ PT_NODE *expr = parser_new_node (parser, PT_EXPR); if (expr) { expr->info.expr.op = OP; if (pt_is_operator_logical (expr->info.expr.op)) expr->type_enum = PT_TYPE_LOGICAL;
expr->info.expr.arg1 = arg1; expr->info.expr.arg2 = arg2; expr->info.expr.arg3 = arg3; /* ... INST_NUM / GROUPBY_NUM / ORDERBY_NUM compatibility checks ... */ /* ... mark parser_cannot_cache for SERIAL / SYS_DATE-family ops ... */ } return expr;}
static PT_NODE *parser_make_link (PT_NODE * list, PT_NODE * node){ parser_append_node (node, list); return list;}parser_make_expression is what every binary or ternary
operator rule in the grammar reduces to — <expr> + <expr>,
<expr> AND <expr>, <expr> BETWEEN <expr> AND <expr> all
funnel through it. parser_make_link is what lets a recursive
list rule (select_list : select_list ',' expr) extend the
list using the next pointer rather than allocating a separate
list node. These two helpers — plus
parser_make_func_with_arg_count for arity-checked function
nodes and pt_makename for interning identifier strings — are
about 80% of every reduce action.
Putting it together — the SELECT example
Section titled “Putting it together — the SELECT example”The deck’s running example, SELECT * FROM code WHERE s_name > 'G';, exercises every piece above. The rough trace:
- The lexer returns the tokens
SELECT '*' FROM IDENTIFIER('code') WHERE IDENTIFIER('s_name') '>' STRING_LITERAL('G') ';'.IDENTIFIERandSTRING_LITERALcarry their lexeme throughpt_makename/pt_append_bytes; the others carry only their token code. - As the bottom-up parser reduces, each rule’s action
allocates the right
PT_NODEand wires its children. Allocation order is roughly bottom-up: the literal'G'reduces first (PT_VALUE,type_enum = PT_TYPE_CHAR), then the identifiers_name(PT_NAME), then the comparison (PT_EXPRwithop = PT_GT), then the wildcard*(PT_VALUE,type_enum = PT_TYPE_STAR), then theFROM-spec (PT_SPECwhoseentity_namepoints at aPT_NAMEforcode), then thePT_SELECTwhoseinfo.query.q.select.list/from/wherepoints at all of the above. - The top-level reduce pushes the resulting
PT_SELECTontoparser->node_stack. parser_mainsnapshots the stack intoparser->statementsand returns it.
After parser_parse_string returns, the rest of the system
sees only the PT_SELECT root. The lexer and grammar are out
of the picture; every later transformation walks
pt_apply_f/pt_walk_private and rewrites the same arena.
Build wiring — Flex/Bison via CMake
Section titled “Build wiring — Flex/Bison via CMake”Both .l and .y are inputs to a CMake custom command that
runs Flex and Bison at build time:
# CMakeLists.txt (root) — CSQL FLEX/BISON targets (excerpt)
# Replace old bison directives with new ones; regenerate csql_grammar.yy# whenever csql_grammar.y is modified.add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy COMMAND ${CMAKE_COMMAND} -DGRAMMAR_INPUT_FILE=${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y -DGRAMMAR_OUTPUT_FILE=${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy -P ${CMAKE_SOURCE_DIR}/cmake/CSQLGrammarTransform.cmake MAIN_DEPENDENCY ${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y)
set(CSQL_GRAMMAR_INPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy)
bison_target(csql_grammar ${CSQL_GRAMMAR_INPUT} ${CSQL_GRAMMAR_OUTPUT} COMPILE_FLAGS "--no-lines --name-prefix=csql_yy -d -r all")flex_target(csql_lexer ${CSQL_LEXER_INPUT} ${CSQL_LEXER_OUTPUT} COMPILE_FLAGS "--noline --never-interactive --prefix=csql_yy")add_flex_bison_dependency(csql_lexer csql_grammar)
add_custom_target(gen_csql_grammar DEPENDS ${BISON_csql_grammar_OUTPUTS})add_custom_target(gen_csql_lexer DEPENDS ${FLEX_csql_lexer_OUTPUTS})Two notable choices: --name-prefix=csql_yy (so all generated
symbols begin csql_yy* and do not collide with the loader’s
own Flex/Bison grammar in src/loaddb/), and the
csql_grammar.yy indirection — csql_grammar.y is
pre-processed by CSQLGrammarTransform.cmake before Bison
sees it, which is how the project rewrites legacy YACC
directives into Bison-3 syntax without editing the original
.y file.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The CUBRID source moves; the position table at the end is scoped to the doc’s
updated:date.
Lexer (src/parser/csql_lexer.l)
Section titled “Lexer (src/parser/csql_lexer.l)”- Definitions block —
YY_INPUT/YY_USER_ACTION/ start conditions / globals (yybuffer_pos,is_dblink_query_string,expecting_pl_lang_spec). - Keyword rules — about 600
[xX]...patterns each returning a Bison token code. Keywords that need their lexeme retained setcsql_yylval.cptr = pt_makename(yytext)first. - Identifier / numeric / string-literal rules.
- Start-condition rules —
<QUOTED_CHAR_STRING>,<DELIMITED_ID_NAME>,<BRACKET_ID_NAME>,<BACKTICK_ID_NAME>,<PLCSQL_TEXT>,<POST_PLCSQL_TEXT>,<PL_LANG_SPEC>, etc. - User-subroutines block —
parser_yyinputand its single / multi-line variants,csql_yyerror,csql_yyerror_explicit,yywrap,begin_token,parser_c_hint/parser_line_hint.
Grammar (src/parser/csql_grammar.y)
Section titled “Grammar (src/parser/csql_grammar.y)”- Prologue (
%{ ... %}) —YYLTYPEextension withbuffer_pos, helper-container typedefs (container_2..11),parser_make_func_with_arg_count,parser_make_expression,parser_make_linkdeclarations. - Bison declarations —
%locations,%glr-parser,%union,%type(~1100 entries),%token(~600 entries). - Grammar rules —
stmt,stmt_,select_stmt,select_expression,select_expression_opt_with,select_list,from_list,extended_table_specification,single_table_spec,subquery,expression,predicate,function_call,data_type,create_stmt,alter_stmt,insert_stmt,update_stmt,delete_stmt,merge_stmt… ~1700 productions. - Epilogue —
parser_initialize_parser_context,parser_make_func_with_arg_count,parser_make_expression,parser_make_link,parser_make_link_or,parser_main,parser_init_yydebug.
Parse-tree definitions (src/parser/parse_tree.h)
Section titled “Parse-tree definitions (src/parser/parse_tree.h)”enum pt_node_type—PT_NODE_NUMBER/PT_LAST_NODE_NUMBERsentinel.enum pt_type_enum— runtime value-type tag (PT_TYPE_INTEGER,PT_TYPE_VARCHAR, …).- Per-type
PT_*_INFOstructs (PT_QUERY_INFO,PT_SELECT_INFOinsidePT_QUERY_INFO,PT_SPEC_INFO,PT_NAME_INFO,PT_EXPR_INFO,PT_VALUE_INFO,PT_DATA_TYPE_INFO, …). union pt_statement_info PT_STATEMENT_INFO.struct parser_node— thePT_NODEitself with the 27 flag bits and theinfounion.struct parser_context— node stack, statements array, per-parser globals (error_msgs,host_variables, …).enum { PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK }— walker continuation modes.typedef PT_NODE *(*PT_NODE_WALK_FUNCTION) (...)— the walker function-pointer type users plug intoparser_walk_tree.
Tree-and-arena code (src/parser/parse_tree.c)
Section titled “Tree-and-arena code (src/parser/parse_tree.c)”PARSER_NODE_BLOCK,PARSER_NODE_FREE_LIST,PARSER_STRING_BLOCK— the per-parser pool types.parser_create_node_block— extend the block list.parser_create_node— bump-allocate onePT_NODEslot.pt_free_node_blocks— release all blocks for a parser id.parser_create_string_block/parser_allocate_string_buffer— strings.pt_register_parser/pt_unregister_parser— process-wide hash registration.parser_alloc— generic-allocator wrapper.pt_append_string/pt_append_varchar/pt_append_bytes— string accumulation.parser_create_parser/parser_free_parser— context lifecycle.parser_free_node_resources/parser_free_node— individual-node free (rarely called; the arena handles bulk free).
Walker, init, helpers (src/parser/parse_tree_cl.c)
Section titled “Walker, init, helpers (src/parser/parse_tree_cl.c)”pt_apply_f,pt_init_f,pt_print_f— the three per-PT_NODE_TYPEfunction-pointer arrays.pt_apply_func_array,pt_init_func_array,pt_print_func_array— the static backing arrays.pt_init_apply_f,pt_init_init_f,pt_init_print_f— populate the three arrays.parser_init_func_vectors— lazy entry called fromparser_create_parser.pt_walk_private— the recursive walker, shared byparser_walk_treeandparser_walk_leaves.parser_walk_tree,parser_walk_leaves,pt_continue_walk— public traversal API.parser_new_node,parser_init_node,parser_reinit_node— node creation and re-init.parser_free_tree,parser_free_subtrees— explicit subtree free (used by transforms that re-write a subtree before the parser dies).parser_copy_tree,parser_copy_tree_list— deep-copy withcopy_node_in_tree_pre.parser_append_node,parser_append_node_or— list-spine helpers.pt_apply_select,pt_init_select,pt_print_select— the canonical example forPT_SELECT; every otherPT_NODE_TYPEhas a parallel triple.parser_parse_string,parser_parse_string_with_escapes,parser_parse_string_use_sys_charset,parser_parse_file— public entrypoints.
Public API (src/parser/parser.h)
Section titled “Public API (src/parser/parser.h)”parser_main,parser_create_parser,parser_free_parser,parser_init_func_vectors,parser_alloc.parser_parse_string,parser_parse_string_with_escapes,parser_parse_string_use_sys_charset,parser_parse_file.parser_new_node,parser_init_node,parser_reinit_node,parser_free_tree,parser_free_subtrees,parser_clear_node,parser_free_node,parser_free_node_resources.parser_walk_tree,parser_walk_leaves,pt_continue_walk.parser_print_tree,parser_print_tree_with_quotes,parser_print_tree_list.
Scanner support (src/parser/scanner_support.c)
Section titled “Scanner support (src/parser/scanner_support.c)”pt_makename— interns an identifier string into the per-parser arena and returns thechar *.pt_get_keyword/pt_check_word_for_keyword— look up whether a candidate identifier is actually a reserved word.dbcs_get_next/buffgetin/binarygetin/fgetin—next_char/next_byteimplementations the parser context wires up before calling the lexer.
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
enum pt_node_type (PT_NODE_TYPE) | parse_tree.h | 963 |
PT_LAST_NODE_NUMBER | parse_tree.h | 1091 |
enum pt_type_enum (PT_TYPE_ENUM) | parse_tree.h | 1096 |
union pt_statement_info | parse_tree.h | 3470 |
struct parser_node (the PT_NODE itself) | parse_tree.h | 3649 |
struct parser_context | parse_tree.h | 3766 |
PT_NODE_WALK_FUNCTION typedef | parse_tree.h | 1737 |
PT_STOP_WALK / CONTINUE / LEAF / LIST | parse_tree.h | 882 |
parser_main | csql_grammar.y | 23678 |
parser_make_expression | csql_grammar.y | 22573 |
parser_make_link | csql_grammar.y | 22633 |
parser_make_func_with_arg_count | csql_grammar.y | 22542 |
parser_initialize_parser_context | csql_grammar.y | 23480 |
select_stmt rule | csql_grammar.y | 12848 |
select_expression rule | csql_grammar.y | 12365 |
select_expression_opt_with rule | csql_grammar.y | 12263 |
stmt rule | csql_grammar.y | 1776 |
%union | csql_grammar.y | 545 |
%locations / %glr-parser | csql_grammar.y | 540 |
pt_walk_private | parse_tree_cl.c | 914 |
parser_walk_leaves | parse_tree_cl.c | 1001 |
parser_walk_tree | parse_tree_cl.c | 1045 |
parser_new_node | parse_tree_cl.c | 2245 |
parser_init_node | parse_tree_cl.c | 2266 |
parser_reinit_node | parse_tree_cl.c | 2292 |
parser_append_node | parse_tree_cl.c | 3238 |
pt_init_apply_f | parse_tree_cl.c | 5016 |
pt_init_init_f | parse_tree_cl.c | 5145 |
pt_init_print_f | parse_tree_cl.c | 5280 |
pt_apply_select | parse_tree_cl.c | 14302 |
pt_init_select | parse_tree_cl.c | 14340 |
pt_print_select | parse_tree_cl.c | 14358 |
parser_init_func_vectors | parse_tree_cl.c | 17754 |
parser_parse_string | parse_tree_cl.c | 1800 |
parser_parse_string_with_escapes | parse_tree_cl.c | 1813 |
parser_parse_string_use_sys_charset | parse_tree_cl.c | 1782 |
parser_parse_file | parse_tree_cl.c | 1924 |
parser_create_parser | parse_tree.c | 1174 |
parser_free_parser | parse_tree.c | 1253 |
parser_create_node | parse_tree.c | 276 |
parser_create_node_block | parse_tree.c | 219 |
parser_alloc | parse_tree.c | 956 |
parser_main declaration | parser.h | 68 |
parser_create_parser declaration | parser.h | 73 |
parser_walk_tree declaration | parser.h | 114 |
pt_makename | scanner_support.c | 213 |
Cross-check Notes
Section titled “Cross-check Notes”The raw analyses (PDF “Parser - Lexing and Parsing SQL v1.0”, PDF “Parser - Parsing Tree Structure v1.0”, PPTX “parser_semantic_check_basic v1.0”) were written against an earlier branch of CUBRID. Verified deltas as of the source state captured above:
parser_yyinputis now a single function, not three. The PDF describesparser_yyinput,parser_yyinput_single_line, andparser_yyinput_multi_lineas three peers in the user-subroutines block. Currentcsql_lexer.ldeclares onlyparser_yyinputas the active reader; the old multi-line helper is referenced fromparser_yyinput_single_mode’s one-shot mode but is not the default. Behaviour is the same for typical multi-statement strings.- The
extern int yyline; extern int dot_flag;declarations in the deck no longer appear at the top ofcsql_lexer.l. Current source has onlyextern int yycolumnandextern int yycolumn_end; line tracking moved to Flex’s own%option yylineno. The deck’s snippet pre-dates that switch. parser_make_linkreturnsPT_NODE *, notvoid. The PDF excerpt at the bottom of “Lexing and Parsing SQL” (parser_make_linkreturning aPT_NODE *) is correct; this is consistent with the current source.- The
pt_walk_privatebody in the deck’s “Parsing Tree Structure” PDF is current, modulo one structural change. The PDF’sif (node && walk->pre_function)opening guard has been simplified to an unconditional pre-call (the caller already null-checks). The end-of-walkdata_typevisit is now folded inside the typed-child arm (PT_APPLY_WALK (parser, node->data_type, walk)after(*apply)), not after it. The fourcontinue_walkmodes (PT_STOP_WALK,PT_CONTINUE_WALK,PT_LEAF_WALK,PT_LIST_WALK) and their semantics are unchanged. PT_NODE::flagis a 27-bit struct, not a freeint. The deck’s “PT_NODE” diagram shows…for “etc flag” — it glosses over the bitfield. The current source has 27 named single-bit flags includingrecompile,cannot_prepare,partition_pruned,si_datetime,si_tran_id,clt_cache_check,use_plan_cache,is_hidden_column,with_rollup,do_not_fold,is_value_query,do_not_replace_orderby,is_added_by_parser,is_alias_enabled_expr,is_wrapped_res_for_coll,is_system_generated_stmt,use_auto_commit,done_reduce_equality_terms,print_in_value_for_dblink,do_not_use_subquery_cache,for_default_func— the rest are pre-existing.- The
next_rowlink inPT_NODEis forPT_NODE_LIST/VALUES-query rows. The deck does not mention it; current source uses it forPT_VALUE/PT_NAME/PT_EXPRnodes that belong to aPT_NODE_LISTforPT_SELECT“values” generated queries. The walker (pt_walk_private) does not follownext_rowdirectly — aPT_NODE_LISTapply visits its inner row list explicitly via the typed-child path. parser_init_func_vectorsis now lazy-fired fromparser_create_parser(under#if !defined (SERVER_MODE)). The deck describesparser_init_func_vectorsas called during process startup; in current source it is called the first time a parser context is created in a process.PT_NODE::sql_user_textis per-node, set insideparser_new_node. The deck doesn’t describe it; this is what lets the plan cache key on the literal SQL string for the node’s owning statement (and what makes per-statementEXPLAINquote the original text rather than a re-rendered form).
Open Questions
Section titled “Open Questions”- GLR conflict count is unmeasured. The grammar declares
%glr-parserbecause it has shift/reduce ambiguities; what is the actual ambiguous-state count, and which constructs are responsible? Investigation path: build with-r all -d(already inbison_target’sCOMPILE_FLAGS), inspectcsql_grammar.output, count entries in “State has conflicts”. The deck does not address this. csql_grammar.yypre-processor scope.cmake/CSQLGrammarTransform.cmakerewritescsql_grammar.yintocsql_grammar.yybefore Bison sees it. What does the rewrite do — replace deprecated YACC directives (%pure_parser→%define api.pure)? Does it rewrite reduce-action syntax? This matters because every grammar edit goes through that filter. Investigation path: readCSQLGrammarTransform.cmakeand diff a representative pair of input / output files in a debug build.expecting_pl_lang_specis a one-shot flag. The lexer uses it to switch into thePLCSQL_TEXTstart condition when the grammar is about to consume a procedure body. How does it get reset on parse error mid-procedure-body — does an error insidePLCSQL_TEXTleak the start-condition state into the nextparser_maincall? Investigation path: trace theBEGIN(INITIAL)calls incsql_lexer.land confirm thatparser_main’s entry-timeexpecting_pl_lang_spec = 0is sufficient to recover (it is not obvious — Flex’s start-condition stack is separate from the variable).- Memory recovery on
parser_mainre-entry. The commentparser_main can be reentered while executing statements loaded by loaddb -s. During the loaddb -s, the yybuffer_pos must not be curruptedsaves and restoresyybuffer_poson entry/exit. Are there other thread-local-ish parser globals (g_query_string,g_query_string_pos,parser_select_level,parser_within_join_condition) that re-entry should protect but currently does not? Investigation path: grep for^static .* parser_incsql_grammar.yand audit each for re-entry safety. parser_idoverflow. The staticparser_idcounter inparse_tree.c(static int parser_id = 1;) is the key intoparser_Node_blocks[HASH_NUMBER]. After 2³¹ parser creations it overflows. In a long-livedcub_server(or a heavily-stressed broker CAS) is this reachable, and what happens — does the hash chain just collide and old blocks get returned, or is there a guard? Investigation path: readpt_register_parserfor the collision check.PT_NODE::flagversus struct alignment growth. The flag struct currently has 27 single-bit fields; one more bit pushes it to a second word on 32-bit hosts. ThePT_NODEis allocated 256-at-a-time; growing it by even one machine word changes the pool’s per-block memory footprint by 8 KB. Is anyone tracking this? Investigation path:sizeof (PT_NODE)snapshots over recent commits.
Sources
Section titled “Sources”Raw analyses (raw/code-analysis/cubrid/query-processing/)
Section titled “Raw analyses (raw/code-analysis/cubrid/query-processing/)”code_analysis_Parser-Lexing_and_Parsing_SQL_v1_0.pdf— the lex/parse pipeline, file-level walkthrough ofcsql_lexer.landcsql_grammar.y, build wiring snippet.code_analysis_Parser-Parsing_Tree_Structure_v_1_0.pdf—PT_NODEstructure, three-element(node, info, walk)framing, BST analogy for traversal,pt_walk_privatebody.parser_semantic_check_basic_v1.0.pptx— query-processing pipeline diagram,PT_NODEquick reference, the runningSELECT * FROM code WHERE s_name > 'G'example, full example trees for joins / DOT paths / derived tables / Oracle-outer joins. The document used the parser-related slides; the semantic-check half of the deck is upstream of the next code-analysis doc._converted/codeanalysisparser-lexingandparsingsqlv10.pdf.md,_converted/codeanalysisparser-parsingtreestructurev10.pdf.md,_converted/parsersemanticcheckbasicv1.0.pptx.md— markitdown extracts of the above.
Sibling docs
Section titled “Sibling docs”knowledge/code-analysis/cubrid/cubrid-mvcc.md— MVCC visibility runs against thePT_NODEtree only after semantic check; the parser is upstream of the MVCC code path.
Textbook chapters / references
Section titled “Textbook chapters / references”- Database Internals (Petrov), Ch. 1 §“Query Processor” — the architectural picture of where the parser sits; cited for the framing of the parser as the producer of the tree every later pass walks.
- Aho, Lam, Sethi, Ullman, Compilers: Principles, Techniques, and Tools (Dragon book), 2nd ed. — Ch. 3 (lexing, regular languages → DFA, Lex), Ch. 4.7 (LALR(1)), Ch. 4.1.4 (error recovery), Ch. 5 (syntax-directed translation, semantic actions). The canonical reference for everything Flex/Bison is generating under the hood.
- Bison manual, GNU Free Software Foundation —
%glr-parsersemantics,%locations, thecsql_yyparse/csql_yylexdriver contract,YYLTYPE/csql_yylvalinteractions. - PostgreSQL source —
src/backend/parser/scan.l,src/backend/parser/gram.y,src/include/nodes/nodes.h(theNode/NodeTagpattern that CUBRID’sPT_NODE/PT_NODE_TYPEmirrors). - MySQL source —
sql/sql_yacc.yy,sql/sql_lex.cc(the GLR-flavoured large grammar that CUBRID’s pattern resembles).
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/parser/csql_lexer.l— Flex source.src/parser/csql_grammar.y— Bison source plusparser_mainand reduce-action helpers.src/parser/parse_tree.h—PT_NODE,PT_NODE_TYPE,PT_STATEMENT_INFO,parser_context, walker-typedefs.src/parser/parse_tree.c— per-parser arena, lifecycle.src/parser/parse_tree_cl.c— walker, init vectors, per-type apply/init/print, publicparser_parse_*wrappers, node-creation entry points.src/parser/parser.h— public API surface.src/parser/scanner_support.c—pt_makename,next_char/next_byteimplementations.cmake/CSQLGrammarTransform.cmake,CMakeLists.txt(root) —flex_target/bison_targetwiring.