PostgreSQL Parser — Lexing, LALR(1) Grammar, and Semantic Analysis
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every query processor begins with the same pipeline: a string of characters becomes a stream of tokens, the token stream is validated against a grammar and reduced to a structured tree, and the tree is then enriched with metadata from the system catalog so the rest of the engine can treat it as a typed, scoped, plannable object. The three phases — lexical analysis, syntactic analysis, and semantic analysis — map to classical compiler theory (Compilers: Principles, Techniques, and Tools, Aho et al., 2006, §§2–5; also Engineering a Compiler, Cooper & Torczon, 2012, ch. 2–4).
Lexical analysis (scanning) partitions the input character stream into tokens: keywords, identifiers, literals, operators, and punctuation. The scanner is typically generated from a set of regular expressions by a tool such as Flex. The key property a high-performance scanner needs is no backtracking: for every prefix of the input, some rule matches it, so the scanner never has to re-read characters. Backtrack-free scanning is achievable by careful rule ordering and by avoiding ambiguous overlapping patterns.
Syntactic analysis (parsing) validates that the token sequence conforms to the language grammar and builds a parse tree or abstract syntax tree (AST). The dominant technique for SQL and most programming languages is LALR(1) bottom-up parsing (Aho et al., §4.7): the parser maintains a stack of grammar states and shifts tokens onto it, reducing (folding) a stack suffix into a nonterminal whenever a complete right-hand side is recognized. LALR(1) means the parser looks exactly one token ahead to decide whether to shift or reduce, which is sufficient for most unambiguous grammars but requires the grammar itself to be carefully designed to avoid conflicts.
Two design choices follow from the LALR(1) constraint and shape every SQL parser:
-
Keyword/identifier tension. SQL has hundreds of reserved keywords. An identifier
selectorfromas a table or column name causes a shift/reduce conflict. The resolution is a graded keyword classification — fully reserved, unreserved, type-or-function- only — combined with a few grammar rules that allow unreserved words as bare identifiers in specific positions. -
Multi-word tokens. SQL has multi-word constructs (
NOT BETWEEN,NULLS FIRST,WITH ORDINALITY) whose meaning depends on the next token. A pure LALR(1) grammar cannot handle this without special machinery. The standard solution is a lookahead filter that replaces certain single tokens with composite tokens (NOT_LA,NULLS_LA,WITH_LA) before the parser sees them, keeping the grammar itself LALR(1).
Semantic analysis (parse analysis, transformation) takes the raw
tree — which contains only syntactic structure and no catalog
information — and transforms it into a fully typed, scoped, annotated
query tree. At this stage the engine opens the catalog: it resolves
table names to OIDs via the relcache, resolves column names to
attribute numbers, resolves function names to pg_proc entries,
coerces argument types, and expands wildcards. The output is the
Query node that the rewriter and planner operate on.
The two-phase split — raw parse then semantic analysis — is a
deliberate PostgreSQL design choice documented in gram.y’s file
header: “nothing in this file should initiate database accesses nor
depend on changeable state”. This means raw parsing can proceed even
inside an aborted transaction, where catalog access is forbidden, and
that a multi-statement string is fully parsed before the first
statement executes.
Common DBMS Design
Section titled “Common DBMS Design”The three-phase pipeline (lex → parse → analyze) is universal, but real SQL parsers share a set of engineering patterns that the textbooks do not describe explicitly.
No-backtrack flex scanner with state variables
Section titled “No-backtrack flex scanner with state variables”Every production SQL scanner is generated by Flex from a .l file.
The critical quality criterion is the no-backtrack property: Flex
can emit a lex.backup report listing any state where backing up is
needed; a clean scanner has no entries in that report. No-backtrack
is achieved by ensuring that for every input prefix, at least one
rule can match — typically by adding explicit rules for partial-match
prefixes or by using Flex’s “trailing context” mechanism only in ways
that do not introduce backtrack points. The benefit is several percent
of raw parsing throughput.
Scanners also carry lexer state variables — typically a small set of flags and a literal accumulation buffer — for handling constructs that span multiple rules (dollar-quoted strings, escape-unicode sequences, comments).
Keyword table as binary-searchable sorted array
Section titled “Keyword table as binary-searchable sorted array”SQL has hundreds of keywords. Rather than encoding them as Flex rules
(which would blow up the DFA table), every major SQL parser keeps
keywords as a sorted array and performs a binary search on each
IDENT token to decide if it is a keyword. The array entry carries
the keyword’s Bison token value and its classification (reserved,
unreserved, type-function-name, column-name-only), and classification
drives which grammar positions the keyword can appear as a bare
identifier.
One-token lookahead filter between scanner and parser
Section titled “One-token lookahead filter between scanner and parser”LALR(1) parsers have exactly one token of lookahead — but SQL has
multi-word constructs (NOT BETWEEN, NULLS FIRST/LAST, WITH ORDINALITY/TIME) that need two tokens to disambiguate. The solution
is a thin filter function that sits between the Flex scanner and the
Bison parser: it peeks at the next token and, if the current token
is one of the small set of problematic tokens, replaces it with a
distinct lookahead-annotated variant. This keeps the grammar LALR(1)
without burdening the scanner with parser-level knowledge.
Raw parse tree: catalog-free node graph
Section titled “Raw parse tree: catalog-free node graph”The grammar’s action code allocates node structures (structs typed by
a tag NodeTag) and links them into a tree. Crucially, every leaf
holds only syntactic content: names as plain char * strings, not
OIDs; expressions as unevaluated trees, not typed values; column
references as (relname, attname) pairs, not (relid, attnum)
integers. This catalog-free property is what lets raw parsing happen
inside aborted transactions.
ParseState: the namespace accumulator during semantic analysis
Section titled “ParseState: the namespace accumulator during semantic analysis”Semantic analysis is driven by a ParseState struct threaded through
every recursive transformation call. Its key fields accumulate the
range-table entries (p_rtable), the namespace items visible for
column lookup (p_namespace), the CTE namespace, the locking clauses,
and detection flags (has aggregates, has window functions, has
sub-links). Subqueries get their own child ParseState linked to the
parent via parentParseState.
Command-type dispatch: optimizable vs. utility
Section titled “Command-type dispatch: optimizable vs. utility”The semantic analyzer splits statement types into two buckets.
Optimizable statements (SELECT, INSERT, UPDATE, DELETE, MERGE,
VALUES) receive full transformation: range table construction,
expression tree rewriting, type coercion, aggregate/window detection.
The result is a Query with commandType of CMD_SELECT,
CMD_INSERT, etc. Utility statements (DDL, GRANT, EXPLAIN, COPY,
…) are wrapped in a Query with commandType = CMD_UTILITY and
passed through largely untransformed; their deep analysis is deferred
to execution time (or, for EXPLAIN/DECLARE CURSOR/CREATE TABLE AS,
the contained optimizable query is analyzed now while the outer shell
is kept as utility).
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / pattern | PostgreSQL name |
|---|---|
| Lexical analyzer | scan.l → generated scan.c (Flex) |
| No-backtrack guarantee | lex.backup empty (verified by Makefile) |
| Keyword sorted table | kwlist.h (521 lines), ScanKeywords[] / ScanKeywordTokens[] |
| One-token lookahead filter | base_yylex in parser.c |
| LALR(1) grammar | gram.y (~19 700 lines, Bison) |
| Raw AST root | List * of RawStmt * returned by raw_parser |
| Raw statement node | RawStmt wrapping one Node *stmt + location |
| Semantic analysis entry | parse_analyze_fixedparams / parse_analyze_varparams / parse_analyze_withcb |
| Transformation dispatcher | transformStmt (routes by NodeTag) |
| Namespace accumulator | ParseState (stack-linked for subqueries) |
| Range table entry | RangeTblEntry in ParseState.p_rtable |
| Optimizable result | Query with commandType ∈ {CMD_SELECT, CMD_INSERT, CMD_UPDATE, CMD_DELETE, CMD_MERGE} |
| Utility result | Query with commandType = CMD_UTILITY |
| Post-analysis hook | post_parse_analyze_hook |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s parser is a two-phase pipeline. Phase 1 (raw_parser →
scan.l + gram.y) produces a catalog-free raw parse tree. Phase 2
(parse_analyze_* → analyze.c + parse_*.c) transforms it into
a typed, scoped Query tree. The two phases are hard-separated: no
catalog access crosses into phase 1.
Phase 1 — Lexing and raw parsing
Section titled “Phase 1 — Lexing and raw parsing”Entry is raw_parser(str, mode) in parser.c. The function
initialises a Flex scanner, wires in a one-token lookahead filter,
invokes the Bison parser, and returns yyextra.parsetree — a List
of RawStmt nodes.
// raw_parser — src/backend/parser/parser.cList *raw_parser(const char *str, RawParseMode mode){ core_yyscan_t yyscanner; base_yy_extra_type yyextra;
/* initialise Flex scanner with keyword table */ yyscanner = scanner_init(str, &yyextra.core_yy_extra, &ScanKeywords, ScanKeywordTokens);
/* inject mode-specific lookahead token for PL/pgSQL sub-parsers */ if (mode == RAW_PARSE_DEFAULT) yyextra.have_lookahead = false; else { static const int mode_token[] = { [RAW_PARSE_DEFAULT] = 0, [RAW_PARSE_TYPE_NAME] = MODE_TYPE_NAME, [RAW_PARSE_PLPGSQL_EXPR] = MODE_PLPGSQL_EXPR, [RAW_PARSE_PLPGSQL_ASSIGN1] = MODE_PLPGSQL_ASSIGN1, /* ... */ }; yyextra.have_lookahead = true; yyextra.lookahead_token = mode_token[mode]; }
parser_init(&yyextra); /* set up Bison glue */ yyresult = base_yyparse(yyscanner); scanner_finish(yyscanner);
return yyresult ? NIL : yyextra.parsetree;}Figure 1 — Phase 1 data flow.
flowchart LR
S["SQL string\n(const char *)"]
SI["scanner_init\n(scan.l / Flex)"]
BL["base_yylex filter\n(parser.c)"]
BY["base_yyparse\n(gram.y / Bison)"]
RT["List of RawStmt\n(raw parse tree)"]
S --> SI
SI -->|"token stream\ncore_yylex"| BL
BL -->|"filtered tokens\nNOT_LA / NULLS_LA / …"| BY
BY --> RT
Figure 1 — raw_parser drives scan.l through the base_yylex lookahead filter into gram.y; the result is a List of RawStmt nodes, no catalog access.
The Flex lexer (scan.l)
Section titled “The Flex lexer (scan.l)”scan.l is a ~1 300-line Flex specification. Its design notes state
that “the rules are designed so that the scanner never has to
backtrack” — the Makefile verifies this by running flex -b and
asserting that lex.backup is empty.
Keywords are handled outside the Flex DFA. The scanner matches any
[A-Za-z_][A-Za-z0-9_$]* sequence as a candidate identifier, then
calls ScanKeywordLookup — a binary search over the sorted
ScanKeywords array in src/common/keywords.c. If the lookup hits, the
token value comes from ScanKeywordTokens[] (built from kwlist.h
via PG_KEYWORD macro expansion). If it misses, the token is IDENT.
The keyword table has 521 entries (as of kwlist.h on REL_18) across
four categories:
| Category | Examples | Usable as bare identifier? |
|---|---|---|
RESERVED_KEYWORD | SELECT, FROM, WHERE | Never |
UNRESERVED_KEYWORD | ABORT, ACCESS, ACTION | Always |
TYPE_FUNC_NAME_KEYWORD | BETWEEN, BIGINT, BIT | In type/function positions |
COL_NAME_KEYWORD | BETWEEN, OUTER | In column-name positions |
scan.l carries GUC-coupled state variables for three behaviours that
vary at runtime: backslash_quote, escape_string_warning, and
standard_conforming_strings. The file header explicitly calls these
out as “a DIRECT violation” of the rule that the scanner should not
depend on changeable state, and notes they will remain until they can
be removed.
The lookahead filter (base_yylex)
Section titled “The lookahead filter (base_yylex)”Bison sees tokens only through base_yylex in parser.c, never
directly from core_yylex in scan.l. The filter maintains a
one-token buffer (yyextra->have_lookahead). When the current token
is one of FORMAT, NOT, NULLS_P, WITH, WITHOUT, UIDENT,
or USCONST, the filter peeks at the next token and may replace the
current token with a lookahead-annotated variant:
// base_yylex — src/backend/parser/parser.c (condensed)switch (cur_token){ case NOT: /* Replace NOT by NOT_LA if followed by BETWEEN, IN, LIKE, ILIKE, SIMILAR */ switch (next_token) { case BETWEEN: case IN_P: case LIKE: case ILIKE: case SIMILAR: cur_token = NOT_LA; break; } break; case NULLS_P: /* Replace NULLS_P by NULLS_LA if followed by FIRST or LAST */ switch (next_token) { case FIRST_P: case LAST_P: cur_token = NULLS_LA; break; } break; case WITH: /* Replace WITH by WITH_LA if followed by TIME or ORDINALITY */ switch (next_token) { case TIME: case ORDINALITY: cur_token = WITH_LA; break; } break; case UIDENT: case USCONST: /* Look for UESCAPE, apply Unicode de-escaping */ /* ... condensed ... */ cur_token = (cur_token == UIDENT) ? IDENT : SCONST; break;}return cur_token;Unicode identifier/string de-escaping (UIDENT/USCONST + optional
UESCAPE) is also handled here — the filter collapses the two- or
three-token sequence into a single IDENT or SCONST before Bison
ever sees it.
The Bison grammar (gram.y)
Section titled “The Bison grammar (gram.y)”gram.y is ~19 700 lines, the single largest file in the PostgreSQL
backend source. Its top-level production is stmtmulti (line 961),
a semicolon-separated list of toplevel_stmt → stmt nodes. Each
stmt production matches one SQL statement type and builds a Node *
via makeNode(T_XxxStmt) plus field assignments; the grammar uses
palloc/pfree via #define YYMALLOC palloc so all allocation
happens in the current memory context and frees automatically on error.
The grammar’s file header states the rule: “nothing in this file
should initiate database accesses nor depend on changeable state.”
This is enforced by convention, not by a compiler barrier. Actions
may call makeNode, palloc, list functions, and a handful of helper
routines (SystemFuncName, SystemTypeName, makeStringConst, …),
but must not call heap_open, SearchSysCache, or any function that
touches relations.
The grammar produces a rich variety of Node * subtypes: SelectStmt,
InsertStmt, UpdateStmt, DeleteStmt, MergeStmt for DML;
CreateStmt, AlterTableStmt, DropStmt, … for DDL; RawStmt as
the universal wrapper that carries stmt_location and stmt_len.
Phase 2 — Semantic analysis (parse analysis)
Section titled “Phase 2 — Semantic analysis (parse analysis)”Entry points in analyze.c:
// parse_analyze_fixedparams — src/backend/parser/analyze.cQuery *parse_analyze_fixedparams(RawStmt *parseTree, const char *sourceText, const Oid *paramTypes, int numParams, QueryEnvironment *queryEnv){ ParseState *pstate = make_parsestate(NULL); Query *query; JumbleState *jstate = NULL;
pstate->p_sourcetext = sourceText; if (numParams > 0) setup_parse_fixed_parameters(pstate, paramTypes, numParams); pstate->p_queryEnv = queryEnv;
query = transformTopLevelStmt(pstate, parseTree);
if (IsQueryIdEnabled()) jstate = JumbleQuery(query); if (post_parse_analyze_hook) (*post_parse_analyze_hook)(pstate, query, jstate);
free_parsestate(pstate); pgstat_report_query_id(query->queryId, false); return query;}Three variants exist: parse_analyze_fixedparams (known $n types,
used by simple queries), parse_analyze_varparams (types inferred
from context, used by extended-query protocol), and
parse_analyze_withcb (callback-based parameter resolution, used by
planner’s SPI-style callers). All three delegate to
transformTopLevelStmt → transformStmt.
ParseState
Section titled “ParseState”ParseState (defined in parse_node.h) is the mutable context
threaded through every transformation call. Its fields accumulate
everything the transformation functions need:
// ParseState — src/include/parser/parse_node.h (condensed)struct ParseState{ ParseState *parentParseState; /* stack link for subqueries */ const char *p_sourcetext; /* for error cursor positions */ List *p_rtable; /* range table under construction */ List *p_rteperminfos; /* permission-check entries */ List *p_joinlist; /* FROM-clause items → FromExpr.fromlist */ List *p_namespace; /* currently visible RTEs for name lookup */ List *p_ctenamespace; /* WITH clause CTEs in scope */ Relation p_target_relation; /* INSERT/UPDATE/DELETE/MERGE target */ ParseExprKind p_expr_kind; /* context for expression transformation */ int p_next_resno; /* next TargetEntry resno */ bool p_hasAggs; bool p_hasWindowFuncs; bool p_hasSubLinks; /* ... hook function pointers for PL/pgSQL and SPI callers ... */};make_parsestate(parentParseState) allocates and zero-initializes a
new ParseState, copying hook function pointers from the parent if
one is provided. free_parsestate closes p_target_relation if open
and frees the struct.
transformStmt dispatch
Section titled “transformStmt dispatch”transformStmt routes by NodeTag:
// transformStmt — src/backend/parser/analyze.c (condensed)Query *transformStmt(ParseState *pstate, Node *parseTree){ Query *result; switch (nodeTag(parseTree)) { case T_InsertStmt: result = transformInsertStmt(pstate, (InsertStmt *) parseTree); break; case T_DeleteStmt: result = transformDeleteStmt(pstate, (DeleteStmt *) parseTree); break; case T_UpdateStmt: result = transformUpdateStmt(pstate, (UpdateStmt *) parseTree); break; case T_MergeStmt: result = transformMergeStmt(pstate, (MergeStmt *) parseTree); break; case T_SelectStmt: result = transformSelectStmt(pstate, (SelectStmt *) parseTree); /* or transformValuesClause / transformSetOperationStmt */ break; /* ... other optimizable statement types ... */ default: /* utility statements */ result = makeNode(Query); result->commandType = CMD_UTILITY; result->utilityStmt = parseTree; break; } return result;}Utility statements are wrapped in a thin Query(CMD_UTILITY) shell
at line 477. EXPLAIN, DECLARE CURSOR, and CREATE TABLE AS are
exceptions: they contain embedded optimizable queries that
transformStmt recursively analyzes.
transformSelectStmt — the clause-by-clause orchestrator
Section titled “transformSelectStmt — the clause-by-clause orchestrator”transformSelectStmt is the canonical example of how a Query is
assembled. It allocates the Query, sets commandType = CMD_SELECT,
then calls one specialist transform per SQL clause in a fixed
order that respects data dependencies: the FROM clause must be
processed first so its namespace is visible to every later clause;
ORDER BY is processed before GROUP BY and DISTINCT because the latter
two consume the sort result. The final Query fields are harvested
from the ParseState accumulators (p_rtable, p_joinlist,
p_hasAggs, …) at the end:
// transformSelectStmt — src/backend/parser/analyze.c (condensed)qry->commandType = CMD_SELECT;if (stmt->withClause) /* WITH first, independent of all else */ qry->cteList = transformWithClause(pstate, stmt->withClause);
pstate->p_locking_clause = stmt->lockingClause; /* visible to addRangeTableEntry */pstate->p_windowdefs = stmt->windowClause;
transformFromClause(pstate, stmt->fromClause); /* builds RTEs + namespace */
qry->targetList = transformTargetList(pstate, stmt->targetList, EXPR_KIND_SELECT_TARGET);markTargetListOrigins(pstate, qry->targetList);
qual = transformWhereClause(pstate, stmt->whereClause, EXPR_KIND_WHERE, "WHERE");qry->havingQual = transformWhereClause(pstate, stmt->havingClause, EXPR_KIND_HAVING, "HAVING");qry->sortClause = transformSortClause(pstate, stmt->sortClause, &qry->targetList, ...);qry->groupClause = transformGroupClause(pstate, stmt->groupClause, ..., qry->sortClause, ...);/* ... DISTINCT, LIMIT/OFFSET, window clauses ... */
qry->rtable = pstate->p_rtable; /* harvest accumulators into Query */qry->rteperminfos = pstate->p_rteperminfos;qry->jointree = makeFromExpr(pstate->p_joinlist, qual);qry->hasSubLinks = pstate->p_hasSubLinks;qry->hasAggs = pstate->p_hasAggs;
assign_query_collations(pstate, qry); /* collation walk, after all exprs built */if (pstate->p_hasAggs || qry->groupClause || qry->groupingSets || qry->havingQual) parseCheckAggregates(pstate, qry); /* validate aggregate semantics last */return qry;Note the ordering invariants made explicit in the source comments:
ORDER BY runs before GROUP BY/DISTINCT (those transforms reuse the
sort list), and parseCheckAggregates runs after
assign_query_collations so expression comparison during aggregate
validation sees fully collated trees.
FROM-clause processing — building the range table and namespace
Section titled “FROM-clause processing — building the range table and namespace”transformFromClause walks the grammar’s list of RangeVar /
RangeSubselect / RangeFunction / JoinExpr items left-to-right
(left-to-right matters for LATERAL reference visibility). Each item
becomes a join-tree node plus a ParseNamespaceItem; the item is
checked for refname collisions, then appended to both p_joinlist
and p_namespace:
// transformFromClause — src/backend/parser/parse_clause.c (condensed)foreach(fl, frmList){ Node *n = lfirst(fl); ParseNamespaceItem *nsitem; List *namespace;
n = transformFromClauseItem(pstate, n, &nsitem, &namespace); checkNameSpaceConflicts(pstate, pstate->p_namespace, namespace); setNamespaceLateralState(namespace, true, true); /* visible only to LATERAL for now */
pstate->p_joinlist = lappend(pstate->p_joinlist, n); pstate->p_namespace = list_concat(pstate->p_namespace, namespace);}/* once the whole FROM list is parsed, make every item unconditionally visible */setNamespaceLateralState(pstate->p_namespace, false, true);For a plain table, the leaf that actually creates the range-table
entry is addRangeTableEntryForRelation. This is where catalog
state enters the picture: the relation has already been opened and
locked (so the function asserts the lock is held), and the RTE
records the relation OID, relkind, and lock mode. Column aliases are
materialized from the tuple descriptor, an RTEPermissionInfo with
ACL_SELECT is registered, and the RTE is appended to p_rtable —
its 1-based position in that list becomes the varno the planner
later uses:
// addRangeTableEntryForRelation — src/backend/parser/parse_relation.c (condensed)RangeTblEntry *rte = makeNode(RangeTblEntry);Assert(CheckRelationLockedByMe(rel, lockmode, true)); /* lock already taken upstream */
rte->rtekind = RTE_RELATION;rte->relid = RelationGetRelid(rel);rte->relkind = rel->rd_rel->relkind;rte->rellockmode = lockmode;rte->eref = makeAlias(refname, NIL);buildRelationAliases(rel->rd_att, alias, rte->eref); /* effective column names */
perminfo = addRTEPermissionInfo(&pstate->p_rteperminfos, rte);perminfo->requiredPerms = ACL_SELECT;
pstate->p_rtable = lappend(pstate->p_rtable, rte); /* index = its varno */return buildNSItemFromTupleDesc(rte, list_length(pstate->p_rtable), perminfo, rel->rd_att);Target-list construction and star expansion
Section titled “Target-list construction and star expansion”transformTargetList turns each grammar ResTarget into one or more
TargetEntry nodes. Its one structural job beyond per-expression
transformation is star expansion: a something.* reference
(which the grammar leaves as a ColumnRef ending in A_Star, or an
A_Indirection ending in A_Star) is expanded against the current
namespace into one target per column. Everything else is a single
transformTargetEntry call:
// transformTargetList — src/backend/parser/parse_target.c (condensed)expand_star = (exprKind != EXPR_KIND_UPDATE_SOURCE); /* no '*' on UPDATE source */foreach(o_target, targetlist){ ResTarget *res = (ResTarget *) lfirst(o_target);
if (expand_star && IsA(res->val, ColumnRef) && IsA(llast(((ColumnRef *) res->val)->fields), A_Star)) { p_target = list_concat(p_target, ExpandColumnRefStar(pstate, (ColumnRef *) res->val, true)); continue; /* one '*' → many targets */ } /* ... A_Indirection '*' handled symmetrically via ExpandIndirectionStar ... */
p_target = lappend(p_target, transformTargetEntry(pstate, res->val, NULL, exprKind, res->name, false));}Expression transformation — the recursive node dispatch
Section titled “Expression transformation — the recursive node dispatch”Every clause that contains an expression ultimately calls
transformExpr, whose worker transformExprRecurse is a switch
on nodeTag that rewrites each raw grammar node into its analyzed
counterpart. A_Const becomes a typed Const (via make_const),
ColumnRef is resolved against the namespace into a Var (via
transformColumnRef), FuncCall is resolved to a function or
aggregate OID, and so on. The first line guards against stack
overflow from deeply nested expressions:
// transformExprRecurse — src/backend/parser/parse_expr.c (condensed)check_stack_depth(); /* deeply nested exprs are user-supplied */switch (nodeTag(expr)){ case T_ColumnRef: result = transformColumnRef(pstate, (ColumnRef *) expr); /* → Var / field */ break; case T_ParamRef: result = transformParamRef(pstate, (ParamRef *) expr); /* $n → Param */ break; case T_A_Const: result = (Node *) make_const(pstate, (A_Const *) expr); /* literal → Const */ break; case T_A_Expr: switch (((A_Expr *) expr)->kind) { case AEXPR_OP: result = transformAExprOp(pstate, a); break; /* a + b */ case AEXPR_IN: result = transformAExprIn(pstate, a); break; /* IN (...) */ case AEXPR_BETWEEN: result = transformAExprBetween(pstate, a); break; /* ... AEXPR_OP_ANY/ALL, DISTINCT, NULLIF, LIKE/ILIKE/SIMILAR ... */ } break; case T_FuncCall: result = transformFuncCall(pstate, (FuncCall *) expr); /* name → pg_proc OID */ break; case T_SubLink: result = transformSubLink(pstate, (SubLink *) expr); /* child ParseState */ break; /* ... BoolExpr, TypeCast, CaseExpr, CoalesceExpr, etc. ... */}transformColumnRef dispatches on p_expr_kind (the
ParseExprKind the caller set, e.g. EXPR_KIND_WHERE vs.
EXPR_KIND_GROUP_BY) to reject column references in positions where
they are illegal (default expressions, partition bounds) before
attempting namespace resolution — the same ParseExprKind that
transformSelectStmt threads into each clause transform above.
The parse_*.c specialist files
Section titled “The parse_*.c specialist files”Each specialist file in the parser/ directory handles one concern
of semantic analysis. The key ones:
| File | Responsibility |
|---|---|
parse_relation.c | Add RTEs to the range table; name/column lookup; addRangeTableEntry, checkNameSpaceConflicts |
parse_expr.c | Recursive expression transformation; transformExpr, ParseExprKind dispatch |
parse_func.c | Function name resolution; aggregate detection; ParseFuncOrColumn |
parse_coerce.c | Type coercion; coerce_to_target_type, implicit cast insertion |
parse_clause.c | WHERE, ORDER BY, GROUP BY, HAVING, LIMIT/OFFSET; transformWhereClause, transformSortClause |
parse_agg.c | Aggregate/window function checks; transformAggregateCall, parseCheckAggregates |
parse_cte.c | WITH clause CTEs; recursive CTE type resolution |
parse_target.c | Target list construction; transformTargetList, star expansion |
parse_collate.c | Collation assignment walk after expression transformation |
parse_merge.c | MERGE statement transformation; transformMergeStmt |
The overall semantic analysis flow for a SELECT is:
flowchart TD
TS["transformSelectStmt\n(analyze.c:1389)"]
FR["transformFromClause\naddRangeTableEntry per relation/subquery\n(parse_relation.c)"]
WH["transformWhereClause\ntransformExpr on predicate\n(parse_clause.c)"]
TL["transformTargetList\nstar expansion, alias resolution\n(parse_target.c)"]
GB["transformGroupClause\nGROUP BY / HAVING\n(parse_clause.c)"]
OB["transformSortClause\nORDER BY\n(parse_clause.c)"]
AG["parseCheckAggregates\naggregate/window validation\n(parse_agg.c)"]
CO["assign_collations_to_grouplist\n(parse_collate.c)"]
QN["Query node\nCMD_SELECT"]
TS --> FR --> WH --> TL --> GB --> OB --> AG --> CO --> QN
Figure 2 — transformSelectStmt calls specialist transformation functions in order; each step may recursively enter transformStmt for subqueries, creating a child ParseState linked via parentParseState.
Error position reporting
Section titled “Error position reporting”Raw parse tree nodes carry a location field: a byte offset into
the original SQL string. The semantic analysis phase uses
parser_errposition(pstate, location) to convert this byte offset to
a 1-based character index (via pg_mbstrlen_with_len) and passes it
to errposition() in error reports, producing the familiar ^ cursor
in client error messages. For functions called inside semantic analysis
that do not have a ParseState, setup_parser_errposition_callback
installs an error-context callback that injects the location
automatically.
Query ID and post-analysis hook
Section titled “Query ID and post-analysis hook”After transformTopLevelStmt returns, parse_analyze_fixedparams
optionally runs JumbleQuery to compute a query fingerprint
(query->queryId) and fires post_parse_analyze_hook — the
extensibility hook used by pg_stat_statements and similar extensions
to intercept the fully-transformed query before it reaches the
rewriter.
Source Walkthrough
Section titled “Source Walkthrough”Phase 1 — Lexing and raw parsing
Section titled “Phase 1 — Lexing and raw parsing”raw_parser(parser.c) — top-level entry; initializes Flex scanner and Bison parser, returnsList *ofRawStmt.scanner_init/scanner_finish(scan.l) — allocate/free the Flex scanner state (core_yyscan_t); install keyword tables.core_yylex(scan.l) — Flex-generated inner scanner; emits raw tokens.base_yylex(parser.c) — one-token lookahead filter; convertsNOT→NOT_LA,NULLS_P→NULLS_LA,WITH→WITH_LA, resolvesUIDENT/USCONSTUnicode sequences.base_yyparse(gram.y, Bison-generated) — LALR(1) parser; builds raw AST via node-allocation actions.parser_init(gram.y) — initialize Bison glue struct.ScanKeywordLookup(src/common/keywords.c) — binary-search keyword table; called byscan.lfor every identifier token.ScanKeywords/ScanKeywordTokens/kwlist.h— 521-entry sorted keyword table.
Phase 2 — Semantic analysis
Section titled “Phase 2 — Semantic analysis”parse_analyze_fixedparams/parse_analyze_varparams/parse_analyze_withcb(analyze.c) — three entry points; wrapParseState, calltransformTopLevelStmt, fire hook.transformTopLevelStmt(analyze.c) — thin wrapper that transfers statement location data, callstransformStmt.transformStmt(analyze.c) — dispatch switch onNodeTag; optimizable statements to specialist transforms, utilities wrapped asCMD_UTILITY.transformSelectStmt(analyze.c:1389) — full SELECT transformation including subquery handling.transformInsertStmt(analyze.c:633) — INSERT with ON CONFLICT.transformUpdateStmt(analyze.c:2472) — UPDATE assignment list.transformDeleteStmt(analyze.c:553) — DELETE with USING.transformMergeStmt(parse_merge.c:107) — MERGE WHEN clauses.transformSetOperationStmt(analyze.c:1751) — UNION/INTERSECT/EXCEPT.transformValuesClause(analyze.c:1532) — VALUES lists.transformReturnStmt(analyze.c:2441) — PL/pgSQL RETURN.transformPLAssignStmt(analyze.c:2783) — PL/pgSQL assignment.parse_sub_analyze(analyze.c:222) — recursive entry for subqueries.make_parsestate/free_parsestate(parse_node.c) — allocate and releaseParseState; copies hook pointers from parent.parser_errposition(parse_node.c:106) — byte-offset to character- index conversion for error cursor.make_const(parse_node.c:347) — convertA_Constgrammar node to a typedConst(INT4, INT8, NUMERIC, BOOL, UNKNOWN, BIT).post_parse_analyze_hook(analyze.c:59) — extensibility hook fired after full transformation; used bypg_stat_statements.
Position-hint table (as of 2026-06-05, commit 273fe94)
Section titled “Position-hint table (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
raw_parser | src/backend/parser/parser.c | 42 |
base_yylex | src/backend/parser/parser.c | 111 |
str_udeescape | src/backend/parser/parser.c | 372 |
scanner_init | src/backend/parser/scan.l | 1249 |
scanner_finish | src/backend/parser/scan.l | 1291 |
ScanKeywordTokens array | src/backend/parser/scan.l | 81 |
stmtmulti production | src/backend/parser/gram.y | 961 |
parser_init | src/backend/parser/gram.y | 19725 |
post_parse_analyze_hook | src/backend/parser/analyze.c | 59 |
parse_analyze_fixedparams | src/backend/parser/analyze.c | 105 |
parse_analyze_varparams | src/backend/parser/analyze.c | 145 |
parse_analyze_withcb | src/backend/parser/analyze.c | 186 |
parse_sub_analyze | src/backend/parser/analyze.c | 222 |
transformTopLevelStmt | src/backend/parser/analyze.c | 249 |
transformStmt | src/backend/parser/analyze.c | 312 |
transformDeleteStmt | src/backend/parser/analyze.c | 553 |
transformInsertStmt | src/backend/parser/analyze.c | 633 |
transformSelectStmt | src/backend/parser/analyze.c | 1389 |
transformValuesClause | src/backend/parser/analyze.c | 1532 |
transformSetOperationStmt | src/backend/parser/analyze.c | 1751 |
transformReturnStmt | src/backend/parser/analyze.c | 2441 |
transformUpdateStmt | src/backend/parser/analyze.c | 2472 |
transformPLAssignStmt | src/backend/parser/analyze.c | 2783 |
make_parsestate | src/backend/parser/parse_node.c | 39 |
free_parsestate | src/backend/parser/parse_node.c | 72 |
parser_errposition | src/backend/parser/parse_node.c | 106 |
make_const | src/backend/parser/parse_node.c | 347 |
transformMergeStmt | src/backend/parser/parse_merge.c | 107 |
transformFromClause | src/backend/parser/parse_clause.c | 112 |
transformWhereClause | src/backend/parser/parse_clause.c | 1854 |
transformTargetList | src/backend/parser/parse_target.c | 121 |
markTargetListOrigins | src/backend/parser/parse_target.c | 318 |
ExpandColumnRefStar | src/backend/parser/parse_target.c | 1123 |
transformExpr | src/backend/parser/parse_expr.c | 119 |
transformExprRecurse | src/backend/parser/parse_expr.c | 137 |
transformColumnRef | src/backend/parser/parse_expr.c | 509 |
checkNameSpaceConflicts | src/backend/parser/parse_relation.c | 442 |
buildNSItemFromTupleDesc | src/backend/parser/parse_relation.c | 1309 |
addRangeTableEntryForRelation | src/backend/parser/parse_relation.c | 1584 |
ParseState struct | src/include/parser/parse_node.h | 192 |
RawParseMode enum | src/include/parser/parser.h | 38 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
raw_parserreturnsNIL(not an error throw) on Bison failure. Verified inparser.cline 82:if (yyresult) return NIL;. The caller is responsible for detecting the empty list and issuing an error. Scanner-level errors useereport(ERROR)directly and never return. -
The no-backtrack property is enforced by the build system.
scan.l’s file header states: “verify that you haven’t broken the no-backtrack property by running flex with the -b option … (As of Postgres 9.2, this check is made automatically by the Makefile.)” The resultinglex.backupmust be empty. Verified insrc/backend/parser/Makefile(the-bflag is present in the Flex invocation and the empty-check is part of the build). -
gram.yis ~19 700 lines;kwlist.hhas 521 keyword entries. Counted directly:wc -l gram.y= 19 728;wc -l kwlist.h= 521. The keyword count grows with each major PostgreSQL version as new SQL features add new reserved words. -
base_yylexhandles exactly six token types that need lookahead:FORMAT,NOT,NULLS_P,WITH,WITHOUT,UIDENT/USCONST. Verified by reading theswitch (cur_token)inparser.clines 138–161 and 196–321. All other tokens pass through immediately via thedefault: return cur_token;branch. -
post_parse_analyze_hookis declared inanalyze.cand initialized toNULL. Verified at line 59. Extensions (e.g.,pg_stat_statements) install their hook at library load time. The hook fires afterJumbleQueryso thequeryIdis already set. -
Utility statements are not analyzed at parse time except for three cases.
transformStmt’s comment at line 437 lists the exceptions:DECLARE CURSOR,EXPLAIN, andCREATE TABLE ASeach contain an embedded optimizable query that is analyzed immediately. -
ParseStatecarries parser hook function pointers for PL/pgSQL and SPI extensibility.parse_node.hlines 238–242 declarep_pre_columnref_hook,p_post_columnref_hook,p_paramref_hook, andp_coerce_param_hook. These are copied from parent to childParseStateinmake_parsestate. PL/pgSQL installs its own hooks to intercept column reference resolution and parameter binding.
Open questions
Section titled “Open questions”-
GUC variables in
scan.l. Three GUC variables (backslash_quote,escape_string_warning,standard_conforming_strings) are read directly by the scanner. The file header calls this a “DIRECT violation” of the rule against GUC-dependent scanner/grammar code. Investigation path: determine whether PG18 has any plan to push these into the scanner’s extra state, or whetherstandard_conforming_strings = onby default (since PG9.1) has made the issue moot in practice. -
parse_utilcmd.crole. The README listsparse_utilcmd.cas “parse analysis for utility commands (done at execution time).” This file is outsideanalyze.c’s dispatch and is called fromProcessUtility. Investigation path: trace which utility commands calltransformCreateStmt/transformIndexStmtinparse_utilcmd.cand confirm they do not acquire catalog locks beforeProcessUtilityacquires them. -
query->queryIdstability.JumbleQueryproduces a fingerprint of the normalized query tree. The stability of this fingerprint across minor PG version upgrades (e.g., when a new expression node type is introduced) is not documented. Investigation path: read thejumble_querylogic innodes/queryjumble.cto understand whether added node types invalidate existing fingerprints.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Hand-written recursive-descent parsers. MySQL 8.0 replaced its Bison grammar with a hand-written recursive-descent parser for better error messages and incremental parsing. A side-by-side analysis with PostgreSQL’s generated LALR(1) approach would show the error-quality vs. grammar-maintainability tradeoff. PostgreSQL has historically prioritized grammar expressiveness over error message quality.
-
Incremental / partial parsing. The PostgreSQL parser reparses the entire query string on every call. Tools like
libpg_queryand thepg_queryGo/Ruby wrappers exposeraw_parserexternally. Research on incremental tree-sitter grammars (Brunsfeld et al., 2018) demonstrates parse-state reuse across edits; applying this to PostgreSQL’s grammar would benefit IDEs and query analytics tools. -
Grammar-based query normalization. The
JumbleQuerypath innodes/queryjumble.cfingerprints the analyzed query tree, not the raw text. This means syntactically different but semantically equivalent queries (e.g.,SELECT 1vs.SELECT 1::int4) may get different IDs. Approaches based on canonical form reduction at the raw-parse level (before semantic analysis) are explored in thepg_stat_statementsissue tracker. -
Parser hooks as an extensibility surface. The
post_parse_analyze_hook,p_pre_columnref_hook, andp_paramref_hookslots give extensions access to parse analysis internals. A survey of extensions that use these hooks (Citus, TimescaleDB,pg_stat_statements, PL/pgSQL itself) would document what the hook contract actually is vs. what is documented.
Sources
Section titled “Sources”In-tree documentation
Section titled “In-tree documentation”src/backend/parser/README— module-level description of each file.src/backend/parser/gram.y(file header) — design rationale for the no-catalog rule and the LALR(1) requirement.src/backend/parser/scan.l(file header) — no-backtrack rule and sync requirement withpsqlscan.l/pgc.l.
Textbooks
Section titled “Textbooks”- Aho, Lam, Sethi, Ullman. Compilers: Principles, Techniques, and Tools (Dragon Book), 2nd ed., 2006. §§2–5 (lexical analysis, parsing, LALR).
- Cooper & Torczon. Engineering a Compiler, 2nd ed., 2012. Ch. 2–4.
Source code paths
Section titled “Source code paths”src/backend/parser/— all parser source files.src/include/parser/—parser.h,parse_node.h,kwlist.h,gramparse.h.src/common/keywords.c—ScanKeywordLookup(shared with frontend tools).src/nodes/queryjumble.c—JumbleQueryfor query fingerprinting.