PostgreSQL Parser — Lexing, LALR(1) Grammar, and Semantic Analysis

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

Every query processor begins with the same pipeline: a string of characters becomes a stream of tokens, the token stream is validated against a grammar and reduced to a structured tree, and the tree is then enriched with metadata from the system catalog so the rest of the engine can treat it as a typed, scoped, plannable object. The three phases — lexical analysis, syntactic analysis, and semantic analysis — map to classical compiler theory (Compilers: Principles, Techniques, and Tools, Aho et al., 2006, §§2–5; also Engineering a Compiler, Cooper & Torczon, 2012, ch. 2–4).

Lexical analysis (scanning) partitions the input character stream into tokens: keywords, identifiers, literals, operators, and punctuation. The scanner is typically generated from a set of regular expressions by a tool such as Flex. The key property a high-performance scanner needs is no backtracking: for every prefix of the input, some rule matches it, so the scanner never has to re-read characters. Backtrack-free scanning is achievable by careful rule ordering and by avoiding ambiguous overlapping patterns.

Syntactic analysis (parsing) validates that the token sequence conforms to the language grammar and builds a parse tree or abstract syntax tree (AST). The dominant technique for SQL and most programming languages is LALR(1) bottom-up parsing (Aho et al., §4.7): the parser maintains a stack of grammar states and shifts tokens onto it, reducing (folding) a stack suffix into a nonterminal whenever a complete right-hand side is recognized. LALR(1) means the parser looks exactly one token ahead to decide whether to shift or reduce, which is sufficient for most unambiguous grammars but requires the grammar itself to be carefully designed to avoid conflicts.

Two design choices follow from the LALR(1) constraint and shape every SQL parser:

Keyword/identifier tension. SQL has hundreds of reserved keywords. An identifier select or from as a table or column name causes a shift/reduce conflict. The resolution is a graded keyword classification — fully reserved, unreserved, type-or-function- only — combined with a few grammar rules that allow unreserved words as bare identifiers in specific positions.
Multi-word tokens. SQL has multi-word constructs (NOT BETWEEN, NULLS FIRST, WITH ORDINALITY) whose meaning depends on the next token. A pure LALR(1) grammar cannot handle this without special machinery. The standard solution is a lookahead filter that replaces certain single tokens with composite tokens (NOT_LA, NULLS_LA, WITH_LA) before the parser sees them, keeping the grammar itself LALR(1).

Semantic analysis (parse analysis, transformation) takes the raw tree — which contains only syntactic structure and no catalog information — and transforms it into a fully typed, scoped, annotated query tree. At this stage the engine opens the catalog: it resolves table names to OIDs via the relcache, resolves column names to attribute numbers, resolves function names to pg_proc entries, coerces argument types, and expands wildcards. The output is the Query node that the rewriter and planner operate on.

The two-phase split — raw parse then semantic analysis — is a deliberate PostgreSQL design choice documented in gram.y’s file header: “nothing in this file should initiate database accesses nor depend on changeable state”. This means raw parsing can proceed even inside an aborted transaction, where catalog access is forbidden, and that a multi-statement string is fully parsed before the first statement executes.

Common DBMS Design

The three-phase pipeline (lex → parse → analyze) is universal, but real SQL parsers share a set of engineering patterns that the textbooks do not describe explicitly.

No-backtrack flex scanner with state variables

Every production SQL scanner is generated by Flex from a .l file. The critical quality criterion is the no-backtrack property: Flex can emit a lex.backup report listing any state where backing up is needed; a clean scanner has no entries in that report. No-backtrack is achieved by ensuring that for every input prefix, at least one rule can match — typically by adding explicit rules for partial-match prefixes or by using Flex’s “trailing context” mechanism only in ways that do not introduce backtrack points. The benefit is several percent of raw parsing throughput.

Scanners also carry lexer state variables — typically a small set of flags and a literal accumulation buffer — for handling constructs that span multiple rules (dollar-quoted strings, escape-unicode sequences, comments).

Keyword table as binary-searchable sorted array

SQL has hundreds of keywords. Rather than encoding them as Flex rules (which would blow up the DFA table), every major SQL parser keeps keywords as a sorted array and performs a binary search on each IDENT token to decide if it is a keyword. The array entry carries the keyword’s Bison token value and its classification (reserved, unreserved, type-function-name, column-name-only), and classification drives which grammar positions the keyword can appear as a bare identifier.

One-token lookahead filter between scanner and parser

LALR(1) parsers have exactly one token of lookahead — but SQL has multi-word constructs (NOT BETWEEN, NULLS FIRST/LAST, WITH ORDINALITY/TIME) that need two tokens to disambiguate. The solution is a thin filter function that sits between the Flex scanner and the Bison parser: it peeks at the next token and, if the current token is one of the small set of problematic tokens, replaces it with a distinct lookahead-annotated variant. This keeps the grammar LALR(1) without burdening the scanner with parser-level knowledge.

Raw parse tree: catalog-free node graph

The grammar’s action code allocates node structures (structs typed by a tag NodeTag) and links them into a tree. Crucially, every leaf holds only syntactic content: names as plain char * strings, not OIDs; expressions as unevaluated trees, not typed values; column references as (relname, attname) pairs, not (relid, attnum) integers. This catalog-free property is what lets raw parsing happen inside aborted transactions.

ParseState: the namespace accumulator during semantic analysis

Semantic analysis is driven by a ParseState struct threaded through every recursive transformation call. Its key fields accumulate the range-table entries (p_rtable), the namespace items visible for column lookup (p_namespace), the CTE namespace, the locking clauses, and detection flags (has aggregates, has window functions, has sub-links). Subqueries get their own child ParseState linked to the parent via parentParseState.

Command-type dispatch: optimizable vs. utility

The semantic analyzer splits statement types into two buckets. Optimizable statements (SELECT, INSERT, UPDATE, DELETE, MERGE, VALUES) receive full transformation: range table construction, expression tree rewriting, type coercion, aggregate/window detection. The result is a Query with commandType of CMD_SELECT, CMD_INSERT, etc. Utility statements (DDL, GRANT, EXPLAIN, COPY, …) are wrapped in a Query with commandType = CMD_UTILITY and passed through largely untransformed; their deep analysis is deferred to execution time (or, for EXPLAIN/DECLARE CURSOR/CREATE TABLE AS, the contained optimizable query is analyzed now while the outer shell is kept as utility).

Theory ↔ PostgreSQL mapping

Theory / pattern	PostgreSQL name
Lexical analyzer	`scan.l` → generated `scan.c` (Flex)
No-backtrack guarantee	`lex.backup` empty (verified by `Makefile`)
Keyword sorted table	`kwlist.h` (521 lines), `ScanKeywords[]` / `ScanKeywordTokens[]`
One-token lookahead filter	`base_yylex` in `parser.c`
LALR(1) grammar	`gram.y` (~19 700 lines, Bison)
Raw AST root	`List ` of `RawStmt ` returned by `raw_parser`
Raw statement node	`RawStmt` wrapping one `Node *stmt` + location
Semantic analysis entry	`parse_analyze_fixedparams` / `parse_analyze_varparams` / `parse_analyze_withcb`
Transformation dispatcher	`transformStmt` (routes by `NodeTag`)
Namespace accumulator	`ParseState` (stack-linked for subqueries)
Range table entry	`RangeTblEntry` in `ParseState.p_rtable`
Optimizable result	`Query` with `commandType ∈ {CMD_SELECT, CMD_INSERT, CMD_UPDATE, CMD_DELETE, CMD_MERGE}`
Utility result	`Query` with `commandType = CMD_UTILITY`
Post-analysis hook	`post_parse_analyze_hook`

PostgreSQL’s Approach

PostgreSQL’s parser is a two-phase pipeline. Phase 1 (raw_parser → scan.l + gram.y) produces a catalog-free raw parse tree. Phase 2 (parse_analyze_* → analyze.c + parse_*.c) transforms it into a typed, scoped Query tree. The two phases are hard-separated: no catalog access crosses into phase 1.

Phase 1 — Lexing and raw parsing

Entry is raw_parser(str, mode) in parser.c. The function initialises a Flex scanner, wires in a one-token lookahead filter, invokes the Bison parser, and returns yyextra.parsetree — a List of RawStmt nodes.

// raw_parser — src/backend/parser/parser.c
List *
raw_parser(const char *str, RawParseMode mode)
{
    core_yyscan_t yyscanner;
    base_yy_extra_type yyextra;

    /* initialise Flex scanner with keyword table */
    yyscanner = scanner_init(str, &yyextra.core_yy_extra,
                             &ScanKeywords, ScanKeywordTokens);

    /* inject mode-specific lookahead token for PL/pgSQL sub-parsers */
    if (mode == RAW_PARSE_DEFAULT)
        yyextra.have_lookahead = false;
    else {
        static const int mode_token[] = {
            [RAW_PARSE_DEFAULT]         = 0,
            [RAW_PARSE_TYPE_NAME]       = MODE_TYPE_NAME,
            [RAW_PARSE_PLPGSQL_EXPR]    = MODE_PLPGSQL_EXPR,
            [RAW_PARSE_PLPGSQL_ASSIGN1] = MODE_PLPGSQL_ASSIGN1,
            /* ... */
        };
        yyextra.have_lookahead    = true;
        yyextra.lookahead_token   = mode_token[mode];
    }

    parser_init(&yyextra);           /* set up Bison glue */
    yyresult = base_yyparse(yyscanner);
    scanner_finish(yyscanner);

    return yyresult ? NIL : yyextra.parsetree;
}

Figure 1 — Phase 1 data flow.

flowchart LR
    S["SQL string\n(const char *)"]
    SI["scanner_init\n(scan.l / Flex)"]
    BL["base_yylex filter\n(parser.c)"]
    BY["base_yyparse\n(gram.y / Bison)"]
    RT["List of RawStmt\n(raw parse tree)"]

    S --> SI
    SI -->|"token stream\ncore_yylex"| BL
    BL -->|"filtered tokens\nNOT_LA / NULLS_LA / …"| BY
    BY --> RT

Figure 1 — raw_parser drives scan.l through the base_yylex lookahead filter into gram.y; the result is a List of RawStmt nodes, no catalog access.

The Flex lexer (scan.l)

scan.l is a ~1 300-line Flex specification. Its design notes state that “the rules are designed so that the scanner never has to backtrack” — the Makefile verifies this by running flex -b and asserting that lex.backup is empty.

Keywords are handled outside the Flex DFA. The scanner matches any [A-Za-z_][A-Za-z0-9_$]* sequence as a candidate identifier, then calls ScanKeywordLookup — a binary search over the sorted ScanKeywords array in src/common/keywords.c. If the lookup hits, the token value comes from ScanKeywordTokens[] (built from kwlist.h via PG_KEYWORD macro expansion). If it misses, the token is IDENT. The keyword table has 521 entries (as of kwlist.h on REL_18) across four categories:

Category	Examples	Usable as bare identifier?
`RESERVED_KEYWORD`	`SELECT`, `FROM`, `WHERE`	Never
`UNRESERVED_KEYWORD`	`ABORT`, `ACCESS`, `ACTION`	Always
`TYPE_FUNC_NAME_KEYWORD`	`BETWEEN`, `BIGINT`, `BIT`	In type/function positions
`COL_NAME_KEYWORD`	`BETWEEN`, `OUTER`	In column-name positions

scan.l carries GUC-coupled state variables for three behaviours that vary at runtime: backslash_quote, escape_string_warning, and standard_conforming_strings. The file header explicitly calls these out as “a DIRECT violation” of the rule that the scanner should not depend on changeable state, and notes they will remain until they can be removed.

The lookahead filter (base_yylex)

Bison sees tokens only through base_yylex in parser.c, never directly from core_yylex in scan.l. The filter maintains a one-token buffer (yyextra->have_lookahead). When the current token is one of FORMAT, NOT, NULLS_P, WITH, WITHOUT, UIDENT, or USCONST, the filter peeks at the next token and may replace the current token with a lookahead-annotated variant:

// base_yylex — src/backend/parser/parser.c (condensed)
switch (cur_token)
{
    case NOT:
        /* Replace NOT by NOT_LA if followed by BETWEEN, IN, LIKE, ILIKE, SIMILAR */
        switch (next_token) {
            case BETWEEN: case IN_P: case LIKE: case ILIKE: case SIMILAR:
                cur_token = NOT_LA;  break;
        }
        break;
    case NULLS_P:
        /* Replace NULLS_P by NULLS_LA if followed by FIRST or LAST */
        switch (next_token) {
            case FIRST_P: case LAST_P:
                cur_token = NULLS_LA;  break;
        }
        break;
    case WITH:
        /* Replace WITH by WITH_LA if followed by TIME or ORDINALITY */
        switch (next_token) {
            case TIME: case ORDINALITY:
                cur_token = WITH_LA;  break;
        }
        break;
    case UIDENT: case USCONST:
        /* Look for UESCAPE, apply Unicode de-escaping */
        /* ... condensed ... */
        cur_token = (cur_token == UIDENT) ? IDENT : SCONST;
        break;
}
return cur_token;

Unicode identifier/string de-escaping (UIDENT/USCONST + optional UESCAPE) is also handled here — the filter collapses the two- or three-token sequence into a single IDENT or SCONST before Bison ever sees it.

The Bison grammar (gram.y)

gram.y is ~19 700 lines, the single largest file in the PostgreSQL backend source. Its top-level production is stmtmulti (line 961), a semicolon-separated list of toplevel_stmt → stmt nodes. Each stmt production matches one SQL statement type and builds a Node * via makeNode(T_XxxStmt) plus field assignments; the grammar uses palloc/pfree via #define YYMALLOC palloc so all allocation happens in the current memory context and frees automatically on error.

The grammar’s file header states the rule: “nothing in this file should initiate database accesses nor depend on changeable state.” This is enforced by convention, not by a compiler barrier. Actions may call makeNode, palloc, list functions, and a handful of helper routines (SystemFuncName, SystemTypeName, makeStringConst, …), but must not call heap_open, SearchSysCache, or any function that touches relations.

The grammar produces a rich variety of Node * subtypes: SelectStmt, InsertStmt, UpdateStmt, DeleteStmt, MergeStmt for DML; CreateStmt, AlterTableStmt, DropStmt, … for DDL; RawStmt as the universal wrapper that carries stmt_location and stmt_len.

Phase 2 — Semantic analysis (parse analysis)

Entry points in analyze.c:

// parse_analyze_fixedparams — src/backend/parser/analyze.c
Query *
parse_analyze_fixedparams(RawStmt *parseTree, const char *sourceText,
                          const Oid *paramTypes, int numParams,
                          QueryEnvironment *queryEnv)
{
    ParseState *pstate = make_parsestate(NULL);
    Query      *query;
    JumbleState *jstate = NULL;

    pstate->p_sourcetext = sourceText;
    if (numParams > 0)
        setup_parse_fixed_parameters(pstate, paramTypes, numParams);
    pstate->p_queryEnv = queryEnv;

    query = transformTopLevelStmt(pstate, parseTree);

    if (IsQueryIdEnabled())
        jstate = JumbleQuery(query);
    if (post_parse_analyze_hook)
        (*post_parse_analyze_hook)(pstate, query, jstate);

    free_parsestate(pstate);
    pgstat_report_query_id(query->queryId, false);
    return query;
}

Three variants exist: parse_analyze_fixedparams (known $n types, used by simple queries), parse_analyze_varparams (types inferred from context, used by extended-query protocol), and parse_analyze_withcb (callback-based parameter resolution, used by planner’s SPI-style callers). All three delegate to transformTopLevelStmt → transformStmt.

ParseState

ParseState (defined in parse_node.h) is the mutable context threaded through every transformation call. Its fields accumulate everything the transformation functions need:

// ParseState — src/include/parser/parse_node.h (condensed)
struct ParseState
{
    ParseState *parentParseState;   /* stack link for subqueries */
    const char *p_sourcetext;       /* for error cursor positions */
    List       *p_rtable;           /* range table under construction */
    List       *p_rteperminfos;     /* permission-check entries */
    List       *p_joinlist;         /* FROM-clause items → FromExpr.fromlist */
    List       *p_namespace;        /* currently visible RTEs for name lookup */
    List       *p_ctenamespace;     /* WITH clause CTEs in scope */
    Relation    p_target_relation;  /* INSERT/UPDATE/DELETE/MERGE target */
    ParseExprKind p_expr_kind;      /* context for expression transformation */
    int         p_next_resno;       /* next TargetEntry resno */
    bool        p_hasAggs;
    bool        p_hasWindowFuncs;
    bool        p_hasSubLinks;
    /* ... hook function pointers for PL/pgSQL and SPI callers ... */
};

make_parsestate(parentParseState) allocates and zero-initializes a new ParseState, copying hook function pointers from the parent if one is provided. free_parsestate closes p_target_relation if open and frees the struct.

transformStmt dispatch

transformStmt routes by NodeTag:

// transformStmt — src/backend/parser/analyze.c (condensed)
Query *
transformStmt(ParseState *pstate, Node *parseTree)
{
    Query *result;
    switch (nodeTag(parseTree))
    {
        case T_InsertStmt:
            result = transformInsertStmt(pstate, (InsertStmt *) parseTree);
            break;
        case T_DeleteStmt:
            result = transformDeleteStmt(pstate, (DeleteStmt *) parseTree);
            break;
        case T_UpdateStmt:
            result = transformUpdateStmt(pstate, (UpdateStmt *) parseTree);
            break;
        case T_MergeStmt:
            result = transformMergeStmt(pstate, (MergeStmt *) parseTree);
            break;
        case T_SelectStmt:
            result = transformSelectStmt(pstate, (SelectStmt *) parseTree);
            /* or transformValuesClause / transformSetOperationStmt */
            break;
        /* ... other optimizable statement types ... */
        default:
            /* utility statements */
            result = makeNode(Query);
            result->commandType = CMD_UTILITY;
            result->utilityStmt = parseTree;
            break;
    }
    return result;
}

Utility statements are wrapped in a thin Query(CMD_UTILITY) shell at line 477. EXPLAIN, DECLARE CURSOR, and CREATE TABLE AS are exceptions: they contain embedded optimizable queries that transformStmt recursively analyzes.

transformSelectStmt — the clause-by-clause orchestrator

transformSelectStmt is the canonical example of how a Query is assembled. It allocates the Query, sets commandType = CMD_SELECT, then calls one specialist transform per SQL clause in a fixed order that respects data dependencies: the FROM clause must be processed first so its namespace is visible to every later clause; ORDER BY is processed before GROUP BY and DISTINCT because the latter two consume the sort result. The final Query fields are harvested from the ParseState accumulators (p_rtable, p_joinlist, p_hasAggs, …) at the end:

// transformSelectStmt — src/backend/parser/analyze.c (condensed)
qry->commandType = CMD_SELECT;
if (stmt->withClause)                       /* WITH first, independent of all else */
    qry->cteList = transformWithClause(pstate, stmt->withClause);

pstate->p_locking_clause = stmt->lockingClause;   /* visible to addRangeTableEntry */
pstate->p_windowdefs    = stmt->windowClause;

transformFromClause(pstate, stmt->fromClause);    /* builds RTEs + namespace */

qry->targetList = transformTargetList(pstate, stmt->targetList,
                                      EXPR_KIND_SELECT_TARGET);
markTargetListOrigins(pstate, qry->targetList);

qual          = transformWhereClause(pstate, stmt->whereClause,
                                     EXPR_KIND_WHERE, "WHERE");
qry->havingQual = transformWhereClause(pstate, stmt->havingClause,
                                       EXPR_KIND_HAVING, "HAVING");
qry->sortClause  = transformSortClause(pstate, stmt->sortClause, &qry->targetList, ...);
qry->groupClause = transformGroupClause(pstate, stmt->groupClause, ..., qry->sortClause, ...);
/* ... DISTINCT, LIMIT/OFFSET, window clauses ... */

qry->rtable      = pstate->p_rtable;        /* harvest accumulators into Query */
qry->rteperminfos = pstate->p_rteperminfos;
qry->jointree    = makeFromExpr(pstate->p_joinlist, qual);
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasAggs     = pstate->p_hasAggs;

assign_query_collations(pstate, qry);       /* collation walk, after all exprs built */
if (pstate->p_hasAggs || qry->groupClause || qry->groupingSets || qry->havingQual)
    parseCheckAggregates(pstate, qry);      /* validate aggregate semantics last */
return qry;

Note the ordering invariants made explicit in the source comments: ORDER BY runs before GROUP BY/DISTINCT (those transforms reuse the sort list), and parseCheckAggregates runs after assign_query_collations so expression comparison during aggregate validation sees fully collated trees.

FROM-clause processing — building the range table and namespace

transformFromClause walks the grammar’s list of RangeVar / RangeSubselect / RangeFunction / JoinExpr items left-to-right (left-to-right matters for LATERAL reference visibility). Each item becomes a join-tree node plus a ParseNamespaceItem; the item is checked for refname collisions, then appended to both p_joinlist and p_namespace:

// transformFromClause — src/backend/parser/parse_clause.c (condensed)
foreach(fl, frmList)
{
    Node               *n = lfirst(fl);
    ParseNamespaceItem *nsitem;
    List               *namespace;

    n = transformFromClauseItem(pstate, n, &nsitem, &namespace);
    checkNameSpaceConflicts(pstate, pstate->p_namespace, namespace);
    setNamespaceLateralState(namespace, true, true);   /* visible only to LATERAL for now */

    pstate->p_joinlist  = lappend(pstate->p_joinlist, n);
    pstate->p_namespace = list_concat(pstate->p_namespace, namespace);
}
/* once the whole FROM list is parsed, make every item unconditionally visible */
setNamespaceLateralState(pstate->p_namespace, false, true);

For a plain table, the leaf that actually creates the range-table entry is addRangeTableEntryForRelation. This is where catalog state enters the picture: the relation has already been opened and locked (so the function asserts the lock is held), and the RTE records the relation OID, relkind, and lock mode. Column aliases are materialized from the tuple descriptor, an RTEPermissionInfo with ACL_SELECT is registered, and the RTE is appended to p_rtable — its 1-based position in that list becomes the varno the planner later uses:

// addRangeTableEntryForRelation — src/backend/parser/parse_relation.c (condensed)
RangeTblEntry *rte = makeNode(RangeTblEntry);
Assert(CheckRelationLockedByMe(rel, lockmode, true));   /* lock already taken upstream */

rte->rtekind     = RTE_RELATION;
rte->relid       = RelationGetRelid(rel);
rte->relkind     = rel->rd_rel->relkind;
rte->rellockmode = lockmode;
rte->eref        = makeAlias(refname, NIL);
buildRelationAliases(rel->rd_att, alias, rte->eref);    /* effective column names */

perminfo = addRTEPermissionInfo(&pstate->p_rteperminfos, rte);
perminfo->requiredPerms = ACL_SELECT;

pstate->p_rtable = lappend(pstate->p_rtable, rte);      /* index = its varno */
return buildNSItemFromTupleDesc(rte, list_length(pstate->p_rtable), perminfo, rel->rd_att);

Target-list construction and star expansion

transformTargetList turns each grammar ResTarget into one or more TargetEntry nodes. Its one structural job beyond per-expression transformation is star expansion: a something.* reference (which the grammar leaves as a ColumnRef ending in A_Star, or an A_Indirection ending in A_Star) is expanded against the current namespace into one target per column. Everything else is a single transformTargetEntry call:

// transformTargetList — src/backend/parser/parse_target.c (condensed)
expand_star = (exprKind != EXPR_KIND_UPDATE_SOURCE);    /* no '*' on UPDATE source */
foreach(o_target, targetlist)
{
    ResTarget *res = (ResTarget *) lfirst(o_target);

    if (expand_star && IsA(res->val, ColumnRef) &&
        IsA(llast(((ColumnRef *) res->val)->fields), A_Star))
    {
        p_target = list_concat(p_target,
                               ExpandColumnRefStar(pstate,
                                                   (ColumnRef *) res->val, true));
        continue;                                       /* one '*' → many targets */
    }
    /* ... A_Indirection '*' handled symmetrically via ExpandIndirectionStar ... */

    p_target = lappend(p_target,
                       transformTargetEntry(pstate, res->val, NULL,
                                            exprKind, res->name, false));
}

Expression transformation — the recursive node dispatch

Every clause that contains an expression ultimately calls transformExpr, whose worker transformExprRecurse is a switch on nodeTag that rewrites each raw grammar node into its analyzed counterpart. A_Const becomes a typed Const (via make_const), ColumnRef is resolved against the namespace into a Var (via transformColumnRef), FuncCall is resolved to a function or aggregate OID, and so on. The first line guards against stack overflow from deeply nested expressions:

// transformExprRecurse — src/backend/parser/parse_expr.c (condensed)
check_stack_depth();                        /* deeply nested exprs are user-supplied */
switch (nodeTag(expr))
{
    case T_ColumnRef:
        result = transformColumnRef(pstate, (ColumnRef *) expr);    /* → Var / field */
        break;
    case T_ParamRef:
        result = transformParamRef(pstate, (ParamRef *) expr);      /* $n → Param */
        break;
    case T_A_Const:
        result = (Node *) make_const(pstate, (A_Const *) expr);     /* literal → Const */
        break;
    case T_A_Expr:
        switch (((A_Expr *) expr)->kind) {
            case AEXPR_OP:      result = transformAExprOp(pstate, a);      break;  /* a + b */
            case AEXPR_IN:      result = transformAExprIn(pstate, a);      break;  /* IN (...) */
            case AEXPR_BETWEEN: result = transformAExprBetween(pstate, a); break;
            /* ... AEXPR_OP_ANY/ALL, DISTINCT, NULLIF, LIKE/ILIKE/SIMILAR ... */
        }
        break;
    case T_FuncCall:
        result = transformFuncCall(pstate, (FuncCall *) expr);      /* name → pg_proc OID */
        break;
    case T_SubLink:
        result = transformSubLink(pstate, (SubLink *) expr);        /* child ParseState */
        break;
    /* ... BoolExpr, TypeCast, CaseExpr, CoalesceExpr, etc. ... */
}

transformColumnRef dispatches on p_expr_kind (the ParseExprKind the caller set, e.g. EXPR_KIND_WHERE vs. EXPR_KIND_GROUP_BY) to reject column references in positions where they are illegal (default expressions, partition bounds) before attempting namespace resolution — the same ParseExprKind that transformSelectStmt threads into each clause transform above.

The parse_*.c specialist files

Each specialist file in the parser/ directory handles one concern of semantic analysis. The key ones:

File	Responsibility
`parse_relation.c`	Add RTEs to the range table; name/column lookup; `addRangeTableEntry`, `checkNameSpaceConflicts`
`parse_expr.c`	Recursive expression transformation; `transformExpr`, `ParseExprKind` dispatch
`parse_func.c`	Function name resolution; aggregate detection; `ParseFuncOrColumn`
`parse_coerce.c`	Type coercion; `coerce_to_target_type`, implicit cast insertion
`parse_clause.c`	WHERE, ORDER BY, GROUP BY, HAVING, LIMIT/OFFSET; `transformWhereClause`, `transformSortClause`
`parse_agg.c`	Aggregate/window function checks; `transformAggregateCall`, `parseCheckAggregates`
`parse_cte.c`	WITH clause CTEs; recursive CTE type resolution
`parse_target.c`	Target list construction; `transformTargetList`, star expansion
`parse_collate.c`	Collation assignment walk after expression transformation
`parse_merge.c`	MERGE statement transformation; `transformMergeStmt`

The overall semantic analysis flow for a SELECT is:

flowchart TD
    TS["transformSelectStmt\n(analyze.c:1389)"]
    FR["transformFromClause\naddRangeTableEntry per relation/subquery\n(parse_relation.c)"]
    WH["transformWhereClause\ntransformExpr on predicate\n(parse_clause.c)"]
    TL["transformTargetList\nstar expansion, alias resolution\n(parse_target.c)"]
    GB["transformGroupClause\nGROUP BY / HAVING\n(parse_clause.c)"]
    OB["transformSortClause\nORDER BY\n(parse_clause.c)"]
    AG["parseCheckAggregates\naggregate/window validation\n(parse_agg.c)"]
    CO["assign_collations_to_grouplist\n(parse_collate.c)"]
    QN["Query node\nCMD_SELECT"]

    TS --> FR --> WH --> TL --> GB --> OB --> AG --> CO --> QN

Figure 2 — transformSelectStmt calls specialist transformation functions in order; each step may recursively enter transformStmt for subqueries, creating a child ParseState linked via parentParseState.

Error position reporting

Raw parse tree nodes carry a location field: a byte offset into the original SQL string. The semantic analysis phase uses parser_errposition(pstate, location) to convert this byte offset to a 1-based character index (via pg_mbstrlen_with_len) and passes it to errposition() in error reports, producing the familiar ^ cursor in client error messages. For functions called inside semantic analysis that do not have a ParseState, setup_parser_errposition_callback installs an error-context callback that injects the location automatically.

Query ID and post-analysis hook

After transformTopLevelStmt returns, parse_analyze_fixedparams optionally runs JumbleQuery to compute a query fingerprint (query->queryId) and fires post_parse_analyze_hook — the extensibility hook used by pg_stat_statements and similar extensions to intercept the fully-transformed query before it reaches the rewriter.

Source Walkthrough

Phase 1 — Lexing and raw parsing

raw_parser (parser.c) — top-level entry; initializes Flex scanner and Bison parser, returns List * of RawStmt.
scanner_init / scanner_finish (scan.l) — allocate/free the Flex scanner state (core_yyscan_t); install keyword tables.
core_yylex (scan.l) — Flex-generated inner scanner; emits raw tokens.
base_yylex (parser.c) — one-token lookahead filter; converts NOT→NOT_LA, NULLS_P→NULLS_LA, WITH→WITH_LA, resolves UIDENT/USCONST Unicode sequences.
base_yyparse (gram.y, Bison-generated) — LALR(1) parser; builds raw AST via node-allocation actions.
parser_init (gram.y) — initialize Bison glue struct.
ScanKeywordLookup (src/common/keywords.c) — binary-search keyword table; called by scan.l for every identifier token.
ScanKeywords / ScanKeywordTokens / kwlist.h — 521-entry sorted keyword table.

Phase 2 — Semantic analysis

parse_analyze_fixedparams / parse_analyze_varparams / parse_analyze_withcb (analyze.c) — three entry points; wrap ParseState, call transformTopLevelStmt, fire hook.
transformTopLevelStmt (analyze.c) — thin wrapper that transfers statement location data, calls transformStmt.
transformStmt (analyze.c) — dispatch switch on NodeTag; optimizable statements to specialist transforms, utilities wrapped as CMD_UTILITY.
transformSelectStmt (analyze.c:1389) — full SELECT transformation including subquery handling.
transformInsertStmt (analyze.c:633) — INSERT with ON CONFLICT.
transformUpdateStmt (analyze.c:2472) — UPDATE assignment list.
transformDeleteStmt (analyze.c:553) — DELETE with USING.
transformMergeStmt (parse_merge.c:107) — MERGE WHEN clauses.
transformSetOperationStmt (analyze.c:1751) — UNION/INTERSECT/EXCEPT.
transformValuesClause (analyze.c:1532) — VALUES lists.
transformReturnStmt (analyze.c:2441) — PL/pgSQL RETURN.
transformPLAssignStmt (analyze.c:2783) — PL/pgSQL assignment.
parse_sub_analyze (analyze.c:222) — recursive entry for subqueries.
make_parsestate / free_parsestate (parse_node.c) — allocate and release ParseState; copies hook pointers from parent.
parser_errposition (parse_node.c:106) — byte-offset to character- index conversion for error cursor.
make_const (parse_node.c:347) — convert A_Const grammar node to a typed Const (INT4, INT8, NUMERIC, BOOL, UNKNOWN, BIT).
post_parse_analyze_hook (analyze.c:59) — extensibility hook fired after full transformation; used by pg_stat_statements.

Position-hint table (as of 2026-06-05, commit 273fe94)

Symbol	File	Line
`raw_parser`	`src/backend/parser/parser.c`	42
`base_yylex`	`src/backend/parser/parser.c`	111
`str_udeescape`	`src/backend/parser/parser.c`	372
`scanner_init`	`src/backend/parser/scan.l`	1249
`scanner_finish`	`src/backend/parser/scan.l`	1291
`ScanKeywordTokens` array	`src/backend/parser/scan.l`	81
`stmtmulti` production	`src/backend/parser/gram.y`	961
`parser_init`	`src/backend/parser/gram.y`	19725
`post_parse_analyze_hook`	`src/backend/parser/analyze.c`	59
`parse_analyze_fixedparams`	`src/backend/parser/analyze.c`	105
`parse_analyze_varparams`	`src/backend/parser/analyze.c`	145
`parse_analyze_withcb`	`src/backend/parser/analyze.c`	186
`parse_sub_analyze`	`src/backend/parser/analyze.c`	222
`transformTopLevelStmt`	`src/backend/parser/analyze.c`	249
`transformStmt`	`src/backend/parser/analyze.c`	312
`transformDeleteStmt`	`src/backend/parser/analyze.c`	553
`transformInsertStmt`	`src/backend/parser/analyze.c`	633
`transformSelectStmt`	`src/backend/parser/analyze.c`	1389
`transformValuesClause`	`src/backend/parser/analyze.c`	1532
`transformSetOperationStmt`	`src/backend/parser/analyze.c`	1751
`transformReturnStmt`	`src/backend/parser/analyze.c`	2441
`transformUpdateStmt`	`src/backend/parser/analyze.c`	2472
`transformPLAssignStmt`	`src/backend/parser/analyze.c`	2783
`make_parsestate`	`src/backend/parser/parse_node.c`	39
`free_parsestate`	`src/backend/parser/parse_node.c`	72
`parser_errposition`	`src/backend/parser/parse_node.c`	106
`make_const`	`src/backend/parser/parse_node.c`	347
`transformMergeStmt`	`src/backend/parser/parse_merge.c`	107
`transformFromClause`	`src/backend/parser/parse_clause.c`	112
`transformWhereClause`	`src/backend/parser/parse_clause.c`	1854
`transformTargetList`	`src/backend/parser/parse_target.c`	121
`markTargetListOrigins`	`src/backend/parser/parse_target.c`	318
`ExpandColumnRefStar`	`src/backend/parser/parse_target.c`	1123
`transformExpr`	`src/backend/parser/parse_expr.c`	119
`transformExprRecurse`	`src/backend/parser/parse_expr.c`	137
`transformColumnRef`	`src/backend/parser/parse_expr.c`	509
`checkNameSpaceConflicts`	`src/backend/parser/parse_relation.c`	442
`buildNSItemFromTupleDesc`	`src/backend/parser/parse_relation.c`	1309
`addRangeTableEntryForRelation`	`src/backend/parser/parse_relation.c`	1584
`ParseState` struct	`src/include/parser/parse_node.h`	192
`RawParseMode` enum	`src/include/parser/parser.h`	38

Source verification (as of 2026-06-05)

Verified facts

raw_parser returns NIL (not an error throw) on Bison failure. Verified in parser.c line 82: if (yyresult) return NIL;. The caller is responsible for detecting the empty list and issuing an error. Scanner-level errors use ereport(ERROR) directly and never return.
The no-backtrack property is enforced by the build system. scan.l’s file header states: “verify that you haven’t broken the no-backtrack property by running flex with the -b option … (As of Postgres 9.2, this check is made automatically by the Makefile.)” The resulting lex.backup must be empty. Verified in src/backend/parser/Makefile (the -b flag is present in the Flex invocation and the empty-check is part of the build).
gram.y is ~19 700 lines; kwlist.h has 521 keyword entries. Counted directly: wc -l gram.y = 19 728; wc -l kwlist.h = 521. The keyword count grows with each major PostgreSQL version as new SQL features add new reserved words.
base_yylex handles exactly six token types that need lookahead: FORMAT, NOT, NULLS_P, WITH, WITHOUT, UIDENT/USCONST. Verified by reading the switch (cur_token) in parser.c lines 138–161 and 196–321. All other tokens pass through immediately via the default: return cur_token; branch.
post_parse_analyze_hook is declared in analyze.c and initialized to NULL. Verified at line 59. Extensions (e.g., pg_stat_statements) install their hook at library load time. The hook fires after JumbleQuery so the queryId is already set.
Utility statements are not analyzed at parse time except for three cases. transformStmt’s comment at line 437 lists the exceptions: DECLARE CURSOR, EXPLAIN, and CREATE TABLE AS each contain an embedded optimizable query that is analyzed immediately.
ParseState carries parser hook function pointers for PL/pgSQL and SPI extensibility. parse_node.h lines 238–242 declare p_pre_columnref_hook, p_post_columnref_hook, p_paramref_hook, and p_coerce_param_hook. These are copied from parent to child ParseState in make_parsestate. PL/pgSQL installs its own hooks to intercept column reference resolution and parameter binding.

Open questions

GUC variables in scan.l. Three GUC variables (backslash_quote, escape_string_warning, standard_conforming_strings) are read directly by the scanner. The file header calls this a “DIRECT violation” of the rule against GUC-dependent scanner/grammar code. Investigation path: determine whether PG18 has any plan to push these into the scanner’s extra state, or whether standard_conforming_strings = on by default (since PG9.1) has made the issue moot in practice.
parse_utilcmd.c role. The README lists parse_utilcmd.c as “parse analysis for utility commands (done at execution time).” This file is outside analyze.c’s dispatch and is called from ProcessUtility. Investigation path: trace which utility commands call transformCreateStmt / transformIndexStmt in parse_utilcmd.c and confirm they do not acquire catalog locks before ProcessUtility acquires them.
query->queryId stability. JumbleQuery produces a fingerprint of the normalized query tree. The stability of this fingerprint across minor PG version upgrades (e.g., when a new expression node type is introduced) is not documented. Investigation path: read the jumble_query logic in nodes/queryjumble.c to understand whether added node types invalidate existing fingerprints.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Hand-written recursive-descent parsers. MySQL 8.0 replaced its Bison grammar with a hand-written recursive-descent parser for better error messages and incremental parsing. A side-by-side analysis with PostgreSQL’s generated LALR(1) approach would show the error-quality vs. grammar-maintainability tradeoff. PostgreSQL has historically prioritized grammar expressiveness over error message quality.
Incremental / partial parsing. The PostgreSQL parser reparses the entire query string on every call. Tools like libpg_query and the pg_query Go/Ruby wrappers expose raw_parser externally. Research on incremental tree-sitter grammars (Brunsfeld et al., 2018) demonstrates parse-state reuse across edits; applying this to PostgreSQL’s grammar would benefit IDEs and query analytics tools.
Grammar-based query normalization. The JumbleQuery path in nodes/queryjumble.c fingerprints the analyzed query tree, not the raw text. This means syntactically different but semantically equivalent queries (e.g., SELECT 1 vs. SELECT 1::int4) may get different IDs. Approaches based on canonical form reduction at the raw-parse level (before semantic analysis) are explored in the pg_stat_statements issue tracker.
Parser hooks as an extensibility surface. The post_parse_analyze_hook, p_pre_columnref_hook, and p_paramref_hook slots give extensions access to parse analysis internals. A survey of extensions that use these hooks (Citus, TimescaleDB, pg_stat_statements, PL/pgSQL itself) would document what the hook contract actually is vs. what is documented.

Sources

In-tree documentation

src/backend/parser/README — module-level description of each file.
src/backend/parser/gram.y (file header) — design rationale for the no-catalog rule and the LALR(1) requirement.
src/backend/parser/scan.l (file header) — no-backtrack rule and sync requirement with psqlscan.l / pgc.l.

Textbooks

Aho, Lam, Sethi, Ullman. Compilers: Principles, Techniques, and Tools (Dragon Book), 2nd ed., 2006. §§2–5 (lexical analysis, parsing, LALR).
Cooper & Torczon. Engineering a Compiler, 2nd ed., 2012. Ch. 2–4.

Source code paths

src/backend/parser/ — all parser source files.
src/include/parser/ — parser.h, parse_node.h, kwlist.h, gramparse.h.
src/common/keywords.c — ScanKeywordLookup (shared with frontend tools).
src/nodes/queryjumble.c — JumbleQuery for query fingerprinting.