PostgreSQL Planner Prep — Subquery Pull-Up, Qual Canonicalization, and Set Ops
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Between parsing/rewriting and cost-based path generation, a query optimizer
performs a body of work that is best described as normalization rather
than optimization: it rewrites the query tree into a canonical, flatter
shape so that the subsequent cost-based search sees a uniform problem and
explores a richer space of plans. PostgreSQL calls this the prep phase,
and the four files under optimizer/prep/ implement it. This document
covers four prep tasks — subquery and SubLink pull-up, qual
canonicalization, target-list preprocessing, and set-operation planning —
and defers the surrounding planner driver (subquery_planner,
grouping_planner) and the join-search / path machinery to the cross-ref
docs postgres-planner-overview.md and postgres-path-generation.md.
Three theoretical threads run through prep:
-
Query unnesting (decorrelation). The classic result, traceable to Kim’s 1982 “On Optimizing an SQL-like Nested Query” and the System R optimizer lineage (Selinger et al. 1979, captured in
dbms-papers/systemr-optimizer.md), is that a nested subquery — a correlatedEXISTS, anIN/= ANYagainst a sub-SELECT — is semantically a join, but a cost-based join optimizer can only reorder and re-method joins it can see as joins in its flat relation set. As long as the subquery hides inside aWHEREpredicate, the optimizer is forced to evaluate it row-by-row as a correlated subplan. Unnesting rewritesWHERE x IN (SELECT y FROM t)into a semijoin andWHERE NOT EXISTS (...)into an antijoin, exposing the inner relation to join reordering, hashing, and merge methods. Database System Concepts (Silberschatz et al., the query-optimization chapter) frames this as “transforming nested subqueries into joins” and notes that the semijoin is the precise relational-algebra operator that preserves duplicate and NULL semantics ofIN/EXISTS. -
View / subquery merging (flattening). A subquery in
FROM—SELECT * FROM (SELECT ... FROM t WHERE ...) s WHERE ...— can, when it contains no grouping/aggregation/DISTINCT/LIMIT, be merged into the parent query: its range-table entries become the parent’s, its quals become the parent’s quals, and its output columns are substituted inline. This is the same operation as view inlining. The payoff is that the merged relations participate in the parent’s single global join search instead of forming an optimization barrier. The cost is bookkeeping: everyVarthat referenced the subquery’s output must be replaced with the underlying expression, and range-table indices must be offset. -
Boolean normalization. A
WHEREclause is a Boolean expression tree, and the optimizer benefits when that tree exposes as much top-level AND structure as possible — because each top-level AND conjunct is an independently pushable, independently indexable restriction. The textbook normal form is CNF (conjunctive normal form, an AND-of-ORs), reached by pushing negations to the leaves (DeMorgan) and distributing OR over AND. PostgreSQL deliberately stops short of full CNF (the distribution step can blow up exponentially) but does push NOTs down and applies the inverse OR-distributive law to factor common conjuncts out of ORs.
The unifying principle: prep trades a small amount of eager rewriting for a much larger downstream optimization opportunity. None of these steps is costed — they are heuristic rewrites believed to be always-or-usually beneficial, applied unconditionally when their safety preconditions hold.
Common DBMS Design
Section titled “Common DBMS Design”Every serious relational optimizer contains analogues of PostgreSQL’s prep phase, because the problems — nested subqueries, view merging, Boolean normalization, set operations — are inherent to SQL.
Subquery unnesting as a rewrite catalog
Section titled “Subquery unnesting as a rewrite catalog”Mature optimizers (Oracle, SQL Server, DB2) maintain a catalog of
unnesting transformations keyed on subquery shape: IN/= ANY →
semijoin; NOT IN/<> ALL → antijoin (with the notorious NULL
complications); correlated EXISTS → semijoin; scalar subquery in
SELECT list → outer join or lateral. The transformations are guarded by
correctness side-conditions — chiefly around three-valued logic (NULL)
and duplicate semantics. The recurring engineering insight is that the
pull-up is only sound at the top level of a conjunctive WHERE/ON,
because anywhere a subquery’s truth value can flow into an OR or a NOT, the
distinction between “no rows” (FALSE) and “NULL row” (UNKNOWN) becomes
observable and a flat semijoin cannot reproduce it.
Flattening with a barrier list
Section titled “Flattening with a barrier list”The flattening decision is universally gated by a barrier predicate: a
subquery is mergeable only if it lacks the features that make it an
optimization unit of its own — aggregation, GROUP BY, DISTINCT,
LIMIT/OFFSET, set operations, windowing, FOR UPDATE. These features
imply a cardinality-changing or order-sensitive boundary that cannot be
dissolved into the parent’s flat relation algebra. Every engine has some
is_simple_subquery-equivalent gate.
Negation push-down, bounded normalization
Section titled “Negation push-down, bounded normalization”No production engine forces full CNF/DNF unconditionally, because the
distribution step is worst-case exponential in clause size. The common
compromise is: (a) flatten nested same-operator Booleans (AND-under-AND,
OR-under-OR) into N-ary nodes; (b) push NOT to the leaves via DeMorgan and
operator-negation (NOT (a < b) → a >= b); (c) apply cheap, bounded
factoring like the inverse OR-distributive law. This keeps the clause small
while exposing top-level conjuncts.
Set operations as a plan sub-tree
Section titled “Set operations as a plan sub-tree”UNION, INTERSECT, and EXCEPT are planned as a small tree of physical
operators sitting above independently planned leaf queries. UNION ALL is
a pure concatenation (Append); UNION (distinct) adds duplicate
elimination (sort+unique or hash aggregate); INTERSECT/EXCEPT are
realized either by a sort/hash-based set operator or by rewriting to
grouped aggregation. The leaf queries are optimization barriers — each is
planned on its own — but the engine still propagates interesting orders
downward so a merge-based parent can request already-sorted children.
flowchart TD A["Parse/rewrite output<br/>Query tree (may be nested)"] --> B["Prep phase<br/>(optimizer/prep/*)"] B --> C["SubLink pull-up<br/>ANY/EXISTS to semi/anti-join"] B --> D["Subquery flatten<br/>merge FROM-subquery into parent"] B --> E["Qual canonicalize<br/>NOT push-down + OR factoring"] B --> F["Targetlist prep<br/>expand INSERT, add junk cols"] B --> G["Set-op planning<br/>UNION/INTERSECT/EXCEPT tree"] C --> H["Flat jointree + flat WHERE<br/>fed to join search"] D --> H E --> H H --> I["Path generation<br/>(postgres-path-generation.md)"] G --> I
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s prep runs inside subquery_planner (in planner.c), in a
fixed order chosen so that each rewrite sees the output of the previous one.
The driver sequence relevant to this doc is:
// subquery_planner — src/backend/optimizer/plan/planner.creplace_empty_jointree(parse); // FROM-less query gets RTE_RESULTif (parse->hasSubLinks) pull_up_sublinks(root); // ANY/EXISTS -> semi/anti-joinpreprocess_function_rtes(root); // inline SRFs into subqueriesparse = root->parse = expand_virtual_generated_columns(root);pull_up_subqueries(root); // merge FROM-subqueries into parentif (parse->setOperations) flatten_simple_union_all(root); // UNION ALL -> appendrel// ... later ...reduce_outer_joins(root); // strength-reduce OJs to innerremove_useless_result_rtes(root);Qual canonicalization (canonicalize_qual) and target-list prep
(preprocess_targetlist) are invoked from grouping_planner /
preprocess_expression further down; plan_set_operations is the
entry point for a set-op query. The ordering is load-bearing: SubLink
pull-up runs before subquery pull-up so that a subquery exposed by
flattening gets its own SubLinks processed when it is pulled up, and
flatten_simple_union_all runs after pull_up_subqueries because it
needs pull-up applied to the UNION ALL leaves.
SubLink pull-up: ANY/EXISTS become joins
Section titled “SubLink pull-up: ANY/EXISTS become joins”pull_up_sublinks walks the jointree and, for each WHERE/ON qual,
recurses through the explicit AND structure (the quals are not yet
AND-flat or implicit-AND at this stage) looking for convertible SubLink
nodes. The recursion deliberately stops at the first non-AND item, encoding
the “top level of conjunctive WHERE only” soundness rule:
// pull_up_sublinks_qual_recurse — src/backend/optimizer/prep/prepjointree.cif (sublink->subLinkType == ANY_SUBLINK){ if ((j = convert_ANY_sublink_to_join(root, sublink, available_rels1)) != NULL) { /* insert the new join node into the join tree */ j->larg = *jtlink1; *jtlink1 = (Node *) j; j->rarg = pull_up_sublinks_jointree_recurse(root, j->rarg, &child_rels); j->quals = pull_up_sublinks_qual_recurse(root, j->quals, &j->larg, available_rels1, &j->rarg, child_rels); return NULL; /* qual replaced by the join; TRUE remains */ } ...}else if (sublink->subLinkType == EXISTS_SUBLINK){ if ((j = convert_EXISTS_sublink_to_join(root, sublink, false, available_rels1)) != NULL) { ... }}The actual rewrite (convert_ANY_sublink_to_join,
convert_EXISTS_sublink_to_join) lives in plan/subselect.c; prep’s job is
to find the convertible site, build the JoinExpr (a JOIN_SEMI, or
JOIN_ANTI when the EXISTS is under a NOT), splice it into the jointree,
and return NULL so the qual position collapses to constant TRUE. The
available_rels argument carries the set of rels the sublink may reference
— a sublink referencing the nonnullable side of an outer join is not
convertible, which is why two link slots (jtlink1/jtlink2) are threaded
through.
flowchart TD
A["WHERE x IN (SELECT y FROM t)"] --> B{"ANY_SUBLINK at<br/>top of conjunctive WHERE?"}
B -->|yes, refs available_rels| C["convert_ANY_sublink_to_join<br/>build JOIN_SEMI"]
B -->|no / under OR or NOT| D["leave as correlated SubPlan"]
C --> E["splice JoinExpr into jointree<br/>larg = old tree, rarg = sub rel"]
E --> F["qual position returns NULL<br/>(constant TRUE)"]
F --> G["semijoin now visible to<br/>join-order search"]
Subquery flattening: merge the FROM-subquery
Section titled “Subquery flattening: merge the FROM-subquery”pull_up_subqueries recurses the jointree; at each RangeTblRef it asks
whether the RTE is a RTE_SUBQUERY that is_simple_subquery approves, and
if so calls pull_up_simple_subquery to merge it. The gate is a long list
of disqualifying features:
// is_simple_subquery — src/backend/optimizer/prep/prepjointree.cif (subquery->setOperations) return false; /* setops handled by another path */if (subquery->hasAggs || subquery->hasWindowFuncs || subquery->hasTargetSRFs || subquery->groupClause || subquery->groupingSets || subquery->havingQual || subquery->sortClause || subquery->distinctClause || subquery->limitOffset || subquery->limitCount || subquery->hasForUpdate || subquery->cteList) return false; /* any of these makes it an opt barrier */if (rte->security_barrier) return false; /* would leak rows past the barrier */When the gate passes, pull_up_simple_subquery creates a child
PlannerInfo, runs the early subquery-planner steps on the copied subquery,
offsets its range-table indices into the parent, and — the load-bearing
step — substitutes every parent Var that referenced the subquery’s output
columns with the corresponding subquery target-list expression
(pullup_replace_vars). A non-pullable subquery simply remains an
RTE_SUBQUERY that later becomes a SubqueryScan path — an optimization
barrier, but a correct one.
There are three sibling pull-ups in the same recursion: a simple
UNION ALL subquery becomes an appendrel (pull_up_simple_union_all), a
single-row VALUES becomes a RESULT (pull_up_simple_values), and an
inlinable set-returning function RTE becomes a subquery
(preprocess_function_rtes, then pulled up on the next pass).
The dispatch that chooses among these sits at the top of
pull_up_subqueries_recurse and reads as a cascade of RTE-kind tests, each
guarded by a feasibility predicate — this is the single point where prep
decides a relation’s fate:
// pull_up_subqueries_recurse — src/backend/optimizer/prep/prepjointree.cif (rte->rtekind == RTE_SUBQUERY && is_simple_subquery(root, rte->subquery, rte, lowest_outer_join) && (containing_appendrel == NULL || is_safe_append_member(rte->subquery))) return pull_up_simple_subquery(root, jtnode, rte, lowest_outer_join, containing_appendrel);if (rte->rtekind == RTE_SUBQUERY && is_simple_union_all(rte->subquery)) return pull_up_simple_union_all(root, jtnode, rte);if (rte->rtekind == RTE_VALUES && lowest_outer_join == NULL && containing_appendrel == NULL && is_simple_values(root, rte)) return pull_up_simple_values(root, jtnode, rte);if (rte->rtekind == RTE_FUNCTION) return pull_up_constant_function(root, jtnode, rte, containing_appendrel);/* Otherwise, do nothing at this node. */Qual canonicalization: bounded, not full CNF
Section titled “Qual canonicalization: bounded, not full CNF”canonicalize_qual is the entry point, and its header comment is candid
that the name is a “holdover” — it no longer forces canonical AND-of-ORs:
// canonicalize_qual — src/backend/optimizer/prep/prepqual.cExpr *canonicalize_qual(Expr *qual, bool is_check){ Expr *newqual; if (qual == NULL) return NULL; Assert(!IsA(qual, List)); /* not implicit-AND format */ /* Pull up redundant subclauses in OR-of-AND trees; drop NULL consts */ newqual = find_duplicate_ors(qual, is_check); return newqual;}The flattening of nested AND/OR is no longer done here — it is folded into
eval_const_expressions, which already makes a full pass. canonicalize_qual
assumes AND/OR-flat input and only has to preserve flatness while applying
find_duplicate_ors, which implements the inverse OR-distributive law
((A AND B) OR (A AND C)) => (A AND (B OR C)) plus top-level NULL-constant
removal (treating NULL::bool as FALSE in WHERE, TRUE in CHECK).
The NOT push-down lives in negate_clause, a helper invoked from
eval_const_expressions. It applies DeMorgan and operator negation:
// negate_clause — src/backend/optimizer/prep/prepqual.ccase T_OpExpr:{ OpExpr *opexpr = (OpExpr *) node; Oid negator = get_negator(opexpr->opno); /* (NOT (< A B)) => (>= A B) */ if (negator) { OpExpr *newopexpr = makeNode(OpExpr); newopexpr->opno = negator; ... return (Node *) newopexpr; }}break;case T_BoolExpr: switch (((BoolExpr *) node)->boolop) { case AND_EXPR: /* (NOT (AND A B)) => (OR (NOT A) (NOT B)) */ ... return (Node *) make_orclause(nargs); case OR_EXPR: /* (NOT (OR A B)) => (AND (NOT A) (NOT B)) */ ... return (Node *) make_andclause(nargs); case NOT_EXPR: /* NOT NOT cancels */ return (Node *) linitial(expr->args); }The header comment is explicit about the trade: DeMorgan may increase the
NOT count, but it is applied unconditionally anyway because exposing
top-level AND/OR structure (and making logically equal expressions compare
equal()) is worth more than the node count.
Target-list prep and set operations
Section titled “Target-list prep and set operations”preprocess_targetlist reshapes the parser/rewriter target list for the
executor: for INSERT it expands to a full attribute-ordered list with
NULLs for missing columns (expand_insert_targetlist); for UPDATE it
extracts the target column numbers and renumbers to consecutive resnos; and
for UPDATE/DELETE/MERGE it adds junk columns — the row-identity
ctid/wholerow/tableoid needed to locate and re-fetch rows, plus
FOR UPDATE row-mark junk and RETURNING Vars from other relations.
plan_set_operations plans a UNION/INTERSECT/EXCEPT query by
recursing the SetOperationStmt tree (recurse_set_operations),
planning each leaf subquery independently via subquery_planner, and
building Append paths for UNIONs (generate_union_paths) or SetOp paths for
INTERSECT/EXCEPT (generate_nonunion_paths). Interesting sort orders are
pushed down through the parentOp argument so a UNION (distinct) can
request already-sorted children for a merge-based unique-ification.
Source Walkthrough
Section titled “Source Walkthrough”This section anchors on stable symbol names, grouped by the four prep files
and their call flow. Read it alongside the driver in subquery_planner /
grouping_planner (planner.c), which is covered in
postgres-planner-overview.md.
prepjointree.c — jointree rewrites
Section titled “prepjointree.c — jointree rewrites”The file’s public entry points are all called from subquery_planner in a
fixed order. replace_empty_jointree handles a FROM-less query
(SELECT 1) by jamming a synthetic RTE_RESULT into the jointree so later
code never has to special-case an empty fromlist:
replace_empty_jointree— early normalization; inserts the*RESULT*RTE. Skipped for set-op top levels.pull_up_sublinks— driver for ANY/EXISTS → semi/anti-join. Callspull_up_sublinks_jointree_recurse(walksFromExpr/JoinExprnodes, collectingrelids) andpull_up_sublinks_qual_recurse(descends the explicit AND structure of a qual, stopping at the first non-AND, and at eachSubLinkcalls the converter).convert_ANY_sublink_to_join/convert_EXISTS_sublink_to_join— the actual rewrite, inplan/subselect.c(cross-file). They return aJoinExpr(JOIN_SEMI, orJOIN_ANTIforNOT EXISTS) or NULL if the sublink is not convertible at this site.
The pull-up site test threads two link slots through the recursion
(jtlink1/jtlink2) with two available_rels sets, encoding the rule that
a sublink referencing the nonnullable side of an outer join cannot be
pulled into the nullable side.
Subquery flattening is the second cluster:
pull_up_subqueries→pull_up_subqueries_recurse— walks the jointree; at eachRangeTblRefdispatches onrte->rtekind. Note how the recursion passes the enclosing outer join down so pull-up can be restricted under it:
// pull_up_subqueries_recurse — src/backend/optimizer/prep/prepjointree.ccase JOIN_LEFT:case JOIN_SEMI:case JOIN_ANTI: j->larg = pull_up_subqueries_recurse(root, j->larg, j, NULL); j->rarg = pull_up_subqueries_recurse(root, j->rarg, j, NULL); break;is_simple_subquery— the mergeability gate (no aggs/window/SRF/group/having/sort/distinct/limit/FOR UPDATE/CTE/setops, not a security barrier, LATERAL restrictions, no volatile tlist funcs).pull_up_simple_subquery— does the merge: builds a childPlannerInfo, offsets the subquery’s RT indices, and replaces parentVars via thepullup_replace_vars*family.pull_up_simple_union_all/is_simple_union_all/is_simple_union_all_recurse— flatten aUNION ALLsubquery into an appendrel; the recurse helper requires every node beSETOP_UNIONwithop->alland matchingcolTypes:
// is_simple_union_all_recurse — src/backend/optimizer/prep/prepjointree.celse if (IsA(setOp, SetOperationStmt)){ SetOperationStmt *op = (SetOperationStmt *) setOp; if (op->op != SETOP_UNION || !op->all) /* must be UNION ALL */ return false; return is_simple_union_all_recurse(op->larg, setOpQuery, colTypes) && is_simple_union_all_recurse(op->rarg, setOpQuery, colTypes);}preprocess_function_rtes— const-simplifies function RTEs and inlines SRFs (inline_set_returning_function), converting the RTE to a subquery so the next pull-up pass can merge it.flatten_simple_union_all— top-level companion: if the whole query isUNION ALL, turn the leftmost leaf into an appendrel member, mark the former leftmost RTE as an appendrel parent (inh = true), and clearparse->setOperationsso the query is replanned as a plain scan of the appendrel:
// flatten_simple_union_all — src/backend/optimizer/prep/prepjointree.cchildRTE = copyObject(leftmostRTE);parse->rtable = lappend(parse->rtable, childRTE);childRTI = list_length(parse->rtable);((RangeTblRef *) leftmostjtnode)->rtindex = childRTI;leftmostRTE->inh = true; /* now an appendrel parent */...parse->setOperations = NULL; /* pretend no setops */The third cluster is outer-join strength reduction (a prep rewrite that borders on optimization, included here because it lives in the same file):
reduce_outer_joins— two-pass:reduce_outer_joins_pass1gathers, per jointree node, the contained relids and whether any outer join sits below;reduce_outer_joins_pass2examines quals and converts an outer join to inner (or a full join to left) when a null-intolerant qual above it would reject the null-extended rows anyway. Reduced joins then have their nulling-rel references stripped viaremove_nulling_relids.remove_useless_result_rtes— dropsRTE_RESULTRTEs that became unnecessary after the above rewrites.
The pass-2 core converts a JOIN_LEFT to JOIN_ANTI (when the join’s own
quals are strict and there is no later use of the inner rows) and a
JOIN_RIGHT to the canonical JOIN_LEFT by swapping arms. Semi/anti joins
produced earlier by pull_up_sublinks are deliberately skipped because no
upper qual can reference their righthand side:
// reduce_outer_joins_pass2 — src/backend/optimizer/prep/prepjointree.ccase JOIN_SEMI:case JOIN_ANTI: /* * These could only have been introduced by pull_up_sublinks, so * there's no way that upper quals could refer to their righthand * sides, and no point in checking. */ break;...if (jointype == JOIN_RIGHT) /* canonicalize to LEFT by swapping */{ Node *tmparg = j->larg; j->larg = j->rarg; j->rarg = tmparg; jointype = JOIN_LEFT; right_state = linitial(state1->sub_states); left_state = lsecond(state1->sub_states);}prepqual.c — Boolean canonicalization
Section titled “prepqual.c — Boolean canonicalization”canonicalize_qual— public entry; assumes AND/OR-flat input (flattening now done byeval_const_expressions) and delegates tofind_duplicate_ors. Theis_checkflag switches NULL-constant handling between WHERE semantics (NULL≈FALSE) and CHECK semantics (NULL≈TRUE).negate_clause— NOT push-down helper called fromeval_const_expressions. HandlesConst,OpExpr(operator negator),ScalarArrayOpExpr(= ANY↔<> ALL),BoolExpr(DeMorgan, NOT-NOT cancel),NullTest, andBooleanTest. Falls through to a literalmake_notclausewhen no smarter rule applies.find_duplicate_ors— recurses the top-level AND/OR structure, drops NULL/constant inputs per the WHERE/CHECK rule, and hands each OR-list toprocess_duplicate_ors.process_duplicate_ors— implements the inverse OR-distributive law: picks the shortest AND child as a reference, finds reference terms present in every OR arm (“winners”), and factors them out:
// process_duplicate_ors — src/backend/optimizer/prep/prepqual.c/* Choose the shortest AND clause as the reference list */foreach(temp, orlist){ Expr *clause = (Expr *) lfirst(temp); if (is_andclause(clause)) { int nclauses = list_length(((BoolExpr *) clause)->args); if (reference == NIL || nclauses < num_subclauses) { reference = ((BoolExpr *) clause)->args; num_subclauses = nclauses; } } else { reference = list_make1(clause); break; }}pull_ands/pull_ors— flatten nested same-operator Booleans (AND-under-AND, OR-under-OR) into a single N-ary arglist, used to restore flatness after the factoring above pulls sub-clauses up.
preptlist.c — target-list preprocessing
Section titled “preptlist.c — target-list preprocessing”preprocess_targetlist— driver; result goes toroot->processed_tlist. Branches oncommandType:CMD_INSERT→expand_insert_targetlist(attribute-ordered list, NULL/Constfor missing or dropped columns, domain coercion viacoerce_null_to_domain).CMD_UPDATE→extract_update_targetlist_colnos(pull target column numbers intoroot->update_colnos, renumber resnos consecutively).CMD_UPDATE/DELETE/MERGEnon-inherited →add_row_identity_columns(junkctidetc.).
- Junk-column augmentation: the driver appends
ctid<id>(TID),wholerow<id>, andtableoid<id>junk TLEs for eachPlanRowMark, andRETURNINGVars from non-target relations:
// preprocess_targetlist — src/backend/optimizer/prep/preptlist.cif (rc->allMarkTypes & ~(1 << ROW_MARK_COPY)){ var = makeVar(rc->rti, SelfItemPointerAttributeNumber, TIDOID, ...); snprintf(resname, sizeof(resname), "ctid%u", rc->rowmarkId); tle = makeTargetEntry((Expr *) var, list_length(tlist) + 1, pstrdup(resname), true /* resjunk */); tlist = lappend(tlist, tle);}get_plan_rowmark— lookup helper (PlanRowMark by RT index).
prepunion.c — set-operation planning
Section titled “prepunion.c — set-operation planning”plan_set_operations— entry for a setop query; finds the leftmost leaf (for output column names), then eithergenerate_recursion_path(recursiveWITH RECURSIVEUNION) orrecurse_set_operations.recurse_set_operations— the dispatcher: a leafRangeTblReftriggerssubquery_planneron the leaf subquery (an optimization barrier); aSetOperationStmtdispatches togenerate_union_pathsorgenerate_nonunion_paths. TheparentOpargument carries interesting orderings downward.generate_union_paths— forUNION/UNION ALL: merges identical child UNION nodes (plan_union_children), builds the Append tlist (generate_append_tlist), builds child paths, creates an Append path, and for distinctUNIONadds sort+unique and/or hash-aggregate unique-ification.generate_nonunion_paths— forINTERSECT/EXCEPT: recurses both arms, builds aSetOptlist, computesgroupList, checks sort/hash feasibility, and (for non-EXCEPT) swaps inputs to put the smaller relation first. Errors out if datatypes support neither sorting nor hashing.generate_setop_tlist/generate_append_tlist/generate_setop_grouplist— tlist/group-clause construction with type-coercion insertion where child column types differ from the set-op output types.
For INTERSECT/EXCEPT, after the inputs are chosen and possibly swapped,
the result RelOptInfo is created at UPPERREL_SETOP and a SetOpCmd is
selected from the operator and the all flag; a hashed plan is just a
SetOp atop the cheapest child paths:
// generate_nonunion_paths — src/backend/optimizer/prep/prepunion.cswitch (op->op){ case SETOP_INTERSECT: cmd = op->all ? SETOPCMD_INTERSECT_ALL : SETOPCMD_INTERSECT; break; case SETOP_EXCEPT: cmd = op->all ? SETOPCMD_EXCEPT_ALL : SETOPCMD_EXCEPT; break;}if (can_hash){ path = (Path *) create_setop_path(root, result_rel, lpath, rpath, cmd, SETOP_HASHED, groupList, dNumGroups, dNumOutputRows); add_path(result_rel, path);}The input-swap heuristic (smaller relation first for INTERSECT, left
input fixed for EXCEPT) is what makes the hash-table size and the
empty-left-input fast path predictable; it is a plan-shape decision made
in prep, before costing the alternatives.
flowchart TD
A["plan_set_operations"] --> B{"hasRecursion?"}
B -->|yes| C["generate_recursion_path<br/>RecursiveUnion"]
B -->|no| D["recurse_set_operations"]
D --> E{"node kind"}
E -->|RangeTblRef leaf| F["subquery_planner<br/>(barrier, build_simple_rel)"]
E -->|UNION SetOpStmt| G["generate_union_paths<br/>Append + dedup"]
E -->|INTERSECT/EXCEPT| H["generate_nonunion_paths<br/>SetOp sort/hash"]
G --> I["root->processed_tlist"]
H --> I
C --> I
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
replace_empty_jointree | optimizer/prep/prepjointree.c | 410 |
pull_up_sublinks | optimizer/prep/prepjointree.c | 468 |
pull_up_sublinks_qual_recurse | optimizer/prep/prepjointree.c | 652 |
preprocess_function_rtes | optimizer/prep/prepjointree.c | 914 |
pull_up_subqueries | optimizer/prep/prepjointree.c | 1102 |
pull_up_subqueries_recurse | optimizer/prep/prepjointree.c | 1146 |
pull_up_simple_subquery | optimizer/prep/prepjointree.c | 1291 |
pull_up_simple_union_all | optimizer/prep/prepjointree.c | 1636 |
is_simple_subquery | optimizer/prep/prepjointree.c | 1826 |
is_simple_union_all | optimizer/prep/prepjointree.c | 2234 |
is_simple_union_all_recurse | optimizer/prep/prepjointree.c | 2262 |
flatten_simple_union_all | optimizer/prep/prepjointree.c | 3002 |
reduce_outer_joins | optimizer/prep/prepjointree.c | 3121 |
reduce_outer_joins_pass1 | optimizer/prep/prepjointree.c | 3194 |
remove_useless_result_rtes | optimizer/prep/prepjointree.c | 3615 |
negate_clause | optimizer/prep/prepqual.c | 73 |
canonicalize_qual | optimizer/prep/prepqual.c | 293 |
pull_ands | optimizer/prep/prepqual.c | 323 |
pull_ors | optimizer/prep/prepqual.c | 349 |
find_duplicate_ors | optimizer/prep/prepqual.c | 406 |
process_duplicate_ors | optimizer/prep/prepqual.c | 517 |
preprocess_targetlist | optimizer/prep/preptlist.c | 64 |
extract_update_targetlist_colnos | optimizer/prep/preptlist.c | 348 |
expand_insert_targetlist | optimizer/prep/preptlist.c | 382 |
get_plan_rowmark | optimizer/prep/preptlist.c | 526 |
plan_set_operations | optimizer/prep/prepunion.c | 93 |
recurse_set_operations | optimizer/prep/prepunion.c | 209 |
generate_recursion_path | optimizer/prep/prepunion.c | 357 |
generate_union_paths | optimizer/prep/prepunion.c | 676 |
generate_nonunion_paths | optimizer/prep/prepunion.c | 998 |
generate_setop_tlist | optimizer/prep/prepunion.c | 1357 |
generate_append_tlist | optimizer/prep/prepunion.c | 1485 |
convert_ANY_sublink_to_join | optimizer/plan/subselect.c | 1333 |
convert_EXISTS_sublink_to_join | optimizer/plan/subselect.c | 1450 |
subquery_planner (driver) | optimizer/plan/planner.c | ~700 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the REL_18_STABLE working tree at commit 273fe94
(/data/hgryoo/references/postgres, PG 18.x). Checks performed:
-
Driver ordering.
subquery_plannerinplanner.ccalls, in order:replace_empty_jointree(line 748),pull_up_sublinks(757, guarded byparse->hasSubLinks),preprocess_function_rtes(765),expand_virtual_generated_columns(773),pull_up_subqueries(779),flatten_simple_union_all(788, guarded byparse->setOperations),reduce_outer_joins(1260),remove_useless_result_rtes(1269), andpreprocess_targetlist(1809). Confirmed by grep; the in-source comments at each call site state the ordering rationale quoted in §PostgreSQL’s Approach. -
canonicalize_qualno longer forces CNF. The header comment (lines 279–283) explicitly states the canonical-form behavior was removed because it had “more theoretical purity than actual usefulness”; the body only callsfind_duplicate_ors. Verified the flattening was moved toeval_const_expressions(header comment lines 14–17). -
SubLink converters are cross-file.
convert_ANY_sublink_to_joinandconvert_EXISTS_sublink_to_joinare defined inplan/subselect.c(lines 1333, 1450), not inprepjointree.c; prep only locates the site and splices the returnedJoinExpr. This boundary is asserted in the walkthrough and verified by grep across the optimizer tree. -
is_simple_subquerygate list. All disqualifying fields (hasAggs,hasWindowFuncs,hasTargetSRFs,groupClause,groupingSets,havingQual,sortClause,distinctClause,limitOffset,limitCount,hasForUpdate,cteList, plussetOperations,security_barrier, LATERAL refs, volatile tlist) appear verbatim in the function (lines 1841–1946). Excerpt quoted matches source. -
flatten_simple_union_allside effects. Confirmed it setsleftmostRTE->inh = trueandparse->setOperations = NULL(lines 3055, 3070), and that the in-source comment names thepull_up_simple_subqueryAssert as the reason setOperations must be cleared first. -
Junk column names.
preprocess_targetlistconstructs resjunk TLEs namedctid%u(line 249),wholerow%u(263), andtableoid%u(280) keyed onrc->rowmarkId; excerpt matches. -
Symbols out of scope. Path-construction details (
build_setop_child_paths,create_append_path, EquivalenceClass setup,make_pathkeys_for_sortclauses) are named only as call targets; their internals belong topostgres-path-generation.md.subquery_plannerandgrouping_plannerbodies belong topostgres-planner-overview.md. Line numbers in the position-hint table are scoped to thisupdated:date and will drift; the symbol names are the durable anchors. (Thesubquery_plannerline is given as approximate because the function is long and its declaration line moves with header edits.)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s prep phase is a pragmatic subset of a much larger research and engineering literature on query unnesting and normalization.
Unnesting: PostgreSQL vs. the full taxonomy
Section titled “Unnesting: PostgreSQL vs. the full taxonomy”PostgreSQL unnests exactly two SubLink shapes into joins — ANY/IN to
semijoin and EXISTS to semi/anti-join — and only at the top level of a
conjunctive WHERE/ON. The seminal taxonomy (Kim 1982; Ganski & Wong 1987,
which corrected Kim’s COUNT bug) and the later “complete” unnesting work of
Neumann & Kemper, “Unnesting Arbitrary Queries” (BTW 2015) describe a
strictly larger space: arbitrary correlated subqueries — scalar
subqueries in the SELECT list, subqueries under OR, subqueries inside
disjunctive predicates — can be unnested by introducing a dependent join
(d-join) operator and then algebraically pushing the dependency down until
it disappears, yielding a plan with no correlated execution at all. HyPer
and DuckDB implement this general algorithm; PostgreSQL does not. Scalar
subqueries and subqueries under OR in PostgreSQL remain correlated
SubPlans evaluated per-row (or cached via the Materialize/Memoize
nodes — see postgres-path-generation.md), which is correct but can be
orders of magnitude slower than a decorrelated join on large inputs.
The reason PostgreSQL stops where it does is the three-valued-logic soundness boundary discussed in §Common DBMS Design: a semijoin cannot reproduce the FALSE-vs-UNKNOWN distinction that becomes observable once a subquery’s truth value can flow through a NOT or an OR. The general algorithms sidestep this by not collapsing to a semijoin — they keep an explicit (anti)dependent join and reason about NULLs at the algebra level.
Magic-set / sideways-information-passing
Section titled “Magic-set / sideways-information-passing”A parallel line of decorrelation, magic sets (Bancilhon et al. 1986;
Mumick et al. 1990, “Magic is Relevant”), rewrites a correlated subquery so
the correlation values are collected into an auxiliary “magic” relation that
is joined in, restricting the subquery’s evaluation to only the relevant
parameter bindings. SQL Server and DB2 use magic-set rewrites for certain
correlated aggregates. PostgreSQL has no magic-set machinery; its closest
analogue is the parameterized-path / Memoize mechanism that caches
correlated subplan results by parameter value.
Normalization: bounded vs. full CNF
Section titled “Normalization: bounded vs. full CNF”PostgreSQL’s deliberate refusal to force CNF (the canonicalize_qual
“holdover name” comment) is the now-conventional position. Early System R
and academic optimizers normalized to CNF or DNF; the exponential blow-up of
distribution (an OR of n two-term ANDs distributes to 2ⁿ clauses) made this
untenable for machine-generated SQL. Modern engines instead do bounded
normalization plus targeted factoring. PostgreSQL’s specific factoring — the
inverse OR-distributive law in process_duplicate_ors — is unusual in being
documented (the header comment cites TPC benchmark queries that need it) and
is a nice example of an optimization motivated by generated SQL rather
than hand-written queries. Selinger-era cost-based search
(dbms-papers/systemr-optimizer.md) assumed a normalized predicate as
input; PostgreSQL’s prep is precisely the layer that delivers that
precondition without paying the CNF tax.
Set operations: SetOp node vs. grouped aggregation
Section titled “Set operations: SetOp node vs. grouped aggregation”PostgreSQL plans INTERSECT/EXCEPT with a dedicated SetOp executor
node (sort- or hash-based) rather than rewriting to grouped aggregation with
HAVING count(...) predicates the way some textbooks suggest. The PG 18
line modernized this area: the historical flag-column representation
(synthetic resjunk columns tagging each row with its source) was removed,
which is why recurse_set_operations carries the lingering “XXX Now that
there are no flag columns” comment quoted in the source. The
research frontier here is incremental set operations for materialized
views and streaming, where INTERSECT/EXCEPT must be maintained under
inserts/deletes — outside the scope of the batch planner described here.
Frontiers
Section titled “Frontiers”- Cost-based unnesting decisions. PostgreSQL pulls up unconditionally when safe; it never costs “leave it correlated vs. decorrelate.” For some shapes (a highly selective correlated EXISTS over a tiny outer) the correlated form can win. Cost-based unnesting (explored in DB2 and in academic work) remains a frontier PG has not adopted.
- Subquery caching. The
Memoizenode (PG 14+) narrows the gap for correlated subplans the optimizer cannot decorrelate, but it is a runtime cache, not a plan-time rewrite — orthogonal to prep. - LATERAL as the general escape hatch. PostgreSQL’s prep increasingly
leans on LATERAL semantics (and the
lateralrestrictions inis_simple_subquery) as the principled way to represent dependencies it chooses not to flatten — converging, in representation if not in algorithm, on the dependent-join model of the general unnesting work.
Sources
Section titled “Sources”-
Source tree.
/data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94 (PG 18.x). Core files:src/backend/optimizer/prep/prepjointree.c— SubLink pull-up, subquery flattening, UNION ALL flattening, outer-join reduction.src/backend/optimizer/prep/prepqual.c—negate_clause,canonicalize_qual,find_duplicate_ors,process_duplicate_ors.src/backend/optimizer/prep/preptlist.c—preprocess_targetlist,expand_insert_targetlist, junk-column augmentation.src/backend/optimizer/prep/prepunion.c—plan_set_operationsand thegenerate_*_paths/generate_*_tlistfamily.src/backend/optimizer/plan/planner.c—subquery_plannerdriver (call ordering).src/backend/optimizer/plan/subselect.c—convert_ANY_sublink_to_join,convert_EXISTS_sublink_to_join.
-
Cross-references (this KB).
postgres-planner-overview.md—subquery_planner/grouping_plannerdriver, PlannerInfo, overall pipeline.postgres-path-generation.md— RelOptInfo path construction, Append / SetOp / SubqueryScan paths, EquivalenceClass and pathkey machinery referenced as call targets here.postgres-join-ordering.md,postgres-join-nodes.md— what the pulled-up semijoins/antijoins feed into.postgres-rewriter.md— the upstream phase whose output prep consumes (notepreprocess_targetlistshares labor withrewriteTargetListIU).postgres-agg-sort-nodes.md— Sort/Unique/Agg nodes used for set-operation duplicate elimination.
-
Theory anchors (see
.omc/plans/postgres-paper-bibliography.md).- Selinger et al. 1979, System R access-path selection —
knowledge/research/dbms-papers/systemr-optimizer.md(cost-based search that consumes prep’s normalized output). - Database System Concepts (Silberschatz et al., 7e) —
knowledge/research/dbms-general/database-system-concepts.md(nested-subquery-to-join transformation, set-operation semantics). - Architecture of a Database System (Hellerstein et al. 2007) —
knowledge/research/dbms-papers/fntdb07-architecture.md(query rewrite / optimizer layering). - Kim 1982; Ganski & Wong 1987; Neumann & Kemper 2015 (arbitrary-query unnesting); Mumick et al. 1990 (magic sets) — external, named in §Beyond PostgreSQL for comparative context; not held in this KB.
- Selinger et al. 1979, System R access-path selection —