PostgreSQL Declarative Partitioning — Bounds, Tuple Routing, and Pruning
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Partitioning is the discipline of taking one logical relation and splitting its rows across several physical storage units according to a deterministic function of the row’s values. Database System Concepts (Silberschatz, 7e, ch. 21 “Parallel and Distributed Storage”, §21.2 “Data Partitioning”) frames it as the foundational step for both parallelism and manageability: “the most basic way of distributing data across multiple nodes is to partition the relations” — assigning each tuple to a partition by a partitioning function on a designated partitioning attribute.
The textbook names the three partitioning strategies that production engines converge on, and PostgreSQL implements exactly these three (§21.2):
- Range partitioning. Tuples are assigned to partitions by a
contiguous, ordered range of the partitioning attribute — partition i
holds all values in
[v_i, v_{i+1}). Range partitioning “is well suited to range queries” because a predicateattr BETWEEN a AND btouches only the partitions whose ranges overlap[a, b]. The cost is skew: a poorly chosen split (or a monotonically increasing key like a timestamp) piles recent rows into one partition. - List partitioning. A generalization where each partition is assigned
an explicit set of values (
FOR VALUES IN (...)). Useful when the key is categorical — country code, status enum — and the buckets are not naturally ordered into ranges. - Hash partitioning. Tuples are assigned by
h(attr) mod n. The textbook’s verdict: hash partitioning “tends to distribute data evenly,” defeating the skew that plagues range partitioning, at the cost of destroying range-query locality — aBETWEENpredicate now potentially touches every partition because adjacent values hash to unrelated buckets (§21.3 “Handling of Skew”).
The §21.3 discussion of skew is the reason all three coexist rather than
one winning: range gives query locality but risks skew; hash gives balance
but loses locality; list sits between, giving the DBA explicit control.
PostgreSQL exposes all three as first-class PARTITION BY {RANGE|LIST|HASH}
declarations and lets them nest (a range partition can itself be
hash-subpartitioned), so the DBA composes the trade-offs per level.
Two operations dominate the runtime of any partitioned system, and both are “find the partition for a value” problems in disguise:
- Tuple routing (write path). Given a new row, which one partition does it belong to? This must be cheap — it runs once per inserted row.
- Partition pruning (read path). Given a query predicate, which partitions can be skipped entirely? The textbook frames this as the core win of partitioning for queries: a scan that would touch a billion-row table instead touches only the handful of partitions whose value sets intersect the predicate. Pruning is what makes partitioning a performance feature and not merely a manageability one.
The elegance of declarative partitioning is that both operations reduce to the same primitive: a search over the sorted set of partition bounds. If the bounds are stored sorted, routing one tuple is a binary search keyed on the tuple’s value, and pruning a range predicate is a pair of binary searches bracketing the predicate’s endpoints. The rest of this document traces how PostgreSQL builds that sorted bound structure once, then reuses it for both paths.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the what — three strategies, two value-lookup operations. This section names the engineering conventions that production partitioning engines adopt to make those operations fast and correct, the patterns the textbook leaves implicit. PostgreSQL’s specific choices in the next section are one set of dials within this shared space.
Canonicalize bounds into a sorted, searchable structure
Section titled “Canonicalize bounds into a sorted, searchable structure”A partition’s bound, as the user wrote it (FOR VALUES FROM (...) TO (...)), is a parser node — convenient to declare, useless to search. The
universal convention is to canonicalize all of a table’s bounds, once,
into a single sorted array of boundary datums plus a parallel map from
boundary position to partition index. Range bounds collapse to a sorted
list of breakpoints; list bounds to a sorted list of accepted values; hash
bounds to a (modulus, remainder) table. After canonicalization, every
value lookup — routing a tuple, pruning a predicate — is a binary search
(range/list) or a modulo index (hash) into that one structure. Building it
is O(n log n) once; using it is O(log n) forever after.
Cache the bound structure on the relation, invalidate on DDL
Section titled “Cache the bound structure on the relation, invalidate on DDL”The bound structure is expensive to build and changes only on DDL (ATTACH
/ DETACH / CREATE ... PARTITION OF). The convention is to attach it to
the in-memory relation descriptor and rebuild lazily on cache invalidation,
not per query. A subtlety unique to partitioning: because different
sessions hold different snapshots, a partition that one transaction sees as
attached another may still see as detached, so the cached descriptor must
be snapshot-qualified rather than globally shared.
Route writes top-down through the hierarchy
Section titled “Route writes top-down through the hierarchy”Partitions nest. The convention for routing an inserted tuple is a top-down descent: start at the root partitioned table, find the child partition for the tuple’s key, and if that child is itself partitioned, recurse with (possibly) a re-mapped tuple — because each level may have a different column layout (dropped columns, attribute reordering). The descent terminates at a leaf table, the only kind that physically stores rows.
Prune in two phases — plan time and run time
Section titled “Prune in two phases — plan time and run time”A predicate known at planning (WHERE created > '2024-01-01') lets the
planner discard partitions before the plan is even finalized — plan-time
pruning. But a predicate whose value is unknown until execution — a
parameter from a generic prepared plan, or the current outer-row value
feeding the inner side of a nested loop — cannot prune at plan time. The
convention is to compile pruning into a reusable program of steps that
can run in both phases: once at plan time with constants folded, and
again at run time once the parameters bind. Run-time pruning is what lets a
single generic plan skip partitions per execution.
Push operators below the partitioning boundary
Section titled “Push operators below the partitioning boundary”If two tables are partitioned identically (same strategy, same bounds), a
join between them equals the union of joins between matching partition
pairs — partitionwise join. Likewise a GROUP BY whose keys include
the partition key equals the union of per-partition aggregates —
partitionwise aggregate. The convention is to recognize bound
compatibility (by merging the two bound structures) and rewrite one large
operator into N small ones, each touching a fraction of the data and each
independently parallelizable.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:
| Theory / convention | PostgreSQL name |
|---|---|
| Per-table partition catalog | PartitionDesc (PartitionDescData), cached on the Relation |
| Canonicalized sorted bound structure | PartitionBoundInfo (PartitionBoundInfoData): datums[] + indexes[] |
| Build the bound structure | partition_bounds_create → create_{hash,list,range}_bounds |
| Range/list value lookup | partition_range_datum_bsearch / partition_list_bsearch |
| Hash value lookup | compute_partition_hash_value then % nindexes |
| Route one tuple to a leaf | ExecFindPartition → get_partition_for_tuple |
| Per-INSERT routing state | PartitionTupleRouting + PartitionDispatch |
| Last-found routing cache | PartitionDesc.last_found_* + PARTITION_CACHED_FIND_THRESHOLD |
| Pruning program of steps | PartitionPruneStepOp / PartitionPruneStepCombine |
| Plan-time pruning | prune_append_rel_partitions → get_matching_partitions |
| Run-time (initial + exec) pruning | ExecDoInitialPruning / ExecFindMatchingSubPlans |
| Strategy-specific bound matching | get_matching_{hash,list,range}_bounds |
| Partitionwise join bound merge | partition_bounds_merge → merge_{list,range}_bounds |
The catalog DDL that creates partitions and the ATTACH/DETACH
validation are owned by postgres-ddl-execution.md. How the planner builds
the Append/MergeAppend paths over the surviving partitions, and how
partitionwise paths are costed, is owned by postgres-planner-overview.md.
The leaf-scan node mechanics (SeqScan, IndexScan over each partition)
are owned by postgres-scan-nodes.md. This document covers the
partitioning machinery proper: the bound structure, tuple routing, and
the two-phase pruning that all three layers share.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s declarative partitioning (the PARTITION BY syntax introduced
in PG 10 and steadily hardened since) is the textbook model rendered with
four design decisions that give it its shape:
- One canonical bound structure per table,
PartitionBoundInfo, built once bypartition_bounds_createand cached inside thePartitionDescon the relcache entry. Both routing and pruning search it; neither rebuilds it. - Top-down tuple routing through a
PartitionDispatchchain, with alast_foundcache on thePartitionDescthat skips the binary search when consecutive rows land in the same partition — the common case for time-series inserts. - Pruning as a compiled step program,
PartitionPruneStepnodes, so the same program runs at plan time (constants only) and run time (parameters bound). This is the single most important structural choice: it unifies two pruning phases under one executor. - Partitionwise operators gated behind a bound-merge,
partition_bounds_merge, which proves two tables’ bounds are compatible before the planner pushes a join or aggregate below theAppend.
PartitionDesc and PartitionBoundInfo — the cached bound structure
Section titled “PartitionDesc and PartitionBoundInfo — the cached bound structure”Every partitioned relation carries a PartitionDesc in its relcache entry.
It is the ordered list of partition OIDs plus the canonicalized bounds:
// PartitionDescData — src/include/partitioning/partdesc.htypedef struct PartitionDescData{ int nparts; /* Number of partitions */ bool detached_exist; /* Are there any detached partitions? */ Oid *oids; /* Array of 'nparts' elements containing * partition OIDs in order of their bounds */ bool *is_leaf; /* whether each oids[] entry is a leaf */ PartitionBoundInfo boundinfo; /* collection of partition bounds */
/* Caching fields to cache lookups in get_partition_for_tuple() */ int last_found_datum_index; int last_found_part_index; int last_found_count;} PartitionDescData;boundinfo is where canonicalization lives. Its layout is identical for
all three strategies — a datums[] array of boundary values and an
indexes[] array mapping boundary positions to partition indexes — only
the meaning of the arrays differs per strategy:
// PartitionBoundInfoData — src/include/partitioning/partbounds.htypedef struct PartitionBoundInfoData{ PartitionStrategy strategy; /* hash, list or range? */ int ndatums; /* Length of the datums[] array */ Datum **datums; PartitionRangeDatumKind **kind; /* per range datum; NULL for hash/list */ Bitmapset *interleaved_parts; /* possibly-interleaved LIST partitions */ int nindexes; /* Length of the indexes[] array */ int *indexes; /* Partition indexes */ int null_index; /* the null-accepting partition; -1 if none */ int default_index; /* the default partition; -1 if none */} PartitionBoundInfoData;The header comment in partbounds.h spells out the per-strategy invariant
that makes one struct serve all three:
- LIST:
indexes[]has one entry per datum (nindexes == ndatums): the partition that accepts rows matching that exact value. - RANGE:
indexes[]has one entry per distinct range datum plus one trailing entry (nindexes == ndatums + 1):indexes[i]is the partition for whichdatums[i]is the upper bound, or-1for a gap with no partition. - HASH:
nindexes == greatest modulus:indexes[r]is the partition accepting remainderr, or-1.
null_index and default_index are the two special slots — the partition
accepting NULL keys (LIST only) and the catch-all DEFAULT partition.
The structure is built by partition_bounds_create, a thin dispatch on
strategy:
// partition_bounds_create — src/backend/partitioning/partbounds.cPartitionBoundInfopartition_bounds_create(PartitionBoundSpec **boundspecs, int nparts, PartitionKey key, int **mapping){ *mapping = (int *) palloc(sizeof(int) * nparts); for (int i = 0; i < nparts; i++) (*mapping)[i] = -1;
switch (key->strategy) { case PARTITION_STRATEGY_HASH: return create_hash_bounds(boundspecs, nparts, key, mapping); case PARTITION_STRATEGY_LIST: return create_list_bounds(boundspecs, nparts, key, mapping); case PARTITION_STRATEGY_RANGE: return create_range_bounds(boundspecs, nparts, key, mapping); } Assert(false); return NULL;}Each create_*_bounds worker qsorts the raw bounds into ascending order
(the qsort_partition_{hbound,list_value,rbound}_cmp comparators) so the
later binary searches are correct, then fills datums[]/indexes[]. The
*mapping out-parameter records, for each original partition position,
its canonical (sorted) index — the caller needs it because the catalog
stores partitions in OID order, not bound order. For hash, the sorted
moduli let the code pick the greatest modulus as the index-array size:
// create_hash_bounds — src/backend/partitioning/partbounds.c (condensed)qsort(hbounds, nparts, sizeof(PartitionHashBound), qsort_partition_hbound_cmp);/* After sorting, moduli are now stored in ascending order. */greatest_modulus = hbounds[nparts - 1].modulus;
boundinfo->nindexes = greatest_modulus;boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));for (i = 0; i < greatest_modulus; i++) boundinfo->indexes[i] = -1;/* ... each partition with (modulus m, remainder r) claims every * index r, r+m, r+2m, ... < greatest_modulus ... */The descriptor itself is built lazily by RelationBuildPartitionDesc (in
partdesc.c) on first access through RelationGetPartitionDesc, and is
snapshot-qualified via the omit_detached flag — the comment in
partdesc.h notes “the partitions that are visible to each caller depends
on the snapshot it has, so it’s pretty much impossible to evict a
descriptor from cache at the right time,” which is why detached-partition
visibility is threaded through as a parameter rather than baked into one
shared descriptor.
flowchart TB
SPEC["PartitionBoundSpec[] (parser nodes)<br/>FOR VALUES FROM/TO/IN/WITH"]
SPEC -->|partition_bounds_create| DISP{key->strategy}
DISP -->|HASH| CH["create_hash_bounds<br/>qsort by (modulus,remainder)"]
DISP -->|LIST| CL["create_list_bounds<br/>qsort accepted values"]
DISP -->|RANGE| CR["create_range_bounds<br/>qsort breakpoints, de-dup"]
CH --> BI
CL --> BI
CR --> BI
subgraph BI["PartitionBoundInfo (canonical, sorted)"]
D["datums[] (sorted boundary values)"]
IX["indexes[] (boundary pos -> partition idx)"]
SP["null_index / default_index"]
end
BI --> PD["PartitionDesc on relcache<br/>oids[] + is_leaf[] + boundinfo"]
Figure 1 — Bound canonicalization. The user’s parser-node bounds are
sorted once by partition_bounds_create into a single PartitionBoundInfo
whose datums[]/indexes[] layout serves all three strategies. The
descriptor is cached on the relcache entry and reused by both the write
(routing) and read (pruning) paths.
The three value-lookup primitives
Section titled “The three value-lookup primitives”With datums[] sorted, finding the partition for a value is the same
binary search for list and range, and a modulo for hash. The list search
returns the greatest datum <= value:
// partition_list_bsearch — src/backend/partitioning/partbounds.cintpartition_list_bsearch(FmgrInfo *partsupfunc, Oid *partcollation, PartitionBoundInfo boundinfo, Datum value, bool *is_equal){ int lo = -1, hi = boundinfo->ndatums - 1, mid; while (lo < hi) { int32 cmpval; mid = (lo + hi + 1) / 2; cmpval = DatumGetInt32(FunctionCall2Coll(&partsupfunc[0], partcollation[0], boundinfo->datums[mid][0], value)); if (cmpval <= 0) { lo = mid; *is_equal = (cmpval == 0); if (*is_equal) break; } else hi = mid - 1; } return lo;}The caller treats a list match as valid only when *is_equal is true
(list partitioning is exact-match, not range-containment).
partition_range_datum_bsearch is structurally identical but uses
partition_rbound_datum_cmp to compare a multi-column tuple against a
multi-column range bound, and the caller takes indexes[bound_offset + 1]
(the partition whose upper bound is the next breakpoint). Hash needs no
search — compute_partition_hash_value combines the per-column hashes and
the caller takes % nindexes:
// compute_partition_hash_value — src/backend/partitioning/partbounds.cuint64compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc, const Oid *partcollation, const Datum *values, const bool *isnull){ uint64 rowHash = 0; Datum seed = UInt64GetDatum(HASH_PARTITION_SEED);
for (int i = 0; i < partnatts; i++) { if (!isnull[i]) /* Nulls are just ignored */ { Datum hash = FunctionCall2Coll(&partsupfunc[i], partcollation[i], values[i], seed); rowHash = hash_combine64(rowHash, DatumGetUInt64(hash)); } } return rowHash;}These three primitives are the entire arithmetic core of partitioning; everything else is plumbing that decides when to call them and what to do with the answer.
Tuple routing on INSERT — ExecFindPartition
Section titled “Tuple routing on INSERT — ExecFindPartition”When a row is inserted into a partitioned table, ModifyTable calls
ExecFindPartition (in execPartition.c) to find the leaf
ResultRelInfo that will physically store it. Routing state lives in a
PartitionTupleRouting struct whose partition_dispatch_info[] array
holds one PartitionDispatch per partitioned table touched (slot 0 is
always the root). ExecFindPartition is a top-down descent: at each
level it extracts the partition key, finds the child, and either stops (if
the child is a leaf) or recurses into the child’s dispatch:
// ExecFindPartition — src/backend/executor/execPartition.c (condensed)dispatch = pd[0]; /* start at the root table */while (dispatch != NULL){ int partidx = -1; rel = dispatch->reldesc; partdesc = dispatch->partdesc;
/* expression machinery in FormPartitionKeyDatum reads ecxt_scantuple */ ecxt->ecxt_scantuple = slot; FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
if (partdesc->nparts == 0 || (partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0) ereport(ERROR, (errcode(ERRCODE_CHECK_VIOLATION), errmsg("no partition of relation \"%s\" found for row", RelationGetRelationName(rel)), /* ... */ ));
if (partdesc->is_leaf[partidx]) { /* reached a leaf: fetch (or lazily build) its ResultRelInfo */ if (likely(dispatch->indexes[partidx] >= 0)) rri = proute->partitions[dispatch->indexes[partidx]]; else rri = ExecInitPartitionInfo(mtstate, estate, proute, dispatch, rootResultRelInfo, partidx); dispatch = NULL; /* terminate the loop */ } else { /* sub-partitioned: descend, possibly converting the tuple layout */ dispatch = /* the child PartitionDispatch for partidx */; }}FormPartitionKeyDatum evaluates the partition-key expressions into
values[]/isnull[] (it runs the compiled keystate ExprState, which
is why it needs ecxt_scantuple pointed at the right slot — the executor’s
expression machinery is covered in postgres-expression-eval.md). Note the
lazy construction: the ResultRelInfo and PartitionDispatch for a
partition are built only the first time a tuple is routed there
(ExecInitPartitionInfo / ExecInitPartitionDispatchInfo), so a bulk
INSERT ... SELECT that hits only three of a thousand partitions pays for
three.
get_partition_for_tuple is where the value-lookup primitives are called,
wrapped in the last-found cache that is the routing path’s signature
optimization. The header comment is explicit about the motivating case —
RANGE on a timestamp where “the partition key is the current time,” so
consecutive rows land in the same partition:
// get_partition_for_tuple — src/backend/executor/execPartition.c (LIST arm)case PARTITION_STRATEGY_LIST: if (isnull[0]) { if (partition_bound_accepts_nulls(boundinfo)) return boundinfo->null_index; } else { bool equal; if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD) { int last_datum_offset = partdesc->last_found_datum_index; Datum lastDatum = boundinfo->datums[last_datum_offset][0]; int32 cmpval; /* does the last found datum index match this datum? */ cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0], key->partcollation[0], lastDatum, values[0])); if (cmpval == 0) return boundinfo->indexes[last_datum_offset]; /* fall-through and do a manual lookup */ } bound_offset = partition_list_bsearch(key->partsupfunc, key->partcollation, boundinfo, values[0], &equal); if (bound_offset >= 0 && equal) part_index = boundinfo->indexes[bound_offset]; } break;The cache logic, run after the switch for the non-special cases, bumps
last_found_count when the same bound_offset recurs and resets it to 1
otherwise; only once the count reaches PARTITION_CACHED_FIND_THRESHOLD
(16) does the cache fast-path engage. The fast path replaces an
O(log n) binary search with a single comparison against the remembered
datum, and on a miss falls back to the search transparently:
// get_partition_for_tuple — src/backend/executor/execPartition.c (cache update)if (bound_offset == partdesc->last_found_datum_index) partdesc->last_found_count++;else{ partdesc->last_found_count = 1; partdesc->last_found_part_index = part_index; partdesc->last_found_datum_index = bound_offset;}return part_index;Hash is never cached (“too cheap to bother”) and NULL/DEFAULT matches
are returned directly without touching the cache (there is no bound_offset
for them). The threshold of 16 is a deliberate guard: a workload that
alternates partitions every row never reaches 16 in a row, so it never
pays the double-check cost — it just does the binary search it would have
done anyway.
flowchart TB
T["new tuple slot"] --> FK["FormPartitionKeyDatum<br/>values[] / isnull[]"]
FK --> GP["get_partition_for_tuple"]
GP --> ST{strategy}
ST -->|HASH| H["compute_partition_hash_value<br/>% nindexes"]
ST -->|LIST/RANGE| C{last_found_count<br/>>= 16 ?}
C -->|yes| FAST["compare vs cached datum<br/>hit -> return cached index"]
C -->|no| BS["partition_list_bsearch /<br/>partition_range_datum_bsearch"]
FAST -->|miss| BS
H --> IDX["partition index"]
BS --> IDX
IDX --> LEAF{is_leaf[partidx] ?}
LEAF -->|no| GP
LEAF -->|yes| RRI["leaf ResultRelInfo<br/>(built lazily)"]
Figure 2 — Tuple routing. ExecFindPartition descends the hierarchy; at
each level get_partition_for_tuple does a modulo (hash) or binary search
(list/range), short-circuited by the last-found cache once the same
partition recurs 16 times. The descent stops at a leaf; non-leaf matches
recurse.
Plan-time and runtime partition pruning
Section titled “Plan-time and runtime partition pruning”Pruning is the read-path mirror of routing: instead of “which partition does this value go to?”, it asks “which partitions can possibly satisfy this predicate?”. The pivotal design choice is that pruning compiles to a program of steps that runs in either phase. The two step node types are tiny:
// PartitionPruneStepOp / Combine — src/include/nodes/plannodes.htypedef struct PartitionPruneStepOp{ PartitionPruneStep step; StrategyNumber opstrategy; /* btree strategy of the clause's operator */ List *exprs; /* one expr per partition key column */ List *cmpfns; /* comparison support function OIDs */ Bitmapset *nullkeys; /* keys matched by IS NULL */} PartitionPruneStepOp;
typedef enum PartitionPruneCombineOp{ PARTPRUNE_COMBINE_UNION, PARTPRUNE_COMBINE_INTERSECT,} PartitionPruneCombineOp;
typedef struct PartitionPruneStepCombine{ PartitionPruneStep step; PartitionPruneCombineOp combineOp; List *source_stepids;} PartitionPruneStepCombine;An Op step is a leaf comparison (partkey < $1); a Combine step folds
the results of earlier steps with UNION (for OR) or INTERSECT (for
AND). gen_partprune_steps walks the clause tree and emits this program.
At plan time, prune_append_rel_partitions runs it with constants only:
// prune_append_rel_partitions — src/backend/partitioning/partprune.c (condensed)if (!enable_partition_pruning || clauses == NIL) return bms_add_range(NULL, 0, rel->nparts - 1);
gen_partprune_steps(rel, clauses, PARTTARGET_PLANNER, &gcontext);if (gcontext.contradictory) return NULL; /* WHERE 1=0 style: prune everything */pruning_steps = gcontext.steps;if (pruning_steps == NIL) return bms_add_range(NULL, 0, rel->nparts - 1); /* nothing usable */
/* ... fill PartitionPruneContext from rel->part_scheme / rel->boundinfo ... */context.planstate = NULL; /* not valid at plan time */context.exprcontext = NULL;return get_matching_partitions(&context, pruning_steps);get_matching_partitions is the interpreter for the step program. It
runs each step into a results[] slot, then reads the final step’s
bound_offsets bitmap and maps offsets to partition indexes through
boundinfo->indexes[], adding the null/default partitions as the result
flags demand:
// get_matching_partitions — src/backend/partitioning/partprune.c (condensed)foreach(lc, pruning_steps){ PartitionPruneStep *step = lfirst(lc); switch (nodeTag(step)) { case T_PartitionPruneStepOp: results[step->step_id] = perform_pruning_base_step(context, (PartitionPruneStepOp *) step); break; case T_PartitionPruneStepCombine: results[step->step_id] = perform_pruning_combine_step(context, (PartitionPruneStepCombine *) step, results); break; }}final_result = results[num_steps - 1];/* map surviving bound_offsets -> partition indexes, add null/default */perform_pruning_base_step builds the lookup key from the step’s exprs
(extracting a Datum per partition-key column, bailing out to “no match”
if any is NULL, since pruning operators are strict) and dispatches to the
strategy-specific matcher. The hash matcher shows the structure — if all
keys have values it computes one remainder and returns a singleton offset;
otherwise it conservatively returns all offsets:
// get_matching_hash_bounds — src/backend/partitioning/partprune.c (condensed)if (nvalues + bms_num_members(nullkeys) == partnatts){ for (i = 0; i < partnatts; i++) isnull[i] = bms_is_member(i, nullkeys); rowHash = compute_partition_hash_value(partnatts, partsupfunc, partcollation, values, isnull); greatest_modulus = boundinfo->nindexes; if (partindices[rowHash % greatest_modulus] >= 0) result->bound_offsets = bms_make_singleton(rowHash % greatest_modulus);}else /* not all keys known: every partition is a candidate */ result->bound_offsets = bms_add_range(NULL, 0, boundinfo->nindexes - 1);The same step program runs again at run time when values are not known
at plan time — a generic prepared plan’s $1, or the inner side of a
nested loop whose key depends on the current outer row.
ExecDoInitialPruning (called from InitPlan, before the plan executes)
performs the initial prune for external params, while
ExecFindMatchingSubPlans is re-invoked during execution for params that
change per outer row:
// ExecDoInitialPruning — src/backend/executor/execPartition.c (condensed)foreach(lc, estate->es_part_prune_infos){ PartitionPruneInfo *pruneinfo = lfirst_node(PartitionPruneInfo, lc); PartitionPruneState *prunestate; Bitmapset *validsubplans = NULL;
prunestate = CreatePartitionPruneState(estate, pruneinfo, &all_leafpart_rtis); estate->es_part_prune_states = lappend(estate->es_part_prune_states, prunestate);
if (prunestate->do_initial_prune) validsubplans = ExecFindMatchingSubPlans(prunestate, true, &validsubplan_rtis); /* ... record surviving subplan indexes for Append/MergeAppend init ... */ estate->es_part_prune_results = lappend(estate->es_part_prune_results, validsubplans);}ExecFindMatchingSubPlans runs the steps in a temp memory context (it is
re-run per outer tuple, so its allocations must be cheaply discardable),
recurses the partition hierarchy via find_matching_subplans_recurse, and
returns the bitmap of Append/MergeAppend subplan indexes worth
executing. The surviving set drives which child plan nodes are even
initialized — pruned subplans are skipped, which is the executor wrinkle
noted in postgres-executor.md where the state subnode array falls out of
lockstep with the plan’s subplan list.
flowchart TB
CL["WHERE clauses"] -->|gen_partprune_steps| PROG["step program:<br/>PruneStepOp + PruneStepCombine"]
PROG --> PT["PLAN TIME<br/>prune_append_rel_partitions<br/>(constants only)"]
PROG --> RT["RUN TIME<br/>ExecDoInitialPruning /<br/>ExecFindMatchingSubPlans<br/>(params bound)"]
PT -->|get_matching_partitions| GM["interpret steps -><br/>bound_offsets bitmap"]
RT -->|get_matching_partitions| GM
GM --> BASE["perform_pruning_base_step<br/>get_matching_{hash,list,range}_bounds"]
GM --> COMB["perform_pruning_combine_step<br/>UNION (OR) / INTERSECT (AND)"]
BASE --> SURV["surviving partition indexes"]
COMB --> SURV
SURV --> AP["Append / MergeAppend<br/>only scans surviving children"]
Figure 3 — Two-phase pruning over one step program. gen_partprune_steps
compiles WHERE clauses into a reusable program of Op/Combine steps.
The same program is interpreted by get_matching_partitions at plan time
(constants) and re-interpreted at run time once params bind, so a single
generic plan prunes per execution.
Partitionwise join and aggregate
Section titled “Partitionwise join and aggregate”When enable_partitionwise_join is on and two joined tables are
partitioned compatibly, the planner rewrites the join as an Append
over per-partition-pair joins. Compatibility is decided by merging the
two bound structures with partition_bounds_merge, which dispatches to
merge_list_bounds / merge_range_bounds:
// partition_bounds_merge dispatch — src/backend/partitioning/partbounds.c/* called from try_partitionwise_join() in the planner */switch (outer_binfo->strategy){ case PARTITION_STRATEGY_LIST: return merge_list_bounds(partsupfunc, partcollation, outer_rel, inner_rel, jointype, outer_parts, inner_parts); case PARTITION_STRATEGY_RANGE: return merge_range_bounds(partnatts, partsupfunc, partcollation, outer_rel, inner_rel, jointype, outer_parts, inner_parts); /* HASH partitionwise join requires identical bounds, handled separately */}The merge produces a new PartitionBoundInfo for the join relation plus
two lists (outer_parts/inner_parts) pairing each merged partition’s
inner and outer source partitions, handling the asymmetries that a join
introduces: a DEFAULT partition on one side, a NULL partition, ranges
that partially overlap, and the outer-join semantics where an unmatched
outer partition must still emit rows (process_outer_partition /
merge_default_partitions handle these cases). The output is consumed by
the planner’s try_partitionwise_join, which is owned by
postgres-planner-overview.md; this document covers only the
bound-merge primitive that gates it.
enable_partitionwise_aggregate does the analogous rewrite for GROUP BY:
if the grouping keys are a superset of the partition key, each partition’s
rows form complete groups, so the aggregate can run below the Append
(one Agg per partition) instead of above it — and each per-partition
Agg is independently parallelizable. Both GUCs default off because the
planning-time cost of the bound merge and the extra path generation is only
worth it when partitions are numerous and the per-partition operators are
cheaper than the monolithic one.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use
git grep -n '<symbol>' src/backend/partitioning/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Bound structure (partbounds.c, partdesc.c)
Section titled “Bound structure (partbounds.c, partdesc.c)”partition_bounds_create— strategy dispatch that canonicalizes parserPartitionBoundSpec[]into aPartitionBoundInfo; fills the original→canonical*mapping.create_hash_bounds/create_list_bounds/create_range_bounds— per-strategy workers; each qsorts the raw bounds and populatesdatums[]/indexes[]. Hash sizesindexes[]to the greatest modulus.qsort_partition_hbound_cmp/qsort_partition_list_value_cmp/qsort_partition_rbound_cmp— the sort comparators that establish the invariant the binary searches rely on.partition_list_bsearch/partition_range_datum_bsearch/partition_range_bsearch/partition_hash_bsearch— the value-lookup primitives; return the greatest bound<=the probe.compute_partition_hash_value—hash_combine64over per-columnpartsupfunchashes withHASH_PARTITION_SEED; nulls ignored.partition_bounds_equal/partition_bounds_copy— structural equality (used to detect identical-bound partitionwise joins) and deep copy.RelationBuildPartitionDesc/RelationGetPartitionDesc(partdesc.c) — lazily build / fetch the cachedPartitionDesc;omit_detachedthreads snapshot-qualified detached-partition visibility.
Tuple routing (execPartition.c)
Section titled “Tuple routing (execPartition.c)”ExecSetupPartitionTupleRouting— allocate thePartitionTupleRoutingfor aModifyTabletargeting a partitioned table.ExecFindPartition— top-down descent; per level callsFormPartitionKeyDatum+get_partition_for_tuple, stops at a leaf.get_partition_for_tuple— strategy switch over the lookup primitives, wrapped in thelast_foundcache gated byPARTITION_CACHED_FIND_THRESHOLD(= 16).FormPartitionKeyDatum— run the partition-keykeystateExprStateintovalues[]/isnull[]; needsecxt_scantupleset.ExecInitPartitionInfo/ExecInitPartitionDispatchInfo/ExecInitRoutingInfo— lazy construction of a leaf’sResultRelInfoand a sub-partition’sPartitionDispatch, built on first routed tuple.ExecCleanupTupleRouting— close partitions opened only for routing.
Two-phase pruning (partprune.c, execPartition.c)
Section titled “Two-phase pruning (partprune.c, execPartition.c)”gen_partprune_steps/gen_partprune_steps_internal— compileWHEREclauses into thePartitionPruneStepprogram; setscontradictory.make_partition_pruneinfo— planner-side: build thePartitionPruneInfocarried in the plan for runtime pruning.prune_append_rel_partitions— plan-time entry; runs the steps with constants viaget_matching_partitions.get_matching_partitions— the step interpreter; runs each step intoresults[], maps finalbound_offsetsto partition indexes, adds null/default.perform_pruning_base_step— build the lookup key fromexprs; dispatch toget_matching_{hash,list,range}_bounds.perform_pruning_combine_step— fold child results withPARTPRUNE_COMBINE_UNION/PARTPRUNE_COMBINE_INTERSECT.get_matching_hash_bounds/get_matching_list_bounds/get_matching_range_bounds— strategy-specific offset matching.ExecDoInitialPruning— executor entry fromInitPlan; buildsPartitionPruneStates and runs the initial prune for external params.ExecInitPartitionExecPruning— wire up per-node exec pruning; resequence subplan maps after initial pruning.ExecFindMatchingSubPlans/find_matching_subplans_recurse— run the steps (in a temp context) to return survivingAppend/MergeAppendsubplan indexes; re-invoked per outer tuple for exec params.
Partitionwise operators (partbounds.c)
Section titled “Partitionwise operators (partbounds.c)”partition_bounds_merge— dispatch tomerge_list_bounds/merge_range_boundsto prove join-side bound compatibility.merge_list_bounds/merge_range_bounds— produce the mergedPartitionBoundInfo+ matchedouter_parts/inner_partslists.process_outer_partition/process_inner_partition/merge_null_partitions/merge_default_partitions— handle the null/default/outer-join asymmetries during the merge.partitions_are_ordered— test enabling orderedAppend(MergeAppend avoidance) over partitions.
Key structs and constants
Section titled “Key structs and constants”PartitionDescData(partdesc.h) —oids[]+is_leaf[]+boundinfolast_found_*cache fields.
PartitionBoundInfoData(partbounds.h) —strategy,datums[],kind[],indexes[],null_index,default_index.PartitionTupleRouting/PartitionDispatch(execPartition.c) — the per-INSERT routing state and per-level dispatch object.PartitionPruneStepOp/PartitionPruneStepCombine/PartitionPruneCombineOp(plannodes.h) — the compiled pruning program.PartitionPruneContext(partprune.h) — boundinfo + support funcs + optionalplanstate/exprcontext(NULL at plan time).PARTITION_CACHED_FIND_THRESHOLD(= 16,execPartition.c),HASH_PARTITION_SEED(catalog/partition.h),PARTITION_MAX_KEYS(= 32,pg_config_manual.h).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
partition_bounds_create | src/backend/partitioning/partbounds.c | 299 |
create_hash_bounds | src/backend/partitioning/partbounds.c | 347 |
create_list_bounds | src/backend/partitioning/partbounds.c | 462 |
create_range_bounds | src/backend/partitioning/partbounds.c | 677 |
partition_list_bsearch | src/backend/partitioning/partbounds.c | 3607 |
partition_range_bsearch | src/backend/partitioning/partbounds.c | 3653 |
partition_range_datum_bsearch | src/backend/partitioning/partbounds.c | 3695 |
partition_hash_bsearch | src/backend/partitioning/partbounds.c | 3738 |
compute_partition_hash_value | src/backend/partitioning/partbounds.c | 4722 |
RelationBuildPartitionDesc | src/backend/partitioning/partdesc.c | 134 |
RelationGetPartitionDesc | src/backend/partitioning/partdesc.c | 71 |
ExecSetupPartitionTupleRouting | src/backend/executor/execPartition.c | 218 |
ExecFindPartition | src/backend/executor/execPartition.c | 265 |
ExecInitPartitionInfo | src/backend/executor/execPartition.c | 502 |
ExecInitRoutingInfo | src/backend/executor/execPartition.c | 994 |
ExecInitPartitionDispatchInfo | src/backend/executor/execPartition.c | 1102 |
ExecCleanupTupleRouting | src/backend/executor/execPartition.c | 1241 |
FormPartitionKeyDatum | src/backend/executor/execPartition.c | 1302 |
PARTITION_CACHED_FIND_THRESHOLD | src/backend/executor/execPartition.c | 1356 |
get_partition_for_tuple | src/backend/executor/execPartition.c | 1399 |
ExecDoInitialPruning | src/backend/executor/execPartition.c | 1824 |
ExecInitPartitionExecPruning | src/backend/executor/execPartition.c | 1880 |
ExecFindMatchingSubPlans | src/backend/executor/execPartition.c | 2498 |
find_matching_subplans_recurse | src/backend/executor/execPartition.c | 2571 |
make_partition_pruneinfo | src/backend/partitioning/partprune.c | 225 |
gen_partprune_steps | src/backend/partitioning/partprune.c | 744 |
prune_append_rel_partitions | src/backend/partitioning/partprune.c | 780 |
get_matching_partitions | src/backend/partitioning/partprune.c | 847 |
get_matching_hash_bounds | src/backend/partitioning/partprune.c | 2706 |
get_matching_list_bounds | src/backend/partitioning/partprune.c | 2783 |
get_matching_range_bounds | src/backend/partitioning/partprune.c | 2994 |
perform_pruning_base_step | src/backend/partitioning/partprune.c | 3458 |
perform_pruning_combine_step | src/backend/partitioning/partprune.c | 3606 |
PartitionBoundInfoData | src/include/partitioning/partbounds.h | 79 |
PartitionDescData | src/include/partitioning/partdesc.h | 29 |
PartitionPruneStepOp | src/include/nodes/plannodes.h | 1699 |
PartitionPruneStepCombine | src/include/nodes/plannodes.h | 1721 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All symbols, struct fields, and code excerpts above were read from the
REL_18_STABLE working tree at commit 273fe94 (PostgreSQL 18.x). Spot
checks performed:
PartitionBoundInfoDatalayout — confirmedstrategy,ndatums,datums,kind,interleaved_parts,nindexes,indexes,null_index,default_indexinpartbounds.h. The per-strategyindexes[]invariants (LISTnindexes == ndatums; RANGEnindexes == ndatums + 1; HASHnindexes == greatest modulus) are stated verbatim in that header’s comment block.PartitionDescDatacache fields —last_found_datum_index,last_found_part_index,last_found_countconfirmed present inpartdesc.h, with the comment namingget_partition_for_tuple()as the consumer.PARTITION_CACHED_FIND_THRESHOLD— confirmed#define ... 16inexecPartition.c, and confirmed both LIST and RANGE arms ofget_partition_for_tuplegate the cache fast-path onlast_found_count >= PARTITION_CACHED_FIND_THRESHOLD.- Pruning step types —
PartitionPruneStepOp(withopstrategy,exprs,cmpfns,nullkeys) andPartitionPruneStepCombine(withcombineOp,source_stepids) and thePARTPRUNE_COMBINE_UNION/PARTPRUNE_COMBINE_INTERSECTenum confirmed inplannodes.h. - Two-phase entry points —
prune_append_rel_partitions(plan time, setscontext.planstate = NULL) andExecDoInitialPruning/ExecFindMatchingSubPlans(run time) both confirmed to callget_matching_partitionsover the samepruning_steps. - Partitionwise GUCs —
enable_partitionwise_joinandenable_partitionwise_aggregateconfirmed declared incostsize.c(both defaultfalse) and consumed inrelnode.c,allpaths.c,planner.c. - No REL_18-forbidden assertions — this document references no
XLOG2rmgr and noB_DATACHECKSUMSWORKER_*symbols.
Caveat: line numbers in the position-hint table decay; symbol names are the
durable anchor. The merge helpers
(merge_{list,range}_bounds, process_{outer,inner}_partition) were
confirmed present as static forward declarations and definitions in
partbounds.c but were quoted only at the dispatch level, since their
detailed asymmetry handling is planner-facing and deferred to
postgres-planner-overview.md.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s declarative partitioning is deliberately conservative: it is single-node partitioning — every partition lives in the same instance, on the same storage, under the same transaction manager. The textbook’s framing of partitioning (DSC §21.2) is broader: it is the foundation for distributed data placement across nodes. The interesting comparisons are along the axes PostgreSQL chose not to cross.
Local vs. distributed partitioning. Oracle, SQL Server, and DB2 have
shipped local declarative partitioning for decades with the same three
strategies (range/list/hash) plus composite (e.g. range-then-hash)
partitioning that PostgreSQL approximates via manual sub-partitioning.
Distributed engines — Citus (a PostgreSQL extension), CockroachDB,
Spanner, Vitess, YugabyteDB — extend the same h(key) mod n idea to place
partitions on different nodes, which turns the pruning problem into a
routing-to-node problem and adds the distributed-transaction concern that
DSC ch. 23 covers. PostgreSQL core stops at the node boundary; the
postgres_fdw + partitioning combination (“sharding via foreign tables”)
is the closest core comes, and it leans on the same PartitionBoundInfo
plus FDW pushdown rather than a native distributed planner.
Pruning sophistication. PostgreSQL’s plan-time pruning is a clause
interpreter; commercial engines layer additional pruning forms that
PostgreSQL lacks or only partially has: join pruning (eliminating
partitions of one table using a join predicate to a small dimension table —
Oracle’s “partition-wise join with partition pruning”), and dynamic
pruning driven by a bitmap built at runtime from the build side of a hash
join (Oracle’s “bloom filter pruning”, SQL Server’s “bitmap filtering”).
PostgreSQL’s runtime pruning (ExecFindMatchingSubPlans) handles
parameterized nested loops and generic plans, but does not yet build a
runtime value set from a sibling operator to prune with — a recognized gap
on the hackers list.
Adaptive / learned partitioning. A research frontier (and the subject
of DSC §21.3’s skew discussion) is choosing the partition boundaries
automatically from the data distribution rather than by DBA fiat.
Self-driving database work (Pavlo et al., “Self-Driving Database Management
Systems”, CIDR 2017) and learned-structure work (Kraska et al., “The Case
for Learned Index Structures”, SIGMOD 2018) both point at using observed
workloads to choose partition counts and breakpoints; the “learned
partitioning” idea replaces the sorted datums[] array with a learned
model that predicts the partition for a key. PostgreSQL’s datums[] +
binary search is the classical baseline these compare against — and its
last_found cache is a hand-rolled approximation of the locality that a
learned model would exploit.
The column-store axis. Database Internals (Petrov, ch. 1) draws the orthogonal distinction between horizontal partitioning (splitting rows, what PostgreSQL declarative partitioning does) and vertical partitioning (splitting columns, what column-stores do). Analytical engines (ClickHouse, DuckDB, Snowflake) combine both: horizontal partitions (“parts” / “micro-partitions”) each stored column-wise, with per-partition min/max “zone maps” that prune at a far finer grain than PostgreSQL’s per-partition bounds. PostgreSQL’s row-store partitions prune whole tables; a zone-map column-store prunes individual column blocks within a partition. The two techniques compose, and bringing zone-map-style block pruning into PostgreSQL’s heap (via the BRIN index, covered elsewhere) is the row-store analogue.
Where PostgreSQL’s design shines. The unification of routing and
pruning under one PartitionBoundInfo, and of plan-time and run-time
pruning under one PartitionPruneStep program, is unusually clean. Many
engines bolt runtime pruning on as a separate code path; PostgreSQL’s
“compile once, interpret in either phase” structure means a single
correctness argument covers both. The cost is that PostgreSQL pruning is
interpretive (a step program walked at runtime) rather than compiled to
native code — a deliberate trade favoring maintainability over the last
increment of speed, consistent with the executor’s interpreter-first
philosophy described in postgres-executor.md.
Sources
Section titled “Sources”Source tree (REL_18_STABLE, commit 273fe94, read 2026-06-05):
src/backend/partitioning/partbounds.c— bound canonicalization (partition_bounds_create,create_*_bounds), the value-lookup primitives (partition_*_bsearch,compute_partition_hash_value), and the partitionwise bound-merge (partition_bounds_merge,merge_*_bounds).src/backend/partitioning/partprune.c— pruning step compilation (gen_partprune_steps), the interpreter (get_matching_partitions,perform_pruning_*_step), and strategy matchers (get_matching_*_bounds).src/backend/partitioning/partdesc.c—RelationBuildPartitionDesc/RelationGetPartitionDesc(the cached, snapshot-qualified descriptor).src/backend/executor/execPartition.c— tuple routing (ExecFindPartition,get_partition_for_tuple,PartitionTupleRouting/PartitionDispatch) and runtime pruning (ExecDoInitialPruning,ExecFindMatchingSubPlans).src/backend/catalog/partition.c— partition catalog helpers (get_partition_parent,get_partition_ancestors,get_default_partition_oid,map_partition_varattnos).src/include/partitioning/partbounds.h,src/include/partitioning/partdesc.h,src/include/partitioning/partprune.h,src/include/nodes/plannodes.h— thePartitionBoundInfoData,PartitionDescData,PartitionPruneContext, andPartitionPruneStep*type definitions.
Theory:
- Database System Concepts, Silberschatz, Korth & Sudarshan, 7th ed.,
ch. 21 “Parallel and Distributed Storage” — §21.2 “Data Partitioning”
(range / list / hash strategies, partitioning function and attribute),
§21.3 “Dealing with Skew in Partitioning”. Captured at
knowledge/research/dbms-general/database-system-concepts.md. - Database Internals, Alex Petrov, ch. 1 — horizontal vs. vertical
partitioning, column-store data layout. Captured at
knowledge/research/dbms-general/database-internals.md.
Related KB docs (cross-references, not duplicated here):
postgres-ddl-execution.md—CREATE TABLE ... PARTITION BY,ATTACH/DETACH PARTITION, and bound-overlap validation (check_new_partition_bound).postgres-planner-overview.md—Append/MergeAppendpath generation over surviving partitions andtry_partitionwise_join.postgres-scan-nodes.md— the per-partition leaf scan node mechanics.postgres-executor.md— thePlanStatetree,ExecInitNode, and the pruned-subplan wrinkle in the state/plan-tree mapping.postgres-expression-eval.md— theExprStatemachinery behindFormPartitionKeyDatumandpartkey_datum_from_expr.postgres-relcache.md— relcache invalidation that rebuilds the cachedPartitionDesc.