Skip to content

PostgreSQL Declarative Partitioning — Bounds, Tuple Routing, and Pruning

Contents:

Partitioning is the discipline of taking one logical relation and splitting its rows across several physical storage units according to a deterministic function of the row’s values. Database System Concepts (Silberschatz, 7e, ch. 21 “Parallel and Distributed Storage”, §21.2 “Data Partitioning”) frames it as the foundational step for both parallelism and manageability: “the most basic way of distributing data across multiple nodes is to partition the relations” — assigning each tuple to a partition by a partitioning function on a designated partitioning attribute.

The textbook names the three partitioning strategies that production engines converge on, and PostgreSQL implements exactly these three (§21.2):

  1. Range partitioning. Tuples are assigned to partitions by a contiguous, ordered range of the partitioning attribute — partition i holds all values in [v_i, v_{i+1}). Range partitioning “is well suited to range queries” because a predicate attr BETWEEN a AND b touches only the partitions whose ranges overlap [a, b]. The cost is skew: a poorly chosen split (or a monotonically increasing key like a timestamp) piles recent rows into one partition.
  2. List partitioning. A generalization where each partition is assigned an explicit set of values (FOR VALUES IN (...)). Useful when the key is categorical — country code, status enum — and the buckets are not naturally ordered into ranges.
  3. Hash partitioning. Tuples are assigned by h(attr) mod n. The textbook’s verdict: hash partitioning “tends to distribute data evenly,” defeating the skew that plagues range partitioning, at the cost of destroying range-query locality — a BETWEEN predicate now potentially touches every partition because adjacent values hash to unrelated buckets (§21.3 “Handling of Skew”).

The §21.3 discussion of skew is the reason all three coexist rather than one winning: range gives query locality but risks skew; hash gives balance but loses locality; list sits between, giving the DBA explicit control. PostgreSQL exposes all three as first-class PARTITION BY {RANGE|LIST|HASH} declarations and lets them nest (a range partition can itself be hash-subpartitioned), so the DBA composes the trade-offs per level.

Two operations dominate the runtime of any partitioned system, and both are “find the partition for a value” problems in disguise:

  • Tuple routing (write path). Given a new row, which one partition does it belong to? This must be cheap — it runs once per inserted row.
  • Partition pruning (read path). Given a query predicate, which partitions can be skipped entirely? The textbook frames this as the core win of partitioning for queries: a scan that would touch a billion-row table instead touches only the handful of partitions whose value sets intersect the predicate. Pruning is what makes partitioning a performance feature and not merely a manageability one.

The elegance of declarative partitioning is that both operations reduce to the same primitive: a search over the sorted set of partition bounds. If the bounds are stored sorted, routing one tuple is a binary search keyed on the tuple’s value, and pruning a range predicate is a pair of binary searches bracketing the predicate’s endpoints. The rest of this document traces how PostgreSQL builds that sorted bound structure once, then reuses it for both paths.

The textbook gives the what — three strategies, two value-lookup operations. This section names the engineering conventions that production partitioning engines adopt to make those operations fast and correct, the patterns the textbook leaves implicit. PostgreSQL’s specific choices in the next section are one set of dials within this shared space.

Canonicalize bounds into a sorted, searchable structure

Section titled “Canonicalize bounds into a sorted, searchable structure”

A partition’s bound, as the user wrote it (FOR VALUES FROM (...) TO (...)), is a parser node — convenient to declare, useless to search. The universal convention is to canonicalize all of a table’s bounds, once, into a single sorted array of boundary datums plus a parallel map from boundary position to partition index. Range bounds collapse to a sorted list of breakpoints; list bounds to a sorted list of accepted values; hash bounds to a (modulus, remainder) table. After canonicalization, every value lookup — routing a tuple, pruning a predicate — is a binary search (range/list) or a modulo index (hash) into that one structure. Building it is O(n log n) once; using it is O(log n) forever after.

Cache the bound structure on the relation, invalidate on DDL

Section titled “Cache the bound structure on the relation, invalidate on DDL”

The bound structure is expensive to build and changes only on DDL (ATTACH / DETACH / CREATE ... PARTITION OF). The convention is to attach it to the in-memory relation descriptor and rebuild lazily on cache invalidation, not per query. A subtlety unique to partitioning: because different sessions hold different snapshots, a partition that one transaction sees as attached another may still see as detached, so the cached descriptor must be snapshot-qualified rather than globally shared.

Route writes top-down through the hierarchy

Section titled “Route writes top-down through the hierarchy”

Partitions nest. The convention for routing an inserted tuple is a top-down descent: start at the root partitioned table, find the child partition for the tuple’s key, and if that child is itself partitioned, recurse with (possibly) a re-mapped tuple — because each level may have a different column layout (dropped columns, attribute reordering). The descent terminates at a leaf table, the only kind that physically stores rows.

Prune in two phases — plan time and run time

Section titled “Prune in two phases — plan time and run time”

A predicate known at planning (WHERE created > '2024-01-01') lets the planner discard partitions before the plan is even finalized — plan-time pruning. But a predicate whose value is unknown until execution — a parameter from a generic prepared plan, or the current outer-row value feeding the inner side of a nested loop — cannot prune at plan time. The convention is to compile pruning into a reusable program of steps that can run in both phases: once at plan time with constants folded, and again at run time once the parameters bind. Run-time pruning is what lets a single generic plan skip partitions per execution.

Push operators below the partitioning boundary

Section titled “Push operators below the partitioning boundary”

If two tables are partitioned identically (same strategy, same bounds), a join between them equals the union of joins between matching partition pairspartitionwise join. Likewise a GROUP BY whose keys include the partition key equals the union of per-partition aggregates — partitionwise aggregate. The convention is to recognize bound compatibility (by merging the two bound structures) and rewrite one large operator into N small ones, each touching a fraction of the data and each independently parallelizable.

By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:

Theory / conventionPostgreSQL name
Per-table partition catalogPartitionDesc (PartitionDescData), cached on the Relation
Canonicalized sorted bound structurePartitionBoundInfo (PartitionBoundInfoData): datums[] + indexes[]
Build the bound structurepartition_bounds_createcreate_{hash,list,range}_bounds
Range/list value lookuppartition_range_datum_bsearch / partition_list_bsearch
Hash value lookupcompute_partition_hash_value then % nindexes
Route one tuple to a leafExecFindPartitionget_partition_for_tuple
Per-INSERT routing statePartitionTupleRouting + PartitionDispatch
Last-found routing cachePartitionDesc.last_found_* + PARTITION_CACHED_FIND_THRESHOLD
Pruning program of stepsPartitionPruneStepOp / PartitionPruneStepCombine
Plan-time pruningprune_append_rel_partitionsget_matching_partitions
Run-time (initial + exec) pruningExecDoInitialPruning / ExecFindMatchingSubPlans
Strategy-specific bound matchingget_matching_{hash,list,range}_bounds
Partitionwise join bound mergepartition_bounds_mergemerge_{list,range}_bounds

The catalog DDL that creates partitions and the ATTACH/DETACH validation are owned by postgres-ddl-execution.md. How the planner builds the Append/MergeAppend paths over the surviving partitions, and how partitionwise paths are costed, is owned by postgres-planner-overview.md. The leaf-scan node mechanics (SeqScan, IndexScan over each partition) are owned by postgres-scan-nodes.md. This document covers the partitioning machinery proper: the bound structure, tuple routing, and the two-phase pruning that all three layers share.

PostgreSQL’s declarative partitioning (the PARTITION BY syntax introduced in PG 10 and steadily hardened since) is the textbook model rendered with four design decisions that give it its shape:

  1. One canonical bound structure per table, PartitionBoundInfo, built once by partition_bounds_create and cached inside the PartitionDesc on the relcache entry. Both routing and pruning search it; neither rebuilds it.
  2. Top-down tuple routing through a PartitionDispatch chain, with a last_found cache on the PartitionDesc that skips the binary search when consecutive rows land in the same partition — the common case for time-series inserts.
  3. Pruning as a compiled step program, PartitionPruneStep nodes, so the same program runs at plan time (constants only) and run time (parameters bound). This is the single most important structural choice: it unifies two pruning phases under one executor.
  4. Partitionwise operators gated behind a bound-merge, partition_bounds_merge, which proves two tables’ bounds are compatible before the planner pushes a join or aggregate below the Append.

PartitionDesc and PartitionBoundInfo — the cached bound structure

Section titled “PartitionDesc and PartitionBoundInfo — the cached bound structure”

Every partitioned relation carries a PartitionDesc in its relcache entry. It is the ordered list of partition OIDs plus the canonicalized bounds:

// PartitionDescData — src/include/partitioning/partdesc.h
typedef struct PartitionDescData
{
int nparts; /* Number of partitions */
bool detached_exist; /* Are there any detached partitions? */
Oid *oids; /* Array of 'nparts' elements containing
* partition OIDs in order of their bounds */
bool *is_leaf; /* whether each oids[] entry is a leaf */
PartitionBoundInfo boundinfo; /* collection of partition bounds */
/* Caching fields to cache lookups in get_partition_for_tuple() */
int last_found_datum_index;
int last_found_part_index;
int last_found_count;
} PartitionDescData;

boundinfo is where canonicalization lives. Its layout is identical for all three strategies — a datums[] array of boundary values and an indexes[] array mapping boundary positions to partition indexes — only the meaning of the arrays differs per strategy:

// PartitionBoundInfoData — src/include/partitioning/partbounds.h
typedef struct PartitionBoundInfoData
{
PartitionStrategy strategy; /* hash, list or range? */
int ndatums; /* Length of the datums[] array */
Datum **datums;
PartitionRangeDatumKind **kind; /* per range datum; NULL for hash/list */
Bitmapset *interleaved_parts; /* possibly-interleaved LIST partitions */
int nindexes; /* Length of the indexes[] array */
int *indexes; /* Partition indexes */
int null_index; /* the null-accepting partition; -1 if none */
int default_index; /* the default partition; -1 if none */
} PartitionBoundInfoData;

The header comment in partbounds.h spells out the per-strategy invariant that makes one struct serve all three:

  • LIST: indexes[] has one entry per datum (nindexes == ndatums): the partition that accepts rows matching that exact value.
  • RANGE: indexes[] has one entry per distinct range datum plus one trailing entry (nindexes == ndatums + 1): indexes[i] is the partition for which datums[i] is the upper bound, or -1 for a gap with no partition.
  • HASH: nindexes == greatest modulus: indexes[r] is the partition accepting remainder r, or -1.

null_index and default_index are the two special slots — the partition accepting NULL keys (LIST only) and the catch-all DEFAULT partition. The structure is built by partition_bounds_create, a thin dispatch on strategy:

// partition_bounds_create — src/backend/partitioning/partbounds.c
PartitionBoundInfo
partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
PartitionKey key, int **mapping)
{
*mapping = (int *) palloc(sizeof(int) * nparts);
for (int i = 0; i < nparts; i++)
(*mapping)[i] = -1;
switch (key->strategy)
{
case PARTITION_STRATEGY_HASH:
return create_hash_bounds(boundspecs, nparts, key, mapping);
case PARTITION_STRATEGY_LIST:
return create_list_bounds(boundspecs, nparts, key, mapping);
case PARTITION_STRATEGY_RANGE:
return create_range_bounds(boundspecs, nparts, key, mapping);
}
Assert(false);
return NULL;
}

Each create_*_bounds worker qsorts the raw bounds into ascending order (the qsort_partition_{hbound,list_value,rbound}_cmp comparators) so the later binary searches are correct, then fills datums[]/indexes[]. The *mapping out-parameter records, for each original partition position, its canonical (sorted) index — the caller needs it because the catalog stores partitions in OID order, not bound order. For hash, the sorted moduli let the code pick the greatest modulus as the index-array size:

// create_hash_bounds — src/backend/partitioning/partbounds.c (condensed)
qsort(hbounds, nparts, sizeof(PartitionHashBound),
qsort_partition_hbound_cmp);
/* After sorting, moduli are now stored in ascending order. */
greatest_modulus = hbounds[nparts - 1].modulus;
boundinfo->nindexes = greatest_modulus;
boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));
for (i = 0; i < greatest_modulus; i++)
boundinfo->indexes[i] = -1;
/* ... each partition with (modulus m, remainder r) claims every
* index r, r+m, r+2m, ... < greatest_modulus ... */

The descriptor itself is built lazily by RelationBuildPartitionDesc (in partdesc.c) on first access through RelationGetPartitionDesc, and is snapshot-qualified via the omit_detached flag — the comment in partdesc.h notes “the partitions that are visible to each caller depends on the snapshot it has, so it’s pretty much impossible to evict a descriptor from cache at the right time,” which is why detached-partition visibility is threaded through as a parameter rather than baked into one shared descriptor.

flowchart TB
  SPEC["PartitionBoundSpec[] (parser nodes)<br/>FOR VALUES FROM/TO/IN/WITH"]
  SPEC -->|partition_bounds_create| DISP{key-&gt;strategy}
  DISP -->|HASH| CH["create_hash_bounds<br/>qsort by (modulus,remainder)"]
  DISP -->|LIST| CL["create_list_bounds<br/>qsort accepted values"]
  DISP -->|RANGE| CR["create_range_bounds<br/>qsort breakpoints, de-dup"]
  CH --> BI
  CL --> BI
  CR --> BI
  subgraph BI["PartitionBoundInfo (canonical, sorted)"]
    D["datums[] (sorted boundary values)"]
    IX["indexes[] (boundary pos -&gt; partition idx)"]
    SP["null_index / default_index"]
  end
  BI --> PD["PartitionDesc on relcache<br/>oids[] + is_leaf[] + boundinfo"]

Figure 1 — Bound canonicalization. The user’s parser-node bounds are sorted once by partition_bounds_create into a single PartitionBoundInfo whose datums[]/indexes[] layout serves all three strategies. The descriptor is cached on the relcache entry and reused by both the write (routing) and read (pruning) paths.

With datums[] sorted, finding the partition for a value is the same binary search for list and range, and a modulo for hash. The list search returns the greatest datum <= value:

// partition_list_bsearch — src/backend/partitioning/partbounds.c
int
partition_list_bsearch(FmgrInfo *partsupfunc, Oid *partcollation,
PartitionBoundInfo boundinfo,
Datum value, bool *is_equal)
{
int lo = -1, hi = boundinfo->ndatums - 1, mid;
while (lo < hi)
{
int32 cmpval;
mid = (lo + hi + 1) / 2;
cmpval = DatumGetInt32(FunctionCall2Coll(&partsupfunc[0],
partcollation[0],
boundinfo->datums[mid][0],
value));
if (cmpval <= 0)
{
lo = mid;
*is_equal = (cmpval == 0);
if (*is_equal)
break;
}
else
hi = mid - 1;
}
return lo;
}

The caller treats a list match as valid only when *is_equal is true (list partitioning is exact-match, not range-containment). partition_range_datum_bsearch is structurally identical but uses partition_rbound_datum_cmp to compare a multi-column tuple against a multi-column range bound, and the caller takes indexes[bound_offset + 1] (the partition whose upper bound is the next breakpoint). Hash needs no search — compute_partition_hash_value combines the per-column hashes and the caller takes % nindexes:

// compute_partition_hash_value — src/backend/partitioning/partbounds.c
uint64
compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
const Oid *partcollation,
const Datum *values, const bool *isnull)
{
uint64 rowHash = 0;
Datum seed = UInt64GetDatum(HASH_PARTITION_SEED);
for (int i = 0; i < partnatts; i++)
{
if (!isnull[i]) /* Nulls are just ignored */
{
Datum hash = FunctionCall2Coll(&partsupfunc[i], partcollation[i],
values[i], seed);
rowHash = hash_combine64(rowHash, DatumGetUInt64(hash));
}
}
return rowHash;
}

These three primitives are the entire arithmetic core of partitioning; everything else is plumbing that decides when to call them and what to do with the answer.

Tuple routing on INSERT — ExecFindPartition

Section titled “Tuple routing on INSERT — ExecFindPartition”

When a row is inserted into a partitioned table, ModifyTable calls ExecFindPartition (in execPartition.c) to find the leaf ResultRelInfo that will physically store it. Routing state lives in a PartitionTupleRouting struct whose partition_dispatch_info[] array holds one PartitionDispatch per partitioned table touched (slot 0 is always the root). ExecFindPartition is a top-down descent: at each level it extracts the partition key, finds the child, and either stops (if the child is a leaf) or recurses into the child’s dispatch:

// ExecFindPartition — src/backend/executor/execPartition.c (condensed)
dispatch = pd[0]; /* start at the root table */
while (dispatch != NULL)
{
int partidx = -1;
rel = dispatch->reldesc;
partdesc = dispatch->partdesc;
/* expression machinery in FormPartitionKeyDatum reads ecxt_scantuple */
ecxt->ecxt_scantuple = slot;
FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
if (partdesc->nparts == 0 ||
(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
ereport(ERROR,
(errcode(ERRCODE_CHECK_VIOLATION),
errmsg("no partition of relation \"%s\" found for row",
RelationGetRelationName(rel)), /* ... */ ));
if (partdesc->is_leaf[partidx])
{
/* reached a leaf: fetch (or lazily build) its ResultRelInfo */
if (likely(dispatch->indexes[partidx] >= 0))
rri = proute->partitions[dispatch->indexes[partidx]];
else
rri = ExecInitPartitionInfo(mtstate, estate, proute,
dispatch, rootResultRelInfo, partidx);
dispatch = NULL; /* terminate the loop */
}
else
{
/* sub-partitioned: descend, possibly converting the tuple layout */
dispatch = /* the child PartitionDispatch for partidx */;
}
}

FormPartitionKeyDatum evaluates the partition-key expressions into values[]/isnull[] (it runs the compiled keystate ExprState, which is why it needs ecxt_scantuple pointed at the right slot — the executor’s expression machinery is covered in postgres-expression-eval.md). Note the lazy construction: the ResultRelInfo and PartitionDispatch for a partition are built only the first time a tuple is routed there (ExecInitPartitionInfo / ExecInitPartitionDispatchInfo), so a bulk INSERT ... SELECT that hits only three of a thousand partitions pays for three.

get_partition_for_tuple is where the value-lookup primitives are called, wrapped in the last-found cache that is the routing path’s signature optimization. The header comment is explicit about the motivating case — RANGE on a timestamp where “the partition key is the current time,” so consecutive rows land in the same partition:

// get_partition_for_tuple — src/backend/executor/execPartition.c (LIST arm)
case PARTITION_STRATEGY_LIST:
if (isnull[0])
{
if (partition_bound_accepts_nulls(boundinfo))
return boundinfo->null_index;
}
else
{
bool equal;
if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
{
int last_datum_offset = partdesc->last_found_datum_index;
Datum lastDatum = boundinfo->datums[last_datum_offset][0];
int32 cmpval;
/* does the last found datum index match this datum? */
cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
key->partcollation[0],
lastDatum, values[0]));
if (cmpval == 0)
return boundinfo->indexes[last_datum_offset];
/* fall-through and do a manual lookup */
}
bound_offset = partition_list_bsearch(key->partsupfunc,
key->partcollation,
boundinfo, values[0], &equal);
if (bound_offset >= 0 && equal)
part_index = boundinfo->indexes[bound_offset];
}
break;

The cache logic, run after the switch for the non-special cases, bumps last_found_count when the same bound_offset recurs and resets it to 1 otherwise; only once the count reaches PARTITION_CACHED_FIND_THRESHOLD (16) does the cache fast-path engage. The fast path replaces an O(log n) binary search with a single comparison against the remembered datum, and on a miss falls back to the search transparently:

// get_partition_for_tuple — src/backend/executor/execPartition.c (cache update)
if (bound_offset == partdesc->last_found_datum_index)
partdesc->last_found_count++;
else
{
partdesc->last_found_count = 1;
partdesc->last_found_part_index = part_index;
partdesc->last_found_datum_index = bound_offset;
}
return part_index;

Hash is never cached (“too cheap to bother”) and NULL/DEFAULT matches are returned directly without touching the cache (there is no bound_offset for them). The threshold of 16 is a deliberate guard: a workload that alternates partitions every row never reaches 16 in a row, so it never pays the double-check cost — it just does the binary search it would have done anyway.

flowchart TB
  T["new tuple slot"] --> FK["FormPartitionKeyDatum<br/>values[] / isnull[]"]
  FK --> GP["get_partition_for_tuple"]
  GP --> ST{strategy}
  ST -->|HASH| H["compute_partition_hash_value<br/>% nindexes"]
  ST -->|LIST/RANGE| C{last_found_count<br/>&gt;= 16 ?}
  C -->|yes| FAST["compare vs cached datum<br/>hit -&gt; return cached index"]
  C -->|no| BS["partition_list_bsearch /<br/>partition_range_datum_bsearch"]
  FAST -->|miss| BS
  H --> IDX["partition index"]
  BS --> IDX
  IDX --> LEAF{is_leaf[partidx] ?}
  LEAF -->|no| GP
  LEAF -->|yes| RRI["leaf ResultRelInfo<br/>(built lazily)"]

Figure 2 — Tuple routing. ExecFindPartition descends the hierarchy; at each level get_partition_for_tuple does a modulo (hash) or binary search (list/range), short-circuited by the last-found cache once the same partition recurs 16 times. The descent stops at a leaf; non-leaf matches recurse.

Pruning is the read-path mirror of routing: instead of “which partition does this value go to?”, it asks “which partitions can possibly satisfy this predicate?”. The pivotal design choice is that pruning compiles to a program of steps that runs in either phase. The two step node types are tiny:

// PartitionPruneStepOp / Combine — src/include/nodes/plannodes.h
typedef struct PartitionPruneStepOp
{
PartitionPruneStep step;
StrategyNumber opstrategy; /* btree strategy of the clause's operator */
List *exprs; /* one expr per partition key column */
List *cmpfns; /* comparison support function OIDs */
Bitmapset *nullkeys; /* keys matched by IS NULL */
} PartitionPruneStepOp;
typedef enum PartitionPruneCombineOp
{
PARTPRUNE_COMBINE_UNION,
PARTPRUNE_COMBINE_INTERSECT,
} PartitionPruneCombineOp;
typedef struct PartitionPruneStepCombine
{
PartitionPruneStep step;
PartitionPruneCombineOp combineOp;
List *source_stepids;
} PartitionPruneStepCombine;

An Op step is a leaf comparison (partkey < $1); a Combine step folds the results of earlier steps with UNION (for OR) or INTERSECT (for AND). gen_partprune_steps walks the clause tree and emits this program. At plan time, prune_append_rel_partitions runs it with constants only:

// prune_append_rel_partitions — src/backend/partitioning/partprune.c (condensed)
if (!enable_partition_pruning || clauses == NIL)
return bms_add_range(NULL, 0, rel->nparts - 1);
gen_partprune_steps(rel, clauses, PARTTARGET_PLANNER, &gcontext);
if (gcontext.contradictory)
return NULL; /* WHERE 1=0 style: prune everything */
pruning_steps = gcontext.steps;
if (pruning_steps == NIL)
return bms_add_range(NULL, 0, rel->nparts - 1); /* nothing usable */
/* ... fill PartitionPruneContext from rel->part_scheme / rel->boundinfo ... */
context.planstate = NULL; /* not valid at plan time */
context.exprcontext = NULL;
return get_matching_partitions(&context, pruning_steps);

get_matching_partitions is the interpreter for the step program. It runs each step into a results[] slot, then reads the final step’s bound_offsets bitmap and maps offsets to partition indexes through boundinfo->indexes[], adding the null/default partitions as the result flags demand:

// get_matching_partitions — src/backend/partitioning/partprune.c (condensed)
foreach(lc, pruning_steps)
{
PartitionPruneStep *step = lfirst(lc);
switch (nodeTag(step))
{
case T_PartitionPruneStepOp:
results[step->step_id] =
perform_pruning_base_step(context, (PartitionPruneStepOp *) step);
break;
case T_PartitionPruneStepCombine:
results[step->step_id] =
perform_pruning_combine_step(context,
(PartitionPruneStepCombine *) step,
results);
break;
}
}
final_result = results[num_steps - 1];
/* map surviving bound_offsets -> partition indexes, add null/default */

perform_pruning_base_step builds the lookup key from the step’s exprs (extracting a Datum per partition-key column, bailing out to “no match” if any is NULL, since pruning operators are strict) and dispatches to the strategy-specific matcher. The hash matcher shows the structure — if all keys have values it computes one remainder and returns a singleton offset; otherwise it conservatively returns all offsets:

// get_matching_hash_bounds — src/backend/partitioning/partprune.c (condensed)
if (nvalues + bms_num_members(nullkeys) == partnatts)
{
for (i = 0; i < partnatts; i++)
isnull[i] = bms_is_member(i, nullkeys);
rowHash = compute_partition_hash_value(partnatts, partsupfunc,
partcollation, values, isnull);
greatest_modulus = boundinfo->nindexes;
if (partindices[rowHash % greatest_modulus] >= 0)
result->bound_offsets = bms_make_singleton(rowHash % greatest_modulus);
}
else
/* not all keys known: every partition is a candidate */
result->bound_offsets = bms_add_range(NULL, 0, boundinfo->nindexes - 1);

The same step program runs again at run time when values are not known at plan time — a generic prepared plan’s $1, or the inner side of a nested loop whose key depends on the current outer row. ExecDoInitialPruning (called from InitPlan, before the plan executes) performs the initial prune for external params, while ExecFindMatchingSubPlans is re-invoked during execution for params that change per outer row:

// ExecDoInitialPruning — src/backend/executor/execPartition.c (condensed)
foreach(lc, estate->es_part_prune_infos)
{
PartitionPruneInfo *pruneinfo = lfirst_node(PartitionPruneInfo, lc);
PartitionPruneState *prunestate;
Bitmapset *validsubplans = NULL;
prunestate = CreatePartitionPruneState(estate, pruneinfo, &all_leafpart_rtis);
estate->es_part_prune_states = lappend(estate->es_part_prune_states, prunestate);
if (prunestate->do_initial_prune)
validsubplans = ExecFindMatchingSubPlans(prunestate, true, &validsubplan_rtis);
/* ... record surviving subplan indexes for Append/MergeAppend init ... */
estate->es_part_prune_results = lappend(estate->es_part_prune_results,
validsubplans);
}

ExecFindMatchingSubPlans runs the steps in a temp memory context (it is re-run per outer tuple, so its allocations must be cheaply discardable), recurses the partition hierarchy via find_matching_subplans_recurse, and returns the bitmap of Append/MergeAppend subplan indexes worth executing. The surviving set drives which child plan nodes are even initialized — pruned subplans are skipped, which is the executor wrinkle noted in postgres-executor.md where the state subnode array falls out of lockstep with the plan’s subplan list.

flowchart TB
  CL["WHERE clauses"] -->|gen_partprune_steps| PROG["step program:<br/>PruneStepOp + PruneStepCombine"]
  PROG --> PT["PLAN TIME<br/>prune_append_rel_partitions<br/>(constants only)"]
  PROG --> RT["RUN TIME<br/>ExecDoInitialPruning /<br/>ExecFindMatchingSubPlans<br/>(params bound)"]
  PT -->|get_matching_partitions| GM["interpret steps -&gt;<br/>bound_offsets bitmap"]
  RT -->|get_matching_partitions| GM
  GM --> BASE["perform_pruning_base_step<br/>get_matching_{hash,list,range}_bounds"]
  GM --> COMB["perform_pruning_combine_step<br/>UNION (OR) / INTERSECT (AND)"]
  BASE --> SURV["surviving partition indexes"]
  COMB --> SURV
  SURV --> AP["Append / MergeAppend<br/>only scans surviving children"]

Figure 3 — Two-phase pruning over one step program. gen_partprune_steps compiles WHERE clauses into a reusable program of Op/Combine steps. The same program is interpreted by get_matching_partitions at plan time (constants) and re-interpreted at run time once params bind, so a single generic plan prunes per execution.

When enable_partitionwise_join is on and two joined tables are partitioned compatibly, the planner rewrites the join as an Append over per-partition-pair joins. Compatibility is decided by merging the two bound structures with partition_bounds_merge, which dispatches to merge_list_bounds / merge_range_bounds:

// partition_bounds_merge dispatch — src/backend/partitioning/partbounds.c
/* called from try_partitionwise_join() in the planner */
switch (outer_binfo->strategy)
{
case PARTITION_STRATEGY_LIST:
return merge_list_bounds(partsupfunc, partcollation,
outer_rel, inner_rel, jointype,
outer_parts, inner_parts);
case PARTITION_STRATEGY_RANGE:
return merge_range_bounds(partnatts, partsupfunc, partcollation,
outer_rel, inner_rel, jointype,
outer_parts, inner_parts);
/* HASH partitionwise join requires identical bounds, handled separately */
}

The merge produces a new PartitionBoundInfo for the join relation plus two lists (outer_parts/inner_parts) pairing each merged partition’s inner and outer source partitions, handling the asymmetries that a join introduces: a DEFAULT partition on one side, a NULL partition, ranges that partially overlap, and the outer-join semantics where an unmatched outer partition must still emit rows (process_outer_partition / merge_default_partitions handle these cases). The output is consumed by the planner’s try_partitionwise_join, which is owned by postgres-planner-overview.md; this document covers only the bound-merge primitive that gates it.

enable_partitionwise_aggregate does the analogous rewrite for GROUP BY: if the grouping keys are a superset of the partition key, each partition’s rows form complete groups, so the aggregate can run below the Append (one Agg per partition) instead of above it — and each per-partition Agg is independently parallelizable. Both GUCs default off because the planning-time cost of the bound merge and the extra path generation is only worth it when partitions are numerous and the per-partition operators are cheaper than the monolithic one.

Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use git grep -n '<symbol>' src/backend/partitioning/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

Bound structure (partbounds.c, partdesc.c)

Section titled “Bound structure (partbounds.c, partdesc.c)”
  • partition_bounds_create — strategy dispatch that canonicalizes parser PartitionBoundSpec[] into a PartitionBoundInfo; fills the original→canonical *mapping.
  • create_hash_bounds / create_list_bounds / create_range_bounds — per-strategy workers; each qsorts the raw bounds and populates datums[]/indexes[]. Hash sizes indexes[] to the greatest modulus.
  • qsort_partition_hbound_cmp / qsort_partition_list_value_cmp / qsort_partition_rbound_cmp — the sort comparators that establish the invariant the binary searches rely on.
  • partition_list_bsearch / partition_range_datum_bsearch / partition_range_bsearch / partition_hash_bsearch — the value-lookup primitives; return the greatest bound <= the probe.
  • compute_partition_hash_valuehash_combine64 over per-column partsupfunc hashes with HASH_PARTITION_SEED; nulls ignored.
  • partition_bounds_equal / partition_bounds_copy — structural equality (used to detect identical-bound partitionwise joins) and deep copy.
  • RelationBuildPartitionDesc / RelationGetPartitionDesc (partdesc.c) — lazily build / fetch the cached PartitionDesc; omit_detached threads snapshot-qualified detached-partition visibility.
  • ExecSetupPartitionTupleRouting — allocate the PartitionTupleRouting for a ModifyTable targeting a partitioned table.
  • ExecFindPartition — top-down descent; per level calls FormPartitionKeyDatum + get_partition_for_tuple, stops at a leaf.
  • get_partition_for_tuple — strategy switch over the lookup primitives, wrapped in the last_found cache gated by PARTITION_CACHED_FIND_THRESHOLD (= 16).
  • FormPartitionKeyDatum — run the partition-key keystate ExprState into values[]/isnull[]; needs ecxt_scantuple set.
  • ExecInitPartitionInfo / ExecInitPartitionDispatchInfo / ExecInitRoutingInfo — lazy construction of a leaf’s ResultRelInfo and a sub-partition’s PartitionDispatch, built on first routed tuple.
  • ExecCleanupTupleRouting — close partitions opened only for routing.

Two-phase pruning (partprune.c, execPartition.c)

Section titled “Two-phase pruning (partprune.c, execPartition.c)”
  • gen_partprune_steps / gen_partprune_steps_internal — compile WHERE clauses into the PartitionPruneStep program; sets contradictory.
  • make_partition_pruneinfo — planner-side: build the PartitionPruneInfo carried in the plan for runtime pruning.
  • prune_append_rel_partitions — plan-time entry; runs the steps with constants via get_matching_partitions.
  • get_matching_partitions — the step interpreter; runs each step into results[], maps final bound_offsets to partition indexes, adds null/default.
  • perform_pruning_base_step — build the lookup key from exprs; dispatch to get_matching_{hash,list,range}_bounds.
  • perform_pruning_combine_step — fold child results with PARTPRUNE_COMBINE_UNION / PARTPRUNE_COMBINE_INTERSECT.
  • get_matching_hash_bounds / get_matching_list_bounds / get_matching_range_bounds — strategy-specific offset matching.
  • ExecDoInitialPruning — executor entry from InitPlan; builds PartitionPruneStates and runs the initial prune for external params.
  • ExecInitPartitionExecPruning — wire up per-node exec pruning; resequence subplan maps after initial pruning.
  • ExecFindMatchingSubPlans / find_matching_subplans_recurse — run the steps (in a temp context) to return surviving Append/MergeAppend subplan indexes; re-invoked per outer tuple for exec params.
  • partition_bounds_merge — dispatch to merge_list_bounds / merge_range_bounds to prove join-side bound compatibility.
  • merge_list_bounds / merge_range_bounds — produce the merged PartitionBoundInfo + matched outer_parts/inner_parts lists.
  • process_outer_partition / process_inner_partition / merge_null_partitions / merge_default_partitions — handle the null/default/outer-join asymmetries during the merge.
  • partitions_are_ordered — test enabling ordered Append (MergeAppend avoidance) over partitions.
  • PartitionDescData (partdesc.h) — oids[] + is_leaf[] + boundinfo
    • last_found_* cache fields.
  • PartitionBoundInfoData (partbounds.h) — strategy, datums[], kind[], indexes[], null_index, default_index.
  • PartitionTupleRouting / PartitionDispatch (execPartition.c) — the per-INSERT routing state and per-level dispatch object.
  • PartitionPruneStepOp / PartitionPruneStepCombine / PartitionPruneCombineOp (plannodes.h) — the compiled pruning program.
  • PartitionPruneContext (partprune.h) — boundinfo + support funcs + optional planstate/exprcontext (NULL at plan time).
  • PARTITION_CACHED_FIND_THRESHOLD (= 16, execPartition.c), HASH_PARTITION_SEED (catalog/partition.h), PARTITION_MAX_KEYS (= 32, pg_config_manual.h).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
partition_bounds_createsrc/backend/partitioning/partbounds.c299
create_hash_boundssrc/backend/partitioning/partbounds.c347
create_list_boundssrc/backend/partitioning/partbounds.c462
create_range_boundssrc/backend/partitioning/partbounds.c677
partition_list_bsearchsrc/backend/partitioning/partbounds.c3607
partition_range_bsearchsrc/backend/partitioning/partbounds.c3653
partition_range_datum_bsearchsrc/backend/partitioning/partbounds.c3695
partition_hash_bsearchsrc/backend/partitioning/partbounds.c3738
compute_partition_hash_valuesrc/backend/partitioning/partbounds.c4722
RelationBuildPartitionDescsrc/backend/partitioning/partdesc.c134
RelationGetPartitionDescsrc/backend/partitioning/partdesc.c71
ExecSetupPartitionTupleRoutingsrc/backend/executor/execPartition.c218
ExecFindPartitionsrc/backend/executor/execPartition.c265
ExecInitPartitionInfosrc/backend/executor/execPartition.c502
ExecInitRoutingInfosrc/backend/executor/execPartition.c994
ExecInitPartitionDispatchInfosrc/backend/executor/execPartition.c1102
ExecCleanupTupleRoutingsrc/backend/executor/execPartition.c1241
FormPartitionKeyDatumsrc/backend/executor/execPartition.c1302
PARTITION_CACHED_FIND_THRESHOLDsrc/backend/executor/execPartition.c1356
get_partition_for_tuplesrc/backend/executor/execPartition.c1399
ExecDoInitialPruningsrc/backend/executor/execPartition.c1824
ExecInitPartitionExecPruningsrc/backend/executor/execPartition.c1880
ExecFindMatchingSubPlanssrc/backend/executor/execPartition.c2498
find_matching_subplans_recursesrc/backend/executor/execPartition.c2571
make_partition_pruneinfosrc/backend/partitioning/partprune.c225
gen_partprune_stepssrc/backend/partitioning/partprune.c744
prune_append_rel_partitionssrc/backend/partitioning/partprune.c780
get_matching_partitionssrc/backend/partitioning/partprune.c847
get_matching_hash_boundssrc/backend/partitioning/partprune.c2706
get_matching_list_boundssrc/backend/partitioning/partprune.c2783
get_matching_range_boundssrc/backend/partitioning/partprune.c2994
perform_pruning_base_stepsrc/backend/partitioning/partprune.c3458
perform_pruning_combine_stepsrc/backend/partitioning/partprune.c3606
PartitionBoundInfoDatasrc/include/partitioning/partbounds.h79
PartitionDescDatasrc/include/partitioning/partdesc.h29
PartitionPruneStepOpsrc/include/nodes/plannodes.h1699
PartitionPruneStepCombinesrc/include/nodes/plannodes.h1721

All symbols, struct fields, and code excerpts above were read from the REL_18_STABLE working tree at commit 273fe94 (PostgreSQL 18.x). Spot checks performed:

  • PartitionBoundInfoData layout — confirmed strategy, ndatums, datums, kind, interleaved_parts, nindexes, indexes, null_index, default_index in partbounds.h. The per-strategy indexes[] invariants (LIST nindexes == ndatums; RANGE nindexes == ndatums + 1; HASH nindexes == greatest modulus) are stated verbatim in that header’s comment block.
  • PartitionDescData cache fieldslast_found_datum_index, last_found_part_index, last_found_count confirmed present in partdesc.h, with the comment naming get_partition_for_tuple() as the consumer.
  • PARTITION_CACHED_FIND_THRESHOLD — confirmed #define ... 16 in execPartition.c, and confirmed both LIST and RANGE arms of get_partition_for_tuple gate the cache fast-path on last_found_count >= PARTITION_CACHED_FIND_THRESHOLD.
  • Pruning step typesPartitionPruneStepOp (with opstrategy, exprs, cmpfns, nullkeys) and PartitionPruneStepCombine (with combineOp, source_stepids) and the PARTPRUNE_COMBINE_UNION / PARTPRUNE_COMBINE_INTERSECT enum confirmed in plannodes.h.
  • Two-phase entry pointsprune_append_rel_partitions (plan time, sets context.planstate = NULL) and ExecDoInitialPruning / ExecFindMatchingSubPlans (run time) both confirmed to call get_matching_partitions over the same pruning_steps.
  • Partitionwise GUCsenable_partitionwise_join and enable_partitionwise_aggregate confirmed declared in costsize.c (both default false) and consumed in relnode.c, allpaths.c, planner.c.
  • No REL_18-forbidden assertions — this document references no XLOG2 rmgr and no B_DATACHECKSUMSWORKER_* symbols.

Caveat: line numbers in the position-hint table decay; symbol names are the durable anchor. The merge helpers (merge_{list,range}_bounds, process_{outer,inner}_partition) were confirmed present as static forward declarations and definitions in partbounds.c but were quoted only at the dispatch level, since their detailed asymmetry handling is planner-facing and deferred to postgres-planner-overview.md.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s declarative partitioning is deliberately conservative: it is single-node partitioning — every partition lives in the same instance, on the same storage, under the same transaction manager. The textbook’s framing of partitioning (DSC §21.2) is broader: it is the foundation for distributed data placement across nodes. The interesting comparisons are along the axes PostgreSQL chose not to cross.

Local vs. distributed partitioning. Oracle, SQL Server, and DB2 have shipped local declarative partitioning for decades with the same three strategies (range/list/hash) plus composite (e.g. range-then-hash) partitioning that PostgreSQL approximates via manual sub-partitioning. Distributed engines — Citus (a PostgreSQL extension), CockroachDB, Spanner, Vitess, YugabyteDB — extend the same h(key) mod n idea to place partitions on different nodes, which turns the pruning problem into a routing-to-node problem and adds the distributed-transaction concern that DSC ch. 23 covers. PostgreSQL core stops at the node boundary; the postgres_fdw + partitioning combination (“sharding via foreign tables”) is the closest core comes, and it leans on the same PartitionBoundInfo plus FDW pushdown rather than a native distributed planner.

Pruning sophistication. PostgreSQL’s plan-time pruning is a clause interpreter; commercial engines layer additional pruning forms that PostgreSQL lacks or only partially has: join pruning (eliminating partitions of one table using a join predicate to a small dimension table — Oracle’s “partition-wise join with partition pruning”), and dynamic pruning driven by a bitmap built at runtime from the build side of a hash join (Oracle’s “bloom filter pruning”, SQL Server’s “bitmap filtering”). PostgreSQL’s runtime pruning (ExecFindMatchingSubPlans) handles parameterized nested loops and generic plans, but does not yet build a runtime value set from a sibling operator to prune with — a recognized gap on the hackers list.

Adaptive / learned partitioning. A research frontier (and the subject of DSC §21.3’s skew discussion) is choosing the partition boundaries automatically from the data distribution rather than by DBA fiat. Self-driving database work (Pavlo et al., “Self-Driving Database Management Systems”, CIDR 2017) and learned-structure work (Kraska et al., “The Case for Learned Index Structures”, SIGMOD 2018) both point at using observed workloads to choose partition counts and breakpoints; the “learned partitioning” idea replaces the sorted datums[] array with a learned model that predicts the partition for a key. PostgreSQL’s datums[] + binary search is the classical baseline these compare against — and its last_found cache is a hand-rolled approximation of the locality that a learned model would exploit.

The column-store axis. Database Internals (Petrov, ch. 1) draws the orthogonal distinction between horizontal partitioning (splitting rows, what PostgreSQL declarative partitioning does) and vertical partitioning (splitting columns, what column-stores do). Analytical engines (ClickHouse, DuckDB, Snowflake) combine both: horizontal partitions (“parts” / “micro-partitions”) each stored column-wise, with per-partition min/max “zone maps” that prune at a far finer grain than PostgreSQL’s per-partition bounds. PostgreSQL’s row-store partitions prune whole tables; a zone-map column-store prunes individual column blocks within a partition. The two techniques compose, and bringing zone-map-style block pruning into PostgreSQL’s heap (via the BRIN index, covered elsewhere) is the row-store analogue.

Where PostgreSQL’s design shines. The unification of routing and pruning under one PartitionBoundInfo, and of plan-time and run-time pruning under one PartitionPruneStep program, is unusually clean. Many engines bolt runtime pruning on as a separate code path; PostgreSQL’s “compile once, interpret in either phase” structure means a single correctness argument covers both. The cost is that PostgreSQL pruning is interpretive (a step program walked at runtime) rather than compiled to native code — a deliberate trade favoring maintainability over the last increment of speed, consistent with the executor’s interpreter-first philosophy described in postgres-executor.md.

Source tree (REL_18_STABLE, commit 273fe94, read 2026-06-05):

  • src/backend/partitioning/partbounds.c — bound canonicalization (partition_bounds_create, create_*_bounds), the value-lookup primitives (partition_*_bsearch, compute_partition_hash_value), and the partitionwise bound-merge (partition_bounds_merge, merge_*_bounds).
  • src/backend/partitioning/partprune.c — pruning step compilation (gen_partprune_steps), the interpreter (get_matching_partitions, perform_pruning_*_step), and strategy matchers (get_matching_*_bounds).
  • src/backend/partitioning/partdesc.cRelationBuildPartitionDesc / RelationGetPartitionDesc (the cached, snapshot-qualified descriptor).
  • src/backend/executor/execPartition.c — tuple routing (ExecFindPartition, get_partition_for_tuple, PartitionTupleRouting/PartitionDispatch) and runtime pruning (ExecDoInitialPruning, ExecFindMatchingSubPlans).
  • src/backend/catalog/partition.c — partition catalog helpers (get_partition_parent, get_partition_ancestors, get_default_partition_oid, map_partition_varattnos).
  • src/include/partitioning/partbounds.h, src/include/partitioning/partdesc.h, src/include/partitioning/partprune.h, src/include/nodes/plannodes.h — the PartitionBoundInfoData, PartitionDescData, PartitionPruneContext, and PartitionPruneStep* type definitions.

Theory:

  • Database System Concepts, Silberschatz, Korth & Sudarshan, 7th ed., ch. 21 “Parallel and Distributed Storage” — §21.2 “Data Partitioning” (range / list / hash strategies, partitioning function and attribute), §21.3 “Dealing with Skew in Partitioning”. Captured at knowledge/research/dbms-general/database-system-concepts.md.
  • Database Internals, Alex Petrov, ch. 1 — horizontal vs. vertical partitioning, column-store data layout. Captured at knowledge/research/dbms-general/database-internals.md.

Related KB docs (cross-references, not duplicated here):

  • postgres-ddl-execution.mdCREATE TABLE ... PARTITION BY, ATTACH / DETACH PARTITION, and bound-overlap validation (check_new_partition_bound).
  • postgres-planner-overview.mdAppend / MergeAppend path generation over surviving partitions and try_partitionwise_join.
  • postgres-scan-nodes.md — the per-partition leaf scan node mechanics.
  • postgres-executor.md — the PlanState tree, ExecInitNode, and the pruned-subplan wrinkle in the state/plan-tree mapping.
  • postgres-expression-eval.md — the ExprState machinery behind FormPartitionKeyDatum and partkey_datum_from_expr.
  • postgres-relcache.md — relcache invalidation that rebuilds the cached PartitionDesc.