PostgreSQL Parallel Query — From No Parallelism to Parallel Joins, Aggregates, and Maintenance
This is a version-evolution document. It traces how one subsystem —
intra-query parallelism — was built up across PostgreSQL major releases,
ending at the current REL_18 state. It does not re-derive the
mechanism; for how parallel execution works today (the DSM plan, the
shm_mq tuple queues, worker state restoration, gather_readnext), read
the current-state module doc
postgres-parallel-query.md. This doc
answers a different question: what changed in each release, and why.
Contents:
- Why this subsystem had to evolve — the single-process limitation and what made parallelism finally feasible.
- Timeline — the eras from 9.6 through REL_18, left to right.
- Pre-9.6 — the single-process executor — the starting point and the infrastructure that had to land first.
- 9.6 — parallel sequential scan and Gather — the first parallel node; the partial-path concept; the read-only contract.
- 10 — parallel joins, parallel aggregate, and GatherMerge — making the subtree below Gather worth parallelizing; order-preserving merge.
- 11 — parallel CREATE INDEX and parallel-aware Append — parallelism enters DDL and the partition-scan layer.
- 12 — planner and cost refinements — parallel-safety bookkeeping and better partial-path generation.
- 13 — parallel VACUUM — parallelism reaches maintenance.
- 14–18 — incremental hardening — leader-participation tuning, more parallel-safe paths, and node-level extensions.
- Where it stands at REL_18 — the consolidated design and the PG19 forward note.
- Sources — release notes, module docs, key source files.
Why this subsystem had to evolve (the original limitation)
Section titled “Why this subsystem had to evolve (the original limitation)”For roughly its first two decades PostgreSQL executed every query in a
single backend process. One client connection mapped to one OS process
(forked by the postmaster), and that one process ran the entire plan tree
top to bottom by itself. This is the model still described structurally in
postgres-executor.md: the executor is a
demand-pull tree of PlanState nodes, and ExecProcNode pulls one tuple
at a time up through that tree inside a single process.
That design was a deliberate, defensible choice. A single-process executor is simple to reason about: there is exactly one snapshot, one transaction state, one set of locks, one memory hierarchy of contexts, and no cross-process coordination to get wrong. For OLTP — many small concurrent transactions — it is also close to optimal, because the server is already parallel across connections; each short query wants a whole core to itself, not a fraction of several.
The limitation bit hard on a different workload: a single large query on a
single large table or join. Consider SELECT count(*) FROM big WHERE ...
over a hundred-gigabyte table on a machine with 32 idle cores. The
single-process executor scans the whole heap on one core while the other
31 sit idle. Analytic and reporting queries — big sequential scans, large
hash joins, wide aggregations — were fundamentally CPU- and scan-bound on a
single thread, and PostgreSQL simply could not spend more than one core on
them no matter how much hardware was available.
Three things had to be true before this could be fixed, and they did not all arrive at once. That staging is why the evolution took several releases rather than landing in one:
-
A way to run code in extra processes tied to a transaction. PostgreSQL chose processes, not threads, so “use more cores” means “fork more backends” — and those backends must share the originating transaction’s visibility, not start their own. The background worker facility (9.3, generalized in 9.4) and dynamic shared memory (DSM) (9.4) were the enabling primitives. Without DSM there was no post-startup shared region to hand a plan and a result channel through.
-
A way to ship state between processes so visibility matches. A worker must see exactly what the leader sees: the same snapshot, the same in-progress transaction’s XID and combo CIDs, the same GUC settings, the same loaded libraries and user identity. That serialization machinery — the
ParallelContextinsrc/backend/access/transam/parallel.c— is itself a substantial subsystem that had to be built and is documented inpostgres-parallel-query.md. -
A planner that knows when parallelism is safe and cheaper. A function marked
VOLATILEwith side effects, a query that writes, a node that cannot be split across workers — all of these make parallelism wrong or illegal. The optimizer needed parallel-safety classification of every function and plan node, and a cost model that knows the price of launching workers and shuffling tuples through queues, before it could choose a parallel plan. That bookkeeping kept expanding release after release, which is much of why later versions parallelize more node types than earlier ones.
The arc below is the story of those three capabilities maturing: first the plumbing and a single parallel node (9.6), then a rich parallelizable subtree (10), then DDL and partition-aware parallelism (11), then maintenance (13), then years of widening the set of things the planner is willing and able to run in parallel.
Timeline
Section titled “Timeline”timeline
title PostgreSQL Intra-Query Parallelism — Evolution
section Pre-9.6 (≤2016)
Single-process executor : One backend runs the whole plan
: DSM (9.4) + bgworkers (9.3/9.4) land as enabling plumbing
section 9.6 (2016)
First parallel node : Parallel Seq Scan
: Gather node + per-worker tuple queues
: Partial paths, read-only parallel mode
section 10 (2017)
Rich parallel subtree : Parallel hash / merge / nestloop joins
: Partial + Finalize aggregation
: GatherMerge preserves sort order
: Parallel bitmap heap scan, parallel index scan
section 11 (2018)
Parallelism in DDL and partitions : Parallel CREATE INDEX (btree)
: Parallel-aware Append over partitions
: Parallel Hash (shared hash table)
section 12 (2019)
Planner refinements : Parallel-safety bookkeeping tightened
: Better partial-path generation
section 13 (2020)
Parallelism in maintenance : Parallel VACUUM (indexes in parallel)
: Parallel CREATE INDEX cost tuning
section 14-18 (2021-2025)
Incremental hardening : Leader-participation tuning
: More parallel-safe functions and paths
: Node-level extensions, REL_18 current state
Pre-9.6 — the single-process executor (the starting point)
Section titled “Pre-9.6 — the single-process executor (the starting point)”Before 9.6 there was no intra-query parallelism at all. The starting point is worth pinning down precisely, because every later era is a delta against it.
What execution looked like. The postmaster forked one backend per
connection. That backend parsed, planned, and executed the query inside
itself. Execution was the demand-pull PlanState tree still documented in
postgres-executor.md: ExecutePlan repeatedly
calls ExecProcNode on the top node, which recursively pulls tuples from
its children. A Seq Scan over a billion-row table returned those rows one
at a time, all on one core, with the rest of the machine idle for the
duration.
Why it could not simply spawn threads. PostgreSQL’s process-per-backend model means there is no shared address space to fan work into. Two design facts made naive parallelism impossible:
- Shared memory was fixed at postmaster startup. The main shared-memory segment (buffers, locks, ProcArray) was sized once and never grew. There was no general mechanism for a running backend to create a new shared region, hand it to a helper process, and tear it down at query end.
- Transaction state lived in process-local memory. The active snapshot, the current XID, combo CIDs, the resource owner tree — all of it was per-process. A helper process forking fresh would compute different visibility and corrupt results.
The enabling plumbing (9.3–9.4). The fix to the first problem arrived before parallel query did, as standalone infrastructure:
- Background workers (9.3, generalized in 9.4): a way to register and launch extra processes managed by the postmaster but tied to a backend’s lifetime.
- Dynamic shared memory / DSM (9.4): a backend can now create a shared segment after startup, addressable by cooperating processes, and destroy it when done. This is the substrate every parallel plan rides on.
- The shared-memory message queue (
shm_mq) and the table-of-contents (shm_toc) layout helpers: a single-reader / single-writer byte pipe over DSM, plus a keyed directory so processes can find sub-structures inside one segment by a known key.
None of this was yet parallel query. It was the runway. 9.6 was the first plane to use it.
Cross-link. The structural model that this era establishes — the
PlanState tree, ExecProcNode, the single registered snapshot — is the
baseline described in
postgres-executor.md. Parallel query does not
replace it; it replicates it across processes. Each worker runs its own
copy of essentially this same tree.
9.6 — parallel sequential scan and Gather (the first parallel node)
Section titled “9.6 — parallel sequential scan and Gather (the first parallel node)”PostgreSQL 9.6 introduced intra-query parallelism for the first time.
The release is the genesis of everything in
postgres-parallel-query.md: the Gather
node, the DSM-serialized plan, the per-worker tuple queues, and the
read-only parallel-mode contract all date from here.
What changed
Section titled “What changed”Three new concepts landed together, and they are still the load-bearing ones today:
-
The
Gathernode. A new executor node that sits above a partial plan. At execution time it launches N background workers, each running a copy of the subtree below it, and collects their output tuples into a single stream for its parent.Gatheris the boundary between “many processes” below and “one process” above. Its current implementation lives insrc/backend/executor/nodeGather.c. -
Partial paths and the parallel-aware subtree. The planner gained the notion of a partial path: a path that, when run by N workers simultaneously, collectively produces the full result with no duplication. The canonical first example is Parallel Seq Scan — each worker grabs a disjoint block range of the heap (coordinated through shared state in the scan descriptor) so that the union of all workers’ output equals one full scan. The planner builds these in the partial-path machinery now in
src/backend/optimizer/path/allpaths.c. -
The DSM-carried plan and tuple queues. On first execution
Gatherserializes the plan into a DSM segment, the postmaster forks workers, each worker attaches the segment, restores the leader’s transaction state so visibility matches, runs its executor copy, and writesMinimalTuples into a per-workershm_mqtuple queue. The leader drains those queues. This is exactly theExecInitParallelPlan/ParallelWorkerMain/gather_readnextflow detailed inpostgres-parallel-query.md.
The motivation was the limitation from the opening section: large scans were stuck on one core. A parallel sequential scan is the simplest possible win — heap blocks are independent, so dividing block ranges among workers needs no data exchange beyond a shared “next block” counter. It was the right first target precisely because it required the least coordination while exercising all the new plumbing end to end.
The structural shift (before → after)
Section titled “The structural shift (before → after)”flowchart TB
subgraph before["Before 9.6 — single process"]
A1["Aggregate count(*)"] --> A2["Seq Scan big"]
note1["One backend\nscans all blocks\non one core"]
end
subgraph after["9.6 — Gather + Parallel Seq Scan"]
B1["Aggregate count(*)\n(leader, finalizes)"] --> B2["Gather\nleader collects tuples"]
B2 --> B3["Partial Aggregate\nworker 1"]
B2 --> B4["Partial Aggregate\nworker 2"]
B2 --> B5["Partial Aggregate\nleader"]
B3 --> B6["Parallel Seq Scan\ndisjoint block range"]
B4 --> B7["Parallel Seq Scan\ndisjoint block range"]
B5 --> B8["Parallel Seq Scan\ndisjoint block range"]
end
The key shape change: a Gather node is spliced into the tree, and the
subtree below it becomes parallel-aware (multiple concurrent copies share
the scan). Above Gather the world is single-process as before. Note that
even in 9.6 a count(*) already needed a partial aggregate below Gather
and a final aggregate above it — but 9.6’s aggregate support was limited;
the rich partial/finalize split came into its own in PG10.
The read-only contract
Section titled “The read-only contract”9.6 made parallel mode strictly read-only, enforced by entering parallel
mode (EnterParallelMode) which forbids the leader and workers from
assigning new XIDs or writing. A query that writes, calls a parallel-unsafe
function, or uses constructs the planner cannot reason about simply runs
serially. This conservative contract — parallelize only what is provably
safe — is the design invariant that the rest of the evolution slowly
relaxes at the edges (more functions and node types proven safe) without
ever abandoning the core rule. See the parallel-mode discussion in
postgres-parallel-query.md.
Cross-link
Section titled “Cross-link”The mechanism introduced here — ExecInitParallelPlan, the shm_toc
layout, worker state restoration, gather_readnext — is documented as the
current design in
postgres-parallel-query.md. 9.6 is where
it was born; the later eras add new consumers of this same plumbing rather
than replacing it.
10 — parallel joins, parallel aggregate, and GatherMerge (a worth-it subtree)
Section titled “10 — parallel joins, parallel aggregate, and GatherMerge (a worth-it subtree)”9.6 could parallelize a scan, but the subtree below Gather was thin: a parallel scan, a filter, a limited partial aggregate. The moment a query needed a join or a real grouped aggregate, the plan fell back to serial. PostgreSQL 10 is the release that made the partial subtree rich enough to be worth parallelizing for real workloads.
What changed
Section titled “What changed”-
Parallel joins — all three strategies. 10 taught the planner to put a join below Gather so each worker joins its share of the outer side:
- Parallel hash join: in 10 each worker builds its own private copy of the inner hash table from the (shared) inner scan, then probes it with its slice of the outer rows. (The shared hash table — one hash table built cooperatively by all workers — is a PG11 addition; see the next era.)
- Parallel merge join and parallel nested-loop join: the outer
side is a partial path (workers each take a slice); the inner side is
re-executed per outer row / merged as usual within each worker.
The join-strategy mechanics themselves are documented in
postgres-join-nodes.md; what 10 added was the planner’s willingness to generate partial join paths and the executor’s parallel-awareness of the outer input.
-
Partial + Finalize aggregation. This is the centerpiece. 10 split aggregation into two cooperating nodes:
- Partial Aggregate (below Gather, one per worker): each worker
computes a partial aggregate state over its slice — e.g. a running
(count, sum)for anavg. - Finalize Aggregate (above Gather, in the leader): combines the
per-worker partial states into the final result using each aggregate’s
combine function (
aggcombinefn) and emits the user-visible value via its final function. This required every parallelizable aggregate to declare a combine function (combinefn not set for aggregate functionis the error you hit otherwise). The partial/finalize split is what letsSUM,COUNT,AVG,MIN/MAXand friends run across workers. Aggregate execution internals are inpostgres-agg-sort-nodes.md.
- Partial Aggregate (below Gather, one per worker): each worker
computes a partial aggregate state over its slice — e.g. a running
-
GatherMerge — order-preserving collection. Plain
Gathercollects tuples in arbitrary arrival order (round-robin over whichever worker queue has data). That destroys any sort order the workers produced, so a query withORDER BYcould not benefit: it would have to re-sort above Gather. 10 added GatherMerge (src/backend/executor/nodeGatherMerge.c): if each worker produces its slice already sorted, GatherMerge binary-heap-merges the per-worker streams to produce a single sorted output — the parallel analogue of a merge step. This unlocked parallelORDER BY, parallel merge joins feeding sorted output upward, and parallel grouped aggregation that relies on sorted input. -
More parallel scan types. 10 also added parallel bitmap heap scan and parallel index scan, so the leaf of the parallel subtree was no longer limited to a plain sequential scan.
After 9.6, the bottleneck moved from “can we scan in parallel” to “the interesting part of an analytic query — the join and the aggregate — still runs serially.” A parallel scan feeding a serial hash join wastes most of the benefit, because the join is where the CPU goes. Making joins and aggregates partial-aware is what turned parallel query from a proof-of-concept into something that accelerated real reporting queries. GatherMerge was the companion fix: without it, every sorted parallel plan paid a full re-sort at the leader, often erasing the gain.
The structural shift (before → after)
Section titled “The structural shift (before → after)”flowchart TB
subgraph nine6["9.6 — thin subtree, serial join"]
C1["Hash Join\n(serial, leader only)"] --> C2["Gather"]
C1 --> C3["Hash inner\n(serial)"]
C2 --> C4["Parallel Seq Scan\nouter"]
end
subgraph ten["10 — join + aggregate below Gather"]
D1["Finalize Aggregate\n(leader, combinefn)"] --> D2["GatherMerge\nmerge sorted streams"]
D2 --> D3["Partial Aggregate\nworker"]
D3 --> D4["Parallel Hash Join\nworker"]
D4 --> D5["Parallel Seq Scan\nouter slice"]
D4 --> D6["Hash inner\nprivate per worker"]
end
The shift: in 9.6 the join sat above Gather (serial); in 10 the entire join-and-partial-aggregate stack sits below Gather, with only the cheap finalize step left for the leader, and GatherMerge preserves order on the way up.
Cross-link
Section titled “Cross-link”The planner-side machinery that decides to generate these partial join and
partial aggregate paths — partial path generation, the parallel cost terms —
is the optimizer behavior summarized in
postgres-planner-overview.md. The
executor-side GatherMerge heap-merge and the partial/finalize node
execution are part of the current design in
postgres-parallel-query.md and
postgres-executor.md.
11 — parallel CREATE INDEX and parallel-aware Append (parallelism leaves SELECT)
Section titled “11 — parallel CREATE INDEX and parallel-aware Append (parallelism leaves SELECT)”Through 10, parallelism lived entirely inside read query execution. PG11 broke parallelism out of pure SELECT in two directions: into DDL (building an index) and into the partition layer (handing whole partitions to workers). It also upgraded the parallel hash join from private to shared hash tables.
What changed
Section titled “What changed”-
Parallel CREATE INDEX (btree). Building a btree is dominated by one thing: sorting every row’s index key. PG11 parallelized that sort. The leader and workers cooperatively scan the heap and feed a parallel external sort (Tuplesort gained a parallel mode); the workers produce sorted runs, and the final merge builds the btree. The driver lives in
src/backend/access/nbtree/nbtsort.c(_bt_parallel_*helpers, theBTBuildStateshared between leader and workers). This is the first time parallelism touched a write-side operation — legal becauseCREATE INDEXreads a stable snapshot of the heap and the only writer of the new index is the build itself. The sort mechanics are inpostgres-tuplesort.md; the build flow is inpostgres-index-creation.md. -
Parallel-aware Append. Before 11, an
Appendover a partitioned table (or aUNION ALL) ran its child subplans one after another within each worker. PG11 madeAppendparallel-aware (src/backend/executor/nodeAppend.c, theParallelAppendState): the children are placed in shared memory and workers claim subplans to run, so different workers process different partitions concurrently. Larger non-partial children are scheduled first (and can be claimed by only one worker), while partial children can be shared by several. This is the key to parallelizing queries over partitioned tables where each partition is itself only moderately large — the win comes from running partitions across workers rather than parallelizing each scan in isolation. -
Parallel Hash (shared hash table). PG10’s parallel hash join had each worker build a private copy of the inner hash table — duplicated work and duplicated memory. PG11 added Parallel Hash: the workers cooperatively build one shared hash table in DSM, partitioning the build work among themselves, then all probe the single shared table. This removed the per-worker memory blowup and made parallel hash join scale to large inner relations. The hash-join internals are in
postgres-join-nodes.md.
Two distinct pressures:
- Index builds were a serial wall during data loads and
pg_restore. After bulk-loading a large table, theCREATE INDEXstep ran on one core for minutes to hours. Parallelizing the sort attacks exactly the dominant cost, with no concurrency hazard because the new index is private until commit. - Partitioning had become a first-class feature (declarative
partitioning landed in 10), so queries over many partitions were common —
but a serial
Appendmeant only one partition was being scanned at a time even when the planner had chosen parallelism. Parallel-aware Append let the partition dimension itself become a source of parallelism.
The shared-hash upgrade is the natural maturation of PG10’s parallel hash join: once joins ran in parallel, the wasted per-worker hash table was the obvious next inefficiency to remove.
The structural shift (before → after)
Section titled “The structural shift (before → after)”flowchart TB
subgraph before["10 — serial Append, private hash"]
E1["Append\n(serial: child by child)"] --> E2["Scan part_1"]
E1 --> E3["Scan part_2"]
E1 --> E4["Scan part_3"]
E5["CREATE INDEX\n(serial sort, one core)"]
end
subgraph after["11 — parallel Append, shared hash, parallel index build"]
F1["Gather"] --> F2["Parallel Append\nworkers claim subplans"]
F2 --> F3["worker A -> part_1"]
F2 --> F4["worker B -> part_2"]
F2 --> F5["worker C -> part_3"]
G1["CREATE INDEX\nparallel Tuplesort"] --> G2["worker sorted run"]
G1 --> G3["worker sorted run"]
G1 --> G4["leader merge -> btree"]
end
The shift on the read side: Append stops being a serial loop over children
and becomes a work queue of subplans that workers pull from. On the
write side: index construction grows a parallel sort below it for the first
time.
Cross-link
Section titled “Cross-link”The current parallel-aware Append execution (ParallelAppendState,
subplan claiming) is part of the executor design in
postgres-executor.md and the parallel plumbing in
postgres-parallel-query.md. Parallel index
build flow is covered by
postgres-index-creation.md and
postgres-tuplesort.md.
12 — planner and cost refinements (no new node, a sharper planner)
Section titled “12 — planner and cost refinements (no new node, a sharper planner)”PG12 added no headline new parallel node, but it is a meaningful era because the constraint on parallelism by this point was no longer “can the executor do it” — it was “will the planner choose it correctly, and is the work classified safe.” 12 is representative of the steady planner-side investment that runs through every release.
What changed
Section titled “What changed”- Parallel-safety bookkeeping tightened. The optimizer’s classification of expressions, functions, and subplans as parallel-safe / -restricted / -unsafe continued to be refined so that more plans qualified for parallelism without risking incorrect results. (The three-level classification is the gate that decides whether a Gather may be placed above a given subtree at all.)
- Better partial-path generation and costing. The planner became more
willing and more accurate about generating partial paths in more places,
and the parallel cost terms (worker startup, tuple-queue transfer) were
refined so the chosen degree of parallelism better matched reality. The
cost framework is documented in
postgres-cost-model.md. - Surrounding features made more parallel-friendly. As features like partition pruning and partitionwise join matured, the interaction with parallel Append and parallel join improved, so partitioned analytic queries got faster plans even without a new executor node.
By PG12 the executor could parallelize scans, joins, aggregates, sorts (via GatherMerge), Append, and index builds. The remaining gains were less about new mechanisms and more about the planner applying the existing mechanisms in more situations and choosing degree-of-parallelism well. A node that the executor can run in parallel is useless if the planner never emits it or prices it wrong.
The structural shift
Section titled “The structural shift”There is no node-shape change to draw here; the shift is in the planner’s decision surface — more subtrees now qualify for a Gather, and the degree of parallelism is chosen more accurately. This is the kind of era that does not change the diagrams but changes how often the parallel diagrams from earlier eras actually get used.
Cross-link
Section titled “Cross-link”Parallel-safety classification and partial-path costing are part of the
optimizer behavior in
postgres-planner-overview.md and
postgres-cost-model.md.
13 — parallel VACUUM (parallelism reaches maintenance)
Section titled “13 — parallel VACUUM (parallelism reaches maintenance)”PG13 extended parallelism to maintenance, the third major domain after
read queries (9.6–10) and DDL index builds (11). It made VACUUM able to
process a table’s indexes in parallel.
What changed
Section titled “What changed”VACUUM over a table with many indexes spends much of its time in the
index-vacuum phase: for each index, scan it and remove pointers to
dead heap tuples. Serially, that is one index after another. PG13’s
parallel vacuum (src/backend/commands/vacuumparallel.c) launches
background workers and assigns each index to a worker, so multiple
indexes are vacuumed (and cleaned up) concurrently. The leader still
performs the heap scan and coordinates; the parallelism is across the
indexes, which is where the embarrassingly-parallel work is — each index
is an independent structure.
Key design points, distilled (the mechanism itself is documented in
postgres-vacuum.md, “Parallel vacuum”):
- Parallelism is per-index, not per-block within one index. An index is
eligible only if it is large enough to be worth a worker
(
min_parallel_index_scan_size) and its access method supports parallel vacuum. - It applies to the index vacuum and index cleanup phases, not the heap scan. The dead-TID list the workers consume is built by the leader’s heap pass and shared through DSM.
- It is available to manual
VACUUM (PARALLEL n)and plainVACUUM; autovacuum does not use it by default, to avoid autovacuum workers multiplying into many processes.
Maintenance windows were a real operational pain on large tables with many
indexes: a nightly VACUUM could run for hours, almost entirely
single-threaded, while the box had idle cores. Because indexes are
independent structures, vacuuming them concurrently is a clean,
low-coordination win — structurally the same insight as parallel Append
(independent children) and parallel scan (independent blocks): find the
axis along which the work is already disjoint, and assign disjoint pieces to
workers. Crucially, parallel vacuum reuses the same ParallelContext
plumbing from src/backend/access/transam/parallel.c that 9.6 built for
query execution — this era is mostly a new consumer of existing
infrastructure, not new infrastructure.
The structural shift (before → after)
Section titled “The structural shift (before → after)”flowchart TB
subgraph before["<=12 — serial index vacuum"]
H1["VACUUM leader"] --> H2["heap scan -> dead TIDs"]
H2 --> H3["vacuum index 1"]
H3 --> H4["vacuum index 2"]
H4 --> H5["vacuum index 3"]
end
subgraph after["13 — parallel index vacuum"]
I1["VACUUM leader"] --> I2["heap scan -> dead TIDs (shared in DSM)"]
I2 --> I3["worker -> vacuum index 1"]
I2 --> I4["worker -> vacuum index 2"]
I2 --> I5["leader -> vacuum index 3"]
end
The shift: the serial chain of per-index vacuum steps becomes a fan-out where each index is an independent unit of work claimed by a worker (or the leader), all consuming the one shared dead-TID list.
Cross-link
Section titled “Cross-link”The current parallel-vacuum design — eligibility rules, the DSM-shared
dead-TID list, the leader/worker split — is documented in
postgres-vacuum.md. The underlying
ParallelContext it rides on is the same one described in
postgres-parallel-query.md.
14–18 — incremental hardening (more safe paths, better tuning)
Section titled “14–18 — incremental hardening (more safe paths, better tuning)”From PG14 onward the evolution shifts from “add a new parallel domain” to hardening and widening what already exists. There is no single dramatic node in this span; instead, each release files down rough edges. Grouping them is honest about their character — they are deltas, not eras with a new architecture.
What changed (themes across 14–18)
Section titled “What changed (themes across 14–18)”-
Leader-participation tuning. The leader process can also run the partial plan itself (
parallel_leader_participation), but on some plans — notably certain parallel hash and GatherMerge shapes — a busy leader stalls draining the worker queues, hurting throughput. Successive releases refined when the leader participates and how the planner costs leader participation, so the default behavior is better on more plan shapes. The leader-participation mechanics are part of the current design inpostgres-parallel-query.md. -
More parallel-safe functions and constructs. The set of built-in functions and SQL constructs marked parallel-safe kept growing, and more query shapes (additional aggregate/window situations, more subquery forms, more of the partitionwise machinery) became eligible for parallelism. Each such change lets a Gather sit above subtrees that previously forced serial execution.
-
Node-level extensions and better partial paths. More plan node types gained partial-path variants or improved parallel costing, and partitionwise join/aggregate interplay with parallel Append continued to improve, so partitioned analytic queries got steadily better parallel plans.
-
Infrastructure that parallel query benefits from indirectly. REL_18 brings a unified asynchronous I/O subsystem (see
postgres-aio.md); parallel scans are among the workloads that benefit from better I/O concurrency underneath the per-worker scans, even though AIO is not itself “parallel query.”
By PG14 the architecture of parallel query was essentially complete: Gather / GatherMerge, the ParallelContext plumbing, partial paths, parallel joins/aggregates/Append, parallel index build, parallel vacuum. The remaining returns came from coverage and tuning — making the planner choose parallelism correctly in more cases and making the chosen plans run efficiently — rather than from inventing new parallel mechanisms. This is the expected maturity curve: the hard structural work front-loads into a few releases (9.6–13), then a long tail of incremental polish.
The structural shift
Section titled “The structural shift”No new node-shape diagram applies; the change is quantitative (more subtrees qualify, better-tuned execution) rather than structural. The “after” picture is simply the union of all the earlier diagrams, now reachable from more queries and tuned to run better — which is exactly the REL_18 picture below.
Where it stands at REL_18 (the current design)
Section titled “Where it stands at REL_18 (the current design)”At REL_18 (commit 273fe94, PG 18.x) intra-query parallelism is a mature,
multi-domain subsystem built on one consistent foundation. The current
design — which this doc deliberately does not re-derive — is documented
in postgres-parallel-query.md. In summary
of where the arc lands:
- The boundary node. Every parallel plan is a single-process tree with a
Gather(arbitrary order) orGatherMerge(sorted) node spliced in. Below it the subtree is parallel-aware; above it execution is single-process. Implementations:src/backend/executor/nodeGather.c,src/backend/executor/nodeGatherMerge.c. - The plumbing. On first execution
ExecInitParallelPlan(src/backend/executor/execParallel.c) serializes the plan into a DSM segment laid out by anshm_toc, sizes it by walking the PlanState tree, and lays out per-workershm_mqtuple queues. TheParallelContextinsrc/backend/access/transam/parallel.cforks the workers and restores leader state (snapshot, XID, combo CIDs, GUCs, libraries, user ID) so visibility matches.ParallelWorkerMainruns each worker’s executor copy; the leader’sgather_readnextdrains the queues. - What can be parallelized. Sequential / index / bitmap-heap scans;
hash / merge / nested-loop joins (with a shared Parallel Hash);
partial+finalize aggregation; sorts via GatherMerge; parallel-aware Append
over partitions and
UNION ALL; parallel btreeCREATE INDEX; and parallelVACUUMof indexes. The read-only contract from 9.6 still holds: parallel mode forbids writes, and only provably parallel-safe work is eligible. - Where each piece came from. Gather + parallel seq scan (9.6); joins, partial aggregate, GatherMerge (10); parallel CREATE INDEX, parallel-aware Append, shared Parallel Hash (11); planner/cost refinements (12); parallel VACUUM (13); leader-participation and coverage tuning (14–18).
The single most important structural fact is the one the timeline makes
visible: 9.6–13 built the mechanisms; everything since widens and tunes
them on the same ParallelContext substrate. New domains (DDL,
maintenance) were added by becoming consumers of the 9.6 plumbing, not by
reinventing it.
PG19 forward note. PostgreSQL development continues to push parallel coverage outward — more node types and utility operations gaining parallel-safe paths, and continued tuning of when parallelism and leader participation pay off. Treat any specific PG19 item as a just-released forward step, not as REL_18 behavior; the REL_18 design above is the current state of record.
Sources
Section titled “Sources”Release notes (feature attributions verified against these):
- PostgreSQL 9.6 release notes — parallel sequential scan, parallel joins (initial), Gather node, parallel aggregate (initial), background-worker and DSM infrastructure.
- PostgreSQL 10 release notes — parallel merge join, parallel bitmap heap scan, parallel index scan, improved parallel aggregation (partial/finalize), GatherMerge.
- PostgreSQL 11 release notes — parallel CREATE INDEX (btree), parallel-aware Append, parallel hash join with shared hash table.
- PostgreSQL 12 release notes — parallel-query planner and cost refinements; partitioning interactions.
- PostgreSQL 13 release notes — parallel VACUUM.
- PostgreSQL 14–18 release notes — incremental parallel-safety and leader-participation improvements; REL_18 (commit 273fe94) current state.
Current-state module docs (mechanism lives here — cited, not duplicated):
postgres-parallel-query.md— Gather / GatherMerge, DSM plan, ParallelContext, worker state restoration,gather_readnext, parallel mode.postgres-executor.md— the PlanState tree,ExecProcNode, parallel-aware node execution, parallel Append.postgres-planner-overview.md— partial-path generation, parallel-safety classification.postgres-vacuum.md— parallel vacuum design.postgres-cost-model.md— parallel cost terms.postgres-join-nodes.md,postgres-agg-sort-nodes.md,postgres-index-creation.md,postgres-tuplesort.md,postgres-aio.md— supporting mechanism for the parallelized operations.
Key source files (observable on REL_18, commit 273fe94):
src/backend/executor/execParallel.c—ExecInitParallelPlan, DSM/shm_toclayout, worker setup.src/backend/executor/nodeGather.c— Gather node,gather_readnext.src/backend/executor/nodeGatherMerge.c— order-preserving merge.src/backend/executor/nodeAppend.c—ParallelAppendState, parallel-aware Append (PG11).src/backend/access/transam/parallel.c—ParallelContext, worker fork and state restoration; reused by parallel vacuum and parallel index build.src/backend/access/nbtree/nbtsort.c—_bt_parallel_*, parallel btree build (PG11).src/backend/commands/vacuumparallel.c— parallel vacuum (PG13).src/backend/optimizer/path/allpaths.c— partial-path generation.