PostgreSQL Parallel Query — Gather, Workers, and the DSM Plan
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A single query, no matter how well optimized, runs on one CPU core in the
classic demand-pull iterator model (covered by postgres-executor.md).
When the bottleneck is a large scan or aggregate over millions of rows,
the only way to go faster is to split the work across cores. Database
System Concepts (Silberschatz, 7e, ch. 22 “Parallel and Distributed
Query Processing”) frames this as intraquery parallelism — “the
processing of different parts of the execution of a single query, in
parallel across multiple nodes” — and distinguishes it from interquery
parallelism (running many independent queries at once, the concern of a
transaction-processing system). Intraquery parallelism, the textbook
says, “is essential for speeding up long-running queries, and it is the
focus of this chapter” (§22.1).
The textbook then splits intraquery parallelism into two complementary sources:
- Intraoperation parallelism (§22.1, §22.2–22.4). A single relational operator — a scan, a join, an aggregate — is run in parallel by partitioning its input across nodes. “Since the number of tuples in a relation can be large, the degree of intraoperation parallelism is also potentially very large; thus, intraoperation parallelism is natural in a database system.” A sort over a range-partitioned relation, for example, sorts each partition independently and concatenates the results.
- Interoperation parallelism (§22.5.1). Different operators of one query run at the same time, either as a pipeline (operator B consumes A’s output as A produces it, on a different core) or independently (two subtrees with no dependency run concurrently). The textbook is candid that pipelined parallelism “does not scale up well” because pipeline chains are short and stalls propagate; “when the degree of parallelism is high, pipelining is a less important source of parallelism than partitioning” (§22.5.1.1).
The unifying mechanism the textbook settles on is the exchange operator (§22.5, §22.5.2), the contribution of Graefe’s Volcano system:
“a model of parallel query execution which breaks parallel query processing into two types of steps: partitioning of data using the exchange operator, and execution of operations on purely local data, without any data exchange. This model is surprisingly powerful and is widely used in parallel database implementations.” (DSC §22.5)
The exchange operator is the key abstraction because it encapsulates parallelism inside one operator type. Every other operator in the plan is written as if it were sequential, operating on “purely local data”; the exchange operator is the only place that knows about partitioning, data movement between processes, and the merge of multiple streams back into one. Drop an exchange into an otherwise sequential operator tree and the tree becomes parallel without any other operator being rewritten. Graefe’s 1990 paper is titled, precisely, “Encapsulation of Parallelism in the Volcano Query Processing System.”
Three properties of the exchange model matter before reading any engine’s source:
- Producers and a consumer. An exchange has N producer processes each running an identical copy of the subplan below it, and a consumer side that gathers their output streams into one. The producers move tuples to the consumer over some transport (a network socket on a shared-nothing cluster; shared memory on a multicore box).
- The subplan is unaware. The subplan below the exchange does not know it is one of N copies. Correctness requires that the N copies, taken together, produce the whole result with no duplication and no omission — which is a property of how the leaf scan partitions its input, not of the operators above it.
- Order is a separate concern. A plain gather makes no promise about the order tuples arrive in; if the query needs sorted output, the exchange must merge the already-sorted producer streams rather than simply concatenate them.
PostgreSQL is a faithful, multicore-shared-memory realization of this
model. Its exchange operator is the Gather node (and its
order-preserving variant GatherMerge); its N producers are
background worker processes; its transport is a shared-memory queue
(shm_mq); and “the subplan is unaware” becomes the parallel-aware
flag on the partial plan beneath the Gather. The rest of this document
traces those pieces in the REL_18 source.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model — an exchange operator over otherwise-local operators. This section names the engineering conventions a multicore-shared-memory engine adopts to make that model work, the problems the textbook’s shared-nothing framing leaves implicit. Read PostgreSQL’s choices in the next section as one point in this space.
Processes (or threads) plus a shared transport
Section titled “Processes (or threads) plus a shared transport”The producers must run concurrently, so they are separate threads or processes; and they must hand tuples to the consumer cheaply, so they share an in-memory ring buffer rather than copying over a socket. A thread-per-worker engine shares the whole address space for free but pays in fragile global state and locking; a process-per-worker engine isolates faults and reuses the single-threaded executor unchanged, but must explicitly copy every piece of state a worker needs. PostgreSQL is firmly in the process camp — a parallel worker is a full backend — which is why so much of its machinery is about state transfer.
A serialized snapshot of the leader’s state
Section titled “A serialized snapshot of the leader’s state”A worker process starts empty. To run the same plan and see the same rows, it must be handed: the plan itself, the bound parameters, the MVCC snapshot, the set of in-progress transaction IDs (so visibility checks agree), combo-CID mappings, GUC settings, the authenticated user, loaded libraries. The convention is to serialize all of this into a shared segment the workers can read, with a small table-of-contents index so each consumer finds its piece by a well-known key. The plan is serialized by the engine’s generic node-tree serializer; the snapshot and txn state by purpose-built routines.
Two-pass sizing of the shared segment
Section titled “Two-pass sizing of the shared segment”The shared segment must be allocated at one fixed size before any worker exists, but its size depends on the plan (how many nodes need shared state, how big the serialized plan is). The convention is a two-pass walk of the operator tree: an estimate pass that sums up every node’s space request, then allocate, then an initialize pass that fills the now -allocated space. Both passes visit the identical node set in the identical order so offsets line up.
Strictly read-only workers
Section titled “Strictly read-only workers”Allowing workers to write would require propagating new XIDs, command- counter increments, and combo CIDs back to the leader and the other workers mid-flight — a synchronization problem with no cheap solution. The universal convention for a first-generation parallel executor is therefore to make parallel mode read-only: no INSERT/UPDATE/DELETE, no DDL, no XID assignment. An explicit enter/exit parallel mode bracket arms error checks that catch violations.
Graceful degradation and best-effort workers
Section titled “Graceful degradation and best-effort workers”Worker slots are a finite, shared resource (max_worker_processes). A
parallel plan must run correctly even if zero workers are obtained —
the consumer falls back to running the plan itself. So worker launch is
best-effort: request N, accept however many the postmaster grants, and
have the leader optionally participate as an N+1-th producer.
Out-of-band error propagation
Section titled “Out-of-band error propagation”A worker that hits an error cannot just exit(1) — the leader would block
forever waiting for tuples. The convention is a dedicated error channel
(separate from the tuple channel) plus a signal that pokes the leader
to drain it; the leader re-throws the worker’s error as its own at the
next interrupt-check point, so parallel errors “just work” from the user’s
perspective.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”By the time you reach a named PostgreSQL symbol, you should know what kind of thing it is:
| Theory / convention | PostgreSQL name |
|---|---|
| Exchange operator (gather side) | Gather / GatherState (ExecGather) |
| Order-preserving exchange | GatherMerge / GatherMergeState (ExecGatherMerge) |
| “Subplan is unaware” / parallel producer | partial plan with plan->parallel_aware true |
| Producer process | background worker running ParallelWorkerMain → ParallelQueryMain |
| Shared transport for tuples | per-worker shm_mq tuple queue (PARALLEL_TUPLE_QUEUE_SIZE = 64 KB) |
| Tuple format on the wire | MinimalTuple (no txn header), via TupleQueueReader |
| Serialized leader state segment | DSM segment created by InitializeParallelDSM |
| Table-of-contents index | shm_toc keyed by PARALLEL_KEY_* magic numbers |
| Serialized plan | ExecSerializePlan → nodeToString(PlannedStmt) |
| Two-pass sizing | ExecParallelEstimate then ExecParallelInitializeDSM |
| Per-node shared scratch (e.g. scan cursor) | DSA area (PARALLEL_KEY_DSA, dsa_create_in_place) |
| Read-only bracket | EnterParallelMode / ExitParallelMode |
| Graceful degradation | need_to_scan_locally, parallel_leader_participation |
| Out-of-band errors | error shm_mq + PROCSIG_PARALLEL_MESSAGE + ProcessParallelMessages |
| Per-execution parallel handle | ParallelExecutorInfo (pei) / ParallelContext (pcxt) |
The generic process-launch and DSM/shm_mq plumbing — ParallelContext,
InitializeParallelDSM, shm_mq, background-worker registration — is the
subject of postgres-shared-memory-ipc.md; this document covers the
executor’s use of it. Why a Gather appears in the plan at all, and
how partial paths are costed, belongs to postgres-planner-overview.md
and postgres-path-generation.md. The single-process iterator model the
workers each run is postgres-executor.md. This document covers the
seam: how Gather turns a partial plan into a DSM-serialized job, how
workers are launched and synchronized, and how the leader+worker tuple
streams are merged back into one.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s parallel query is the exchange model rendered for a multicore shared-memory box, with five design decisions that give it its shape:
- The exchange operator is a
Gathernode, planted by the planner above a partial plan. The partial subtree’s leaf scan is parallel-aware — e.g. a parallelSeqScanwhose block range is doled out from a shared counter — so that N identical copies cover the whole relation exactly once.GatherMergeis the variant that preserves the input sort order by heap-merging the worker streams. - Workers are full backend processes, forked lazily on first
execution.
ExecGatherdoes nothing parallel atExecInitNodetime; only when the first tuple is pulled does it build the DSM segment and ask the postmaster to launch workers — “it needs to allocate a large dynamic segment, so it is better to do it only if it is really needed.” - The plan and all leader state travel through a DSM segment indexed
by an
shm_toc.ExecInitParallelPlanserializes a dummyPlannedStmtwithnodeToString, sizes the segment by a two-pass walk of thePlanStatetree, and the genericInitializeParallelDSMlayers on GUCs, snapshots, the txn XID set, combo CIDs, libraries — everything a worker needs so its visibility checks match the leader’s. - Tuples flow leader-ward over per-worker
shm_mqqueues asMinimalTuples. Each worker is the sender on its own 64 KB shared-memory ring; the leader is the receiver on all of them and drains them round-robin. A minimal tuple carries no transaction header, so it is exactly what crosses a process boundary cheaply. - Parallel mode is strictly read-only and best-effort.
EnterParallelModearms checks that forbid writes; the leader tolerates getting fewer workers than requested (down to zero) and can itself act as a producer whenparallel_leader_participationis on.
The plan shape: Gather over a partial plan
Section titled “The plan shape: Gather over a partial plan”A parallel plan is an ordinary serial plan with a Gather (or
GatherMerge) node spliced in. Everything below the Gather runs in
each worker; everything above runs only in the leader. The node directly
below is typically a partial path whose leaf is parallel-aware.
flowchart TB
subgraph LEADER["runs in leader only (above Gather)"]
AGG["Finalize Aggregate"]
GATH["Gather<br/>num_workers=2<br/>(the exchange operator)"]
AGG --> GATH
end
subgraph WORKERS["partial plan — one identical copy per worker (+ leader)"]
PAGG["Partial Aggregate"]
PSCAN["Parallel SeqScan<br/>(parallel_aware = true)<br/>block range from shared counter"]
PAGG --> PSCAN
end
GATH -->|"DSM-serialized plan;<br/>tuples back via shm_mq"| PAGG
Figure 1 — A parallel aggregate. The Gather is the exchange operator:
the partial subtree below it is copied into each worker (and optionally
run by the leader too); the parallel-aware SeqScan partitions the heap
by handing each worker a different block range from one shared counter, so
the copies together scan every block exactly once. The leader’s Finalize Aggregate combines the workers’ partial aggregates.
ExecInitGather — set up the funnel, defer the workers
Section titled “ExecInitGather — set up the funnel, defer the workers”ExecInitGather builds the GatherState and, crucially, initializes the
outer (child) plan but does not launch any workers. It also sets up
the funnel slot — a MinimalTuple-backed slot that every worker tuple
will be stored into — because that is the format tuples arrive in:
// ExecInitGather — src/backend/executor/nodeGather.c (condensed)gatherstate = makeNode(GatherState);gatherstate->ps.plan = (Plan *) node;gatherstate->ps.state = estate;gatherstate->ps.ExecProcNode = ExecGather; /* the next() function */
gatherstate->initialized = false; /* workers launched lazily */gatherstate->need_to_scan_locally = !node->single_copy && parallel_leader_participation;gatherstate->tuples_needed = -1;
/* now initialize outer plan (the partial subtree) */outerNode = outerPlan(node);outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);tupDesc = ExecGetResultType(outerPlanState(gatherstate));
/* Initialize funnel slot to same tuple descriptor as outer plan. */gatherstate->funnel_slot = ExecInitExtraTupleSlot(estate, tupDesc, &TTSOpsMinimalTuple);Two design facts are visible. First, initialized = false: the whole DSM
- worker apparatus is built on the first
ExecGathercall, not here. Second,need_to_scan_locallyrecords theparallel_leader_participationpolicy — whether the leader will also act as a producer — but its final value is recomputed after launch, once the actual worker count is known.
ExecGather — lazy launch, then pull from the funnel
Section titled “ExecGather — lazy launch, then pull from the funnel”ExecGather is the next() for the exchange operator. On the first call
it fires up the parallel machinery; on every call it pulls one tuple from
gather_getnext and optionally projects it.
// ExecGather — src/backend/executor/nodeGather.c (condensed)if (!node->initialized){ EState *estate = node->ps.state; Gather *gather = (Gather *) node->ps.plan;
if (gather->num_workers > 0 && estate->es_use_parallel_mode) { ParallelContext *pcxt;
/* Build (or rebuild) the shared state the workers need. */ if (!node->pei) node->pei = ExecInitParallelPlan(outerPlanState(node), estate, gather->initParam, gather->num_workers, node->tuples_needed); else ExecParallelReinitialize(outerPlanState(node), node->pei, gather->initParam);
pcxt = node->pei->pcxt; LaunchParallelWorkers(pcxt); /* ask postmaster to fork */ node->nworkers_launched = pcxt->nworkers_launched;
if (pcxt->nworkers_launched > 0) { ExecParallelCreateReaders(node->pei); /* one reader per worker */ node->nreaders = pcxt->nworkers_launched; node->reader = palloc(node->nreaders * sizeof(TupleQueueReader *)); memcpy(node->reader, node->pei->reader, node->nreaders * sizeof(TupleQueueReader *)); } else { node->nreaders = 0; node->reader = NULL; } node->nextreader = 0; }
/* Run plan locally if no workers, or if leader participation is on. */ node->need_to_scan_locally = (node->nreaders == 0) || (!gather->single_copy && parallel_leader_participation); node->initialized = true;}The graceful-degradation convention is right here: num_workers > 0 && es_use_parallel_mode may still yield nworkers_launched == 0, in which
case need_to_scan_locally becomes true and the leader runs the plan
itself. The single_copy flag is the degenerate one-worker mode where the
leader does not also scan, so the child need not be parallel-aware at
all.
gather_getnext / gather_readnext — drain the queues, maybe scan locally
Section titled “gather_getnext / gather_readnext — drain the queues, maybe scan locally”gather_getnext is the heart of the consumer side. It loops: try a worker
tuple (gather_readnext); if none and the leader is a participant, pull
one tuple from the local copy of the plan via ExecProcNode. The
local-scan branch installs the shared DSA area into es_query_dsa
first, because the parallel-aware leaf nodes the leader is now running
themselves reach into that shared area for their cursor state:
// gather_getnext — src/backend/executor/nodeGather.c (condensed)while (gatherstate->nreaders > 0 || gatherstate->need_to_scan_locally){ CHECK_FOR_INTERRUPTS();
if (gatherstate->nreaders > 0) { tup = gather_readnext(gatherstate); /* a worker MinimalTuple */ if (HeapTupleIsValid(tup)) { ExecStoreMinimalTuple(tup, fslot, false); /* into the funnel slot */ return fslot; } }
if (gatherstate->need_to_scan_locally) { EState *estate = gatherstate->ps.state; /* Install our DSA area while executing the plan. */ estate->es_query_dsa = gatherstate->pei ? gatherstate->pei->area : NULL; outerTupleSlot = ExecProcNode(outerPlan); /* leader as producer */ estate->es_query_dsa = NULL; if (!TupIsNull(outerTupleSlot)) return outerTupleSlot; gatherstate->need_to_scan_locally = false; /* local copy exhausted */ }}return ExecClearTuple(fslot); /* everyone done */gather_readnext is where the round-robin multiplexing lives. It reads
non-blocking from the current worker’s queue; advances to the next
worker only when the current one would block; removes a worker from the
active array when its queue signals done; and, when it has visited every
surviving worker without success and the leader is not also scanning,
sleeps on its latch until a worker wakes it:
// gather_readnext — src/backend/executor/nodeGather.c (condensed)for (;;){ CHECK_FOR_INTERRUPTS(); /* drains worker errors */
reader = gatherstate->reader[gatherstate->nextreader]; tup = TupleQueueReaderNext(reader, true, &readerdone); /* nowait=true */
if (readerdone) /* this worker is finished */ { --gatherstate->nreaders; if (gatherstate->nreaders == 0) { ExecShutdownGatherWorkers(gatherstate); return NULL; } memmove(/* compact the reader array, removing this slot */); if (gatherstate->nextreader >= gatherstate->nreaders) gatherstate->nextreader = 0; continue; }
if (tup) return tup; /* got one; keep this queue */
/* Empty for now: round-robin to the next worker. */ gatherstate->nextreader++; if (gatherstate->nextreader >= gatherstate->nreaders) gatherstate->nextreader = 0;
if (++nvisited >= gatherstate->nreaders) /* visited all of them */ { if (gatherstate->need_to_scan_locally) return NULL; /* let caller scan locally */ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0, WAIT_EVENT_EXECUTE_GATHER); /* nothing to do but wait */ ResetLatch(MyLatch); nvisited = 0; }}The comment in the source explains a tuning subtlety: PostgreSQL “used to advance the nextreader pointer after every tuple, but it turns out to be much more efficient to keep reading from the same queue until that would require blocking” — staying on one queue maximizes batch locality and minimizes latch churn.
flowchart TB
subgraph LDR["Leader process"]
EG["ExecGather / gather_getnext"]
RR["gather_readnext<br/>round-robin, non-blocking"]
LOC["local scan<br/>(parallel_leader_participation)"]
EG --> RR
EG --> LOC
end
subgraph W0["Worker 0 (ParallelQueryMain)"]
P0["executor copy of partial plan"]
end
subgraph W1["Worker 1"]
P1["executor copy of partial plan"]
end
P0 -->|"MinimalTuple"| Q0[["shm_mq queue 0<br/>64 KB ring"]]
P1 -->|"MinimalTuple"| Q1[["shm_mq queue 1"]]
Q0 --> RR
Q1 --> RR
LOC -->|"reads via es_query_dsa<br/>shared scan cursor"| DSA[("DSA area")]
P0 -.-> DSA
P1 -.-> DSA
Figure 2 — The funnel. Each worker is the sole sender on its own 64 KB
shm_mq; the leader is the receiver on all of them and drains them
round-robin, parking on its latch only when every queue is empty and it is
not itself scanning. The parallel-aware leaf nodes (in both workers and,
when participating, the leader) coordinate which heap blocks each one reads
through a shared cursor in the DSA area.
GatherMerge — the order-preserving exchange
Section titled “GatherMerge — the order-preserving exchange”GatherMerge is used when the plan needs the gathered output sorted
(e.g. a Sort below each worker feeding an order-dependent node above).
Instead of gather_readnext’s round-robin concatenation, it maintains a
binary heap of the per-worker stream heads keyed by the sort columns,
and on each call pops the globally-smallest tuple and refills that one
worker’s slot. Its launch path is identical to Gather —
ExecInitParallelPlan, LaunchParallelWorkers, ExecParallelCreateReaders:
// ExecGatherMerge — src/backend/executor/nodeGatherMerge.c (condensed)if (!node->initialized){ if (gm->num_workers > 0 && estate->es_use_parallel_mode) { if (!node->pei) node->pei = ExecInitParallelPlan(outerPlanState(node), estate, gm->initParam, gm->num_workers, node->tuples_needed); else ExecParallelReinitialize(outerPlanState(node), node->pei, gm->initParam); pcxt = node->pei->pcxt; LaunchParallelWorkers(pcxt); node->nworkers_launched = pcxt->nworkers_launched; if (pcxt->nworkers_launched > 0) ExecParallelCreateReaders(node->pei); /* ... build the binary heap, one slot per reader ... */ }}The cost is that GatherMerge cannot return a tuple from a worker until
that worker has produced at least one tuple, so a slow or stalled worker
holds back the merge; a plain Gather has no such ordering coupling. That
is the engineering price of the “order is a separate concern” property
from the theory.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. A function or struct name is the stable handle; line numbers drift the moment someone reformats. Use
git grep -n '<symbol>' src/backend/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Serializing the plan — ExecSerializePlan
Section titled “Serializing the plan — ExecSerializePlan”The plan that travels to workers is a copy (workers must not scribble on
the leader’s plan), wrapped in a dummy PlannedStmt. Two fix-ups matter:
resjunk columns are un-marked (the tuples come back to another backend
that may need them, not to the user), and parallel-unsafe subplans are
replaced by a NULL “hole” so a worker can never even ExecInitNode one:
// ExecSerializePlan — src/backend/executor/execParallel.c (condensed)plan = copyObject(plan); /* don't touch the original */
foreach(lc, plan->targetlist) lfirst_node(TargetEntry, lc)->resjunk = false; /* keep junk cols on the wire */
pstmt = makeNode(PlannedStmt);pstmt->commandType = CMD_SELECT; /* always read-only */pstmt->planTree = plan;pstmt->rtable = estate->es_range_table;/* Transfer only parallel-safe subplans, NULL-padding the unsafe ones. */pstmt->subplans = NIL;foreach(lc, estate->es_plannedstmt->subplans){ Plan *subplan = (Plan *) lfirst(lc); if (subplan && !subplan->parallel_safe) subplan = NULL; pstmt->subplans = lappend(pstmt->subplans, subplan);}return nodeToString(pstmt); /* the generic serializer */nodeToString is the engine-wide node-tree serializer (owned by
postgres-node-trees.md); reusing it is what makes “ship the plan” a
one-liner.
Two-pass DSM sizing — ExecParallelEstimate then init
Section titled “Two-pass DSM sizing — ExecParallelEstimate then init”ExecInitParallelPlan sizes the segment before creating it. The estimate
pass walks the PlanState tree via planstate_tree_walker, counting
nodes and letting each parallel-aware node reserve space; nodes that need
DSM only for EXPLAIN ANALYZE instrumentation reserve unconditionally:
// ExecParallelEstimate — src/backend/executor/execParallel.c (condensed)e->nnodes++; /* count every node */switch (nodeTag(planstate)){ case T_SeqScanState: if (planstate->plan->parallel_aware) ExecSeqScanEstimate((SeqScanState *) planstate, e->pcxt); break; case T_SortState: /* even when not parallel-aware, for EXPLAIN ANALYZE */ ExecSortEstimate((SortState *) planstate, e->pcxt); break; /* ... Index/BitmapHeap/Append/Hash/Agg/Memoize/... ... */ default: break;}return planstate_tree_walker(planstate, ExecParallelEstimate, e);The estimator accumulates a chunk size and a key count for each of the
executor’s fixed pieces — fixed state, query text, the serialized
PlannedStmt, ParamListInfo, per-worker BufferUsage/WalUsage, the
tuple queues, optional instrumentation, and the DSA area — each tagged with
a PARALLEL_KEY_* magic number above 2^32 so individual nodes can use
smaller keys for their own state:
// ExecInitParallelPlan — src/backend/executor/execParallel.c (condensed)pstmt_data = ExecSerializePlan(planstate->plan, estate);pcxt = CreateParallelContext("postgres", "ParallelQueryMain", nworkers);pei->pcxt = pcxt;
shm_toc_estimate_chunk(&pcxt->estimator, sizeof(FixedParallelExecutorState));shm_toc_estimate_keys(&pcxt->estimator, 1);/* ... query text, PlannedStmt, ParamListInfo, Buffer/Wal usage ... */shm_toc_estimate_chunk(&pcxt->estimator, mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Let parallel-aware nodes add to the estimate; also counts nnodes. */e.pcxt = pcxt; e.nnodes = 0;ExecParallelEstimate(planstate, &e);
shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize); /* DSA area */shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Everyone's asked for space; now create the DSM. */InitializeParallelDSM(pcxt);After InitializeParallelDSM returns, the space exists but is
uninitialized; the function then allocates and fills each chunk —
shm_toc_allocate to carve it, shm_toc_insert to register it under its
key — storing the fixed state, the query string, the serialized plan, the
serialized params, and setting up the tuple queues:
// ExecInitParallelPlan — store phase (condensed)fpes = shm_toc_allocate(pcxt->toc, sizeof(FixedParallelExecutorState));fpes->tuples_needed = tuples_needed;fpes->eflags = estate->es_top_eflags;fpes->jit_flags = estate->es_jit_flags;shm_toc_insert(pcxt->toc, PARALLEL_KEY_EXECUTOR_FIXED, fpes);
pstmt_space = shm_toc_allocate(pcxt->toc, pstmt_len);memcpy(pstmt_space, pstmt_data, pstmt_len);shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, pstmt_space);
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false); /* the funnel queues */
/* A DSA area shared by leader and workers (parallel-aware node scratch). */area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);shm_toc_insert(pcxt->toc, PARALLEL_KEY_DSA, area_space);pei->area = dsa_create_in_place(area_space, dsa_minsize, LWTRANCHE_PARALLEL_QUERY_DSA, pcxt->seg);
/* Initialize pass: visits the identical nnodes in identical order. */estate->es_query_dsa = pei->area;ExecParallelInitializeDSM(planstate, &d);estate->es_query_dsa = NULL;if (e.nnodes != d.nnodes) elog(ERROR, "inconsistent count of PlanState nodes");The e.nnodes != d.nnodes assertion is the guardrail on the two-pass
contract: estimate and init must visit the same node set, or offsets
computed in one pass would be wrong in the other.
The tuple queues — ExecParallelSetupTupleQueues
Section titled “The tuple queues — ExecParallelSetupTupleQueues”Each worker gets one shm_mq of PARALLEL_TUPLE_QUEUE_SIZE (64 KB),
carved from a contiguous block. The leader sets itself as the receiver
of each here; the worker will set itself as the sender later:
// ExecParallelSetupTupleQueues — src/backend/executor/execParallel.c (condensed)tqueuespace = shm_toc_allocate(pcxt->toc, mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));for (i = 0; i < pcxt->nworkers; ++i){ shm_mq *mq = shm_mq_create(tqueuespace + ((Size) i) * PARALLEL_TUPLE_QUEUE_SIZE, (Size) PARALLEL_TUPLE_QUEUE_SIZE); shm_mq_set_receiver(mq, MyProc); /* leader receives */ responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);}shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tqueuespace);return responseq;The readers are created separately by ExecParallelCreateReaders, after
launch, because the leader can do useful setup work while workers fork:
// ExecParallelCreateReaders — src/backend/executor/execParallel.c (condensed)for (i = 0; i < nworkers; i++){ shm_mq_set_handle(pei->tqueue[i], pei->pcxt->worker[i].bgwhandle); pei->reader[i] = CreateTupleQueueReader(pei->tqueue[i]);}Launching workers — LaunchParallelWorkers
Section titled “Launching workers — LaunchParallelWorkers”The generic LaunchParallelWorkers (in parallel.c) registers one
background worker per requested slot. The entry point name baked in is
ParallelWorkerMain; the DSM handle is passed as the worker’s main
argument so it can attach. The leader first becomes a lock-group
leader to prevent undetected deadlocks between itself and its workers:
// LaunchParallelWorkers — src/backend/access/transam/parallel.c (condensed)BecomeLockGroupLeader(); /* group locking */
memset(&worker, 0, sizeof(worker));snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d", MyProcPid);worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION | BGWORKER_CLASS_PARALLEL;worker.bgw_start_time = BgWorkerStart_ConsistentState;sprintf(worker.bgw_library_name, "postgres");sprintf(worker.bgw_function_name, "ParallelWorkerMain");worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(pcxt->seg));worker.bgw_notify_pid = MyProcPid;
for (i = 0; i < pcxt->nworkers_to_launch; ++i){ memcpy(worker.bgw_extra, &i, sizeof(int)); /* this worker's number */ if (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, &pcxt->worker[i].bgwhandle)) { shm_mq_set_handle(pcxt->worker[i].error_mqh, pcxt->worker[i].bgwhandle); pcxt->nworkers_launched++; /* best-effort count */ } else any_registrations_failed = true; /* hit max_worker_processes */}Registration failure is not an error — it simply caps
nworkers_launched below the request, and the caller (the Gather) copes
by scanning locally. This is the graceful-degradation convention in code.
The worker — ParallelWorkerMain restores leader state
Section titled “The worker — ParallelWorkerMain restores leader state”ParallelWorkerMain (in parallel.c) is what every parallel worker runs.
Before it can execute a single tuple it must make its private process look
like the leader. The ordering is delicate and heavily commented in the
source; the essential restores:
// ParallelWorkerMain — src/backend/access/transam/parallel.c (condensed, reordered)seg = dsm_attach(DatumGetUInt32(main_arg)); /* attach the segment */toc = shm_toc_attach(PARALLEL_MAGIC, dsm_segment_address(seg));fps = shm_toc_lookup(toc, PARALLEL_KEY_FIXED, false);
/* Attach the error queue FIRST, so later failures reach the leader. */mq = (shm_mq *) (error_queue_space + ParallelWorkerNumber * PARALLEL_ERROR_QUEUE_SIZE);shm_mq_set_sender(mq, MyProc);mqh = shm_mq_attach(mq, seg, NULL);pq_redirect_to_shm_mq(seg, mqh); /* elog/ereport -> leader */
BecomeLockGroupMember(fps->parallel_leader_pgproc, fps->parallel_leader_pid);SetAuthenticatedUserId(fps->authenticated_user_id);BackgroundWorkerInitializeConnectionByOid(fps->database_id, ...); /* same DB */
StartParallelWorkerTransaction(tstatespace); /* XID set, txn block state */RestoreComboCIDState(combocidspace); /* visibility consistency */
asnapshot = RestoreSnapshot(asnapspace); /* same MVCC view */RestoreTransactionSnapshot(tsnapshot, fps->parallel_leader_pgproc);PushActiveSnapshot(asnapshot);InvalidateSystemCaches(); /* tuples we can see changed */
RestoreGUCState(gucspace); /* every GUC value */SetUserIdAndSecContext(fps->current_user_id, fps->sec_context);
EnterParallelMode(); /* arm read-only checks */entrypt(seg, toc); /* == ParallelQueryMain */The README is explicit about why each restore exists: the XID set and combo CIDs so “tuple visibility checks return the same results in the worker as they do in the initiating backend”; the snapshots for the same reason; GUCs so planner/executor behavior matches. The error-queue attach is deliberately first — “until we do that, any errors that happen here will not be reported back.”
The worker executor — ParallelQueryMain
Section titled “The worker executor — ParallelQueryMain”Once the environment matches the leader, the executor-specific entry point
ParallelQueryMain (back in execParallel.c) reconstructs the
QueryDesc from the DSM and runs an ordinary executor against it, with the
twist that its DestReceiver writes into the worker’s tuple queue:
// ParallelQueryMain — src/backend/executor/execParallel.c (condensed)fpes = shm_toc_lookup(toc, PARALLEL_KEY_EXECUTOR_FIXED, false);receiver = ExecParallelGetReceiver(seg, toc); /* writes to the shm_mq */queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
area = dsa_attach_in_place(shm_toc_lookup(toc, PARALLEL_KEY_DSA, false), seg);
ExecutorStart(queryDesc, fpes->eflags);queryDesc->planstate->state->es_query_dsa = area; /* shared scan cursors */ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
InstrStartParallelQuery();ExecutorRun(queryDesc, ForwardScanDirection, fpes->tuples_needed < 0 ? (int64) 0 : fpes->tuples_needed);ExecutorFinish(queryDesc);/* report Buffer/Wal usage + instrumentation back through DSM, then: */ExecutorEnd(queryDesc);The receiver is built by ExecParallelGetReceiver, which finds this
worker’s slot in the tuple-queue block, marks the worker as the sender,
and wraps it in a CreateTupleQueueDestReceiver — so every tuple the
worker’s executor produces is serialized as a MinimalTuple and pushed
into the ring the leader is draining:
// ExecParallelGetReceiver — src/backend/executor/execParallel.cmqspace = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE, false);mqspace += ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE;mq = (shm_mq *) mqspace;shm_mq_set_sender(mq, MyProc); /* worker sends */return CreateTupleQueueDestReceiver(shm_mq_attach(mq, seg, NULL));The plan is reconstructed by ExecParallelGetQueryDesc with
stringToNode (the inverse of the leader’s nodeToString) and given
GetActiveSnapshot() — the snapshot the leader pushed and that
ParallelWorkerMain already restored.
flowchart TB L0["Leader: ExecGather (first call)"] L1["ExecInitParallelPlan<br/>serialize plan, size + fill DSM"] L2["LaunchParallelWorkers<br/>RegisterDynamicBackgroundWorker x N"] L0 --> L1 --> L2 L2 -->|"postmaster forks"| W0["ParallelWorkerMain"] W0 --> W1["attach DSM + error queue<br/>restore GUC/snapshot/XID/comboCID"] W1 --> W2["EnterParallelMode<br/>ParallelQueryMain"] W2 --> W3["ExecutorStart/Run<br/>DestReceiver -> shm_mq"] W3 -->|"MinimalTuples"| L3["leader gather_readnext<br/>drains queues"] W3 -->|"on error"| ERR["error shm_mq +<br/>PROCSIG_PARALLEL_MESSAGE"] ERR --> L4["leader CHECK_FOR_INTERRUPTS<br/>ProcessParallelMessages re-throws"]
Figure 3 — End-to-end launch and dataflow. The leader serializes the plan
and state into the DSM, asks the postmaster to fork workers, and each
worker reconstructs the leader’s environment before running its own
executor copy whose output DestReceiver writes MinimalTuples into the
shared queue the leader drains. Errors take the separate error queue and
surface at the leader’s next interrupt check.
Error propagation — the out-of-band channel
Section titled “Error propagation — the out-of-band channel”A worker’s elog/ereport is redirected (pq_redirect_to_shm_mq) into
its dedicated error queue, distinct from its tuple queue. Sending a
message signals the leader with PROCSIG_PARALLEL_MESSAGE; the leader’s
next CHECK_FOR_INTERRUPTS runs ProcessParallelMessages, which drains
every worker’s error queue and re-throws:
// ProcessParallelMessages — src/backend/access/transam/parallel.c (condensed)dlist_foreach(iter, &pcxt_list){ pcxt = dlist_container(ParallelContext, node, iter.cur); for (i = 0; i < pcxt->nworkers_launched; ++i) while (pcxt->worker[i].error_mqh != NULL) { res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes, &data, true); if (res == SHM_MQ_WOULD_BLOCK) break; else if (res == SHM_MQ_SUCCESS) ProcessParallelMessage(pcxt, i, &msg); /* re-throws ErrorResponse */ else ereport(ERROR, ... "lost connection to parallel worker"); }}This is why the README stresses that the leader “execute
CHECK_FOR_INTERRUPTS() regularly” — it is the only point at which a
worker’s error becomes the leader’s error. Note gather_readnext calls
CHECK_FOR_INTERRUPTS on every loop iteration precisely so that a worker
error surfaces promptly while the leader is draining tuples.
Teardown and reuse — ExecParallelFinish / Cleanup / Reinitialize
Section titled “Teardown and reuse — ExecParallelFinish / Cleanup / Reinitialize”ExecParallelFinish detaches the tuple queues (so any still-running worker
notices no more output is wanted), destroys the readers, and
WaitForParallelWorkersToFinish — the leader must not clean up the
transaction until every worker has exited, or “the relation could
disappear while the worker is still busy scanning it.” ExecParallelCleanup
then accumulates instrumentation, frees serialized params, detaches the
DSA, and DestroyParallelContext tears down the DSM. For a node that
rescans (e.g. a Gather on the inner side of a nested loop),
ExecParallelReinitialize recycles the same DSM via
ReinitializeParallelDSM and re-serializes only the (possibly changed)
parameters, avoiding a full segment rebuild.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
ExecInitGather | src/backend/executor/nodeGather.c | 53 |
ExecGather | src/backend/executor/nodeGather.c | 137 |
gather_getnext | src/backend/executor/nodeGather.c | 263 |
gather_readnext | src/backend/executor/nodeGather.c | 311 |
ExecShutdownGatherWorkers | src/backend/executor/nodeGather.c | 400 |
ExecShutdownGather | src/backend/executor/nodeGather.c | 418 |
ExecReScanGather | src/backend/executor/nodeGather.c | 442 |
ExecInitGatherMerge | src/backend/executor/nodeGatherMerge.c | 67 |
ExecGatherMerge | src/backend/executor/nodeGatherMerge.c | 183 |
gather_merge_readnext | src/backend/executor/nodeGatherMerge.c | 636 |
ExecSerializePlan | src/backend/executor/execParallel.c | 146 |
ExecParallelEstimate | src/backend/executor/execParallel.c | 233 |
ExecParallelSetupTupleQueues | src/backend/executor/execParallel.c | 547 |
ExecInitParallelPlan | src/backend/executor/execParallel.c | 599 |
ExecParallelCreateReaders | src/backend/executor/execParallel.c | 890 |
ExecParallelReinitialize | src/backend/executor/execParallel.c | 916 |
ExecParallelFinish | src/backend/executor/execParallel.c | 1156 |
ExecParallelCleanup | src/backend/executor/execParallel.c | 1209 |
ExecParallelGetReceiver | src/backend/executor/execParallel.c | 1245 |
ExecParallelGetQueryDesc | src/backend/executor/execParallel.c | 1261 |
ExecParallelInitializeWorker | src/backend/executor/execParallel.c | 1334 |
ParallelQueryMain | src/backend/executor/execParallel.c | 1429 |
CreateParallelContext | src/backend/access/transam/parallel.c | 173 |
InitializeParallelDSM | src/backend/access/transam/parallel.c | 211 |
ReinitializeParallelDSM | src/backend/access/transam/parallel.c | 508 |
LaunchParallelWorkers | src/backend/access/transam/parallel.c | 580 |
WaitForParallelWorkersToFinish | src/backend/access/transam/parallel.c | 803 |
ProcessParallelMessages | src/backend/access/transam/parallel.c | 1055 |
ProcessParallelMessage | src/backend/access/transam/parallel.c | 1144 |
ParallelWorkerMain | src/backend/access/transam/parallel.c | 1299 |
PARALLEL_TUPLE_QUEUE_SIZE (= 65536) | src/backend/executor/execParallel.c | 69 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against /data/hgryoo/references/postgres at commit 273fe94
(REL_18_STABLE). Checks performed:
Gatherlaunches workers lazily on firstExecGather, not at init. Confirmed:ExecInitGathersetsgatherstate->initialized = falseand only callsExecInitNodeon the outer plan; theExecInitParallelPlan/LaunchParallelWorkersblock is insideif (!node->initialized)inExecGather. The source comment states the rationale (“it needs to allocate a large dynamic segment”).- Tuple queue size is 64 KB per worker.
#define PARALLEL_TUPLE_QUEUE_SIZE 65536inexecParallel.c; each queue isshm_mq_created at that size inExecParallelSetupTupleQueues. - Tuples cross as
MinimalTuple. The funnel slot is created with&TTSOpsMinimalTupleinExecInitGather;gather_getnextstores worker tuples viaExecStoreMinimalTuple; the worker receiver isCreateTupleQueueDestReceiver. - The plan is serialized with
nodeToStringand rebuilt withstringToNode.ExecSerializePlanreturnsnodeToString(pstmt);ExecParallelGetQueryDesccallsstringToNode(pstmtspace). - Two-pass node-count invariant exists.
ExecParallelEstimateincrementse.nnodes;ExecParallelInitializeDSMincrementsd.nnodes;ExecInitParallelPlanerrors with “inconsistent count of PlanState nodes” if they differ. - Read-only enforcement.
ExecSerializePlanhard-codespstmt->commandType = CMD_SELECT;ParallelWorkerMaincallsEnterParallelMode()before invoking the entry point. (TheEnterParallelMode/ExitParallelModecheck machinery lives inxact.c; seepostgres-xact.md.) - Error path.
ParallelWorkerMaincallspq_redirect_to_shm_mq;ProcessParallelMessagesdrainspcxt->worker[i].error_mqhandProcessParallelMessagere-throws. ThePROCSIG_PARALLEL_MESSAGEsignal constant is referenced inparallel.c. - REL_18 sanity. No
XLOG2rmgr and noB_DATACHECKSUMSWORKER_*symbols were asserted. The worker reports per-queryBufferUsageandWalUsage(PARALLEL_KEY_WAL_USAGE), consistent with modern releases. - Symbols exist. Every symbol in the position-hint table was located
by
grep -nE '^<symbol>'in the cited file at the listed line (±0 at this commit).
Not independently re-derived (taken from source comments / README):
the precise list of state the generic InitializeParallelDSM copies
(libraries, combo CIDs, relmapper, REINDEX state, etc.) is enumerated in
README.parallel and the InitializeParallelDSM body; this doc quotes
rather than re-verifies each one.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s parallel executor is a deliberately conservative, first-generation implementation of the Volcano exchange model. Placing it against the design space and the literature clarifies which of its choices are fundamental and which are artifacts of its starting point.
Process-per-worker vs. thread-per-worker
Section titled “Process-per-worker vs. thread-per-worker”PostgreSQL’s single most consequential decision is that a worker is a
full backend process, not a thread. The benefit is enormous code
reuse: a worker runs the same ExecutorStart/ExecutorRun as a normal
query (postgres-executor.md), and fault isolation is free — a crashing
worker cannot corrupt the leader’s heap. The cost is everything in §“State
Sharing”: every GUC, snapshot, XID, combo CID, and library must be
explicitly serialized and re-installed, because processes share no
address space. A thread-per-worker engine (e.g. SQL Server, MySQL,
DuckDB, most OLAP engines) shares all of that for free, paying instead in
fragile global state, pervasive locking, and the loss of crash isolation.
PostgreSQL chose processes because its entire architecture
(postgres-postmaster.md, postgres-backend-lifecycle.md) is
process-per-connection; parallel workers reuse that machinery wholesale.
The serialization tax is the price of that reuse.
Pull-based exchange vs. push-based / morsel-driven
Section titled “Pull-based exchange vs. push-based / morsel-driven”PostgreSQL’s Gather is pull-based: the leader’s gather_readnext
calls down into per-worker queues, and each worker independently pulls its
partition through an iterator. This is the classic Volcano exchange. The
influential alternative is morsel-driven parallelism (Leis et al.,
“Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for
the Many-Core Age,” SIGMOD 2014), as implemented in HyPer and later
Umbra. There, a fixed pool of worker threads pulls small “morsels” of
input from a shared dispatcher and pushes them through a compiled
pipeline; parallelism is elastic (threads pick up the next morsel when
free), NUMA-aware, and avoids the static degree-of-parallelism PostgreSQL
fixes at plan time. PostgreSQL’s parallel-aware SeqScan block-range
handout is a coarse, single-operator echo of morsel dispatch — the shared
block counter in the DSA area is a morsel dispatcher in miniature — but
PostgreSQL fixes num_workers at plan time and cannot rebalance across
operators the way morsel-driven engines do. Vectorized push engines (the
MonetDB/X100 lineage; Boncz, Zukowski & Nes, “MonetDB/X100:
Hyper-Pipelining Query Execution,” CIDR 2005) combine push dataflow with
batch-at-a-time processing, attacking the per-tuple interpreter overhead
that PostgreSQL’s tuple-at-a-time funnel still pays — though PG’s
expression JIT (postgres-expression-eval.md) narrows that gap inside a
worker.
The funnel as a scalability bottleneck
Section titled “The funnel as a scalability bottleneck”A single Gather is a funnel: every worker tuple is serialized to a
MinimalTuple, copied through a 64 KB ring, and deserialized by one
leader. For high-cardinality results this re-centralizes work the parallel
scan spread out, and the leader can become the bottleneck — the classic
“exchange is the scalability limit” observation from parallel-DB folklore
and quantified in many shared-nothing studies. Production parallel engines
mitigate this with multi-level exchanges (partial aggregation below the
exchange so fewer tuples cross it — which PostgreSQL does do via Partial
/ Finalize Aggregate) and with repartitioning exchanges that keep data
parallel across a pipeline of operators rather than funneling to one
node after a single operator. PostgreSQL has no general repartitioning
exchange: parallelism collapses at every Gather and must be re-expanded
below the next one, which is why deeply parallel plans favor pushing as
much work as possible below a single top Gather. Parallel hash join
(postgres-table-am.md adjacent; the T_HashJoinState parallel-aware
path in ExecParallelEstimate) is the main exception where state stays
shared across workers via the DSA area rather than funneling.
Static vs. adaptive degree of parallelism
Section titled “Static vs. adaptive degree of parallelism”The worker count is decided by the planner from table size and
max_parallel_workers_per_gather, then is best-effort at runtime
(nworkers_launched may be lower). It is never increased mid-query and
never rebalanced. Adaptive systems revisit the degree of parallelism based
on runtime feedback — the research frontier of adaptive query
processing (Deshpande, Ives & Raman, “Adaptive Query Processing,”
Foundations and Trends in Databases, 2007) and runtime re-optimization.
PostgreSQL’s parallel_leader_participation is a tiny adaptive touch (the
leader fills in when workers are scarce), but the architecture is
fundamentally plan-time-static.
Distributed shared-nothing exchange
Section titled “Distributed shared-nothing exchange”The DSC chapter frames the exchange operator on a shared-nothing
cluster where producers send tuples over the network. PostgreSQL’s
exchange is strictly single-node shared-memory: workers are local
processes and the transport is shared memory, never a socket. Cross-node
parallelism in the PostgreSQL ecosystem lives in extensions (Citus,
Greenplum, postgres_fdw async append) layered above the core executor,
not in nodeGather.c. This is the cleanest illustration of scope: the
core implements exactly the multicore slice of the textbook model, and
defers the distributed slice entirely.
Frontier summary
Section titled “Frontier summary”The directions a modern reader should watch: (1) morsel-driven /
work-stealing schedulers to replace static num_workers; (2) vectorized
or fully compiled push pipelines to retire the tuple-at-a-time funnel;
(3) repartitioning exchanges so parallelism survives across operators;
(4) relaxing the read-only restriction (parallel DML, parallel index
build already exists for B-tree — see postgres-nbtree.md); and (5) NUMA
-aware worker placement. PostgreSQL’s design is intentionally the simplest
correct point in this space — its conservatism (read-only, process-based,
static, single-node, funnel-per-Gather) is exactly what made parallel
query landable on a 30-year-old single-threaded executor without
rewriting it.
Sources
Section titled “Sources”- Source tree (
/data/hgryoo/references/postgres, REL_18_STABLE, commit273fe94, observed 2026-06-05):src/backend/executor/nodeGather.c—Gathernode: lazy launch,gather_getnext/gather_readnextfunnel, local-scan fallback.src/backend/executor/nodeGatherMerge.c— order-preserving variant; binary-heap merge of worker streams.src/backend/executor/execParallel.c—ExecInitParallelPlan, plan serialization, two-pass DSM sizing, tuple queues,ParallelQueryMain, finish/cleanup/reinitialize.src/backend/access/transam/parallel.c— genericParallelContext,InitializeParallelDSM,LaunchParallelWorkers,ParallelWorkerMainstate restore, error message processing.src/backend/access/transam/README.parallel— the design document: overview, error reporting, state sharing, transaction integration, coding conventions.
- Textbook (
knowledge/research/dbms-general/database-system-concepts.md): Silberschatz, Korth & Sudarshan, Database System Concepts, 7e, ch. 22 “Parallel and Distributed Query Processing” — §22.1 (intra-/inter-query, intra-/inter-operation parallelism), §22.5 (parallel evaluation of query plans), §22.5.1 (interoperation: pipelined vs. independent parallelism), §22.5.2 (the exchange-operator model); ch. 15 §15.7.2 (pipelining, referenced by the parallel chapter). - Foundational papers (see
.omc/plans/postgres-paper-bibliography.md): G. Graefe, “Encapsulation of Parallelism in the Volcano Query Processing System,” SIGMOD 1990 — the exchange operator. G. Graefe & W. McKenna, “The Volcano Optimizer Generator,” ICDE 1993. - Comparative / frontier literature (named in §6, not in the KB): Leis et al., “Morsel-Driven Parallelism,” SIGMOD 2014; Boncz, Zukowski & Nes, “MonetDB/X100: Hyper-Pipelining Query Execution,” CIDR 2005; Deshpande, Ives & Raman, “Adaptive Query Processing,” FnT Databases 2007.
- Cross-references (this KB):
postgres-executor.md(the per-worker iterator model),postgres-shared-memory-ipc.md(DSM,shm_toc,shm_mq, background workers),postgres-planner-overview.mdandpostgres-path-generation.md(why aGatherand partial paths are produced),postgres-mvcc-snapshots.md(snapshot semantics restored in workers),postgres-xact.md(parallel-mode read-only enforcement),postgres-expression-eval.md(per-worker expression evaluation/JIT),postgres-postmaster.md/postgres-backend-lifecycle.md(the process-per-worker substrate).