PostgreSQL Asynchronous I/O — The PG18 AIO Subsystem, io_uring, and Read Streams
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”For most of its history PostgreSQL performed I/O the simplest way possible:
when a backend needed a page that was not in shared_buffers, it called
pread() and blocked until the kernel delivered the bytes. This worked far
better than it had any right to, for one reason — the operating system hides
storage latency on the backend’s behalf. The kernel’s own read-ahead notices
sequential access and prefetches the next few blocks into its page cache, so
the synchronous pread() often finds the data already resident and returns
almost immediately. PostgreSQL leaned on that, and on posix_fadvise() hints
(effective_io_concurrency) to nudge the kernel into prefetching for the
handful of access patterns — bitmap heap scans, mostly — where the engine knew
the future better than the kernel could guess.
The Architecture of a Database System survey (Hellerstein, Stonebraker &
Hamilton 2007; captured in dbms-papers/fntdb07-architecture.md) frames the
problem the textbook way: a DBMS is fundamentally a machine for overlapping
computation with I/O. A query that must read a million pages spends almost all
of its wall-clock time waiting for storage unless the system can have many
reads outstanding at once and do useful work — decode tuples, evaluate
predicates, build hash tables — while the disk arm (or the NVMe queue) is busy.
Synchronous blocking I/O makes that overlap impossible within a single
backend: the backend stops dead at every cache miss. PostgreSQL’s
process-per-connection model historically recovered some parallelism by running
many backends, but a single large scan in a single backend was latency-bound.
Two structural shifts made the old approach untenable. First, storage got
faster than the CPU’s ability to feed it. On a modern NVMe array the device
can service hundreds of thousands of IOPS, but a single synchronous backend,
issuing one 8 KB read at a time and waiting for each, cannot keep the queue
deep enough to saturate it — the bottleneck moves from the device to the
round-trip latency of the syscall path. Second, buffered I/O itself became
the cost. Copying data from the kernel page cache into shared_buffers burns
CPU and double-buffers the data (once in the OS cache, once in Postgres). Direct
I/O (O_DIRECT) sidesteps both — DMA moves bytes straight from device to the
buffer pool, freeing the CPU — but, as the AIO README bluntly states, “Without
AIO, Direct IO is unusably slow for most purposes”: there is no kernel
read-ahead to hide latency, so every miss is a full storage round trip. Direct
I/O is therefore only viable once the engine itself issues explicit, deep,
asynchronous prefetches. AIO is the prerequisite for direct I/O, not a separate
feature.
Asynchronous I/O decouples issuing a read from consuming its result. The
backend submits a read, gets control back immediately, does other work (issues
more reads, processes already-resident pages), and only later waits for the
specific read it now needs. The classic abstraction is a pair of operations —
submit(io) returning a handle, and wait(handle) blocking until that IO
completes — plus a completion mechanism that updates shared state (marks a
buffer valid, verifies a checksum) when the bytes land. The engineering
difficulty is never the happy path; it is everything around it: bounding the
number of in-flight IOs, avoiding deadlock when shared resources (buffers) are
both the source and the target of concurrent asynchronous reads, and making
completion processing safe to run in a critical section (because WAL flushes,
which must run in critical sections, want to use AIO too).
Common DBMS Design
Section titled “Common DBMS Design”Database engines that have adopted asynchronous I/O converge on a recognizable set of building blocks, though they differ in mechanism.
An I/O request descriptor with a lifecycle. Every AIO design has some object — call it a request, a handle, or an IOCB — that represents one outstanding operation: which file, which offset, which buffer, how many bytes, and a state field tracking unsubmitted -> in flight -> complete. The descriptor must outlive the syscall that started it and must be reachable by whatever code processes the completion. Where the descriptor lives matters enormously in a process-per-connection engine like PostgreSQL: it cannot be a plain stack variable, because another process may need to complete the IO.
A pluggable transport. Portable engines abstract how the IO is actually
performed behind an interface, because the best mechanism is
platform-specific: Linux io_uring, POSIX AIO (aio_read), Windows overlapped
I/O / IOCP, a pool of synchronous worker threads, or plain blocking syscalls as
a fallback. The transport is a strategy object — a table of function pointers
for submit, wait, and initialization — selected at startup.
Completion handling decoupled from issuance. Because the issuer may be busy or blocked when the bytes arrive, mature designs let any worker drain completions, or offload completion to a dedicated thread/process. This is the crux of deadlock avoidance: if a backend prefetches ten pages and then blocks on a lock, those ten completions must still be processed — by someone — or the whole system can wedge. Engines solve this either by making completions globally drainable (io_uring: any backend can reap another’s ring) or by offloading the work entirely (worker pool: the worker that did the read also does the completion).
I/O combining and look-ahead. A read of one 8 KB page is wasteful when the
next four pages are also wanted; combining them into one vectored
preadv()/ring submission amortizes the per-syscall cost. The producer of “what
to read next” is usually decoupled from the consumer via a prefetch stream or
async scan iterator that looks a tunable distance ahead, merges adjacent
requests, and caps the number concurrently in flight. SQL Server’s read-ahead
manager, Oracle’s db_file_multiblock_read_count, and DB2’s prefetchers are all
variations on this theme.
// IoMethodOps — src/include/storage/aio_internal.h// The pluggable-transport vtable: each io_method fills in a subset.typedef struct IoMethodOps{ bool wait_on_fd_before_close; size_t (*shmem_size) (void); void (*shmem_init) (bool first_time); void (*init_backend) (void); bool (*needs_synchronous_execution) (PgAioHandle *ioh); int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios); void (*wait_one) (PgAioHandle *ioh, uint64 ref_generation);} IoMethodOps;The hardest design constraints are the ones the textbooks skip. Because
PostgreSQL is a multi-process server, AIO state must live in shared memory:
a handle initiated by backend A may be completed by backend B (when A is busy or
when io_uring lets B reap A’s ring), so the descriptor, its callbacks, and its
result cannot be process-local pointers. And because an EXEC_BACKEND build
re-maps each process’s code at a different address under ASLR, shared memory
cannot hold function pointers at all — a completion callback installed by
backend A would point at garbage in backend B. This single fact forces the
callback-by-integer-ID design that pervades the implementation.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL 18 introduces asynchronous I/O as a self-contained subsystem under
src/backend/storage/aio/, layered so that most callers never touch the
low-level API at all. The whole subsystem is governed by one GUC,
io_method, with three values — sync, worker (the default), and
io_uring (Linux only, compiled in when liburing is present).
The AIO handle is the unit of work. A PgAioHandle (defined in
aio_internal.h) is a fixed-size shared-memory record. The pool is sized once
at startup: AioProcs() (every backend plus auxiliary processes) times
io_max_concurrency handles each. A backend acquires one with
pgaio_io_acquire(), which is guaranteed to succeed — if the backend’s
handles are all in use, it blocks waiting for one of its own in-flight IOs to
complete. To make that guarantee sound, the API enforces that a backend may
hold at most one un-submitted handle at a time (handed_out_io); otherwise
a backend could exhaust its handles without any way to wait for them to free
up, and self-deadlock.
Definition is layered through the storage stack. The backend that acquires
the handle is usually not the code that defines the IO. For a shared-buffer
read, bufmgr.c acquires the handle and registers a completion callback, then
passes it down to smgr.c, which passes it to md.c, which translates the
block number into a segment file and offset and finally calls
pgaio_io_start_readv() in fd.c. Each layer on the way down can register its
own completion callback. This is how the AIO subsystem keeps layers ignorant of
each other: bufmgr does not know the IO goes through md, and md does not
know how to validate a page checksum — each contributes one callback that
understands only its own concern.
Eight states, one direction. A handle moves monotonically through the
states in PgAioHandleState: IDLE (in the backend’s free list) ->
HANDED_OUT (returned by acquire) -> DEFINED (operation associated) ->
STAGED (stage callbacks run, ready to submit) -> SUBMITTED (handed to the
kernel/worker) -> COMPLETED_IO (bytes landed, result known) ->
COMPLETED_SHARED (shared callbacks run — buffer marked valid) ->
COMPLETED_LOCAL (issuer’s local callbacks run), after which the handle’s
generation is bumped and it returns to IDLE for reuse.
// PgAioHandleState — src/include/storage/aio_internal.htypedef enum PgAioHandleState{ PGAIO_HS_IDLE = 0, PGAIO_HS_HANDED_OUT, /* returned by pgaio_io_acquire() */ PGAIO_HS_DEFINED, /* pgaio_io_start_*() called */ PGAIO_HS_STAGED, /* stage() ran; ready to submit */ PGAIO_HS_SUBMITTED, /* given to the IO method */ PGAIO_HS_COMPLETED_IO, /* finished, result not yet processed */ PGAIO_HS_COMPLETED_SHARED, /* shared completion callbacks ran */ PGAIO_HS_COMPLETED_LOCAL, /* local completion callbacks ran */} PgAioHandleState;Wait references survive reuse. Because a handle is recycled the instant its
IO completes, you cannot wait on the handle itself — by the time you look, it
might already be servicing someone else’s read. Instead the issuer takes a
PgAioWaitRef before submitting, which packs the handle’s array index plus
its 64-bit generation counter. pgaio_wref_wait() resolves the reference and
checks the generation: if it has advanced, the IO you cared about is long done.
A wait reference can sit in shared memory and be waited on by any process.
Callbacks are integers, results are compact, errors are deferred. Completion
logic is registered as a PgAioHandleCallbackID — a one-byte index into a
static table aio_handle_cbs[] (e.g. PGAIO_HCB_SHARED_BUFFER_READV,
PGAIO_HCB_MD_READV). The indirection exists precisely because shared memory
cannot hold function pointers under EXEC_BACKEND ASLR. Callbacks run in
critical sections (so AIO is usable for WAL), which means they cannot
ereport(ERROR) — instead each callback “distills” the raw syscall result into
a compact PgAioResult (a status enum plus a few bits of error data). The
issuing backend later reads the result out of a backend-local PgAioReturn and,
if it indicates failure, calls pgaio_result_report() to raise the error in a
context where throwing is safe.
// PgAioHandleFlags — src/include/storage/aio.htypedef enum PgAioHandleFlags{ /* request synchronous execution even when AIO is configured */ PGAIO_HF_SYNCHRONOUS = 1 << 0, /* references process-local memory; worker mode can't reopen it */ PGAIO_HF_REFERENCES_LOCAL = 1 << 1, /* buffered (not direct) IO — io_uring may offload to its workers */ PGAIO_HF_BUFFERED = 1 << 2,} PgAioHandleFlags;The read stream is the helper most callers actually use. Sequential scans,
VACUUM, ANALYZE, bitmap heap scans, and more do not call the AIO API
directly; they create a ReadStream (read_stream.c), hand it a callback that
yields successive block numbers, and pull pinned buffers out one at a time with
read_stream_next_buffer(). The stream looks ahead, merges adjacent blocks into
vectored reads up to io_combine_limit, keeps up to max_ios reads in flight,
and adapts its look-ahead distance to recent cache-hit history — shrinking to
1 when everything is cached (so it does no useless work) and doubling after each
real I/O. This is the bridge between the high-level “I want to read these
blocks” intent and the low-level AIO handle machinery, and it is where the
performance win actually shows up for ordinary queries.
Source Walkthrough
Section titled “Source Walkthrough”The subsystem is small in line count but dense in invariants. We trace it
bottom-up: first the handle lifecycle (aio.c), then the three method
implementations (method_sync.c, method_worker.c, method_io_uring.c),
then the read-stream consumer that almost every caller actually uses
(read_stream.c). Buffer-pool integration (StartReadBuffer /
WaitReadBuffers, the PGAIO_HCB_SHARED_BUFFER_READV callback body) lives in
postgres-buffer-manager.md; segment-and-fd translation (md.c,
PGAIO_HCB_MD_READV) lives in postgres-smgr-md.md. Here we stay inside
src/backend/storage/aio/.
Acquiring a handle (aio.c)
Section titled “Acquiring a handle (aio.c)”pgaio_io_acquire() is the entry point and is guaranteed to return a handle:
if the backend’s free list is empty it calls pgaio_io_wait_for_free() to
reclaim one of its own in-flight IOs. The non-blocking variant
pgaio_io_acquire_nb() is the real worker — it enforces the two invariants
that make the guarantee sound. First, a backend may have at most
PGAIO_SUBMIT_BATCH_SIZE (32) IOs staged; if it is already at the cap it
flushes them with pgaio_submit_staged(). Second, a backend may hold at most
one un-submitted handle at a time — the handed_out_io guard turns a
second concurrent acquire into a hard elog(ERROR).
// pgaio_io_acquire_nb — src/backend/storage/aio/aio.cif (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE){ Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE); pgaio_submit_staged();}
if (pgaio_my_backend->handed_out_io) elog(ERROR, "API violation: Only one IO can be handed out");// ... pops from dclist idle_ios, moves IDLE -> HANDED_OUT ...pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);pgaio_my_backend->handed_out_io = ioh;The handed_out_io rule is the linchpin of the README’s anti-self-deadlock
argument: because a backend can only ever be defining one IO at a time, it
can never paint itself into a corner where it holds N undefined handles with no
way to wait for any of them.
Staging and the submit batch (aio.c)
Section titled “Staging and the submit batch (aio.c)”Once bufmgr/md/fd have defined the operation and registered callbacks, the
handle is staged. pgaio_io_stage() runs the per-target stage callbacks
(the counterpart of completion callbacks — they pin resources, snapshot state),
moves the handle DEFINED -> STAGED, then decides whether the IO must run
synchronously (e.g. the method is sync, or the IO references process-local
memory). If not, the handle is appended to the backend’s staged_ios[] array
and — unless the caller explicitly opted into batching via
pgaio_enter_batchmode() — submitted immediately.
// pgaio_io_stage — src/backend/storage/aio/aio.cpgaio_io_update_state(ioh, PGAIO_HS_DEFINED);pgaio_my_backend->handed_out_io = NULL; /* allow a new IO to be staged */pgaio_io_call_stage(ioh);pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);if (!needs_synchronous){ pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh; if (!pgaio_my_backend->in_batchmode) pgaio_submit_staged();}pgaio_submit_staged() is where control crosses into the method vtable. It
wraps the call to pgaio_method_ops->submit() in a critical section — because
submission may itself need to complete earlier IOs (the WAL-in-critical-section
scenario from the README), and completion must be crash-safe.
// pgaio_submit_staged — src/backend/storage/aio/aio.cSTART_CRIT_SECTION();did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios, pgaio_my_backend->staged_ios);END_CRIT_SECTION();pgaio_my_backend->num_staged_ios = 0;Completion processing (aio.c)
Section titled “Completion processing (aio.c)”No matter which method ran the IO, completion funnels through one function,
pgaio_io_process_completion(). It is always called inside a critical section
(Assert(CritSectionCount > 0)), drives the handle SUBMITTED -> COMPLETED_IO,
runs the shared completion callbacks (which update shared state visible to
every backend — marking a buffer valid, distilling errors into the
PgAioResult), moves to COMPLETED_SHARED, and broadcasts the handle’s
condition variable so waiters wake. Only if the issuing backend is the one
processing the completion does it go on to run local callbacks and reclaim the
handle.
// pgaio_io_process_completion — src/backend/storage/aio/aio.cAssert(ioh->state == PGAIO_HS_SUBMITTED);Assert(CritSectionCount > 0);ioh->result = result;pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);pgaio_io_call_complete_shared(ioh);pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);ConditionVariableBroadcast(&ioh->cv);if (ioh->owner_procno == MyProcNumber) pgaio_io_reclaim(ioh); /* runs local callbacks too */The shared/local split is exactly the EXEC_BACKEND constraint made concrete: the
shared callback (PGAIO_HCB_SHARED_BUFFER_READV) must run in whatever process
reaps the completion, so it touches only shared memory; the local callback runs
only in the issuer, so it may touch backend-local state. Both are named by
one-byte PgAioHandleCallbackID, never by pointer.
flowchart TD A["pgaio_io_acquire_nb()<br/>IDLE to HANDED_OUT"] --> B["bufmgr/smgr/md define IO<br/>register callbacks"] B --> C["pgaio_io_stage()<br/>stage cbs, DEFINED to STAGED"] C -->|sync needed| D["pgaio_io_perform_synchronously()"] C -->|async| E["staged_ios[]<br/>append, maybe batch"] E --> F["pgaio_submit_staged()<br/>CRIT: method->submit()"] F --> G["SUBMITTED<br/>kernel / worker owns it"] G --> H["completion reaped<br/>worker, io_uring drain, or sync"] H --> I["pgaio_io_process_completion()<br/>CRIT: shared cbs, COMPLETED_SHARED"] I --> J["ConditionVariableBroadcast(cv)<br/>waiters wake"] I -->|issuer == self| K["pgaio_io_reclaim()<br/>local cbs, bump generation, IDLE"] J -.->|"pgaio_wref_wait()"| K
Method 1 — sync (method_sync.c)
Section titled “Method 1 — sync (method_sync.c)”The sync method is the degenerate case and the safety net. Its submit
hook does not exist (the IO is executed inline during staging via
needs_synchronous_execution returning true), so there is nothing asynchronous
about it: each IO is performed with a blocking preadv()/pwritev() right where
it is issued. It exists for two reasons — debugging the upper layers without any
AIO machinery in the way, and providing a fallback on platforms or builds where
neither workers nor io_uring are usable. Because it has no out-of-process
completion, it sidesteps every deadlock concern by construction.
Method 2 — worker (method_worker.c, the default)
Section titled “Method 2 — worker (method_worker.c, the default)”Worker mode dispatches IOs to a pool of dedicated B_IO_WORKER processes
(default io_workers = 3, hard cap MAX_IO_WORKERS = 32) over a shared-memory
ring, PgAioWorkerSubmissionQueue. Submission inserts each staged handle into
the queue under AioWorkerSubmissionQueueLock and wakes one idle worker by
setting its latch; any IOs that don’t fit in the queue are run synchronously by
the submitter, after dispatching the rest, to keep concurrency high.
// pgaio_worker_submit_internal — src/backend/storage/aio/method_worker.cLWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);for (int i = 0; i < num_staged_ios; ++i){ if (!pgaio_worker_submission_queue_insert(staged_ios[i])) { /* queue full: fall back to synchronous, but only after dispatching */ synchronous_ios[nsync++] = staged_ios[i]; continue; } if (wakeup == NULL) { worker = pgaio_worker_choose_idle(); if (worker >= 0) wakeup = io_worker_control->workers[worker].latch; }}LWLockRelease(AioWorkerSubmissionQueueLock);if (wakeup) SetLatch(wakeup);IoWorkerMain() is the per-worker loop. It consumes one queue entry under the
lock, opportunistically wakes up to IO_WORKER_WAKEUP_FANOUT (2) peers if the
queue is deep, re-opens the file descriptor (pgaio_io_reopen() — the worker is
a different process and does not share the issuer’s open fds), then performs the
IO synchronously. The completion runs in the worker, which is precisely how
worker mode satisfies the README’s anti-deadlock rule: even if the issuing
backend is blocked, the worker still completes the IO.
// IoWorkerMain — src/backend/storage/aio/method_worker.cLWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);if ((io_index = pgaio_worker_submission_queue_consume()) == -1) io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);else{ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); nwakeups = Min(pgaio_worker_submission_queue_depth(), IO_WORKER_WAKEUP_FANOUT); /* ... gather peer latches ... */}LWLockRelease(AioWorkerSubmissionQueueLock);// ... later, for a real job:HOLD_INTERRUPTS();pgaio_io_reopen(ioh);pgaio_io_perform_synchronously(ioh); /* contains its own crit section */RESUME_INTERRUPTS();flowchart LR
subgraph Issuer["issuing backend"]
S["pgaio_worker_submit_internal()<br/>insert into ring"]
end
subgraph SHM["shared memory"]
Q["PgAioWorkerSubmissionQueue<br/>(ring of io_handle indices)"]
M["idle_worker_mask"]
end
subgraph Workers["B_IO_WORKER pool (io_workers, max 32)"]
W1["IoWorkerMain #0<br/>reopen + preadv + complete"]
W2["IoWorkerMain #1"]
W3["IoWorkerMain #2"]
end
S -->|"queue_insert + SetLatch"| Q
Q --> W1
Q --> W2
Q --> W3
W1 -->|"pgaio_io_process_completion()"| SHM
M -.->|"choose_idle"| S
Method 3 — io_uring (method_io_uring.c, Linux)
Section titled “Method 3 — io_uring (method_io_uring.c, Linux)”io_uring mode gives each backend its own ring (PgAioUringContext, one per
pgaio_uring_procs() slot in shared memory) created with
io_uring_queue_init_mem() so the ring buffers live in the shared segment.
Submission fills a submission-queue entry per staged IO via
pgaio_uring_sq_from_io() and calls io_uring_submit(). A subtlety: for
buffered IO with several IOs already in flight, the code sets IOSQE_ASYNC
so the kernel offloads the page-cache copy to its own worker threads rather than
doing it inline (inline is lower-latency for the first few, but serializes the
copies under load).
// pgaio_uring_submit — src/backend/storage/aio/method_io_uring.csqe = io_uring_get_sqe(uring_instance);if (!sqe) elog(ERROR, "io_uring submission queue is unexpectedly full");pgaio_io_prepare_submit(ioh);pgaio_uring_sq_from_io(ioh, sqe);if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED)) io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);// ... then loop on io_uring_submit(), handling EINTR/EAGAIN ...The crucial property is that any backend can drain another backend’s ring.
The ring is in shared memory, but the kernel’s CQ-reaping path is not
inherently multi-process-safe, so each context carries a completion_lock
LWLock; pgaio_uring_drain_locked() peeks a batch of CQEs, and for each one
recovers the PgAioHandle * stashed in the CQE user-data and routes it through
the same pgaio_io_process_completion() used by every method.
// pgaio_uring_drain_locked — src/backend/storage/aio/method_io_uring.cAssert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);while (ready > 0){ ncqes = io_uring_peek_batch_cqe(&context->io_uring_ring, cqes, Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready)); ready -= ncqes; for (int i = 0; i < ncqes; i++) { struct io_uring_cqe *cqe = cqes[i]; PgAioHandle *ioh = io_uring_cqe_get_data(cqe); io_uring_cqe_seen(&context->io_uring_ring, cqe); pgaio_io_process_completion(ioh, cqe->res); }}This is the other half of the anti-deadlock design: where worker mode
offloads completion, io_uring mode makes completion globally drainable —
backend B blocking on an IO that backend A issued simply takes A’s
completion_lock and reaps it itself. The PgAioUringContext comment states it
directly: “Multiple backends can process completions for this backend’s io_uring
instance … only a single backend gets io completions … at a time.”
Waiting: wait references (aio.c)
Section titled “Waiting: wait references (aio.c)”Because a handle is reclaimed and recycled the instant its IO completes, you
cannot safely wait on the handle pointer. The issuer captures a PgAioWaitRef
(index + 64-bit generation) before submitting; pgaio_wref_wait() resolves it
and returns immediately if the generation has moved on. The low-level
pgaio_io_wait() checks pgaio_io_was_recycled() — a generation comparison
behind a read barrier — so a stale wait is a cheap no-op rather than a
use-after-free.
The read-stream consumer (read_stream.c)
Section titled “The read-stream consumer (read_stream.c)”Almost no caller touches the handle API directly. Sequential scans, VACUUM,
ANALYZE, bitmap heap scans, and more create a ReadStream via
read_stream_begin_relation() (or read_stream_begin_smgr_relation()), hand it
a ReadStreamBlockNumberCB that yields successive block numbers, and pull pinned
buffers out one at a time with read_stream_next_buffer(). The stream is the
piece that turns “I want these blocks” into deep, combined, concurrent AIO.
read_stream_look_ahead() is the engine. It runs while two budgets allow:
fewer than max_ios reads are in flight, and the number of pinned-or-pending
buffers is below the adaptive look-ahead distance. It merges a new block into
the pending read when it is contiguous, and starts the pending read (issuing the
vectored AIO) when it reaches io_combine_limit or can grow no further.
// read_stream_look_ahead — src/backend/storage/aio/read_stream.cwhile (stream->ios_in_progress < stream->max_ios && stream->pinned_buffers + stream->pending_read_nblocks < stream->distance){ // ... blocknum = read_stream_get_block(stream, per_buffer_data); if (blocknum == InvalidBlockNumber) { stream->distance = 0; /* end of stream */ break; } /* contiguous? merge into the pending vectored read */ if (stream->pending_read_nblocks > 0 && stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum) { stream->pending_read_nblocks++; continue; } /* non-contiguous: flush the pending read, then start a new one */ while (stream->pending_read_nblocks > 0) { if (!read_stream_start_pending_read(stream) || stream->ios_in_progress == stream->max_ios) { read_stream_unget_block(stream, blocknum); /* rewind, stop */ return; } } stream->pending_read_blocknum = blocknum; stream->pending_read_nblocks = 1;}read_stream_next_buffer() is the consumer side and where the distance
adapts. When it must actually wait for the IO behind the buffer it is about to
return (WaitReadBuffers), it doubles the look-ahead distance (clamped to
max_pinned_buffers) — so a stream that keeps hitting real I/O ramps its
prefetch depth up rapidly, while a stream finding everything in cache lets the
distance collapse toward 1 and does no wasted look-ahead.
// read_stream_next_buffer — src/backend/storage/aio/read_stream.cWaitReadBuffers(&stream->ios[io_index].op);stream->ios_in_progress--;if (++stream->oldest_io_index == stream->max_ios) stream->oldest_io_index = 0;
/* Look-ahead distance ramps up rapidly after we do I/O. */distance = stream->distance * 2;distance = Min(distance, stream->max_pinned_buffers);stream->distance = distance;Note batch_mode: when a stream’s callback is known to be deadlock-safe, the
stream brackets its look-ahead in pgaio_enter_batchmode() /
pgaio_exit_batchmode(), so several reads are staged and then submitted in one
pgaio_submit_staged() call — amortizing the submit cost. The default is to
submit each IO immediately (see pgaio_io_stage() above), precisely because
batching un-submitted IO is what can deadlock if the callback blocks.
Position hints (as of 2026-06-06, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
pgaio_io_acquire | src/backend/storage/aio/aio.c | 162 |
pgaio_io_acquire_nb | src/backend/storage/aio/aio.c | 188 |
pgaio_io_get_wref | src/backend/storage/aio/aio.c | 366 |
pgaio_io_update_state | src/backend/storage/aio/aio.c | 386 |
pgaio_io_stage | src/backend/storage/aio/aio.c | 424 |
pgaio_io_prepare_submit | src/backend/storage/aio/aio.c | 510 |
pgaio_io_process_completion | src/backend/storage/aio/aio.c | 528 |
pgaio_io_was_recycled | src/backend/storage/aio/aio.c | 558 |
pgaio_io_wait | src/backend/storage/aio/aio.c | 579 |
pgaio_io_wait_for_free | src/backend/storage/aio/aio.c | 761 |
pgaio_wref_wait | src/backend/storage/aio/aio.c | 991 |
pgaio_submit_staged | src/backend/storage/aio/aio.c | 1123 |
PgAioWorkerSubmissionQueue (struct) | src/backend/storage/aio/method_worker.c | 55 |
pgaio_worker_submission_queue_insert | src/backend/storage/aio/method_worker.c | 181 |
pgaio_worker_submission_queue_consume | src/backend/storage/aio/method_worker.c | 202 |
pgaio_worker_submit_internal | src/backend/storage/aio/method_worker.c | 244 |
pgaio_worker_submit | src/backend/storage/aio/method_worker.c | 295 |
IoWorkerMain | src/backend/storage/aio/method_worker.c | 386 |
io_workers (GUC var) | src/backend/storage/aio/method_worker.c | 94 |
PgAioUringContext (struct) | src/backend/storage/aio/method_io_uring.c | 87 |
pgaio_uring_submit | src/backend/storage/aio/method_io_uring.c | 405 |
pgaio_uring_sq_from_io | src/backend/storage/aio/method_io_uring.c | (decl 59) |
pgaio_uring_drain_locked | src/backend/storage/aio/method_io_uring.c | 526 |
pgaio_uring_wait_one | src/backend/storage/aio/method_io_uring.c | 584 |
read_stream_get_block | src/backend/storage/aio/read_stream.c | 179 |
read_stream_start_pending_read | src/backend/storage/aio/read_stream.c | 230 |
read_stream_look_ahead | src/backend/storage/aio/read_stream.c | 429 |
read_stream_begin_relation | src/backend/storage/aio/read_stream.c | 746 |
read_stream_next_buffer | src/backend/storage/aio/read_stream.c | 800 |
PGAIO_SUBMIT_BATCH_SIZE (=32) | src/include/storage/aio_internal.h | 28 |
PgAioHandleState (enum) | src/include/storage/aio_internal.h | 43 |
IoMethodOps (vtable) | src/include/storage/aio_internal.h | 260 |
PgAioHandleCallbackID (enum) | src/include/storage/aio.h | 192 |
PgAioResult (struct) | src/include/storage/aio_types.h | 99 |
MAX_IO_WORKERS (=32) | src/include/storage/proc.h | 460 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All claims below were checked against the working tree at
/data/hgryoo/references/postgres, branch REL_18_STABLE, commit
273fe94852b3a7e34fd171e8abdf1481beb302fa (PostgreSQL 18.x).
Verified facts
Section titled “Verified facts”- Three io_methods, worker is the default.
IoMethodinsrc/include/storage/aio.henumeratesIOMETHOD_SYNC,IOMETHOD_WORKER, and (guarded byIOMETHOD_IO_URING_ENABLED)IOMETHOD_IO_URING. The default is worker mode;io_workers = 3is the default worker count (method_worker.c). - At most one handed-out IO.
pgaio_io_acquire_nb()(aio.c) raiseselog(ERROR, "API violation: Only one IO can be handed out")whenhanded_out_iois already set — verified verbatim. - Submit batch size is 32.
PGAIO_SUBMIT_BATCH_SIZEis defined as32inaio_internal.h;staged_ios[PGAIO_SUBMIT_BATCH_SIZE]is the per-backend staging array. - Eight-state handle machine.
PgAioHandleStatelists exactlyPGAIO_HS_IDLE,_HANDED_OUT,_DEFINED,_STAGED,_SUBMITTED,_COMPLETED_IO,_COMPLETED_SHARED,_COMPLETED_LOCAL(8 values) inaio_internal.h. - Completion runs in a critical section.
pgaio_io_process_completion()assertsCritSectionCount > 0;pgaio_submit_staged()wrapspgaio_method_ops->submit()inSTART_CRIT_SECTION()/END_CRIT_SECTION(). Confirmed. - Callbacks are integer IDs, not pointers.
PgAioHandleCallbackID(aio.h) is an enum (PGAIO_HCB_INVALID,PGAIO_HCB_MD_READV,PGAIO_HCB_SHARED_BUFFER_READV,PGAIO_HCB_LOCAL_BUFFER_READV); the README states shared memory “currently cannot contain function pointers” under EXEC_BACKEND ASLR, motivating the indirection. PgAioResultis 8 bytes, bit-packed.aio_types.hdefines it with bitfieldsid,status,error_dataplus anint32 result, and aStaticAssertDecl(sizeof(PgAioResult) == 8, ...).- Worker pool cap is 32.
MAX_IO_WORKERSis#defined to32insrc/include/storage/proc.h;IO_WORKER_WAKEUP_FANOUTis2inmethod_worker.c. Workers run asB_IO_WORKER(MyBackendType = B_IO_WORKERinIoWorkerMain). - io_uring: one ring per backend, drainable by any backend.
PgAioUringContextcarries a per-ringcompletion_lockLWLock; its header comment confirms multiple backends may drain one ring under that lock.pgaio_uring_drain_locked()recovers the handle viaio_uring_cqe_get_data()and routes throughpgaio_io_process_completion(). - io_uring buffered-IO offload heuristic.
pgaio_uring_submit()setsIOSQE_ASYNConly whenin_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED)— verified verbatim. - Read-stream distance doubles after real I/O.
read_stream_next_buffer()computesdistance = stream->distance * 2; distance = Min(distance, stream->max_pinned_buffers)immediately afterWaitReadBuffers(). Look-ahead is bounded bymax_iosand the adaptivedistance; merges are capped atio_combine_limit(DEFAULT_IO_COMBINE_LIMIT = Min(MAX_IO_COMBINE_LIMIT, (128*1024)/BLCKSZ)inbufmgr.h). - “Direct IO is unusably slow without AIO.” Quoted from
src/backend/storage/aio/README.md(Motivation), confirming AIO is the prerequisite for direct I/O rather than an independent feature.
Open questions / deferred
Section titled “Open questions / deferred”- The body of
PGAIO_HCB_SHARED_BUFFER_READV(page-validity marking, checksum verification) andStartReadBuffer/WaitReadBufferslive inbufmgr.cand are covered inpostgres-buffer-manager.md, not re-verified here. PGAIO_HCB_MD_READVand the block-to-segment translation it wraps live inmd.c/smgr.c; seepostgres-smgr-md.md.- Exact sizing of the per-backend handle pool (
AioProcs() * io_max_concurrency) is inaio_init.c/aio_funcs.c; the doc states the shape but the precise auxiliary-process accounting was not line-verified. - io_uring capability probing (
pgaio_uring_check_capabilities(),io_uring_queue_init_mem()vsio_uring_queue_init()fallback) is described at the design level only.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s AIO subsystem arrives late relative to other engines, and the design reflects lessons learned from watching them.
SQL Server has had an asynchronous, scatter/gather read-ahead manager since
the 1990s. Its buffer manager issues read-ahead based on the access pattern the
query processor declares (a range scan tells the storage engine the next N
pages it will want), and completions are serviced by I/O completion ports
(IOCP) — Windows’ equivalent of io_uring’s globally drainable completion model.
The key structural difference is threading: SQL Server is a thread-per-task (or
fiber) engine, so an outstanding read does not block a worker the way a
synchronous pread() blocks a PostgreSQL backend. PostgreSQL’s process model is
exactly why it needed shared-memory handles and integer callback IDs — state
that a thread engine keeps on a private stack must, here, be reachable by any
process.
Oracle exposes asynchronous I/O through DBWR/LGWR background processes
and db_file_multiblock_read_count for multi-block reads — the direct ancestor
of PostgreSQL’s io_combine_limit. Oracle’s ASM and direct-path reads bypass
the OS cache much as PostgreSQL’s direct-I/O-plus-AIO does, and Oracle long ago
concluded what the PG18 README now states: direct I/O without engine-driven
prefetch is a performance trap.
The io_uring interface itself (Axboe, 2019) is the enabling technology, and
PostgreSQL’s adoption is a careful one. io_uring’s submission/completion ring
pair maps almost directly onto the submit()/drain model, but PG had to solve a
problem io_uring does not address: in a multi-process server, whose ring does a
completion land in, and who is allowed to reap it? The answer — one ring per
backend in shared memory, guarded by a per-ring completion_lock so any backend
can drain any ring — is PostgreSQL-specific glue around a Linux primitive. The
IOSQE_ASYNC heuristic for buffered I/O reflects a known io_uring sharp edge:
inline execution copies page-cache data on the submitting CPU, which serializes
under load, so PG offloads to kernel workers once a few IOs are already in
flight.
Research frontiers. The textbook framing — a DBMS as a machine for
overlapping computation with I/O (Hellerstein, Stonebraker & Hamilton,
Architecture of a Database System, captured in
dbms-papers/fntdb07-architecture.md) — is now being pushed in two directions.
First, learned and adaptive prefetching: PostgreSQL’s read-stream distance
heuristic (double-on-miss, collapse-on-hit) is a hand-tuned controller, and
there is active research on replacing such controllers with models that predict
access patterns. Second, kernel-bypass and computational storage: SPDK-style
user-space NVMe drivers and smart SSDs that run predicate evaluation on the
device push the “overlap I/O with compute” idea past what a host-side AIO layer
can do. PostgreSQL’s pluggable IoMethodOps vtable is, deliberately, the seam
where such a method could one day be slotted in beside sync, worker, and
io_uring. The deadlock-avoidance contract (a method must either let any
backend complete an IO, or guarantee out-of-band completion) is the invariant
any future method must honor.
Sources
Section titled “Sources”- Code (REL_18_STABLE, commit
273fe94):src/backend/storage/aio/aio.c— handle lifecycle, staging, submission, completion, wait references.src/backend/storage/aio/aio_callback.c,aio_target.c,aio_io.c— callback dispatch, target abstraction, IO-op definition.src/backend/storage/aio/aio_init.c,aio_funcs.c— shared-memory sizing and SQL-visible introspection.src/backend/storage/aio/method_sync.c— synchronous fallback method.src/backend/storage/aio/method_worker.c— worker pool, submission queue,IoWorkerMain,B_IO_WORKER.src/backend/storage/aio/method_io_uring.c— per-backend rings, submit, drain, wait_one.src/backend/storage/aio/read_stream.c— look-ahead, IO combining, adaptive distance, the consumer API.src/backend/storage/aio/README.md— motivation, deadlock/starvation design criteria, EXEC_BACKEND callback-ID rationale.src/include/storage/aio.h,aio_internal.h,aio_types.h— public API,IoMethodOpsvtable,PgAioHandleState,PgAioResult, callback IDs.src/include/storage/proc.h—MAX_IO_WORKERS.src/include/storage/bufmgr.h—io_combine_limit,MAX_IO_COMBINE_LIMIT.
- Theory: Hellerstein, Stonebraker & Hamilton, Architecture of a Database
System (FnTDB 2007) — overlap of computation and I/O; captured locally in
knowledge/research/dbms-papers/fntdb07-architecture.md. - Related KB docs:
postgres-buffer-manager.md(shared-buffer completion callbacks,StartReadBuffer/WaitReadBuffers),postgres-smgr-md.md(segment/fd translation,PGAIO_HCB_MD_READV),postgres-shared-memory-ipc.md(shared-memory layout, LWLocks),postgres-aux-processes.md(B_IO_WORKERin the auxiliary-process taxonomy),postgres-checkpoint.mdandpostgres-xlog-wal.md(AIO for WAL/data writes).