Skip to content

PostgreSQL I/O — From Synchronous Buffered Reads to Asynchronous I/O

Contents:

Why this subsystem had to evolve (the original limitation)

Section titled “Why this subsystem had to evolve (the original limitation)”

For most of its life PostgreSQL read data the simplest possible way. When a backend needed a page that was not already cached in shared_buffers, it called ReadBuffer, the buffer manager allocated a victim frame, and the storage manager issued a single blocking pread(2) against the relation segment file. The backend then stopped — descheduled by the kernel — until the read completed and the page bytes were in the frame. One miss, one syscall, one stall. Mechanism for the current cached path is in postgres-buffer-manager.md.

This design is correct, portable, and easy to reason about, and it was good enough for two decades because of a hidden subsidy: the operating system kernel. Linux (and every other supported OS) does its own readahead — when it sees sequential pread offsets it speculatively pulls the next chunks into the page cache, so a seq scan over a warm file rarely blocks on physical I/O even though PostgreSQL issues purely synchronous reads. PostgreSQL effectively outsourced I/O concurrency to the kernel’s heuristics.

The subsidy breaks down precisely where databases hurt most:

  • Random I/O. A bitmap heap scan, an index scan, or redo replay touches scattered blocks. Kernel readahead cannot guess a random access pattern, so every miss is a full round-trip to storage with the backend idle the entire time. On a device with 100 µs–10 ms latency and deep queues, a single-threaded chain of synchronous random reads leaves almost all of the device’s potential throughput on the floor: you are paying latency serially when the hardware could service dozens of requests in parallel.
  • One backend = one outstanding I/O. Even on sequential scans, a process blocked in pread has exactly one request in flight. Modern NVMe wants a queue depth in the tens or hundreds to reach peak bandwidth. A synchronous backend can never fill that queue.
  • Recovery is single-threaded. The startup process replays WAL records one at a time, and a record that modifies a not-cached page must read it first — synchronously. Replay then alternates “read a block (stall), apply the change (cheap CPU)” and is dominated by read latency. The redo loop is described in postgres-recovery-redo.md.

The fix the database wants is conceptually simple: start the read before you need the result, so storage latency overlaps with useful CPU work or with other in-flight reads. That is asynchronous I/O. But true async I/O is hard to do portably and hard to thread through a buffer pool whose entire contract was “ReadBuffer gives you a pinned, valid page.” So PostgreSQL got there in stages, each one buying a slice of the benefit while the harder plumbing matured underneath. The next four eras trace that climb; the mechanism endpoints live in postgres-aio.md.

timeline
    title PostgreSQL I/O — synchronous reads to asynchronous I/O
    section Pre-9.0 (baseline)
        Synchronous buffered reads : ReadBuffer -> blocking pread(2)
                                    : one miss = one stall
                                    : kernel readahead is the only concurrency
    section 9.0 (2010)
        posix_fadvise prefetch : PrefetchBuffer() added to bufmgr
                               : POSIX_FADV_WILLNEED hint to kernel
                               : wired into bitmap heap scans
                               : gated by effective_io_concurrency
    section 15 (2022)
        WAL prefetch in recovery : xlogprefetcher.c look-ahead
                                  : fadvise for blocks redo will touch
                                  : recovery_prefetch + maintenance_io_concurrency
    section 17 (2024)
        Read stream abstraction : read_stream.c producer/consumer
                                : block-number callback -> vectored reads
                                : io_combine_limit coalescing
                                : adaptive look-ahead distance
                                : retrofit seq scan / ANALYZE / VACUUM
    section 18 (2025)
        Asynchronous I/O subsystem : aio.c + PgAioHandle state machine
                                   : IoMethodOps vtable sync/worker/io_uring
                                   : B_IO_WORKER processes (default)
                                   : io_uring rings per backend (Linux)
                                   : read_stream on true async completion

Era 0 — Synchronous buffered reads (the long baseline, pre-9.0)

Section titled “Era 0 — Synchronous buffered reads (the long baseline, pre-9.0)”

What it was. The read path that PostgreSQL carried essentially unchanged from its earliest releases through the 8.x line. The shape is still the fallback shape in REL_18, so it is worth pinning down precisely because every later era is defined as a departure from it.

A backend that needs block N of a relation fork calls ReadBuffer (today ReadBufferExtended / ReadBuffer_common in src/backend/storage/buffer/bufmgr.c). The buffer manager:

  1. computes the buffer tag (rel, fork, block) and probes the partitioned buffer-mapping hash;
  2. on a hit, pins the existing frame and returns — no I/O;
  3. on a miss, runs clock-sweep to select a victim frame (GetVictimBuffer), evicts it (flushing first if dirty, honoring the WAL-before-flush rule), and then issues the read.

The read itself is a single blocking call down the storage stack: smgrreadmdreadFileReadpread(2) against the segment file descriptor (see postgres-buffer-manager.md for the pin/evict mechanics). The calling process is parked by the kernel scheduler until pread returns the 8 KB page. There is no overlap: the backend does nothing else while the read is outstanding, and it has exactly one read outstanding.

Why it survived so long. Two reasons. First, the buffer pool absorbs the working set — a warm cache means most ReadBuffer calls are hits and never touch storage at all. Second, for the misses that remain on sequential access, the OS page cache and kernel readahead quietly do the prefetching: by the time PostgreSQL’s serial pread arrives, the next blocks are frequently already in the kernel cache. PostgreSQL got asynchronous-like behavior for sequential reads for free, without writing a line of async code.

Where it broke. The structural limits are exactly those listed in the “Why this had to evolve” section — random I/O defeats kernel readahead, one process can keep only one request in flight, and recovery is a serial read-then-apply loop. The effective_io_concurrency knob did not yet exist; there was simply no mechanism in the engine to express “I will need these other blocks soon.”

Structural shape.

flowchart LR
    Q[Backend executing<br/>a scan] --> RB[ReadBuffer block N]
    RB -->|cache miss| SMGR[smgrread -> mdread]
    SMGR --> PR[pread block N<br/>BLOCKING]
    PR -. backend descheduled,<br/>idle, nothing else<br/>in flight .-> PR
    PR --> RET[page returned]
    RET --> Q2[apply / scan block N]
    Q2 --> RB2[ReadBuffer block N+1<br/>... repeat serially]

This baseline still exists at REL_18 — it is what io_method=sync reduces to, and it is the path taken whenever no read stream or prefetch is involved. Everything after this era is about issuing the next read before the current block is consumed.

Era 1 — posix_fadvise prefetch for bitmap heap scans (9.0)

Section titled “Era 1 — posix_fadvise prefetch for bitmap heap scans (9.0)”

What changed. PostgreSQL 9.0 added the first in-engine mechanism to say “I will need this block soon” — PrefetchBuffer() in src/backend/storage/buffer/bufmgr.c. It does not read the page into shared_buffers. Instead it walks the same tag lookup, and on a miss calls down to smgrprefetchmdprefetchFilePrefetch, which on platforms that support it issues posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED) (still visible in src/backend/storage/file/fd.c). That advisory call tells the kernel “pull this block into your page cache”; the kernel starts the physical read asynchronously while the backend keeps working. When the backend later issues the real synchronous ReadBuffer for that block, the data is (hopefully) already in the OS cache, so the pread returns almost immediately.

Why. The motivating workload was the bitmap heap scan. A bitmap index scan produces a sorted bitmap of heap blocks to visit — a known, finite, random set of block numbers. That is the perfect case for prefetch: the executor knows in advance exactly which blocks it will touch, but they are scattered, so kernel readahead is useless. By walking ahead in the bitmap and firing PrefetchBuffer on upcoming blocks while processing the current one, the scan can keep several reads in flight on the device at once instead of stalling on each block serially.

The knob. 9.0 introduced effective_io_concurrency, the GUC that caps how many blocks ahead a bitmap heap scan will prefetch — i.e. a hint of the storage’s effective queue depth. A value of 1 means “one prefetch ahead”; higher values let more reads overlap, which matters most on RAID arrays and SSDs that service many concurrent requests. The variable is still declared in bufmgr.c:

int effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;

A sibling, maintenance_io_concurrency, was added later for maintenance paths (vacuum-style work) so they can use a different depth than user queries.

The structural shape shift: hint, don’t own. This is the crucial property of Era 1 and Era 2 — PostgreSQL still does not own the asynchronous read. It only hints to the kernel, which owns the page cache and the actual I/O. The engine has no handle on the in-flight read, no way to wait specifically for it, no way to know if it failed; it can only fire-and-forget a WILLNEED and hope the page is warm by the time the real read arrives. The completion path is unchanged from Era 0: a blocking pread, just one that now usually hits the OS cache.

flowchart LR
    subgraph Era0[Era 0: synchronous]
      A0[scan block N] --> R0[ReadBuffer N<br/>blocking pread] --> A0b[apply N]
      A0b --> A0c[ReadBuffer N+1<br/>blocking pread]
    end
    subgraph Era1[Era 1: fadvise prefetch]
      A1[scan block N] --> P1[PrefetchBuffer N+1..N+k<br/>fadvise WILLNEED]
      P1 --> RA1[ReadBuffer N<br/>likely OS-cache hit]
      RA1 --> A1b[apply N]
      P1 -. kernel pulls<br/>N+1..N+k in<br/>background .-> KC[(OS page cache)]
      KC -. warm .-> RA1
    end

Limits carried forward. Because prefetch is advisory, its benefit is unpredictable: the kernel may evict the prefetched page before the real read, the hint is a no-op on platforms without posix_fadvise, and a double read can occur (kernel reads it, evicts it, reads it again). It also only ever helped bitmap heap scans in 9.0 — sequential scans, index scans, ANALYZE and VACUUM still ran fully synchronous. Generalizing prefetch to those paths is exactly what Eras 3 and 4 do, after first extending the recovery path in Era 2. Current prefetch entry points are described in postgres-buffer-manager.md.

Era 2 — WAL prefetch during recovery (15)

Section titled “Era 2 — WAL prefetch during recovery (15)”

What changed. PostgreSQL 15 brought the prefetch idea to the place it hurt most after random user queries: crash recovery, PITR, and standby replay. The new machinery lives in src/backend/access/transam/xlogprefetcher.c (the XLogPrefetcher and its LsnReadQueue). Mechanism is documented in postgres-recovery-redo.md; here we trace why it appeared and what shape it added.

Why. Redo is single-threaded and read-bound. The startup process decodes one WAL record, finds the block(s) it modifies, reads any not- cached block synchronously, applies the change, and moves on. Each not-cached block is a serial stall, and unlike a bitmap scan there was historically no look-ahead at all — the startup process learned which block it needed only at the moment it needed it. On a standby trying to keep up with a busy primary, or during a long crash-recovery window, this serial read-then-apply loop is the bottleneck that determines how fast the database comes back or how far a replica lags.

How. The WAL stream is itself a perfect prefetch oracle: it is a record of exactly which blocks will be touched, in order, in the near future. The prefetcher exploits this by decoding records ahead of the replay position (a look-ahead window over the not-yet-applied WAL), extracting the block references each upcoming record carries, and issuing read hints for those blocks so the kernel starts pulling them into cache while the startup process is still applying earlier records. It maintains a filter so it does not prefetch blocks that a record is about to create (e.g. relation extensions or full-page images that overwrite the whole page), which would be wasted I/O.

The knobs. Two GUCs gate it, both still present in REL_18:

  • recovery_prefetchtry (default), on, or off. The header in xlogprefetcher.c shows the gating predicate combining it with the concurrency knob.
  • maintenance_io_concurrency — reused here as the look-ahead depth; the prefetcher stays no further ahead than this many concurrent reads.
/* from xlogprefetcher.c */
int recovery_prefetch = RECOVERY_PREFETCH_TRY;
/* ... enabled when recovery_prefetch != OFF && maintenance_io_concurrency > 0 */

A new view, pg_stat_recovery_prefetch, exposes counters (prefetch, hit, skip_*) so operators can see whether the look-ahead is actually saving reads.

Structural shape shift. Era 2 is the same hint, don’t own model as Era 1 — it still issues advisory prefetch hints rather than owning real async reads — but it generalizes two things. First, it moves prefetch out of the executor and into the recovery/redo subsystem, proving the pattern beyond user queries. Second, and more importantly for the arc, it introduces an explicit, reusable look-ahead engine (LsnReadQueue): a producer that decodes future work and a consumer that applies it, with a bounded distance between them. That producer/consumer-with-bounded- distance structure is precisely the shape that PG17 will generalize into the read stream. WAL prefetch is, in retrospect, the dress rehearsal for the read stream abstraction.

Limits carried forward. Two big ones. (1) It is still kernel-advisory prefetch — no handle, no real completion, the same unpredictability as Era 1. (2) Its look-ahead engine is bespoke to recovery; the executor’s sequential and index scans still had no equivalent. Both limits fall in the next two eras.

Era 3 — The read stream abstraction (17)

Section titled “Era 3 — The read stream abstraction (17)”

What changed. PostgreSQL 17 introduced read_stream.c (in the new src/backend/storage/aio/ directory) — a reusable producer/consumer helper that finally generalized “look ahead and prefetch” out of the two bespoke call sites (bitmap scan, recovery) into an abstraction any scan could adopt. Full mechanism is in postgres-aio.md; the arc point is that PG17 separated what blocks will be needed from how they are read.

The shape. A consumer creates a stream with read_stream_begin_relation() (or _begin_smgr_relation()), passing a callback that, each time it is invoked, returns the next block number the consumer will want (or InvalidBlockNumber to end the stream). The consumer then just calls read_stream_next_buffer() in a loop and gets back pinned buffers in order — exactly the ReadBuffer contract, but with the reads already overlapped. Inside, the stream:

  • runs the callback ahead of consumption to discover upcoming blocks, keeping a bounded look-ahead distance (the same producer/consumer-with- bounded-distance structure the WAL prefetcher pioneered);
  • coalesces neighbours into a single vectored read up to io_combine_limit (a new PG17 GUC, default 128 KB) — so a run of adjacent blocks becomes one preadv-style I/O instead of many;
  • caps concurrency at a max_ios derived from effective_io_concurrency / maintenance_io_concurrency, so several reads can be in flight at once;
  • adapts the look-ahead distance to recent hit/miss history: if blocks keep hitting shared_buffers it shrinks the window (no point prefetching cached data); if it keeps missing it grows the window toward max_ios.
/* read_stream.h — the consumer-facing API (REL_18) */
extern ReadStream *read_stream_begin_relation(int flags, ...);
extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_data);
#define READ_STREAM_SEQUENTIAL 0x02 /* hint: sequential access */

Why an abstraction, not another call site. By PG16 the codebase had two independent prefetch mechanisms (executor bitmap prefetch; recovery WAL prefetch) and a long list of scans that had none (sequential scan, ANALYZE’s sampling, VACUUM’s heap pass). Hand-coding fadvise look-ahead into each was unmaintainable and still left the “hint, don’t own” weakness. A single helper that owns look-ahead, coalescing, and concurrency lets every consumer benefit from one well-tuned engine — and, critically, gives the project one chokepoint to re-platform onto real async I/O later. That is the strategic move: PG17’s read stream still ran on the old prefetch-then-synchronous-read machinery underneath, but it put the entire engine behind a stable API so PG18 could swap the bottom out without touching consumers.

Retrofit. PG17 converted the obvious sequential consumers to the stream: sequential heap scans, ANALYZE block sampling, and parts of VACUUM’s heap scan moved to read_stream_* calls (the heap-AM references are still in src/backend/access/heap/heapam.c, heapam_handler.c, and vacuumlazy.c). For a seq scan the win is that coalescing turns many 8 KB reads into few large vectored reads, and the look-ahead keeps the device busy rather than relying purely on kernel readahead.

Structural shape shift: separate the oracle from the engine.

flowchart TB
    subgraph Before[Before PG17: bespoke prefetch per consumer]
      BHS[bitmap heap scan<br/>own fadvise loop] --> K1[(kernel<br/>readahead)]
      REC[recovery redo<br/>own LsnReadQueue] --> K1
      SEQ[seq scan / ANALYZE / VACUUM<br/>NO prefetch] --> K1
    end
    subgraph After[PG17: one read_stream engine]
      CB1[seq scan callback] --> RS[ReadStream<br/>look-ahead + coalesce<br/>+ max_ios]
      CB2[ANALYZE callback] --> RS
      CB3[VACUUM callback] --> RS
      RS --> COMB[vectored reads<br/>up to io_combine_limit]
      COMB --> K2[(still synchronous<br/>read underneath<br/>in PG17)]
    end

Limit carried forward — but now isolated. In PG17 the read stream is a better organizer of I/O, but the bottom of the stack is still the same: the actual read is fundamentally synchronous (with fadvise-style hinting), because PostgreSQL still had no first-class async I/O engine. The decisive difference from Eras 1–2 is that this limit is now concentrated in one place. PG18’s whole job is to replace that bottom with real asynchronous completion — and because consumers talk only to read_stream_next_buffer, they get it for free.

Era 4 — The asynchronous I/O subsystem (18)

Section titled “Era 4 — The asynchronous I/O subsystem (18)”

What changed. PostgreSQL 18 added the thing the previous three eras were all stand-ins for: a first-class asynchronous I/O subsystem that the server owns end to end. The core lives in src/backend/storage/aio/aio.c and friends. Now PostgreSQL issues a real async read, tracks it through an explicit state machine in shared memory, and completes it itself — instead of firing a posix_fadvise hint and praying. Full mechanism is in postgres-aio.md; this section traces the structural leap and ties each piece to why the earlier eras needed it.

The central object: PgAioHandle. Where Era 1–2 had no handle on an in-flight read at all, PG18 represents each operation as a PgAioHandle living in shared memory and moving through an eight-state lifecycle:

IDLE -> HANDED_OUT -> DEFINED -> STAGED -> SUBMITTED
-> COMPLETED_IO -> COMPLETED_SHARED -> COMPLETED_LOCAL

A backend acquires a handle (pgaio_io_acquire), the buffer manager and lower layers register completion callbacks and define the operation with pgaio_io_start_readv(), and the handle is staged and submitted through a pluggable method. Because the handle is in shared memory, a PgAioWaitRef (carrying a generation counter so a reused handle is not mistaken for the original) lets any process wait on the I/O — exactly what the kernel-hint model could never express. Completion callbacks are identified by small integer IDs rather than function pointers, because EXEC_BACKEND ASLR forbids shared-memory function pointers; errors are packed into a compact PgAioResult and re-raised in the issuing backend.

The pluggable engine: IoMethodOps vtable. The decisive design choice is that the mechanism of async I/O is abstracted behind a vtable (IoMethodOps), selected by the io_method GUC (PGC_POSTMASTER). Three implementations ship:

  • sync — no real async I/O. Submitting an IO just performs it synchronously inline. This is Era 0 reincarnated as a fallback / debug method, and it is what proves the abstraction is sound: the same read_stream consumer runs unmodified whether the bottom is synchronous or asynchronous.
  • worker — the default. The submitting backend hands the IO to a pool of dedicated I/O worker processes over a shared-memory submission queue; the workers perform the (blocking) preadv and signal completion. This gives real overlap on every platform, since it needs no kernel async API — just extra processes. The new process type is B_IO_WORKER (src/include/miscadmin.h), the count is set by the io_workers GUC, and the implementation is method_worker.c.
  • io_uring — on Linux 5.1+, uses the kernel’s io_uring interface for genuine in-kernel async submission/completion, with one ring per backend held in shared memory so any backend can drain completions (method_io_uring.c). This is the lowest-overhead path: no extra processes, no extra preadv thread hops, the kernel does the async work.
/* guc_tables.c — the knobs that switch the engine (REL_18) */
{"io_method", PGC_POSTMASTER, RESOURCES_IO, ...}, /* sync|worker|io_uring */
{"io_workers", ...}, /* number of B_IO_WORKER procs (worker method) */
{"io_max_concurrency", ...}, /* per-backend max in-flight IOs */

Why a vtable with worker as default, not io_uring. Portability and safety. io_uring is Linux-only and has had a turbulent security and kernel-version history, so it cannot be the universal default. The worker method needs nothing but processes and shared memory, so it gives every supported platform real async I/O out of the box, while io_uring is opt-in for those who can use it. sync keeps the old behavior available for debugging and for environments where async is unwanted. This is the same “don’t bet the whole engine on one OS primitive” instinct that made the project hint-via-fadvise for fifteen years — but now the abstraction is internal, so the consumers never see which method is active.

The payoff: read_stream re-platformed. The reason PG17 hid everything behind read_stream_next_buffer becomes clear here. In PG18 the read stream no longer issues fadvise hints and then synchronous reads; it acquires PgAioHandles and submits true asynchronous reads through the selected method, then hands back buffers as completions arrive. Every consumer that adopted the stream in PG17 — sequential scans, ANALYZE, VACUUM — gets real async I/O in PG18 without a single line of consumer change. The bet paid off: one chokepoint, swapped underneath.

Structural shape shift: from hint to owned completion.

flowchart TB
    subgraph PG17[PG17 read stream: hint + sync read]
      C17[read_stream_next_buffer] --> RS17[ReadStream<br/>look-ahead, coalesce]
      RS17 --> H17[fadvise WILLNEED]
      H17 --> KC17[(OS page cache)]
      RS17 --> SR17[synchronous preadv<br/>likely cache hit]
    end
    subgraph PG18[PG18 read stream: owned async I/O]
      C18[read_stream_next_buffer] --> RS18[ReadStream<br/>look-ahead, coalesce]
      RS18 --> AH[acquire PgAioHandle<br/>start_readv + callbacks]
      AH --> VT{IoMethodOps<br/>vtable}
      VT -->|worker| WK[B_IO_WORKER procs<br/>blocking preadv]
      VT -->|io_uring| IU[io_uring ring<br/>kernel async]
      VT -->|sync| SY[inline synchronous<br/>fallback]
      WK --> CB[completion callback<br/>fills shared buffer]
      IU --> CB
      SY --> CB
      CB --> C18
    end

What this finally fixes. Every limit from the baseline is now addressable in-engine: random reads can be issued many-at-once with real handles (not advisory hints that may be evicted); a single backend can keep io_max_concurrency reads in flight to fill an NVMe queue; and the read path no longer depends on the kernel’s readahead heuristics or the presence of posix_fadvise. The completion is observed, so errors propagate properly and double-reads disappear. Recovery’s WAL prefetch (Era 2) and the bitmap/sequential consumers can all, over time, migrate onto the same owned-async substrate rather than maintaining bespoke fadvise loops.

Note on the write side. The eras above trace the read path, which is where the user-visible latency lived and where PG18’s work concentrated. The same PgAioHandle machinery is built to carry writes too (pgaio_io_start_writev), so checkpointer/background-writer flushing can ride the async substrate; details are in postgres-aio.md. The arc this doc traces is reads, but the subsystem is general.

At REL_18 (commit 273fe94, PG 18.x) PostgreSQL’s read path is a layered stack that any of the previous eras’ code can still be seen inside:

  • Consumers call read_stream_next_buffer() and receive pinned buffers in order — sequential scans, ANALYZE, VACUUM, and a growing set of others. They are oblivious to how the reads happen. Engine and API: src/backend/storage/aio/read_stream.c, documented in postgres-aio.md.
  • The read stream turns each consumer’s block-number callback into coalesced vectored reads (up to io_combine_limit), bounds concurrency by effective_io_concurrency / maintenance_io_concurrency, and adapts its look-ahead to hit/miss history.
  • The AIO core (src/backend/storage/aio/aio.c) issues each read as a PgAioHandle through the io_method-selected IoMethodOps vtable: worker (default, via B_IO_WORKER processes in method_worker.c), io_uring (Linux, method_io_uring.c), or sync (the Era-0 fallback).
  • The buffer manager (src/backend/storage/buffer/bufmgr.c) still owns frame allocation, pinning, clock-sweep eviction, and the WAL-before-flush rule — see postgres-buffer-manager.md. PrefetchBuffer / effective_io_concurrency from Era 1 remain present.
  • Recovery (src/backend/access/transam/xlogprefetcher.c) still carries the Era-2 WAL prefetcher gated by recovery_prefetch + maintenance_io_concurrency, with pg_stat_recovery_prefetch for observability — see postgres-recovery-redo.md.

The net result: the synchronous baseline (Era 0) survives only as io_method=sync; the advisory-hint model (Eras 1–2) persists where it has not yet been migrated; and the strategic abstraction (Era 3) is what made the PG18 leap (Era 4) a substrate swap rather than a rewrite of every scan. PostgreSQL finally owns the I/O-concurrency primitive it borrowed from the kernel for two decades.

Next step (PG19, just-released forward note). The PG18 subsystem deliberately landed the read path first and left writes, more consumers, and direct-I/O integration as follow-on work; the natural PG19-era direction is migrating additional read sites onto the read stream and extending owned-async to the write path (checkpointer / bgwriter), plus maturing io_uring and direct I/O. Treat this only as a forward pointer, not as current REL_18 behavior.

Release notes / project docs

  • PostgreSQL release notes: 9.0 (PrefetchBuffer / effective_io_concurrency, bitmap heap scan prefetch), 15 (recovery prefetch, recovery_prefetch, pg_stat_recovery_prefetch), 17 (read streams, io_combine_limit, seq-scan/ANALYZE/VACUUM conversions), 18 (asynchronous I/O, io_method, io_workers, I/O worker processes, io_uring).
  • src/backend/storage/aio/README.md (AIO design overview, in-tree).

Current-state module docs (mechanism — not re-derived here)

Key source files (observable on REL_18, commit 273fe94)

  • src/backend/storage/aio/aio.c — AIO core, handle state machine.
  • src/backend/storage/aio/read_stream.c — read stream producer/consumer.
  • src/backend/storage/aio/method_worker.c — worker method, B_IO_WORKER.
  • src/backend/storage/aio/method_io_uring.c — io_uring method (Linux).
  • src/backend/storage/buffer/bufmgr.cReadBuffer, PrefetchBuffer, effective_io_concurrency.
  • src/backend/storage/file/fd.cFilePrefetch / posix_fadvise.
  • src/backend/access/transam/xlogprefetcher.c — WAL prefetch in recovery.