PostgreSQL Sequences — Relation-Backed Generators, Caching, and WAL
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A sequence is the relational database’s answer to a deceptively simple
need: hand out a stream of distinct numbers — 1, 2, 3, … — to callers
that arrive concurrently, without any two callers ever receiving the same
number. The textbook framing in Database System Concepts (Silberschatz
7e) places sequences in the SQL chapter under “sequences used to generate
unique key values”: they are the machinery behind surrogate primary keys,
SERIAL/bigserial pseudo-types, and the SQL-standard GENERATED … AS IDENTITY columns. The contract a sequence offers is narrow but precise,
and getting the contract right is the entire design problem:
- Uniqueness. No two successful
nextvalcalls ever return the same value. This is the one guarantee applications actually depend on for primary-key generation. - Monotonicity. Values advance in the direction of the increment (ascending by default). Within a single backend the values a session sees are strictly increasing.
- Explicitly not gap-free. A sequence makes no promise that the values it issues are contiguous. Gaps appear for three independent reasons — a transaction that consumed a value then rolled back, a cache of values discarded at session end, and a crash that lost pre-logged values. Treating a sequence as a gap-free row counter is the single most common application-level misuse, and the design actively refuses to support it because gap-freedom is fundamentally incompatible with concurrency: a gap-free counter would have to serialize every allocator against every other and hold the allocation until commit.
The tension that organizes every sequence implementation is (2)+(3)
versus performance. If correctness only required a global counter
protected by a lock, the design would be trivial and slow: every
nextval would contend on one lock and, if durability mattered, force one
disk write. Real systems issue millions of sequence values per second, so
the counter cannot be a single contended, synchronously-logged cell. The
escape hatch is batching: an allocator reserves a block of values
under a single lock acquisition and a single durability event, then serves
individual values from that block cheaply. The price of batching is
exactly the gaps in (3) — a reserved-but-unissued block that is lost
(session ends, crash) becomes a permanent hole. The design chooses to
pay that price, and the SQL standard blesses the choice.
A second, subtler tension is transactionality. A naive reading says a
generator is just a mutable counter, so its updates should be
transactional like any other row: roll back the transaction, give the
number back. But that is precisely the wrong behavior for a key generator.
If nextval were transactional, two concurrent transactions could each
call nextval, both see “next is 5,” and the loser’s rollback would
release 5 for reuse — destroying uniqueness the moment concurrency appears.
So a generator’s current position must advance with non-transactional,
immediately-visible semantics: the increment is durable and visible to
all sessions the instant it happens, independent of the calling
transaction’s fate. This is the deepest conceptual point about sequences:
they live inside the transactional storage engine but deliberately
opt out of transactional rollback for their core operation, while still
opting in for their definition changes (renaming, re-parameterizing).
The third axis is durability granularity. A counter that is purely
in-memory loses its position on crash and re-issues values — fatal for
uniqueness across restarts. A counter that fsyncs on every increment is
correct but unusably slow. The resolution, again, is to pre-reserve
durability: log a position ahead of where you are, so that after a crash
recovery resumes from the logged-ahead position and the only loss is the
gap between actual and logged — never a re-issue. This is the WAL
log_cnt batching that PostgreSQL implements, and it is the storage-engine
twin of the in-memory cache batching: one batches the lock, the other
batches the write-ahead log.
Common DBMS Design
Section titled “Common DBMS Design”Across engines a recurring playbook for sequence/identity generation emerges. Naming the shared moves first lets PostgreSQL’s specific choices read as points in a known design space.
Where does the counter live?
Section titled “Where does the counter live?”Two broad families:
- Catalog/metadata-row counter. The current position is a column in a system catalog (or a dedicated one-row table). Reading/advancing it is a catalog update. Simple, but every advance contends on catalog buffers and catalog cache invalidation. Several engines store the “next value” as metadata and rely on a dedicated latch.
- Dedicated sequence object with its own storage. The sequence is a first-class schema object backed by its own page(s), advanced under its own lock, logged under its own log-record type. This is the high-concurrency choice and the one PostgreSQL, Oracle, and DB2 all make.
The two-level cache
Section titled “The two-level cache”Almost universally, a fast generator has two batching levels:
- A durable/shared reservation — a block of values reserved in the
persistent counter under one lock + one log event. Oracle calls the size
CACHE n; PostgreSQL spells itCACHEtoo but applies it per-session (below), and separately pre-logsSEQ_LOG_VALSto the WAL. - A session-local cache — each backend pulls a sub-block into private
memory and serves
nextvalfrom it with no shared-state access at all.
The session-local cache is what makes nextval essentially free in the
common case, and it is the direct cause of inter-session gaps and
out-of-order values across sessions: session A may cache [1..10] and
session B [11..20], so the order rows physically appear can interleave
arbitrarily with the values they carry.
Crash recovery and “log ahead”
Section titled “Crash recovery and “log ahead””The durable counter must survive crashes without re-issuing. The shared
trick is to persist a position ahead of the issued position so replay
resumes past anything that was handed out. The recovery rule is “on redo,
fast-forward the counter to at least the logged value.” PostgreSQL’s
log_cnt is exactly this: the on-disk last_value after a log record is
the value after log_cnt more fetches, so replay restores the
logged-ahead position and the live system simply skips the unissued tail.
Transactional vs. non-transactional split
Section titled “Transactional vs. non-transactional split”The universally adopted resolution is to split a sequence’s state:
- Position advance (
nextval/setval) — non-transactional. Durable and globally visible immediately; never rolled back. - Definition (create/alter/drop, re-parameterize, reset) — transactional. Honors MVCC and rollback like any DDL.
Engines implement the second half by versioning the sequence’s storage:
an ALTER/RESTART writes a new physical file/segment that becomes
visible only on commit, so a rollback leaves the old storage in place. This
is exactly PostgreSQL’s RelationSetNewRelfilenumber rewrite.
flowchart TB
subgraph Defn["Definition state — TRANSACTIONAL (MVCC, rollback)"]
PGS["pg_sequence catalog row<br/>start, increment, min, max,<br/>cache, cycle, typid"]
REL["sequence relation identity<br/>relfilenumber (new file on ALTER/RESTART)"]
end
subgraph Pos["Position state — NON-TRANSACTIONAL (immediate, durable)"]
TUP["one heap tuple on block 0<br/>last_value, log_cnt, is_called"]
WAL["XLOG_SEQ_LOG record<br/>pre-logs future last_value"]
end
APP["application: nextval / currval / setval"] --> Pos
DDL["CREATE / ALTER / RESTART / DROP"] --> Defn
Pos -. "reads params from" .-> PGS
The “identity” layer on top
Section titled “The “identity” layer on top”SERIAL and GENERATED AS IDENTITY are not separate mechanisms: they are
a sequence plus an OWNED BY dependency plus a column default. The
generator underneath is the same nextval. PostgreSQL implements the
ownership link with process_owned_by, recording a pg_depend row so that
dropping the table drops the sequence.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s defining decision is that a sequence is a relation —
relkind = RELKIND_SEQUENCE — not a privileged in-memory object. It owns a
pg_class row, a relfilenode, a buffer-managed data file, and it flows
through the same buffer manager, WAL, and smgr machinery as a heap table.
What makes it a sequence is only that (a) its data file holds exactly one
heap tuple on block 0, with a special-area magic number, and (b) its
mutating operations go through sequence.c rather than the generic
executor. This “everything is a relation” choice is why sequences inherit
crash safety, replication, and pg_upgrade for free.
Two structs, two homes
Section titled “Two structs, two homes”A sequence’s state is split across two storage locations, mirroring the transactional/non-transactional divide:
// FormData_pg_sequence — src/include/catalog/pg_sequence.h// IMMUTABLE PARAMETERS, transactional, in the pg_sequence catalogtypedef struct FormData_pg_sequence{ Oid seqrelid; /* OID of the owning sequence relation */ Oid seqtypid; /* smallint | integer | bigint */ int64 seqstart; /* START WITH */ int64 seqincrement; /* INCREMENT BY */ int64 seqmax; /* MAXVALUE */ int64 seqmin; /* MINVALUE */ int64 seqcache; /* CACHE (per-backend batch size) */ bool seqcycle; /* CYCLE? */} FormData_pg_sequence;// FormData_pg_sequence_data — src/include/commands/sequence.h// MUTABLE POSITION, non-transactional, the single tuple on block 0typedef struct FormData_pg_sequence_data{ int64 last_value; /* last value handed to the on-disk state */ int64 log_cnt; /* how many more fetches are pre-logged */ bool is_called; /* has nextval ever advanced past last_value? */} FormData_pg_sequence_data;The split is deliberate. The parameters never change during normal
operation, so they live in a catalog reached through the SEQRELID
syscache (a hash lookup, no buffer lock) — see postgres-catcache-syscache.md.
The position changes on every cache refill, so it lives in the relation’s
own page where it can be ex-locked, mutated, and WAL-logged independently
of catalog machinery.
The per-backend SeqTable cache
Section titled “The per-backend SeqTable cache”The hot-path optimization is a process-local hash table, seqhashtab,
mapping each touched sequence’s relid to a SeqTableData:
// SeqTableData — src/backend/commands/sequence.ctypedef struct SeqTableData{ Oid relid; /* pg_class OID of this sequence (hash key) */ RelFileNumber filenumber; /* last seen relfilenumber of this sequence */ LocalTransactionId lxid; /* xact in which we last did a seq op */ bool last_valid; /* do we have a valid "last" value? */ int64 last; /* value last returned by nextval */ int64 cached; /* last value already cached for nextval */ /* if last != cached, we have not used up all the cached values */ int64 increment; /* copy of sequence's increment field */} SeqTableData;The invariant last != cached means “this backend still holds unissued
cached values.” When it holds, nextval returns by simply adding the
increment to last — no buffer, no lock, no WAL, no catalog lookup.
The lxid field implements once-per-transaction locking: the relation lock
is taken under the top transaction’s resource owner so it is acquired at
most once per xact regardless of how many nextval calls occur. cached
is reset to last (discarding the window) whenever the sequence is
detected to have been transactionally replaced — its relfilenumber
changed out from under us, the signature of an ALTER/RESTART commit.
flowchart TB
START["nextval_internal(relid)"] --> INIT["init_sequence:<br/>find/create SeqTable entry,<br/>lock_and_open_sequence once per xact"]
INIT --> CACHED{"elm->last != elm->cached ?<br/>(unissued cached values)"}
CACHED -- yes --> FAST["elm->last += increment;<br/>return — NO buffer/WAL"]
CACHED -- no --> PARAMS["SearchSysCache SEQRELID:<br/>load incby/max/min/cache/cycle"]
PARAMS --> READ["read_seq_tuple:<br/>ex-lock block 0, read last_value/log_cnt"]
READ --> DECIDE{"log_cnt < fetch OR<br/>page LSN <= redo ptr ?"}
DECIDE -- yes --> FORCE["force WAL: fetch = log = cache + SEQ_LOG_VALS;<br/>logit = true"]
DECIDE -- no --> NOLOG["serve from cache only, no new WAL"]
FORCE --> LOOP["loop: apply increment,<br/>check max/min, cycle, count rescnt"]
NOLOG --> LOOP
LOOP --> CRIT["START_CRIT_SECTION;<br/>MarkBufferDirty"]
CRIT --> WALW{"logit && RelationNeedsWAL ?"}
WALW -- yes --> EMIT["write tuple as 'after log more fetches';<br/>XLogInsert XLOG_SEQ_LOG; PageSetLSN"]
WALW -- no --> SKIP["skip WAL"]
EMIT --> FINAL["install final tuple:<br/>last_value=last, log_cnt=log"]
SKIP --> FINAL
FINAL --> RET["END_CRIT_SECTION; unlock; return result"]
Why nextval is non-transactional, and the guards that enforce it
Section titled “Why nextval is non-transactional, and the guards that enforce it”nextval_internal never enrolls its buffer change in the transaction’s
undo path — the increment is durable the instant the buffer is dirtied and
the WAL (if any) is flushed at commit. Consequently it must refuse contexts
where that would be unsafe or meaningless: it calls
PreventCommandIfParallelMode("nextval()") (parallel workers cannot share
the backend-local cache) and PreventCommandIfReadOnly("nextval()") for
non-temp sequences (a read-only xact may not durably advance shared state).
By contrast, the same file’s ResetSequence and AlterSequence are
transactional: they call RelationSetNewRelfilenumber to stamp a brand-new
storage file, so an aborted ALTER simply never publishes the new file and
the old position survives.
Source Walkthrough
Section titled “Source Walkthrough”The code divides cleanly into five clusters: the access-method shim
(access/sequence/sequence.c), creation/alteration (DefineSequence,
AlterSequence, ResetSequence, init_params), the nextval hot path
(init_sequence, read_seq_tuple, nextval_internal), read/set helpers
(currval, lastval, do_setval, pg_* functions), and WAL/redo
(seq_redo, seq_mask, seq_desc). All of commands/sequence.c is
walked below in call-flow order.
The sequence access method — a relkind assertion
Section titled “The sequence access method — a relkind assertion”The “sequence AM” is intentionally trivial. Unlike the heap or nbtree AMs,
it has no IndexAmRoutine/TableAmRoutine vector; it is just two wrappers
around relation_open/relation_close that assert the relkind. Sequences
reuse heap tuple format on the page, so no per-tuple AM is needed.
// sequence_open — src/backend/access/sequence/sequence.cRelationsequence_open(Oid relationId, LOCKMODE lockmode){ Relation r;
r = relation_open(relationId, lockmode); validate_relation_kind(r); /* ereport unless RELKIND_SEQUENCE */ return r;}validate_relation_kind raises ERRCODE_WRONG_OBJECT_TYPE if the relkind
is not RELKIND_SEQUENCE, which is how nextval('not_a_seq') is rejected.
Creating a sequence — DefineSequence
Section titled “Creating a sequence — DefineSequence”DefineSequence builds an ordinary three-column table (last_value,
log_cnt, is_called), creates it via DefineRelation with relkind
RELKIND_SEQUENCE, then writes the initial one-row tuple and inserts the
parameter row into pg_sequence:
// DefineSequence — src/backend/commands/sequence.cinit_params(pstate, seq->options, seq->for_identity, true, &seqform, &seqdataform, &need_seq_rewrite, &owned_by);/* ... build CreateStmt with three int8/bool columns ... */address = DefineRelation(stmt, RELKIND_SEQUENCE, seq->ownerId, NULL, NULL);seqoid = address.objectId;rel = sequence_open(seqoid, AccessExclusiveLock);tuple = heap_form_tuple(tupDesc, value, null);fill_seq_with_data(rel, tuple); /* lay down block 0 *//* ... then CatalogTupleInsert into pg_sequence with the parameters ... */init_params is the long validator that turns the WITH option list into
seqform/seqdataform, applying type-dependent defaults
(INT8/INT4/INT2 min/max), cross-checks (seqmin < seqmax, START
inside range), and crucially resets log_cnt to 0 whenever any
generation-affecting parameter changes — so a parameter change can never
leave a stale pre-log window.
Laying down block 0 — fill_seq_with_data
Section titled “Laying down block 0 — fill_seq_with_data”fill_seq_with_data initializes the relation’s first page. It extends the
buffer, lays a sequence_magic (0x1717) into the special area, force-sets
the tuple’s xmin to FrozenTransactionId (sequences are never VACUUMed, so
the tuple must be permanently visible), and WAL-logs the whole page with
REGBUF_WILL_INIT:
// fill_seq_fork_with_data — src/backend/commands/sequence.cPageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic));sm = (sequence_magic *) PageGetSpecialPointer(page);sm->magic = SEQ_MAGIC;/* sequences are never vacuumed → force the tuple permanently visible */HeapTupleHeaderSetXmin(tuple->t_data, FrozenTransactionId);HeapTupleHeaderSetXminFrozen(tuple->t_data);/* ... */START_CRIT_SECTION();MarkBufferDirty(buf);offnum = PageAddItem(page, (Item) tuple->t_data, tuple->t_len, InvalidOffsetNumber, false, false);if (RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM){ xl_seq_rec xlrec; XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); xlrec.locator = rel->rd_locator; XLogRegisterData(&xlrec, sizeof(xl_seq_rec)); XLogRegisterData(tuple->t_data, tuple->t_len); recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); PageSetLSN(page, recptr);}END_CRIT_SECTION();The wrapper fill_seq_with_data additionally handles unlogged
sequences: it writes the same content into the INIT_FORKNUM fork (and
flushes), so that after a crash the unlogged sequence is reset to its init
image rather than carrying torn data.
Per-transaction locking — init_sequence and lock_and_open_sequence
Section titled “Per-transaction locking — init_sequence and lock_and_open_sequence”Every entry point routes through init_sequence, which finds or creates the
SeqTable entry and opens the relation. The relation lock is taken at most
once per transaction by checking the cached lxid:
// lock_and_open_sequence — src/backend/commands/sequence.cLocalTransactionId thislxid = MyProc->vxid.lxid;if (seq->lxid != thislxid){ ResourceOwner currentOwner = CurrentResourceOwner; CurrentResourceOwner = TopTransactionResourceOwner; /* lock held to xact end */ LockRelationOid(seq->relid, RowExclusiveLock); CurrentResourceOwner = currentOwner; seq->lxid = thislxid; /* remember we hold it */}return sequence_open(seq->relid, NoLock);init_sequence then detects a transactional replacement: if the relation’s
relfilenode differs from the cached filenumber, the sequence was
ALTER/RESTART-ed under us, so any cached-but-unissued window is dropped
by setting elm->cached = elm->last (while last/currval state is left
intact):
// init_sequence — src/backend/commands/sequence.cseqrel = lock_and_open_sequence(elm);if (seqrel->rd_rel->relfilenode != elm->filenumber){ elm->filenumber = seqrel->rd_rel->relfilenode; elm->cached = elm->last; /* discard pre-fetched cache window */}Reading the one tuple — read_seq_tuple
Section titled “Reading the one tuple — read_seq_tuple”read_seq_tuple ex-locks block 0, validates the magic number, and returns
a pointer into the buffer. It also opportunistically scrubs a stale xmax
left by historical SELECT FOR UPDATE bugs, treating the fix as an
unlogged hint:
// read_seq_tuple — src/backend/commands/sequence.c*buf = ReadBuffer(rel, 0);LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);page = BufferGetPage(*buf);sm = (sequence_magic *) PageGetSpecialPointer(page);if (sm->magic != SEQ_MAGIC) elog(ERROR, "bad magic number in sequence \"%s\": %08X", ..., sm->magic);lp = PageGetItemId(page, FirstOffsetNumber);seqdatatuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);seqdatatuple->t_len = ItemIdGetLength(lp);/* clean up a non-frozen xmax left by old SELECT FOR UPDATE bugs */if (HeapTupleHeaderGetRawXmax(seqdatatuple->t_data) != InvalidTransactionId){ HeapTupleHeaderSetXmax(seqdatatuple->t_data, InvalidTransactionId); seqdatatuple->t_data->t_infomask |= HEAP_XMAX_INVALID; MarkBufferDirtyHint(*buf, true);}The hot path — nextval_internal
Section titled “The hot path — nextval_internal”nextval_internal is the heart of the module. First the cache fast path:
// nextval_internal — src/backend/commands/sequence.c (fast path)if (elm->last != elm->cached) /* some numbers were cached */{ Assert(elm->last_valid); Assert(elm->increment != 0); elm->last += elm->increment; sequence_close(seqrel, NoLock); last_used_seq = elm; return elm->last; /* no buffer, no WAL, no catalog */}On a cache miss it loads parameters from SEQRELID, reads the tuple, and
decides whether this refill must be WAL-logged. Two triggers force a log:
the local demand exceeds the pre-logged budget (log < fetch), or the
page’s last update predates the current checkpoint redo pointer — without
the latter, redo from the checkpoint would not advance the sequence past
already-issued values:
// nextval_internal — src/backend/commands/sequence.c (log decision)log = seq->log_cnt;if (log < fetch || !seq->is_called){ fetch = log = fetch + SEQ_LOG_VALS; /* grab 32 extra, pre-log them */ logit = true;}else{ XLogRecPtr redoptr = GetRedoRecPtr(); if (PageGetLSN(page) <= redoptr) /* first nextval after checkpoint */ { fetch = log = fetch + SEQ_LOG_VALS; logit = true; }}The allocation loop applies the increment up to fetch times, enforcing
MAXVALUE/MINVALUE and CYCLE, and recording the first issued value as
the function result and the last as the new cache high-water:
// nextval_internal — src/backend/commands/sequence.c (allocation loop, ascending arm)if ((maxv >= 0 && next > maxv - incby) || (maxv < 0 && next + incby > maxv)){ if (rescnt > 0) break; /* satisfied caller; stop */ if (!cycle) ereport(ERROR, (errcode(ERRCODE_SEQUENCE_GENERATOR_LIMIT_EXCEEDED), errmsg("nextval: reached maximum value of sequence \"%s\" (%lld)", ...))); next = minv; /* CYCLE wraps to MINVALUE */}else next += incby;fetch--;if (rescnt < cache) { log--; rescnt++; last = next; if (rescnt == 1) result = next; }Finally the WAL/buffer protocol. The buffer is dirtied before the values
are installed (safe because the buffer is ex-locked). The WAL record carries
the tuple as it would look after log more fetches, then the in-buffer
tuple is set to the actual state with the remaining log_cnt:
// nextval_internal — src/backend/commands/sequence.c (WAL + install)START_CRIT_SECTION();MarkBufferDirty(buf);if (logit && RelationNeedsWAL(seqrel)){ XLogBeginInsert(); XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); seq->last_value = next; /* the LOGGED-AHEAD value */ seq->is_called = true; seq->log_cnt = 0; xlrec.locator = seqrel->rd_locator; XLogRegisterData(&xlrec, sizeof(xl_seq_rec)); XLogRegisterData(seqdatatuple.t_data, seqdatatuple.t_len); recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); PageSetLSN(page, recptr);}/* now the intended final in-buffer state, possibly behind the logged value */seq->last_value = last;seq->is_called = true;seq->log_cnt = log; /* this many fetches remain pre-logged */END_CRIT_SECTION();This is the log_cnt batching: after one WAL write the on-disk last_value
is SEQ_LOG_VALS ahead, and the next ~32 refills decrement log_cnt in the
buffer with no new WAL record. A crash loses the unissued tail (the gap),
never re-issues. Note GetTopTransactionId() is called before the critical
section when logit — forcing the top xact to have a real XID so commit
triggers a WAL flush and synchronous-replication wait.
currval, lastval, and setval
Section titled “currval, lastval, and setval”currval_oid returns the session’s last nextval result straight from
elm->last (erroring if last_valid is false). lastval returns the last
value from any sequence this session touched, via the file-static
last_used_seq pointer (re-checking the sequence still exists). do_setval
is nextval’s simpler sibling: it validates the target against min/max,
discards the local cache (elm->cached = elm->last), and always WAL-logs
the new tuple with log_cnt = 0:
// do_setval — src/backend/commands/sequence.cif ((next < minv) || (next > maxv)) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), ...));if (iscalled) { elm->last = next; elm->last_valid = true; }elm->cached = elm->last; /* forget any future cached numbers */START_CRIT_SECTION();seq->last_value = next;seq->is_called = iscalled;seq->log_cnt = 0;MarkBufferDirty(buf);/* ... unconditional XLOG_SEQ_LOG when RelationNeedsWAL ... */The two- and three-argument SQL forms (setval_oid, setval3_oid) differ
only in whether the caller may clear is_called — the three-arg form exists
so pg_dump can restore a sequence’s exact is_called state.
WAL redo and standby masking
Section titled “WAL redo and standby masking”seq_redo handles XLOG_SEQ_LOG by reinitializing block 0 from
scratch. Because the same record type is used for live updates and a hot
standby may be reading the page concurrently, redo builds the new page in
palloc’d workspace and memcpys it in atomically so no torn intermediate is
ever visible:
// seq_redo — src/backend/commands/sequence.cbuffer = XLogInitBufferForRedo(record, 0);page = (Page) BufferGetPage(buffer);localpage = (Page) palloc(BufferGetPageSize(buffer));PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic));sm = (sequence_magic *) PageGetSpecialPointer(localpage);sm->magic = SEQ_MAGIC;item = (char *) xlrec + sizeof(xl_seq_rec);itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec);PageAddItem(localpage, (Item) item, itemsz, FirstOffsetNumber, false, false);PageSetLSN(localpage, lsn);memcpy(page, localpage, BufferGetPageSize(buffer)); /* atomic swap-in */MarkBufferDirty(buffer);The resource manager is registered in rmgrlist.h as
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL). seq_mask (used by wal_consistency_checking)
masks the page LSN/checksum and unused space before comparing standby and
primary pages; seq_desc prints the relfilelocator for pg_waldump.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
SEQ_LOG_VALS (=32) | src/backend/commands/sequence.c | 58 |
SEQ_MAGIC (=0x1717) | src/backend/commands/sequence.c | 63 |
sequence_magic | src/backend/commands/sequence.c | 65 |
SeqTableData | src/backend/commands/sequence.c | 76 |
seqhashtab (static) | src/backend/commands/sequence.c | 91 |
last_used_seq (static) | src/backend/commands/sequence.c | 97 |
DefineSequence | src/backend/commands/sequence.c | 121 |
ResetSequence | src/backend/commands/sequence.c | 262 |
fill_seq_with_data | src/backend/commands/sequence.c | 338 |
fill_seq_fork_with_data | src/backend/commands/sequence.c | 359 |
AlterSequence | src/backend/commands/sequence.c | 437 |
SequenceChangePersistence | src/backend/commands/sequence.c | 541 |
DeleteSequenceTuple | src/backend/commands/sequence.c | 570 |
nextval_oid | src/backend/commands/sequence.c | 615 |
nextval_internal | src/backend/commands/sequence.c | 623 |
currval_oid | src/backend/commands/sequence.c | 866 |
lastval | src/backend/commands/sequence.c | 897 |
do_setval | src/backend/commands/sequence.c | 945 |
setval_oid | src/backend/commands/sequence.c | 1049 |
lock_and_open_sequence | src/backend/commands/sequence.c | 1085 |
create_seq_hashtable | src/backend/commands/sequence.c | 1113 |
init_sequence | src/backend/commands/sequence.c | 1129 |
read_seq_tuple | src/backend/commands/sequence.c | 1190 |
init_params | src/backend/commands/sequence.c | 1257 |
process_owned_by | src/backend/commands/sequence.c | 1593 |
sequence_options | src/backend/commands/sequence.c | 1707 |
pg_get_sequence_data | src/backend/commands/sequence.c | 1787 |
pg_sequence_last_value | src/backend/commands/sequence.c | 1847 |
seq_redo | src/backend/commands/sequence.c | 1892 |
ResetSequenceCaches | src/backend/commands/sequence.c | 1945 |
seq_mask | src/backend/commands/sequence.c | 1960 |
sequence_open | src/backend/access/sequence/sequence.c | 37 |
sequence_close | src/backend/access/sequence/sequence.c | 58 |
validate_relation_kind | src/backend/access/sequence/sequence.c | 70 |
seq_desc / seq_identify | src/backend/access/rmgrdesc/seqdesc.c | 21 / 34 |
FormData_pg_sequence | src/include/catalog/pg_sequence.h | 23 |
FormData_pg_sequence_data / SEQ_COL_* | src/include/commands/sequence.h | 25 / 38 |
XLOG_SEQ_LOG / xl_seq_rec | src/include/commands/sequence.h | 46 / 48 |
PG_RMGR(RM_SEQ_ID, "Sequence", …) | src/include/access/rmgrlist.h | 43 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Checks performed against the REL_18_STABLE worktree at
/data/hgryoo/references/postgres (commit 273fe94):
SEQ_LOG_VALSis 32 andSEQ_MAGICis0x1717. Confirmed atsequence.c:58and:63. The pre-log batch size is a compile-time constant, not a GUC.FormData_pg_sequence_datahas exactly three fields{last_value, log_cnt, is_called}(commands/sequence.h:25), and the relation’s three columns map to them viaSEQ_COL_LASTVAL/LOG/CALLED= 1/2/3 (commands/sequence.h:38). Verified the column build loop inDefineSequenceemitslast_value,log_cnt,is_calledin that order.- Parameters live in
pg_sequencewith fieldsseqrelid, seqtypid, seqstart, seqincrement, seqmax, seqmin, seqcache, seqcycle(catalog/pg_sequence.h:23), reached throughMAKE_SYSCACHE(SEQRELID, pg_sequence_seqrelid_index, 32)— confirming theSearchSysCache1(SEQRELID, …)calls innextval_internal,do_setval,ResetSequence,sequence_options. - The cache fast path takes no lock/WAL. Re-read
nextval_internallines 668–676: whenelm->last != elm->cachedit returns after onlysequence_close(seqrel, NoLock), with noread_seq_tuple, noXLogInsert. Confirmed. - Checkpoint force-log condition is
PageGetLSN(page) <= redoptrwithredoptr = GetRedoRecPtr()(nextval_internal~721–728). Verified the comment block explaining “first nextval after a checkpoint.” - The logged tuple is the future state. In the WAL arm,
seq->last_value = next(the logged-ahead value) is set beforeXLogInsert, then overwritten withseq->last_value = last; seq->log_cnt = logafter. Verified the two assignments straddle theXLogInsertcall. This is the mechanism by which redo restores a value ahead of what was issued. nextval/setvalguards.PreventCommandIfParallelModeandPreventCommandIfReadOnly(the latter gated on!rd_islocaltemp) appear in bothnextval_internalanddo_setval. Confirmed.- ALTER/RESET are transactional via relfilenumber.
ResetSequenceandAlterSequence(whenneed_seq_rewrite) callRelationSetNewRelfilenumberthenfill_seq_with_data. Confirmed at:309and:503. seq_redoreinit-and-memcpy. VerifiedXLogInitBufferForRedo,palloclocal page,PageAddItem,memcpy(page, localpage, …). The comment explicitly cites hot-standby concurrent reads as the reason.- RM registration.
rmgrlist.h:43registersRM_SEQ_ID“Sequence” withseq_redo,seq_desc,seq_identify, andseq_mask(decode/undo hooks NULL). Confirmed no PG19-only rmgr changes are referenced. - The sequence AM is a relkind assertion only.
access/sequence/sequence.cis 79 lines:sequence_open,sequence_close,validate_relation_kind. No AM routine vector exists. Confirmed. - The pre-log critical-section XID forcing.
GetTopTransactionId()is invoked insidenextval_internal(line 396 path isfill_seq_with_data’s; the nextval pre-log site forces a top XID) and in both rewrite arms ofAlterSequence/SequenceChangePersistence(:497,:559), each immediately precedingRelationSetNewRelfilenumber(:309,:503,:562). Confirmed the rewrite path and the pre-log path both force a real top-transaction XID so the eventual commit drives a WAL flush.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Oracle’s
CACHE/NOCACHEand theNOORDER/ORDERknob. Oracle stores the sequence’s high-water mark in theSEQ$dictionary table and caches a block of values in the SGA (shared, not per-session). ItsORDERclause forces RAC-wide ordering by serializing allocation across instances through the global lock — the same uniqueness-vs-ordering-vs-throughput trilemma PostgreSQL resolves by refusing cross-session ordering. PostgreSQL’sCACHEis per-backend (a sharper source of inter-session gaps than Oracle’s shared cache), but it never pays the cross-instance round-trip Oracle’sORDERdemands. A side-by-side ofnextval_internal’sSEQ_LOG_VALSpre-log against Oracle’s SGA cache + dictionary update would sharpen what “log ahead” buys versus “cache ahead.” -
Sequence ranges as a distributed-systems primitive. In a sharded or multi-master system a single monotonic counter is a global serialization point — exactly the bottleneck batching exists to avoid. The standard escape is range allocation: a coordinator hands each node a disjoint block (
[1000..1999]), and nodes serve locally. This is structurally identical to PostgreSQL’s per-backendSeqTablewindow scaled up one level — the backend is the “node,” the WAL pre-log is the “coordinator handoff.” Snowflake-style ID generators go further and abandon the central counter entirely, encoding(timestamp, node_id, per-node_counter)so no coordination is needed at the cost of strict global monotonicity. PostgreSQL’s design sits at the conservative end: one authoritative durable counter, batched twice (cache + WAL), trading distributed scalability for single-node simplicity and exact crash-recovery semantics. -
Lock-manager contention as the real cost. A Scalable Lock Manager for Multicores (
dbms-papers/scalable-lock-manager.md) frames why even a RowExclusiveLock pernextvalwould be ruinous on many cores: the lock table’s partition latches become the bottleneck long before the buffer does. PostgreSQL sidesteps this two ways — thelxidcheck takes the relation lock once per transaction (not once pernextval), and the cache fast path takes no lock at all whenelm->last != elm->cached. The paper’s speculative-lock-inheritance ideas are the research frontier for the rare workload that defeats both (millions of single-valuenextvals in autocommit, each its own transaction, on one hot sequence). -
MVCC-free position state. Database Internals (Petrov, ch. 5 “Transaction Processing and Recovery”) motivates why a generator must opt out of MVCC: rolling back an allocation would destroy uniqueness under concurrency. The frozen-xmin one-row tuple (
HeapTupleHeaderSetXminFrozen) is PostgreSQL’s concrete encoding of “this row is never versioned, never vacuumed, always visible” — a deliberate hole in the otherwise-uniform MVCC fabric. A research comparison against systems that do version sequence state (and pay the rollback-reuse cost) would quantify the trade. -
GENERATED AS IDENTITYvs.SERIALand the SQL standard. The SQL:2003 identity-column feature standardized whatSERIALdid informally. PostgreSQL implements both over the samenextvalmachinery, differing only in the catalog dependency (process_owned_byrecordspg_dependwithDEPENDENCY_INTERNALfor identity vs.DEPENDENCY_AUTOfor serial), which governs whether the sequence can be dropped independently. The generator contract — uniqueness, monotonicity, no gap-freedom — is identical; only the ownership lifecycle differs. This is a clean example of a storage mechanism (the relation-backed counter) being reused under two SQL surface syntaxes. -
Logical replication of sequence advances. Historically
nextvalon a publisher did not replicate to a logical subscriber, so failover lost the position. TheXLOG_SEQ_LOGrecord carries everything needed to replicate the logged-aheadlast_value; building a sequence-decoding output-plugin path on top ofseq_desc’s relfilelocator is an active area (see postgres-logical-decoding.md). The pre-log semantics matter here: a logical subscriber that replays the logged-ahead value inherits the same “skip the unissued tail” gap behavior, which is correct for uniqueness but surprising to operators expecting contiguous values.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/commands/sequence.c— the entire module:DefineSequence,AlterSequence,ResetSequence,SequenceChangePersistence,init_params,fill_seq_with_data/fill_seq_fork_with_data, theSeqTableDatacache,init_sequence/lock_and_open_sequence,read_seq_tuple,nextval_internal,currval_oid/lastval,do_setval/setval_oid,process_owned_by, and the WALseq_redo/seq_mask. The position-hint table above pins each symbol.src/backend/access/sequence/sequence.c— the 79-line “sequence AM”:sequence_open,sequence_close,validate_relation_kind. A relkind assertion overrelation_open, nothing more.src/backend/access/rmgrdesc/seqdesc.c—seq_desc/seq_identifyforpg_waldumpdecoding ofXLOG_SEQ_LOG.src/include/commands/sequence.h—FormData_pg_sequence_data{last_value, log_cnt, is_called},SEQ_COL_*column numbers,XLOG_SEQ_LOG, and thexl_seq_recWAL header struct.src/include/catalog/pg_sequence.h—FormData_pg_sequence(the immutable parameters:seqstart, seqincrement, seqmax, seqmin, seqcache, seqcycle, seqtypid).src/include/access/rmgrlist.h—PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)registration.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), SQL chapter
— “sequences used to generate unique key values,”
GENERATED AS IDENTITY, surrogate keys (knowledge/research/dbms-general/database-system-concepts.md). - Database Internals (Petrov 2019), ch. 5 “Transaction Processing and
Recovery” — why a generator opts out of MVCC, and WAL log-ahead framing
(
knowledge/research/dbms-general/database-internals.md). - A Scalable Lock Manager for Multicores — lock-table partition-latch
contention; motivates the once-per-transaction lock and the lock-free cache
fast path (
knowledge/research/dbms-papers/scalable-lock-manager.md). - Optimistic Concurrency Control (Kung & Robinson 1981) — background for
why “validate then advance, never roll back” is the right discipline for a
shared counter (
knowledge/research/dbms-papers/occ.md).
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-heap-am.md— the heap tuple format and page item layout the one-row sequence tuple reuses; this doc treats the tuple as opaque.postgres-xlog-wal.md—XLogBeginInsert/XLogRegisterBuffer/XLogInsert,REGBUF_WILL_INIT, redo buffer initialization (XLogInitBufferForRedo), and the resource-manager dispatch that routesRM_SEQ_IDtoseq_redo.postgres-catcache-syscache.md— theSEQRELIDsyscache thatnextval_internal/do_setval/ResetSequenceuse to load sequence parameters without a buffer lock.postgres-smgr-md.md/postgres-buffer-manager.md— the storage-manager and buffer layers underReadBuffer(rel, 0)and the INIT_FORKNUM handling for unlogged sequences.postgres-ddl-execution.md—DefineRelationandRelationSetNewRelfilenumber, the DDL machineryDefineSequence/AlterSequence/ResetSequenceride on.postgres-architecture-overview.md— where the relation-backed sequence sits in the storage-engine axis as a degenerate one-row table.