CUBRID File & Disk Manager — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-disk-manager.mdcovers design intent and theoretical background for both the file and disk managers. This document traces every branch and field at the code level, centred onfile_manager.cwith the disk manager as its substrate. Each chapter is self-contained, but reading in order follows the full lifecycle of a single data page — from reserved sector to owning file — inside the kernel.
Contents:
Chapter 1: Data-Structure Map
Section titled “Chapter 1: Data-Structure Map”This chapter is the field dictionary for the whole document; later chapters trace operations over these structures without re-explaining a field. The reader question: what are all the structures the disk and file managers share, and what does every field mean? For design rationale, read the companion cubrid-disk-manager.md (“Volume layout”, “File architecture”, “Permanent vs temporary purpose split”); this chapter assumes that theory and only names fields.
Two boundaries organize everything. The disk/file boundary: the disk manager owns volumes and hands out sectors (64-page extents); the file manager carves pages from them. The on-disk/in-memory boundary: some structures persist byte-for-byte in pages (disk_volume_header, file_header, file_extensible_data, file_partial_sector); others live only in server heap to summarize or coordinate them (disk_cache, disk_extend_info, disk_stab_cursor, disk_reserve_context).
1.1 The disk side, at a glance
Section titled “1.1 The disk side, at a glance”flowchart TB
subgraph ondisk["On disk (one per volume)"]
VH["disk_volume_header<br/>page 0 of every volume"]
STAB["sector allocation table<br/>bitmap pages, 1 bit per sector"]
end
subgraph mem["In memory (one disk_Cache, process-wide)"]
DC["disk_cache"]
DC --> VOLS["vols[LOG_MAX_DBVOLID+1]<br/>per-volume disk_cache_volinfo"]
DC --> PERM["perm_purpose_info<br/>disk_perm_info"]
DC --> TEMP["temp_purpose_info<br/>disk_temp_info"]
PERM --> PEI["extend_info: disk_extend_info"]
TEMP --> TEI["extend_info: disk_extend_info"]
end
subgraph transient["Transient (per reserve call / per iteration)"]
RC["disk_reserve_context"]
RC --> CVR["cache_vol_reserve[]:<br/>disk_cache_vol_reserve"]
CUR["disk_stab_cursor"]
end
VH -. "cached as" .-> VOLS
STAB -. "walked by" .-> CUR
DC -. "drained into" .-> RC
Figure 1-1. Disk-side structure relationships. disk_Cache is the single in-memory summary of all volumes; disk_reserve_context and disk_stab_cursor are transient scratch used while reserving sectors and walking the bitmap.
disk_volume_header — the persisted page-0 of every volume
Section titled “disk_volume_header — the persisted page-0 of every volume”The only disk-manager structure with variable size: it ends in a var_fields[1] flexible region holding the full volume path strings, so sizeof is never used on it (note the literal comment DON'T USE sizeof on this structure).
// disk_volume_header -- src/storage/disk_manager.cstruct disk_volume_header{ char magic[CUBRID_MAGIC_MAX_LENGTH]; /* magic for file/magic Unix utility; DON'T MOVE */ INT16 iopagesize; INT16 volid; INT8 db_charset; INT8 dummy1; DB_VOLPURPOSE purpose; DB_VOLTYPE type; DKNPAGES sect_npgs; /* pages per sector (== DISK_SECTOR_NPAGES = 64) */ DKNSECTS nsect_total; DKNSECTS nsect_max; SECTID hint_allocsect; DKNPAGES stab_npages; PAGEID stab_first_page; PAGEID sys_lastpage; INT32 dummy2; INT64 db_creation; INT64 vol_creation; LOG_LSA chkpt_lsa; HFID boot_hfid; INT32 reserved0; INT32 reserved1; INT32 reserved2; INT32 reserved3; INT16 next_volid; INT16 offset_to_vol_fullname; INT16 offset_to_next_vol_fullname; INT16 offset_to_vol_remarks; char var_fields[1]; /* variable: vol_fullname, next_vol_fullname, remarks */};| Field | Role | Why it exists |
|---|---|---|
magic | Fixed signature at byte 0 | file/magic(5) and CUBRID’s own check identify a volume by it; must not move. |
iopagesize | IO page size at format | Sanity check only; authoritative size is in the log. |
volid | This volume’s id | Self-identification; traces a stray page to its volume. |
db_charset | Database charset code | Volume must match db charset; checked at attach. |
dummy1, dummy2 | Alignment padding. | — |
purpose | Permanent vs temporary data purpose | Picks which rollup the free space feeds (§1.2). |
type | Permanent vs temporary volume type | Differs from purpose: a perm-typed volume may serve temp purpose. |
sect_npgs | Pages per sector | Always 64; stored so the format is self-describing. |
nsect_total | Sectors currently formatted | Upper bound for sector ids that physically exist now. |
nsect_max | Max sectors after all extension | Sizes the allocation table once so the bitmap never moves. |
hint_allocsect | Next sector to scan | Skips known-full prefix of the bitmap. |
stab_npages | Table length in pages | DISK_STAB_NPAGES(nsect_max); bounds the cursor walk. |
stab_first_page | First bitmap page id | Table starts after the header; cursor maps offsets via this. |
sys_lastpage | Last system page | Everything <= sys_lastpage is header+table; user sectors follow. |
db_creation | DB creation timestamp | Replicated everywhere so a foreign volume can’t be attached. |
vol_creation | This volume’s creation time | Per-volume provenance. |
chkpt_lsa | Recovery start LSA | Recovery skips older log records for this volume. |
boot_hfid | Boot/system heap file id | Bootstraps multivolume access. |
reserved0..3 | Four spare INT32 for forward-compatible growth without an offset change. | — |
next_volid | Link to next volume | The volume set is a singly linked chain. |
offset_to_vol_fullname | Offset within var_fields | This volume’s path string. |
offset_to_next_vol_fullname | Offset within var_fields | Next volume’s path (chain followed without a catalog). |
offset_to_vol_remarks | Offset within var_fields | Free-text remarks. |
var_fields | Flexible tail | Holds the three strings; length is DB_PAGESIZE minus the byte offset of var_fields within the page. |
Invariant — the sector allocation table is sized once, for
nsect_max, never fornsect_total.stab_npages == DISK_STAB_NPAGES(nsect_max)andsys_lastpagecover header plus the full table at creation. Extension (Ch.5) raisesnsect_totaltowardnsect_maxbut never touchesstab_first_page/stab_npages. If the table could move, every cacheddisk_stab_cursor.pageidand reservedVSIDwould dangle.
disk_cache_volinfo, disk_extend_info, disk_perm_info, disk_temp_info, disk_cache
Section titled “disk_cache_volinfo, disk_extend_info, disk_perm_info, disk_temp_info, disk_cache”These five form the in-memory free-space summary. disk_cache is the root; there is exactly one (static DISK_CACHE *disk_Cache).
// disk_cache_volinfo -- src/storage/disk_manager.cstruct disk_cache_volinfo{ DB_VOLPURPOSE purpose; DKNSECTS nsect_free; /* hint of free sectors on this volume */};Field (disk_cache_volinfo) | Role | Why it exists |
|---|---|---|
purpose | Per-volume purpose (perm/temp) | Classifies vols[volid] without reading the volume header. |
nsect_free | Per-volume free-sector hint | Fast per-volume estimate; the bitmap holds the authoritative count, this is a cache hint. |
// disk_extend_info -- src/storage/disk_manager.cstruct disk_extend_info{ volatile DKNSECTS nsect_free; /* free sectors across all volumes of this purpose */ volatile DKNSECTS nsect_total; volatile DKNSECTS nsect_max; volatile DKNSECTS nsect_intention; /* sectors a thread intends to add by extending */ pthread_mutex_t mutex_reserve;#if !defined (NDEBUG) volatile int owner_reserve; /* debug: tid holding mutex_reserve */#endif DKNSECTS nsect_vol_max; VOLID volid_extend; DB_VOLTYPE voltype;};Field (disk_extend_info) | Role | Why it exists |
|---|---|---|
nsect_free | Free sectors over all volumes of one purpose | Fast number reservation decrements before touching any bitmap (Ch.4); volatile for cross-thread visibility. |
nsect_total | Formatted sectors of this purpose | Distinguishes exhausted from merely fragmented. |
nsect_max | Ceiling of this purpose | Distinguishes “extend existing” from “add new volume”. |
nsect_intention | Sectors promised but not yet committed by an extender | Prevents thundering-herd extension (Ch.5). |
mutex_reserve | Lock guarding the four counters | Serializes the hot reservation path. |
owner_reserve | Debug owner tid | NDEBUG-only lock-discipline aid. |
nsect_vol_max | Largest sector count a new volume may take | Caps a single extension’s size. |
volid_extend | Volume the next extension grows | Cached target, no rescan. |
voltype | Volume type for this rollup | Tags perm vs temp. |
// disk_perm_info / disk_temp_info -- src/storage/disk_manager.cstruct disk_perm_info { DISK_EXTEND_INFO extend_info; };struct disk_temp_info { DISK_EXTEND_INFO extend_info; DKNSECTS nsect_perm_free; /* free sectors on PERMANENT volumes usable for temp purpose */ DKNSECTS nsect_perm_total;};| Field | Role | Why it exists |
|---|---|---|
disk_perm_info.extend_info | The perm-purpose rollup | All permanent free space funnels here. |
disk_temp_info.extend_info | The temp-volume rollup | Free space on genuine temp volumes. |
disk_temp_info.nsect_perm_free | Free temp-usable sectors on perm volumes | Fallback pool when temp volumes exhausted (temp-on-perm); kept separate so temp alloc prefers real temp volumes first. |
disk_temp_info.nsect_perm_total | Total such sectors | Sizes the fallback pool. |
// disk_cache -- src/storage/disk_manager.cstruct disk_cache{ int nvols_perm; int nvols_temp; DISK_CACHE_VOLINFO vols[LOG_MAX_DBVOLID + 1]; /* per-volume free hint, indexed by volid */ DISK_PERM_PURPOSE_INFO perm_purpose_info; DISK_TEMP_PURPOSE_INFO temp_purpose_info; pthread_mutex_t mutex_extend; /* never take while holding a reserve mutex */#if !defined (NDEBUG) volatile int owner_extend;#endif};Field (disk_cache) | Role | Why it exists |
|---|---|---|
nvols_perm | Number of permanent volumes | Iteration bound / placement. |
nvols_temp | Number of temporary volumes | Same; temp volumes index from the high end of vols. |
vols[] | Per-volume disk_cache_volinfo | Direct vols[volid] lookup; sized LOG_MAX_DBVOLID + 1. |
perm_purpose_info | Permanent rollup | Aggregate perm free space. |
temp_purpose_info | Temporary rollup | Aggregate temp free space plus perm-fallback. |
mutex_extend | Lock for volume-set extension | Coarser than mutex_reserve. |
owner_extend | Debug owner | NDEBUG-only. |
LOG_MAX_DBVOLID is VOLID_MAX - 1 (SHRT_MAX - 1), so vols[] indexes any valid VOLID.
Invariant — lock ordering:
mutex_reservebeforemutex_extend, never the reverse. Both struct comments state it. Reservation (frequent) takesmutex_reserve; extension (rare) takesmutex_extend. Opposite ordering across two threads would deadlock. Ch.4 and Ch.5 rely on this.
disk_stab_cursor and DISK_STAB_UNIT
Section titled “disk_stab_cursor and DISK_STAB_UNIT”The sector allocation table is a bitmap, one bit per sector. The iteration unit is a UINT64.
// DISK_STAB_UNIT -- src/storage/disk_manager.ctypedef UINT64 DISK_STAB_UNIT; /* one 64-bit word of the bitmap */
// disk_stab_cursor -- src/storage/disk_manager.cstruct disk_stab_cursor{ const DISK_VOLUME_HEADER *volheader; PAGEID pageid; /* current bitmap page id (real, not table-relative) */ int offset_to_unit; int offset_to_bit; SECTID sectid; PAGE_PTR page; /* fixed bitmap page (NULL until fixed) */ DISK_STAB_UNIT *unit; /* pointer to current unit inside page */};| Field | Role | Why it exists |
|---|---|---|
volheader | Volume being walked | Source of stab_first_page/nsect_total bounds. |
pageid | Current real bitmap page | From sectid plus stab_first_page. |
offset_to_unit | Which UINT64 word in the page | DISK_ALLOCTBL_SECTOR_UNIT_OFFSET. |
offset_to_bit | Which bit in the word | DISK_ALLOCTBL_SECTOR_BIT_OFFSET. |
sectid | Sector the cursor names | The (page, unit, bit) triple decomposes this. |
page | Pinned page pointer | NULL = no page fixed; non-NULL = a latch is held. |
unit | Pointer into page at offset_to_unit | Reads/writes the live word without recomputing the address. |
Invariant —
page == NULLiff no latch is held. Crossing a page boundary must unfix the oldpagebefore fixing the next and resettingunit. A non-NULLpageleft after the walk is a leaked latch. Ch.4’s bitmap-commit step depends on this.
disk_cache_vol_reserve and disk_reserve_context
Section titled “disk_cache_vol_reserve and disk_reserve_context”Transient scratch the two-step reservation (Ch.4) uses. disk_reserve_context lives on the caller’s stack for one reservation.
// disk_cache_vol_reserve -- src/storage/disk_manager.cstruct disk_cache_vol_reserve{ VOLID volid; /* a volume from which sectors were drawn */ DKNSECTS nsect; /* how many sectors drawn from it */};
// disk_reserve_context -- src/storage/disk_manager.cstruct disk_reserve_context{ int nsect_total; /* total sectors this request must reserve */ VSID *vsidp; /* output cursor: next VSID write position */ DISK_CACHE_VOL_RESERVE cache_vol_reserve[VOLID_MAX]; /* per-volume tally drawn from cache */ int n_cache_vol_reserve; int n_cache_reserve_remaining; /* entries not yet committed to bitmaps */ DKNSECTS nsects_lastvol_remaining; /* sectors still owed on the last volume */ DB_VOLPURPOSE purpose;};Field (disk_reserve_context) | Role | Why it exists |
|---|---|---|
nsect_total | Sectors the request needs | The loop’s goal. |
vsidp | Write cursor into the caller’s VSID[] | Each committed sector appends here. |
cache_vol_reserve[] | Per-volume plan (volume, count) | Step one fills it from cache; step two replays against bitmaps. Sized VOLID_MAX. |
n_cache_vol_reserve | Count of populated plan entries | Bounds the replay loop. |
n_cache_reserve_remaining | Entries not yet committed | Enables precise rollback. |
nsects_lastvol_remaining | Sectors still owed on the current volume | Progress within one entry. |
purpose | Request purpose | Routes to perm or temp rollup. |
disk_cache_vol_reserve is just a (volid, nsect) pair; an array is the reservation plan. DISK_PRERESERVE_BUF_DEFAULT (16) is the default batch the cache reserve fills.
1.2 The file side, at a glance
Section titled “1.2 The file side, at a glance”flowchart TB
subgraph hdrpage["File header page (page 0 of a file)"]
FH["file_header"]
FH --> TS["tablespace: file_tablespace"]
FH --> DESC["descriptor: file_descriptors (union, 64 B)"]
FH -. "offset_to_partial_ftab" .-> PART["partial table:<br/>file_extensible_data of file_partial_sector"]
FH -. "offset_to_full_ftab" .-> FULL["full table:<br/>file_extensible_data of VSID"]
FH -. "offset_to_user_page_ftab" .-> UPT["user page table (numerable):<br/>file_extensible_data of VPID"]
end
PART --> PS["file_partial_sector<br/>{ vsid, page_bitmap }"]
PART -. "vpid_next" .-> MOREP["overflow extdata page"]
Figure 1-2. File-side structure relationships. The header page embeds file_header, which carries the tablespace policy, the typed descriptor, and three byte-offsets into the three extensible tables co-located in the same page (overflowing via vpid_next).
file_header — the persisted page-0 of every file
Section titled “file_header — the persisted page-0 of every file”// file_header -- src/storage/file_manager.cstruct file_header{ INT64 time_creation; VFID self; /* this file's own VFID */ FILE_TABLESPACE tablespace; FILE_DESCRIPTORS descriptor; /* Page counts. */ int n_page_total; int n_page_user; int n_page_ftab; int n_page_free; /* reserved-on-disk, not yet allocated */ int n_page_mark_delete; /* numerable: pages marked deleted */ /* Sector counts. */ int n_sector_total; int n_sector_partial; int n_sector_full; int n_sector_empty; /* empty sectors are a subset of partial */ FILE_TYPE type; INT32 file_flags; /* NUMERABLE / TEMPORARY / ENCRYPTED_* */ VOLID volid_last_expand; INT16 offset_to_partial_ftab; INT16 offset_to_full_ftab; INT16 offset_to_user_page_ftab; /* user page table (numerable only) */ VPID vpid_sticky_first; /* first page if sticky; never deallocated */ /* Temporary files: last-allocation cursor. */ VPID vpid_last_temp_alloc; int offset_to_last_temp_alloc; /* Numerable files. */ VPID vpid_last_user_page_ftab; /* last page of user page table (append point) */ VPID vpid_find_nth_last; /* cache: page of last find_nth result */ int first_index_find_nth_last; /* cache: index of first entry in that page */ INT32 reserved0; INT32 reserved1; INT32 reserved2; INT32 reserved3;};| Field | Role | Why it exists |
|---|---|---|
time_creation | File creation timestamp | Provenance. |
self | The file’s own VFID | A header page in isolation knows its file. |
tablespace | Expansion policy (below) | Drives growth aggressiveness. |
descriptor | Typed metadata union (below) | Each type stashes its ids here. |
n_page_total | Pages owned (user + table + free) | Master accounting. |
n_page_user | Pages handed to the owner | The useful count. |
n_page_ftab | Pages used by the three tables | Overhead; grows on overflow. |
n_page_free | Reserved-but-unallocated pages | Available without a new reservation. |
n_page_mark_delete | Numerable pages marked deleted | Numerable files flag, not remove (Ch.10). |
n_sector_total | Sectors reserved | = n_sector_partial + n_sector_full. |
n_sector_partial | Partial sectors | Have a free page; in the partial table. |
n_sector_full | Full sectors | All 64 pages used; in the full table (perm only). |
n_sector_empty | Sectors with zero pages | Subset of partial; reclaimed first on extension. |
type | FILE_TYPE | Selects table layout and numerable/temp eligibility. |
file_flags | Bit flags | NUMERABLE 0x1, TEMPORARY 0x2, ENCRYPTED_AES 0x4, ENCRYPTED_ARIA 0x8; via FILE_IS_*. |
volid_last_expand | Volume last grown | Locality hint for next expansion. |
offset_to_partial_ftab | Offset to partial table in this page | FILE_HEADER_GET_PART_FTAB asserts range. |
offset_to_full_ftab | Offset to full table | Asserted non-temporary (temp has no full table). |
offset_to_user_page_ftab | Offset to user page table | Numerable only; asserted numerable. |
vpid_sticky_first | First page, if sticky | Never deallocated (Ch.11). |
vpid_last_temp_alloc | Temp alloc cursor: page | Temp files alloc forward, never dealloc (Ch.8). |
offset_to_last_temp_alloc | Temp alloc cursor: sector offset | Offset component of that cursor. |
vpid_last_user_page_ftab | Numerable: append point | New user pages appended here (Ch.10). |
vpid_find_nth_last | Numerable: cached find-nth page | Optimizes sequential find-nth (Ch.10). |
first_index_find_nth_last | Numerable: cached first-entry index | Companion to the cache above. |
reserved0..3 | Four spare INT32 for forward compatibility. | — |
Invariant — accounting balances:
n_page_total == n_page_user + n_page_ftab + n_page_free,n_sector_total == n_sector_partial + n_sector_full, andn_sector_empty <= n_sector_partial. Every alloc/dealloc in Ch.7–Ch.9 adjusts these as a set under the header latch. Drift means the file believes it owns space it does not, or leaks it;file_validateand theFILE_HEADER_GET_*_FTABassertions guard. The empty-subset relation lets extension prefer empty sectors without a separate table.
file_extensible_data — the generic multi-page table component
Section titled “file_extensible_data — the generic multi-page table component”All three file tables are file_extensible_data: a small header followed by an array of fixed-size items, chained page-to-page.
// file_extensible_data -- src/storage/file_manager.cstruct file_extensible_data{ VPID vpid_next; /* next component page, NULL if last */ INT16 max_size; /* capacity in bytes for items in this component */ INT16 size_of_item; /* byte size of one item */ INT16 n_items; /* number of items currently stored */};| Field | Role | Why it exists |
|---|---|---|
vpid_next | Link to next component | Chains overflow pages; NULL terminates. |
max_size | Byte capacity here | Bounds n_items. |
size_of_item | Size of one item | Partial table = file_partial_sector (16 B), full = VSID (8 B), user-page = VPID (8 B). One format, three item types. |
n_items | Items stored | Drives iteration; insert/delete bump it. |
Invariant —
n_items * size_of_item <= max_size, items kept densely packed fromFILE_EXTDATA_HEADER_ALIGNED_SIZE. An insert that would overflow allocates a new component linked viavpid_next; a delete shifts the tail down. The density invariant is what lets find-nth index by position (Ch.6–Ch.10).
file_partial_sector, FILE_ALLOC_BITMAP, file_tablespace, file_descriptors
Section titled “file_partial_sector, FILE_ALLOC_BITMAP, file_tablespace, file_descriptors”// FILE_ALLOC_BITMAP -- src/storage/file_manager.htypedef UINT64 FILE_ALLOC_BITMAP; /* one bit per page in a sector (64 pages) */#define FILE_FULL_PAGE_BITMAP 0xFFFFFFFFFFFFFFFF /* Full allocation bitmap */#define FILE_EMPTY_PAGE_BITMAP 0x0000000000000000 /* Empty allocation bitmap */#define FILE_ALLOC_BITMAP_NBITS ((int) (sizeof (FILE_ALLOC_BITMAP) * CHAR_BIT)) /* 64 */
// file_partial_sector -- src/storage/file_manager.hstruct file_partial_sector{ VSID vsid; /* MUST be first member: reinterpreted as VSID in file table */ FILE_ALLOC_BITMAP page_bitmap;};VSID is { int32_t sectid; short volid; } = 6 bytes padded to 8; FILE_ALLOC_BITMAP is a UINT64 = 8 bytes; so sizeof (file_partial_sector) == 16.
| Field | Role | Why it exists |
|---|---|---|
file_partial_sector.vsid | The reserved sector’s id (8 B) | First member by contract: the full table stores bare VSID, so a file_partial_sector* is reinterpreted as VSID* on promotion. |
file_partial_sector.page_bitmap | 64-bit page allocation map (8 B) | Bit i set = page i allocated. FILE_FULL_PAGE_BITMAP = full; FILE_EMPTY_PAGE_BITMAP = empty. |
Invariant —
vsidis the first member, deliberately. The source comment: “VSID must be first member … the FILE_PARTIAL_SECTOR pointers in file table are reinterpreted as VSID.” When a partial sector fills, the file manager moves the leadingVSIDbytes into the full table without copying the bitmap. Reordering would corrupt the full table silently.FILE_ALLOC_BITMAP_NBITS == DISK_SECTOR_NPAGES == 64, so one bitmap covers one sector exactly.
// file_tablespace -- src/storage/file_manager.hstruct file_tablespace{ INT64 initial_size; /* bytes the file starts with */ float expand_ratio; /* fraction of current size to add when expanding */ int expand_min_size; /* lower clamp on an expansion, in bytes */ int expand_max_size; /* upper clamp on an expansion, in bytes */};| Field | Role | Why it exists |
|---|---|---|
initial_size | Starting byte size | MAX(1, npages) * DB_PAGESIZE at create. |
expand_ratio | Growth fraction | ~1% of current size for perm (FILE_TABLESPACE_DEFAULT_RATIO_EXPAND); 0 for temp. |
expand_min_size | Minimum expansion | At least one sector for perm; 0 for temp. |
expand_max_size | Maximum expansion | Caps one growth (1024 sectors perm; 0 temp). |
Temp files use FILE_TABLESPACE_FOR_TEMP_NPAGES, zeroing ratio/min/max — temp files do not auto-expand the same way.
// file_descriptors -- src/storage/file_manager.h/* note: if you change file descriptors size, make sure to change disk compatibility version too! */#define FILE_DESCRIPTORS_SIZE 64union file_descriptors{ FILE_HEAP_DES heap; FILE_OVF_HEAP_DES heap_overflow; FILE_BTREE_DES btree; FILE_OVF_BTREE_DES btree_key_overflow; /* TODO: rename FILE_OVF_BTREE_DES */ FILE_EHASH_DES ehash; FILE_VACUUM_DATA_DES vacuum_data; char dummy_align[FILE_DESCRIPTORS_SIZE];};The per-member struct shapes below are added annotations (the source defines each FILE_*_DES separately, not inline):
| Member | Shape (annotation) | Role | Why it exists |
|---|---|---|---|
heap | { OID class_oid; HFID hfid; } | Heap file’s class OID + HFID | A heap file points back to its class and heap id. |
heap_overflow | { HFID hfid; OID class_oid; } | Overflow heap’s HFID + class OID | Overflow records for large heap rows. |
btree | { OID class_oid; int attr_id; } | Index file’s class OID + attribute id | A btree file knows the indexed class and attribute. |
btree_key_overflow | { BTID btid; OID class_oid; } | Long-key overflow file (file_ovf_btree_des) | Long keys overflow into a separate file. |
ehash | { OID class_oid; int attr_id; } | Extensible hash’s class OID + attr id | Identifies the hashed attribute. |
vacuum_data | { VPID vpid_first; } | First VPID of vacuum data | Vacuum’s bookkeeping file. |
dummy_align | char[FILE_DESCRIPTORS_SIZE] | 64-byte padding | Pins the union at FILE_DESCRIPTORS_SIZE; the source ties this size to the on-disk compatibility version, so it must not change casually. |
The union is interpreted per file_header.type. FILE_TYPE_CAN_BE_NUMERABLE, FILE_TYPE_IS_ALWAYS_TEMP, and the file_flags bits decide which tables the file actually carries — covered in Ch.6 and Ch.10.
1.3 Chapter summary — key takeaways
Section titled “1.3 Chapter summary — key takeaways”- There are two persisted page-0 structures —
disk_volume_header(one per volume) andfile_header(one per file) — and the rest either summarize them in memory (disk_cachefamily) or are scratch for one operation (disk_reserve_context,disk_stab_cursor). - The disk manager hands out sectors (64-page extents) tracked by a one-bit-per-sector table sized once for
nsect_max; the table is immovable, which is why every reservedVSIDand cached cursor stays valid across volume extension. disk_cacheis the single in-memory free-space oracle:vols[]per-volume hints feed two purpose rollups (disk_perm_info,disk_temp_info), and the lock ordermutex_reservebeforemutex_extendis a hard invariant against deadlock.- Sector reservation is two-step:
disk_reserve_contextdrains a plan from the cache intocache_vol_reserve[], then replays it against the bitmaps via adisk_stab_cursor; the*_remainingcounters make a partial reservation precisely reversible. - The file manager carves pages from reserved sectors using three
file_extensible_datatables — partial, full, user-page — the same chained, densely-packed, fixed-item format differing only insize_of_item. file_partial_sector(16 B) putsvsidfirst on purpose so a filled sector can be promoted to the full table by reinterpreting the pointer as a bareVSID; its 64-bitpage_bitmapmaps exactly one sector’s 64 pages.file_header’s page and sector counters must balance as accounting identities; its threeoffset_to_*_ftab, the temp-alloc cursor, and the numerable find-nth cache are the only state distinguishing regular, temporary, and numerable files — operational meaning deferred to Ch.7, Ch.8, and Ch.10.
Chapter 2: Initialization and Memory Management
Section titled “Chapter 2: Initialization and Memory Management”The Chapter 1 structures have no on-disk form; the source of truth is the
per-volume header plus its sector allocation table (Chapter 3), and disk_Cache
is a derived rollup recomputed from those headers at every boot. This chapter
answers: where do disk_Cache and the file-manager globals come from at
server start, and how is the cache rebuilt by walking the mounted-volume
chain? For why CUBRID keeps a coarse RAM counter, see the companion’s
“In-memory cache” section.
2.1 The bootstrap call chain
Section titled “2.1 The bootstrap call chain”Two modules wake up at boot: the disk manager (owns disk_Cache,
disk_manager_init) reconstructs state from disk; the file manager (owns
file_Tempcache and the tracker globals, file_manager_init) only zeroes RAM.
flowchart TD boot["server boot"] --> dmi["disk_manager_init(load_from_disk=true)"] dmi --> dci["disk_cache_init -> malloc disk_Cache"] dci --> dclav["disk_cache_load_all_volumes"] dclav --> fmm["fileio_map_mounted: walk mounted volumes"] fmm --> dclv["disk_cache_load_volume (per volid)"] dclv --> dvb["disk_volume_boot: read header + count free"] boot --> fmi["file_manager_init"] fmi --> ftci["file_tempcache_init -> zero file_Tempcache"]
Figure 2-1. Boot-time initialization fan-out.
2.2 disk_manager_init — parameter capture, reload guard, optional load
Section titled “2.2 disk_manager_init — parameter capture, reload guard, optional load”disk_manager_init does four things in order: derive the temp-volume sector
cap, capture the logging flag, (re)allocate the cache, and conditionally load
from disk.
// disk_manager_init -- src/storage/disk_manager.cintdisk_manager_init (THREAD_ENTRY * thread_p, bool load_from_disk){ int error_code = NO_ERROR;
disk_Temp_max_sects = (DKNSECTS) prm_get_integer_value (PRM_ID_BOSR_MAXTMP_PAGES); if (disk_Temp_max_sects < 0) disk_Temp_max_sects = SECTID_MAX; /* <- negative param means "no cap" (infinite) */ else disk_Temp_max_sects = disk_Temp_max_sects / DISK_SECTOR_NPAGES; /* <- pages -> sectors */ // ... condensed: disk_Logging = prm_get_bool_value (PRM_ID_DISK_LOGGING) ...
if (disk_Cache != NULL) disk_cache_final (); /* <- idempotent reload: tear down stale cache first */ error_code = disk_cache_init (); if (error_code != NO_ERROR) { ASSERT_ERROR (); return error_code; /* <- malloc failure: nothing to clean up */ } assert (disk_Cache != NULL);
if (load_from_disk && !disk_cache_load_all_volumes (thread_p)) { ASSERT_ERROR_AND_SET (error_code); disk_manager_final (); /* <- partial load failed: roll the whole cache back */ return error_code; } return NO_ERROR;}Branch accounting:
| Branch | Condition | Effect |
|---|---|---|
disk_Temp_max_sects < 0 | param negative (default -1) | cap = SECTID_MAX -> infinite temp space |
| else | param >= 0 | param is a page count; / DISK_SECTOR_NPAGES -> sector cap |
disk_Cache != NULL | prior cache exists (reload) | disk_cache_final frees it first — makes init idempotent |
disk_cache_init != NO_ERROR | malloc failed | early return; nothing allocated to free |
load_from_disk && load fails | a volume failed to boot | disk_manager_final frees the half-cache, propagate error |
load_from_disk == false | first-volume format path | cache stays empty; caller fills it manually |
The static initializer static DKNSECTS disk_Temp_max_sects = -2; is a
pre-init sentinel (“not yet computed”), distinct from the parameter
default -1 (“Infinite”). disk_manager_init always overwrites it from
PRM_ID_BOSR_MAXTMP_PAGES (temp_file_max_size_in_pages) per the branch table;
this later bounds permanent-volume-as-temp growth.
Invariant — the reload path is destructive-then-rebuilding.
disk_manager_initmay run more than once (reload after recovery phases), so it must never leak the old cache. Thedisk_Cache != NULLguard callsdisk_cache_finalfirst; without it a second init leaks the previous allocation and its three mutexes.
2.3 disk_cache_init — allocating and zeroing the global cache
Section titled “2.3 disk_cache_init — allocating and zeroing the global cache”disk_cache_init is the only allocator of disk_Cache. It mallocs one flat
DISK_CACHE (the vols[] array is inline, sized for LOG_MAX_DBVOLID), then
zeroes every counter so the per-volume load can simply add into the rollup.
// disk_cache_init -- src/storage/disk_manager.cstatic intdisk_cache_init (void){ int i; assert (disk_Cache == NULL); /* <- never double-allocate */ disk_Cache = (DISK_CACHE *) malloc (sizeof (DISK_CACHE)); if (disk_Cache == NULL) { /* ... er_set OUT_OF_VIRTUAL_MEMORY, return ER_OUT_OF_VIRTUAL_MEMORY ... */ }
disk_Cache->nvols_perm = disk_Cache->nvols_temp = 0; disk_Cache->perm_purpose_info.extend_info.nsect_vol_max = /* default new-vol size */ DISK_SECTS_ROUND_UP ((DKNSECTS) (prm_get_bigint_value (PRM_ID_DB_VOLUME_SIZE) / IO_SECTORSIZE)); // ... condensed: perm free/total/max = 0 (load ADDS in); volid_extend = NULL_VOLID; voltype = PERM ... // ... condensed: temp extend_info same vol_max, zeroed, NULL_VOLID; nsect_perm_free/total = 0 ... // ... condensed: 3 pthread_mutex_init (perm/temp mutex_reserve, mutex_extend) ... for (i = 0; i <= LOG_MAX_DBVOLID; i++) /* <- inclusive of highest legal volid */ { disk_Cache->vols[i].purpose = DISK_UNKNOWN_PURPOSE; /* <- every slot starts "no volume here" */ disk_Cache->vols[i].nsect_free = 0; } return NO_ERROR;}nsect_vol_max (both purposes) is the default new-volume size for later
auto-extension, not a current value. Both volid_extend start NULL_VOLID
(discovered during load), both nvols_* start 0, and every slot starts
DISK_UNKNOWN_PURPOSE / zero free. Since load only adds, a fresh
disk_cache_init must precede any load.
2.4 disk_cache_load_all_volumes — walking the mounted-volume chain
Section titled “2.4 disk_cache_load_all_volumes — walking the mounted-volume chain”disk_cache_load_all_volumes is a thin wrapper — it asserts the cache exists
and returns fileio_map_mounted (thread_p, disk_cache_load_volume, NULL),
handing the per-volume callback to the chain walker.
fileio_map_mounted (in file_io.c) is that walker. It iterates the
file-IO volume-info header in two passes: permanent volumes ascending from volid
0 up to next_perm_volid - 1, then temporary volumes descending to
next_temp_volid (the file-IO equivalent of the on-disk next_volid chain).
Unmounted slots (vol_info_p->vdes == NULL_VOLDES) are skipped. If the callback
returns false, the walk stops and returns false, which disk_manager_init
treats as fatal.
flowchart TD
start["fileio_map_mounted"] --> permloop{"perm volid <= next_perm_volid-1?"}
permloop -- "vdes live" --> cb1["disk_cache_load_volume(volid)"]
permloop -- "skip / done" --> temploop{"temp volid >= next_temp_volid?"}
cb1 -- false --> stopf["return false"]
cb1 -- true --> permloop
temploop -- "vdes live" --> cb2["disk_cache_load_volume(volid)"]
cb2 -- false --> stopf
cb2 -- true --> temploop
temploop -- done --> okt["return true"]
Figure 2-2. fileio_map_mounted two-pass walk driving the cache load.
2.5 disk_cache_load_volume — rolling one header into the rollup
Section titled “2.5 disk_cache_load_volume — rolling one header into the rollup”The heart of cache reconstruction. Per volume it boots the header via
disk_volume_boot (reads the header, counts free sectors — Chapter 3), then
folds the result into the right purpose info.
// disk_cache_load_volume -- src/storage/disk_manager.cstatic booldisk_cache_load_volume (THREAD_ENTRY * thread_p, INT16 volid, void *ignore){ DB_VOLPURPOSE vol_purpose; DB_VOLTYPE vol_type; DISK_VOLUME_SPACE_INFO space_info = DISK_VOLUME_SPACE_INFO_INITIALIZER;
if (disk_volume_boot (thread_p, volid, &vol_purpose, &vol_type, &space_info) != NO_ERROR) { ASSERT_ERROR (); return false; /* <- aborts the whole map walk */ }
if (vol_type != DB_PERMANENT_VOLTYPE) { /* don't save temporary volumes... they will be dropped anyway */ return true; /* <- temp-type volumes are not cached at all */ }
if (vol_purpose == DB_PERMANENT_DATA_PURPOSE) { // perm_purpose_info.extend_info.nsect_{free,total,max} += space_info.n_{free,total,max}_sects // ... condensed: assert nsect_free <= nsect_total <= nsect_max ... if (space_info.n_total_sects < space_info.n_max_sects) { assert (disk_Cache->perm_purpose_info.extend_info.volid_extend == NULL_VOLID); disk_Cache->perm_purpose_info.extend_info.volid_extend = volid; /* <- this vol can still grow */ } } else /* perm type, temp purpose */ { assert (space_info.n_total_sects == space_info.n_max_sects); /* <- perm-as-temp is fully grown */ // temp_purpose_info.nsect_perm_{free,total} += space_info.n_{free,total}_sects // ... condensed: assert nsect_perm_free <= nsect_perm_total ... }
disk_Cache->vols[volid].nsect_free = space_info.n_free_sects; disk_Cache->vols[volid].purpose = vol_purpose; disk_Cache->nvols_perm++; /* <- runs for BOTH branches above */ return true;}Branch accounting:
| Branch | Condition | Effect |
|---|---|---|
| boot fails | disk_volume_boot != NO_ERROR | return false; map walk and whole init abort |
vol_type != DB_PERMANENT_VOLTYPE | temporary-type volume | return true — not cached (dropped/reformatted at boot) |
vol_purpose == DB_PERMANENT_DATA_PURPOSE | perm volume, perm data | add free/total/max into perm_purpose_info.extend_info; if below max size, set volid_extend |
| else (perm type, temp purpose) | perm volume repurposed for temp | add free/total into temp_purpose_info.nsect_perm_*; assert fully grown |
The else-branch is the subtle case: type (survives restart?) and purpose
(what it holds) are orthogonal. A perm-type/temp-purpose volume’s space rolls
into nsect_perm_* (“permanent sectors lent to temp”), distinct from
temp_purpose_info.extend_info (genuine temporary-type volumes, skipped above).
The perm-path assert (... == NULL_VOLID) enforces at most one permanent volume
“growing”. After the if/else, the slot recording (vols[volid].*) and
nvols_perm++ run unconditionally for every permanent-TYPE volume regardless
of purpose — so a perm-as-temp volume is counted in nvols_perm, never
nvols_temp; since temporary-type volumes returned early, after a full load
nvols_temp == 0.
Invariant — the cache is a derived rollup and may legitimately undercount.
nsect_freeis allowed to be lower than reality at any time; the two-step reservation protocol (Chapter 4) depends on this — a reservation may pessimistically decrement the cache and reconcile against the allocation table later. Never treatnsect_freeas exact; the allocation table is the source of truth.
2.6 disk_manager_final / disk_cache_final — teardown
Section titled “2.6 disk_manager_final / disk_cache_final — teardown”Teardown is branch-light; disk_manager_final delegates to disk_cache_final.
// disk_manager_final -- src/storage/disk_manager.cvoid disk_manager_final (void) { disk_cache_final (); }
// disk_cache_final -- src/storage/disk_manager.cstatic voiddisk_cache_final (void){ if (disk_Cache == NULL) { return; /* <- safe to call when never initialized */ } // ... condensed: assert perm/temp owner_reserve == -1 and owner_extend == -1 (no lock held at teardown) ... // ... condensed: pthread_mutex_destroy the perm/temp mutex_reserve and mutex_extend ... free_and_init (disk_Cache); /* <- frees and NULLs the pointer */}The disk_Cache == NULL guard makes final idempotent, which is why both the
reload path and the load-failure rollback call it unconditionally. The three
owner_* asserts (debug only) document that no thread may hold the reserve or
extend mutex at teardown — a violation is caught here, not as a destroyed
locked mutex. free_and_init zeroes the pointer so a later disk_cache_init
passes assert (disk_Cache == NULL).
2.7 file_manager_init / file_manager_final and the file-manager globals
Section titled “2.7 file_manager_init / file_manager_final and the file-manager globals”The file manager reconstructs nothing from disk: it captures one logging flag, sanity-checks a size assumption, and initializes the temporary-file cache.
// file_manager_init -- src/storage/file_manager.cintfile_manager_init (void){ file_Logging = prm_get_bool_value (PRM_ID_FILE_LOGGING); assert (FILE_DESCRIPTORS_SIZE == sizeof (FILE_DESCRIPTORS)); /* <- layout self-check */ return file_tempcache_init ();}
// file_manager_final -- src/storage/file_manager.cvoid file_manager_final (void) { file_tempcache_final (); }file_manager_init does not touch file_Tracker_vfid / file_Tracker_vpid;
they are statically zero-initialized (VFID_INITIALIZER / VPID_INITIALIZER)
and only filled when the tracker file is created or located during boot
(Chapters 6 and 9). file_Tempcache is likewise static, “empty” until
file_tempcache_init populates it:
// file_tempcache_init -- src/storage/file_manager.cstatic intfile_tempcache_init (void){ int ntrans = logtb_get_number_of_total_tran_indices () + 1; /* SERVER_MODE; else 1 */ assert (file_Tempcache.tran_files == NULL); /* <- tran_files != NULL means "initialized" */
// ... condensed: free_entries/cached_* = NULL, ncached_* = 0, nfree_entries_max = ntrans*8 ... file_Tempcache.ncached_max = prm_get_integer_value (PRM_ID_MAX_ENTRIES_IN_TEMP_FILE_CACHE); pthread_mutex_init (&file_Tempcache.mutex, NULL);
file_Tempcache.tran_files = (FILE_TEMPCACHE_TRAN_ENTRY *) malloc (ntrans * sizeof (...)); if (file_Tempcache.tran_files == NULL) { pthread_mutex_destroy (&file_Tempcache.mutex); /* <- undo the mutex on alloc failure */ // ... er_set OUT_OF_VIRTUAL_MEMORY; return ER_OUT_OF_VIRTUAL_MEMORY ... } // ... condensed: memset tran_files; per-tran mutex_init loop; memset spacedb_temp ... return NO_ERROR;}Branch accounting: the only non-trivial branch is the malloc failure, which
destroys file_Tempcache.mutex before returning so nothing is half-constructed.
file_tempcache_final mirrors this — early return if tran_files == NULL, else
free every per-transaction list, the cached numerable / not-numerable lists and
the free-entry pool, and destroy the mutexes.
Invariant —
file_Tempcache.tran_files == NULLis the “uninitialized” sentinel. Bothinit(viaassert) andfinal(via early return) treattran_filesas the single truth for whether the tempcache exists. Code that allocates or frees it must keep this honest, orfinalskips a real teardown or double-frees.
2.8 Chapter summary — key takeaways
Section titled “2.8 Chapter summary — key takeaways”disk_manager_initis the only assembler ofdisk_Cacheand idempotent: thedisk_Cache != NULLguard tears down any prior cache,disk_cache_initallocates, and a failedload_from_diskrolls back viadisk_manager_final.disk_cache_initzeroes all rollup counters so load purely adds, and seeds everyvols[]slot toDISK_UNKNOWN_PURPOSE.- The cache is rebuilt by walking mounted volumes —
fileio_map_mounted(two-pass perm-ascending / temp-descending, bounded bynext_*_volid), onedisk_cache_load_volumeper live descriptor. disk_cache_load_volumedistinguishes type from purpose: temp-type volumes are skipped; perm-data feedsperm_purpose_info.extend_info(may set the singlevolid_extend); perm-type/temp-purpose feedstemp_purpose_info.nsect_perm_*.nvols_perm++runs for every permanent-type volume regardless of purpose, so after a full loadnvols_temp == 0.- The cache is a derived, lower-bound rollup that may legitimately undercount free sectors; the allocation table is the source of truth.
disk_Temp_max_sectsstarts at-2(pre-init sentinel, vs parameter default-1= Infinite), overwritten fromPRM_ID_BOSR_MAXTMP_PAGES: negatives map toSECTID_MAX, non-negative pages divide byDISK_SECTOR_NPAGES.- The file manager reconstructs nothing from disk:
file_manager_initonly captures a flag and runsfile_tempcache_init; the tracker globals stay static*_INITIALIZERzeros, andfile_Tempcache.tran_files == NULLis the uninitialized sentinel guarding both init and final.
Chapter 3: Volume Format and the Sector Allocation Table
Section titled “Chapter 3: Volume Format and the Sector Allocation Table”This chapter answers: how is a CUBRID volume laid out on disk, and how does the disk manager flip bits in the sector allocation table without scanning the bitmap one bit at a time? The high-level companion (cubrid-disk-manager.md) covers why a sector is the allocation quantum and why a bitmap beats a free-list; here we trace the byte layout, the format-time writers, and the bitmap-as-functor machinery. DISK_VOLUME_HEADER and DISK_STAB_CURSOR are introduced field-by-field in Chapter 1.
3.1 The on-disk volume layout
Section titled “3.1 The on-disk volume layout”Every CUBRID volume — permanent or temporary, first or extension — shares one macro-layout: page 0 is the volume header, then a contiguous run of sector-table (STAB) pages, then data. Three header fields fix it:
// disk_volume_header_set_stab -- src/storage/disk_manager.cvolheader->stab_first_page = DISK_VOLHEADER_PAGE + 1; /* <- STAB always starts at page 1 */volheader->stab_npages = CEIL_PTVDIV (volheader->nsect_max, DISK_STAB_PAGE_BIT_COUNT); /* <- sized by nsect_max, not nsect_total */volheader->sys_lastpage = volheader->stab_first_page + volheader->stab_npages - 1; /* <- last reserved sys page */DISK_VOLHEADER_PAGE is 0, so stab_first_page is always page 1. The decisive choice is the divisor — nsect_max, not nsect_total: a volume grows its used size up to its capacity without moving the data region, because the STAB was sized for the maximum on day one. Chapter 5 (extension) depends on this — extension flips already-present STAB bits and never re-lays-out the volume.
flowchart LR
subgraph Volume["Volume file (pages)"]
H["page 0<br/>DISK_VOLUME_HEADER<br/>magic, volid, purpose,<br/>nsect_total, nsect_max,<br/>stab_first_page, stab_npages,<br/>sys_lastpage, hint_allocsect"]
S["pages 1 .. sys_lastpage<br/>SECTOR ALLOCATION TABLE<br/>stab_npages pages of UINT64 units<br/>1 bit == 1 sector"]
D["pages sys_lastpage+1 .. end<br/>DATA SECTORS<br/>64 pages each"]
end
H --> S --> D
Figure 3-1: macro-layout of any CUBRID volume. The STAB is sized for nsect_max so the data region’s start never moves.
A “sector” is 64 consecutive pages (DISK_SECTOR_NPAGES); SECTOR_FROM_PAGEID(pageid) is pageid / 64. The system sectors a volume self-reserves at format time number SECTOR_FROM_PAGEID(sys_lastpage) + 1 (header + all STAB pages, rounded up) — the value that drives disk_stab_init (§3.3).
Invariant — STAB sizing is pinned to nsect_max. disk_verify_volume_header asserts stab_npages == CEIL_PTVDIV(nsect_max, DISK_STAB_PAGE_BIT_COUNT), stab_npages >= CEIL_PTVDIV(nsect_total, ...), and stab_first_page == DISK_VOLHEADER_PAGE + 1. Sizing by nsect_total instead would leave a later extension with no bitmap bits for the new sectors, and the assert would fire on the next header fetch.
3.2 disk_format and disk_format_first_volume — writing the header
Section titled “3.2 disk_format and disk_format_first_volume — writing the header”disk_format creates any volume; disk_format_first_volume is a thin shim that bootstraps the first volume (LOG_DBFIRST_VOLID) plus the cache: it calls disk_manager_init, bumps disk_Cache->nvols_perm = 1 (rolled back to 0 on failure), and sets ext_info.nsect_total == ext_info.nsect_max (no headroom on the first volume).
disk_format has many error paths. The flowchart accounts for every branch via its edge labels; the prose below adds only what the flowchart cannot carry.
flowchart TD
A["validate name & purpose"] -->|name too long| RET1["return ER_..._TOO_LONG"]
A -->|bad purpose| RET2["return ER_DISK_UNKNOWN_PURPOSE"]
A -->|ok| B{"voltype == PERMANENT?"}
B -->|yes: log undo RVDK_FORMAT| C["force flush both paths<br/>then fileio_format OS file"]
B -->|no| C
C -->|NULL_VOLDES| RET3["return error, nothing to clean"]
C -->|ok| E["fix page 0 NEW_PAGE,<br/>ptype PAGE_VOLHEADER"]
E -->|fix fails| X["goto exit"]
E -->|ok| F["fill header,<br/>set_stab"]
F --> G{"sys_lastpage >= extend_npages?"}
G -->|yes: ER_IO_FORMAT_BAD_NPAGES| X
G -->|no: set params/name/remarks, err goto exit| I{"PERMANENT?"}
I -->|yes: RVDK_NEWVOL + RVDK_FORMAT redo offset=-1| K["disk_stab_init"]
I -->|no| K
K -->|err| X
K -->|ok| L{"PERMANENT and volid != FIRST?"}
L -->|yes: disk_set_link prev vol, err goto exit| N{"PERMANENT?"}
L -->|no| N
N -->|yes: RVDK_FORMAT redo offset=0| P{"TEMPORARY?"}
N -->|no| P
P -->|yes: flush+dwb, sys pages temp-LSA, err goto exit| R["nsect_free_out, dirty_and_free,<br/>flush + dwb_synchronize"]
P -->|no| R
R --> X["exit: unfix header page"]
X --> S{"error_code != NO_ERROR?"}
S -->|no| RET4["return NO_ERROR"]
S -->|yes| T["pgbuf_invalidate_all"]
T --> U{"TEMPORARY?"}
U -->|yes| V["disk_unformat now,<br/>temp not logged"]
U -->|no| RET5["return error, rollback removes it"]
V --> RET5
Figure 3-2: every branch of disk_format. The cleanup split at the bottom is the heart of crash safety.
Two points the flowchart cannot fully carry:
- Undo is logical, force-flush is unconditional. Only the undo
RVDK_FORMAT(log_append_undo_data, carrying just the name) is gated onvoltype == DB_PERMANENT_VOLTYPE— it lets rollback remove the whole volume, since there is no page-level undo. Butlogpb_force_flush_pagesthen runs on both paths, so the log reaches disk before the OS file exists and a crash mid-format is recoverable. - The
exit:split. After any post-fix error,goto exitunfixes the header page, thenpgbuf_invalidate_all. A temporary volume is thendisk_unformat-ed immediately (no log, no rollback to lean on); a permanent volume returns the error and lets the top-action rollback (Chapter 5) replay the logical undo. The two permanentRVDK_FORMATredos useaddr.offset = -1beforedisk_stab_initand0after linking — the sentinel recovery uses to tell a started format from a completed one.
3.3 disk_stab_init — laying out the bitmap
Section titled “3.3 disk_stab_init — laying out the bitmap”After the header is written, disk_stab_init walks every STAB page and marks the system sectors (those the header+STAB occupy) reserved, leaving the rest zero (free).
// disk_stab_init -- src/storage/disk_manager.cDKNSECTS nsects_sys = SECTOR_FROM_PAGEID (volheader->sys_lastpage) + 1; /* <- sectors to pre-reserve */assert (nsects_sys < DISK_STAB_PAGE_BIT_COUNT); /* <- sys region fits in STAB page 0 */for ( /* each STAB page */ ; ...; vpid_stab.pageid++) { page_stab = pgbuf_fix (..., NEW_PAGE, PGBUF_LATCH_WRITE, ...); // NULL -> return error pgbuf_set_page_ptype (thread_p, page_stab, PAGE_VOLBITMAP); if (volheader->purpose == DB_TEMPORARY_DATA_PURPOSE) pgbuf_set_lsa_as_temporary (...); /* <- no log for temp */ memset (page_stab, 0, DB_PAGESIZE); /* <- all sectors free by default */ if (nsects_sys > 0) /* <- only while sys sectors remain (page 0 only) */ { nsect_copy = nsects_sys; disk_stab_cursor_set_at_sectid (volheader, /* page start */ ..., &start_cursor); if ( /* last STAB page */ ) disk_stab_cursor_set_at_end (volheader, &end_cursor); /* <- end = nsect_total */ else disk_stab_cursor_set_at_sectid (volheader, /* next page start */ ..., &end_cursor); error_code = disk_stab_iterate_units (..., disk_stab_set_bits_contiguous, &nsect_copy); } // err -> unfix + return if (volheader->purpose != DB_TEMPORARY_DATA_PURPOSE) /* <- permanent: log only the count, not the image */ { DKNSECTS nsects_set = nsects_sys - nsect_copy; log_append_redo_data2 (thread_p, RVDK_INITMAP, NULL, page_stab, NULL_OFFSET, sizeof (nsects_set), &nsects_set); } if (!LOG_ISRESTARTED ()) { pgbuf_set_dirty (...); pgbuf_flush (..., FREE); page_stab = NULL; } /* <- format: flush, pool invalidated next */ else pgbuf_set_dirty_and_free (thread_p, page_stab); /* <- recovery replay: dirty+free */ nsects_sys = nsect_copy; nsect_copy = 0; /* <- carry leftover to next page (normally 0 after page 1) */ }Every branch is tagged inline. The loop runs stab_npages times zeroing each page; the nsects_sys > 0 block fires only on the first page (the assert guarantees the system sectors fit there), and disk_stab_set_bits_contiguous fills whole BIT64_FULL units then trailing bits up to the end cursor.
3.4 disk_unformat — removing the OS file
Section titled “3.4 disk_unformat — removing the OS file”Destruction is anticlimactic: the disk manager owns no in-memory bitmap, so disk_unformat only flushes, invalidates the page-buffer image, and deletes the file.
// disk_unformat -- src/storage/disk_manager.cvolid = fileio_find_volume_id_with_label (thread_p, vol_fullname);if (volid != NULL_VOLID) { (void) pgbuf_flush_all (thread_p, volid); /* <- push any dirty pages */ (void) pgbuf_invalidate_all (thread_p, volid); /* <- drop them from the pool */ }fileio_unformat (thread_p, vol_fullname); /* <- delete the OS file */return ret; /* <- always NO_ERROR */The single branch is volid != NULL_VOLID: an unmounted volume (no id for the label) skips flush/invalidate and only fileio_unformat runs. This is what disk_format calls on its temporary-volume error path (§3.2) and what recovery calls when undoing a permanent format.
3.5 The bitmap-as-functor pattern
Section titled “3.5 The bitmap-as-functor pattern”Callers never read the STAB bit-by-bit. The manager quantizes it into 64-bit units and exposes one iterator — disk_stab_iterate_units — driving a DISK_STAB_UNIT_FUNC callback over a unit range. Reserve, unreserve, count-free, has-used, and contiguous-set are all just different callbacks.
Quantization. DISK_STAB_UNIT is UINT64. The macros mapping a SECTID to a position are pure integer arithmetic — a flat index split into (page, unit, bit):
// allocation-table addressing macros -- src/storage/disk_manager.c#define DISK_ALLOCTBL_SECTOR_PAGE_OFFSET(sect) ((sect) / DISK_STAB_PAGE_BIT_COUNT)#define DISK_ALLOCTBL_SECTOR_UNIT_OFFSET(sect) (((sect) % DISK_STAB_PAGE_BIT_COUNT) / DISK_STAB_UNIT_BIT_COUNT)#define DISK_ALLOCTBL_SECTOR_BIT_OFFSET(sect) (((sect) % DISK_STAB_PAGE_BIT_COUNT) % DISK_STAB_UNIT_BIT_COUNT)#define DISK_STAB_NPAGES(nsect_max) (CEIL_PTVDIV (nsect_max, DISK_STAB_PAGE_BIT_COUNT))DISK_STAB_NPAGES is the same CEIL_PTVDIV as in disk_volume_header_set_stab, keeping the header field and the macro in agreement.
flowchart LR SECT["SECTID"] --> PG["page offset<br/>sect / PAGE_BIT_COUNT"] SECT --> UN["unit offset<br/>(sect mod PAGE_BIT_COUNT) / 64"] SECT --> BT["bit offset<br/>(sect mod PAGE_BIT_COUNT) mod 64"] PG --> POS["cursor.pageid = stab_first_page + page offset"] UN --> POS2["cursor.offset_to_unit"] BT --> POS3["cursor.offset_to_bit"]
Figure 3-3: a SECTID split into (page, unit, bit) by three modulo/divide macros. The cursor stores all three plus the live unit pointer.
Cursor positioning
Section titled “Cursor positioning”Three inline setters seed a DISK_STAB_CURSOR (fields in Chapter 1), differing only in the target sector; all leave page/unit NULL (the page is fixed lazily by disk_stab_cursor_fix).
disk_stab_cursor_set_at_sectid— general case: asserts0 <= sectid <= nsect_total, fillspageid/offset_to_unit/offset_to_bitfrom the three macros, assertingpageidstays withinstab_npages.disk_stab_cursor_set_at_end— one past the last valid sector viaset_at_sectid(volheader, nsect_total, cursor), first assertingnsect_totalis unit-rounded (DISK_SECTS_ASSERT_ROUNDED) so iteration ends on a 64-bit boundary.disk_stab_cursor_set_at_start— hard-codessectid = 0,pageid = stab_first_page, both offsets0(skipsset_at_sectid; the all-zero position is trivial).
Invariant — cursor position consistency. disk_stab_cursor_check_valid asserts (pageid - stab_first_page) * PAGE_BIT_COUNT + offset_to_unit * 64 + offset_to_bit == sectid, and that whenever unit != NULL, (char*)unit - page == offset_to_unit * DISK_STAB_UNIT_SIZE_OF. The iterator re-establishes this before every callback. If the offsets drift from sectid, reserved VSIDs name the wrong sectors — silent cross-linking corruption.
The iterator
Section titled “The iterator”// disk_stab_iterate_units -- src/storage/disk_manager.cassert (disk_stab_cursor_compare (start, end) < 0); /* <- start strictly before end */for (cursor = *start; cursor.pageid <= end->pageid; cursor.pageid++, cursor.offset_to_unit = 0) { error_code = disk_stab_cursor_fix (thread_p, &cursor, mode); /* <- fix this STAB page */ // ... err -> return ... end_unit = ((DISK_STAB_UNIT *) cursor.page) + (cursor.pageid == end->pageid ? end->offset_to_unit : DISK_STAB_PAGE_UNITS_COUNT); /* <- clamp last page */ for (; cursor.unit < end_unit; cursor.unit++, cursor.offset_to_unit++, cursor.sectid += (DISK_STAB_UNIT_BIT_COUNT - cursor.offset_to_bit), /* <- advance by remaining bits */ cursor.offset_to_bit = 0) { error_code = f_unit (thread_p, &cursor, &stop, f_unit_args); /* <- the functor */ if (error_code != NO_ERROR) { disk_stab_cursor_unfix (...); return error_code; } if (stop) { disk_stab_cursor_unfix (...); return NO_ERROR; } /* <- early-out */ } disk_stab_cursor_unfix (thread_p, &cursor); }The inner stride advances sectid by DISK_STAB_UNIT_BIT_COUNT - cursor.offset_to_bit — normally a full 64, but a callback may leave offset_to_bit partway through a unit (as disk_stab_unit_reserve does), so the stride compensates. Two short-circuits unfix the page first: a callback error (returns the error) and a callback setting *stop = true (returns NO_ERROR). disk_stab_iterate_units_all wraps this with set_at_start/set_at_end.
Reserve — disk_stab_unit_reserve
Section titled “Reserve — disk_stab_unit_reserve”The most branch-rich functor: it reserves up to nsects_lastvol_remaining free bits and records each VSID. All three branches are tagged inline.
// disk_stab_unit_reserve -- src/storage/disk_manager.cif (*cursor->unit == BIT64_FULL) return NO_ERROR; /* <- (1) full unit: nothing free, skip; no dirty/log */context = (DISK_RESERVE_CONTEXT *) args;if (*cursor->unit == 0) /* <- (2) empty unit: grab up to 64 in one store */ { int bits_to_set = MIN (context->nsects_lastvol_remaining, DISK_STAB_UNIT_BIT_COUNT); *cursor->unit = (bits_to_set == DISK_STAB_UNIT_BIT_COUNT) ? BIT64_FULL : bit64_set_trailing_bits (*cursor->unit, bits_to_set); log_unit = *cursor->unit; context->nsects_lastvol_remaining -= bits_to_set; /* ... emit one VSID per bit ... */ }else /* <- (3) mixed unit: skip leading ones, set each free bit */ { log_unit = 0; for (cursor->offset_to_bit = bit64_count_trailing_ones (*cursor->unit), cursor->sectid += cursor->offset_to_bit; cursor->offset_to_bit < DISK_STAB_UNIT_BIT_COUNT && context->nsects_lastvol_remaining > 0; cursor->offset_to_bit++, cursor->sectid++) if (!disk_stab_cursor_is_bit_set (cursor)) { disk_stab_cursor_set_bit (cursor); log_unit = bit64_set (log_unit, cursor->offset_to_bit); context->nsects_lastvol_remaining--; /* ... push VSID ... */ } }assert (log_unit != 0 && (log_unit & *cursor->unit) == log_unit);if (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* <- permanent: undoredo delta; temp skips logging */ log_append_undoredo_data2 (thread_p, RVDK_RESERVE_SECTORS, NULL, cursor->page, cursor->offset_to_unit, sizeof (log_unit), sizeof (log_unit), &log_unit, &log_unit);pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE);if (context->nsects_lastvol_remaining <= 0) *stop = true;log_unit accumulates only the bits this call set; for permanent volumes it is both the redo and undo image of RVDK_RESERVE_SECTORS (redo re-sets, undo clears).
Invariant — log_unit is a strict subset of the unit’s set bits. The assert (log_unit != 0 && (log_unit & *cursor->unit) == log_unit) guarantees the logged delta holds only bits actually set and is never a no-op. A bit absent from *cursor->unit would make recovery’s redo set a bit the live run never set — divergence between logged and live bitmaps.
Unreserve — disk_stab_unit_unreserve
Section titled “Unreserve — disk_stab_unit_unreserve”The mirror functor clears bits whose sector IDs the caller already knows (sorted in context->vsidp).
// disk_stab_unit_unreserve -- src/storage/disk_manager.cwhile (context->nsects_lastvol_remaining > 0 && context->vsidp->sectid < cursor->sectid + DISK_STAB_UNIT_BIT_COUNT) { unreserve_bits = bit64_set (unreserve_bits, context->vsidp->sectid - cursor->sectid); /* <- accumulate this unit's window, abs->rel bit */ context->nsects_lastvol_remaining--; context->vsidp++; nsect++; }assert ((unreserve_bits & (*cursor->unit)) == unreserve_bits); /* <- only clear bits that are set */if (unreserve_bits != 0) /* <- skip an untouched unit */ { if (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* <- permanent: postpone clears at commit, rollback skips it */ log_append_postpone (thread_p, RVDK_UNRESERVE_SECTORS, &addr /* page,offset_to_unit */, ..., &unreserve_bits); else /* <- temp: clear now + cache update under temp reserve lock */ { (*cursor->unit) &= ~unreserve_bits; pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE); disk_cache_update_vol_free (cursor->volheader->volid, nsect); } }if (context->nsects_lastvol_remaining <= 0) *stop = true;The purpose split is the asymmetry worth remembering, and it is tagged inline: permanent unreserve emits a postpone record, so a rollback never runs it and the sectors stay reserved; temporary unreserve clears immediately and updates the cache free count.
Invariant — unreserve only clears set bits. assert((unreserve_bits & *cursor->unit) == unreserve_bits) enforces that every sector being freed was actually reserved; a violation means double-free or a stale VSID list, corrupting free-sector accounting.
3.6 The 64-bit coupling and the hint_allocsect note
Section titled “3.6 The 64-bit coupling and the hint_allocsect note”Hidden 64-bit coupling. The cursor primitives call bit64_is_set, bit64_set, bit64_set_trailing_bits, bit64_count_trailing_ones, bit64_count_zeros — all hard-wired to 64-bit operands. The DISK_STAB_UNIT comment suggests the unit type “can be modified and handled automatically,” but changing typedef UINT64 DISK_STAB_UNIT would silently break every bit64_* call and BIT64_FULL. The quantization macros adapt via DISK_STAB_UNIT_SIZE_OF; the bit-op layer does not. (Open question: whether the “automatic” claim was ever true.) Treat 64 bits as a fixed contract.
hint_allocsect. disk_format only seeds this to NULL_SECTID; the live update is on the reservation path Chapter 4 owns (disk_reserve_sectors_in_volume). The subtlety relevant here: it goes stale after an unreserve — disk_stab_unit_unreserve frees bits below the hint but never lowers it, so a later reservation skips the freshly freed sectors until the wrap-around pass reclaims them. It is an optimization, not an invariant, so the code neither logs nor dirties it.
3.7 Chapter summary — key takeaways
Section titled “3.7 Chapter summary — key takeaways”- Layout is fixed and header-driven. Page 0 is the header;
stab_first_page(always 1) begins a contiguous STAB sized byDISK_STAB_NPAGES(nsect_max); data followssys_lastpage. Sizing bynsect_maxnotnsect_totallets a volume grow without re-layout. disk_formatis branch-heavy for crash safety. The logical undo (RVDK_FORMAT) is permanent-only, butlogpb_force_flush_pagesruns unconditionally before the OS file is created; permanent volumes log the header redo twice (offset-1then0); temporary volumes get temp-LSAs and aredisk_unformat-ed immediately on error.disk_stab_initpre-reserves exactly the system sectors (SECTOR_FROM_PAGEID(sys_lastpage)+1, all in the first STAB page), leaves the rest free, and logs only the count (RVDK_INITMAP), not the page image.- The bitmap is never scanned bit-by-bit. A
SECTIDdecomposes into(page, unit, bit)via three macros, anddisk_stab_iterate_unitsdrives aDISK_STAB_UNIT_FUNCover 64-bit units, short-circuiting on full/empty units. - Reserve and unreserve are mirror functors with a purpose split. Permanent reserve logs an undoredo delta; permanent unreserve uses a postpone record so rollback keeps the sectors; temporary skips logging and updates the cache directly. The
log_unit/unreserve_bitsinvariants keep logged and live bitmaps in lockstep. - 64 bits is a hard contract, not a tunable: the
bit64_*primitives andBIT64_FULLare not parameterized by unit size, despite the optimistic comment onDISK_STAB_UNIT. hint_allocsectis live state owned by Chapter 4;disk_formatonly seeds it toNULL_SECTID. Its one subtlety here is staleness after unreserve — freeing sectors below the hint never lowers it.
Chapter 4: Sector Reservation Two-Step Protocol
Section titled “Chapter 4: Sector Reservation Two-Step Protocol”A file that needs N sectors does not flip N bits under one lock. The disk
manager splits the work into two disjoint phases (the high-level companion,
CUBRID Disk Manager, explains why the cache exists).
This chapter answers: when a file needs N sectors, how does the disk manager
hand them out across volumes while keeping the hot mutex short and staying
crash-safe?
4.1 The two structs that carry a reservation
Section titled “4.1 The two structs that carry a reservation”A reservation is disk_reserve_context, a stack local in disk_reserve_sectors
(re-built in the unreserve path), threaded through every function below.
// disk_reserve_context -- src/storage/disk_manager.cstruct disk_reserve_context{ int nsect_total; /* original request size */ VSID *vsidp; /* write cursor into output array */ DISK_CACHE_VOL_RESERVE cache_vol_reserve[VOLID_MAX]; /* per-volume ledger from step 1 */ int n_cache_vol_reserve; /* ledger slots used */ int n_cache_reserve_remaining; /* cache-phase debt */ DKNSECTS nsects_lastvol_remaining; /* current-volume bitmap debt */ DB_VOLPURPOSE purpose; /* permanent-data / temporary-data */};| Field | Role | Why it exists |
|---|---|---|
nsect_total | Immutable copy of request N. | Final assert (vsidp - reserved_sectors == n_sectors); never decremented. |
vsidp | Write pointer into reserved_sectors[]. | vsidp - reserved_sectors = sectors reserved so far; error path reads it for rollback. |
cache_vol_reserve[] | Step-1 ledger, one {volid, nsect} per volume drawn from. | Step 2 replays it; error path refunds un-flipped sectors from it. |
n_cache_vol_reserve | Count of used ledger slots. | Loop bound for step 2 and the rollback scan. |
n_cache_reserve_remaining | Cache-phase debt; starts N, decremented by disk_reserve_from_cache_volume, 0 when satisfied. | Drives volume-iteration and extend decisions in step 1. |
nsects_lastvol_remaining | Bitmap-phase debt within the current volume; seeded per-volume, decremented as bits flip. | disk_stab_unit_reserve drives off it; 0 sets *stop. |
purpose | DB_PERMANENT_DATA_PURPOSE / DB_TEMPORARY_DATA_PURPOSE. | Selects cache mutex/extend-info and whether STAB changes are logged. |
// disk_cache_vol_reserve -- src/storage/disk_manager.cstruct disk_cache_vol_reserve { VOLID volid; DKNSECTS nsect; };| Field | Role | Why it exists |
|---|---|---|
volid | Volume the cache reserved from. | Step 2 fixes its header and flips its bits; rollback decrements its cache counter. |
nsect | Count promised from volid. | Seeds nsects_lastvol_remaining; rollback decrements it per sector returned by undo, leaving the not-yet-flipped remainder. |
Each ledger entry {volid, nsect} seeds one step-2 per-volume scan (Figure 4-1).
flowchart LR
RC["disk_reserve_context"] --> L["cache_vol_reserve[i]\n{volid, nsect}"] -.seeds nsects_lastvol_remaining.-> S2["step 2 per-volume scan -> reserved_sectors[]"]
Figure 4-1. Reserve context, its per-volume ledger, and the step-2 scan that fills the output array.
Invariant — the two remaining-counters never alias.
n_cache_reserve_remaining is the cache debt; nsects_lastvol_remaining is the
current-volume bitmap debt. The cache phase finishes with
n_cache_reserve_remaining == 0 and sum(cache_vol_reserve[i].nsect) == N.
Separating them lets step 2 be re-driven per volume without re-touching the cache;
aliasing would let a partial volume scan corrupt cache accounting.
4.2 The outer driver: disk_reserve_sectors
Section titled “4.2 The outer driver: disk_reserve_sectors”disk_reserve_sectors(thread_p, purpose, volid_hint, n_sectors, reserved_sectors)
is the disk/file boundary call. volid_hint is accepted but ignored; volume order
is governed by purpose.
- Guards.
assertpurpose is perm or temp;n_sectors <= 0 || reserved_sectors == NULL->assert_release(false); ER_FAILED. - Sysop precondition for permanent reservations (their STAB changes are logged onto the outer transaction):
// disk_reserve_sectors -- src/storage/disk_manager.cif (purpose != DB_TEMPORARY_DATA_PURPOSE && !log_check_system_op_is_started (thread_p)){ assert (false); er_set (...ER_GENERIC_ERROR, 0); return ER_FAILED; } /* caller forgot sysop */
retry:/log_sysop_start— even temp reservations open a sysop to scope the bitmap phase.CSECT_DISK_CHECKas reader (excludes the consistency checker). Fail ->log_sysop_abort; return.- Init context in place:
nsect_total = n_cache_reserve_remaining = n_sectors,vsidp = reserved_sectors,n_cache_vol_reserve = 0. - Step 1 —
disk_reserve_from_cache. Error ->goto error. - Step 2 — loop
disk_reserve_sectors_in_volumeover[0, n_cache_vol_reserve). Any error ->goto error. - Success.
assert ((vsidp - reserved_sectors) == n_sectors); exit csect;log_sysop_attach_to_outer; in debug, ifdid_extend,disk_check;return NO_ERROR. Theerror:path (4.7) handles rollback.
flowchart TD C["csect_enter CSECT_DISK_CHECK after sysop start"] -->|fail| AB["log_sysop_abort, return err"] C -->|ok| D["init context"] --> E["step 1: disk_reserve_from_cache"] E -->|err| ERR["goto error: rollback (4.7)"] E -->|ok| F["step 2 loop: disk_reserve_sectors_in_volume per ledger entry"] F -->|err| ERR F -->|ok| H["assert vsidp-base==N, attach_to_outer, NO_ERROR"]
Figure 4-2. disk_reserve_sectors control flow.
4.3 Step 1 entry: disk_reserve_from_cache
Section titled “4.3 Step 1 entry: disk_reserve_from_cache”Moves free-sector counts into the ledger, extending the disk if short, holding the reserve mutex only across counter math.
disk_Cache == NULL->assert_release(false); return ER_FAILED.- Lock the purpose’s reserve mutex (
disk_cache_lock_reserve_for_purpose). - Temp purpose prefers perm-type-temp-purpose volumes before genuine temp
volumes:
// disk_reserve_from_cache -- src/storage/disk_manager.cif (context->purpose == DB_TEMPORARY_DATA_PURPOSE){extend_info = &disk_Cache->temp_purpose_info.extend_info;if (disk_Cache->temp_purpose_info.nsect_perm_free > 0)disk_reserve_from_cache_vols (DB_PERMANENT_VOLTYPE, context); /* <- perm-temp first */if (context->n_cache_reserve_remaining <= 0) /* satisfied from perm-temp */{ disk_cache_unlock_reserve_for_purpose (context->purpose); return NO_ERROR; }// ... temp-ceiling check, then fall through to temp-volume extend ...}elseextend_info = &disk_Cache->perm_purpose_info.extend_info;
nsect_perm_free= free sectors on perm-type volumes carrying temp purpose; when 0 those volumes are skipped. - Temp-space ceiling (temp branch, before extending temp volumes): if
extend_info->nsect_total - extend_info->nsect_free + n_cache_reserve_remaining > disk_Temp_max_sects->er_set (ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED ...); unlock; return. Operands are the extend-info pool aggregates, not the context’snsect_total. - Common tail:
assert (n_cache_reserve_remaining > 0)and assert this thread holds the mutex. - Reserve from existing free space if the pool is big enough:
if (extend_info->nsect_free > context->n_cache_reserve_remaining) /* strict >: a hair of headroom, see Ch 5 */{disk_reserve_from_cache_vols (extend_info->voltype, context);if (context->n_cache_reserve_remaining <= 0){ disk_cache_unlock_reserve (extend_info); return NO_ERROR; } /* <- done from existing */}
- Short -> extend. Bump
extend_info->nsect_intention(signals concurrent reservers), drop the reserve mutex, takedisk_lock_extend(), re-take the reserve mutex and re-check. If a peer already extended sonsect_freenow suffices: decrement intention, retrydisk_reserve_from_cache_vols, return. Else calldisk_extend(Ch 5) and back the intention out. Both locks released on every exit. - Post-extend.
disk_extenderror -> return it. Stilln_cache_reserve_remaining > 0->assert_release(false); ER_FAILED. Else*did_extend = true; return NO_ERROR.
Invariant — the reserve mutex is never held across a STAB scan or an extend. The intention counter is the hand-off token that lets the mutex drop during the slow extend without two threads double-extending.
4.4 Iterating volumes: disk_reserve_from_cache_vols
Section titled “4.4 Iterating volumes: disk_reserve_from_cache_vols”// disk_reserve_from_cache_vols -- src/storage/disk_manager.cif (type == DB_PERMANENT_VOLTYPE) /* perm: ascend 0..nvols_perm */ { start_iter = 0; end_iter = disk_Cache->nvols_perm; incr = 1; min_free = MIN (context->nsect_total, perm...nsect_vol_max) / 2; }else /* temp: descend from top of volid space */ { start_iter = LOG_MAX_DBVOLID; end_iter = LOG_MAX_DBVOLID - disk_Cache->nvols_temp; incr = -1; min_free = MIN (context->nsect_total, temp...nsect_vol_max) / 2; }min_free = MAX (min_free, 1); /* half the smaller of request/per-vol max, floored at 1 */
for (volid_iter = start_iter; volid_iter != end_iter && context->n_cache_reserve_remaining > 0; /* stop when range exhausted or debt paid */ volid_iter += incr) { if (disk_Cache->vols[volid_iter].purpose != context->purpose) continue; /* wrong purpose */ if (disk_Cache->vols[volid_iter].nsect_free < min_free) continue; /* too fragmented */ disk_reserve_from_cache_volume (volid_iter, context); }4.5 Decrementing one volume’s counter: disk_reserve_from_cache_volume
Section titled “4.5 Decrementing one volume’s counter: disk_reserve_from_cache_volume”The only place step 1 actually moves sectors out of the cache.
// disk_reserve_from_cache_volume -- src/storage/disk_manager.cif (context->n_cache_vol_reserve >= LOG_MAX_DBVOLID) { assert_release (false); return; } /* <- ledger overflow guard */disk_check_own_reserve_for_purpose (context->purpose); /* <- assert mutex held by us */nsects = MIN (disk_Cache->vols[volid].nsect_free, context->n_cache_reserve_remaining);disk_cache_update_vol_free (volid, -nsects); /* <- decrement cache + purpose pool */context->cache_vol_reserve[context->n_cache_vol_reserve].volid = volid;context->cache_vol_reserve[context->n_cache_vol_reserve].nsect = nsects;context->n_cache_reserve_remaining -= nsects;context->n_cache_vol_reserve++; /* <- bitmap untouched, only counters */disk_cache_update_vol_free also adjusts the matching purpose-pool aggregate.
4.6 Step 2: disk_reserve_sectors_in_volume flips the bits
Section titled “4.6 Step 2: disk_reserve_sectors_in_volume flips the bits”Per ledger entry, fixes the volume header under a write latch (cache mutex not held) and flips STAB bits until the per-volume debt hits zero.
- Read ledger.
volid = cache_vol_reserve[vol_index].volid; ifNULL_VOLID->assert_release(false); ER_FAILED. Seednsects_lastvol_remaining = cache_vol_reserve[vol_index].nsect. - Fix volume header
PGBUF_LATCH_WRITE; on error ->return. - Hint-guided scan. Three scan shapes via
disk_stab_iterate_units(..., disk_stab_unit_reserve, context); each error path doesgoto exit:// disk_reserve_sectors_in_volume -- src/storage/disk_manager.cif (volheader->hint_allocsect > 0 && volheader->hint_allocsect < volheader->nsect_total){// ... cursors hint..end; iterate ... /* after hint */if (context->nsects_lastvol_remaining > 0) /* still short: wrap start..hint */{ end_cursor = start_cursor; disk_stab_cursor_set_at_start (volheader, &start_cursor);error_code = disk_stab_iterate_units (...); }}else{ /* ... cursors start..end; iterate whole table ... */ } - Must be satisfied.
if (nsects_lastvol_remaining != 0) { assert_release(false); ER_FAILED; goto exit; }— residue means cache and bitmap disagree (a bug). - Advance the hint.
hint_allocsect = (vsidp - 1)->sectid + 1; best-effort, neither dirtied nor logged. exit:unfix the header if fixed;return error_code.
The bit-flip lives in the disk_stab_unit_reserve callback, invoked per 64-bit
STAB unit (full unit BIT64_FULL returns early; empty unit 0 filled in bulk;
partial unit walked bit by bit), recording each new sector into context->vsidp.
Permanent purpose logs each change:
// disk_stab_unit_reserve -- src/storage/disk_manager.cif (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* redo+undo image = the changed bits mask */ log_append_undoredo_data2 (thread_p, RVDK_RESERVE_SECTORS, NULL, cursor->page, cursor->offset_to_unit, sizeof (log_unit), sizeof (log_unit), &log_unit, &log_unit);pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE);if (context->nsects_lastvol_remaining <= 0) { *stop = true; } /* <- end the volume scan */Redo and undo images are the same log_unit mask; the recovery handlers
disk_rv_reserve_sectors / disk_rv_unreserve_sectors re-sync the cache under
CSECT_DISK_CHECK (recovery chapter). Temporary reservations log nothing — their
bits reset wholesale on restart.
Invariant — the cache mutex is released throughout step 2. Step 1 charged the counters; step 2 touches only page latches and the WAL, so the hot reserve mutex is held for O(volumes) counter math, never O(sectors) bitmap I/O.
4.7 Failure and rollback
Section titled “4.7 Failure and rollback”If either step errors, disk_reserve_sectors jumps to error:. Let
nreserved = vsidp - reserved_sectors be the sectors actually flipped.
nreserved > 0and temp purpose: nothing was logged, so abort cannot undo the partial bitmap changes; disable interrupt checks,qsortthe VSIDs, calldisk_unreserve_ordered_sectors_without_csect. Permanent skips this — thelog_sysop_abortbelow undoes its logged changes.- Reconcile the ledger with what abort/undo already returned: for each
flipped sector, decrement its volume’s
cache_vol_reserve[].nsect, leaving only sectors charged to the cache but never flipped:// disk_reserve_sectors (error path) -- src/storage/disk_manager.cfor (iter_vsid = 0; iter_vsid < nreserved; iter_vsid++){for (iter = 0; iter < context.n_cache_vol_reserve; iter++)if (reserved_sectors[iter_vsid].volid == context.cache_vol_reserve[iter].volid){ context.cache_vol_reserve[iter].nsect--; break; } /* <- don't double-credit */assert (iter < context.n_cache_vol_reserve);} - Refund the residue via
disk_cache_free_reserved(&context)(adds remainingnsectback throughdisk_cache_update_vol_freeunder the reserve mutex). - Exit csect,
log_sysop_abort— for permanent purpose this rolls the logged STAB bits back. - Classify the error. Expected IO/interrupt errors (
ER_INTERRUPTED,ER_IO_MOUNT_FAIL,ER_IO_FORMAT_OUT_OF_SPACE,ER_IO_WRITE,ER_BO_CANNOT_CREATE_VOL) return as-is. Anything else tripsassert_release(false)and self-heals: if not yetretried,disk_check(thread_p, true); if it reportsDISK_INVALID, clear the error, setretried = true,goto retry. A second failure or non-skew cause returns.
disk_unreserve_ordered_sectors_without_csect rebuilds a fresh context from the
ordered VSID list, grouping consecutive same-volid runs into ledger entries
(asserting increasing volids and sectids), then calls
disk_unreserve_sectors_from_volume per group, returning the first error
(ASSERT_ERROR (); return error_code;) without refunding the remaining groups.
Its disk_stab_unit_unreserve callback clears bits and returns sectors to the
cache — the “removed from cache too” effect loop (2) compensates for.
Invariant — reserve order cache->bitmap, release order bitmap->cache; the
cache never overcounts. On reserve the counter drops before the bit is set; on
release the bit clears before the counter rises. Both transients leave the cache
showing less free than the bitmap, so two reservers can never both be told a
sector is free; disk_check repairs the bounded skew, which is why the error:
path can retry through it.
4.8 Chapter summary — key takeaways
Section titled “4.8 Chapter summary — key takeaways”- Two disjoint phases. Step 1 (
disk_reserve_from_cache) moves free-sector counts out of the cache under the short reserve mutex; step 2 (disk_reserve_sectors_in_volume) flips STAB bits under page latches with that mutex released. - Two independent debt counters.
n_cache_reserve_remaining(cache) andnsects_lastvol_remaining(per-volume bitmap) never alias, so step 2 is driven volume by volume offcache_vol_reserve[]. - Temp prefers perm-type-temp-purpose volumes. When
nsect_perm_free > 0those are scanned first, then fall through to temp-volume extension bounded bydisk_Temp_max_sects. - The hot mutex is never held across slow work. An intention counter lets the reserve mutex drop during
disk_extend; step 2 never re-takes it. - Permanent reservations are WAL-logged per STAB unit; temporary ones are not. Temp bits reset on restart, so temp rollback physically un-flips them via
disk_unreserve_ordered_sectors_without_csect. - The transient skew is always conservative. Reserve cache->bitmap, release bitmap->cache, so the cache never reports more free than exists;
disk_checkrepairs the bounded skew and theerror:path retries through it once. - The error path reconciles before refunding. It decrements ledger entries for already-returned sectors, then
disk_cache_free_reservedrefunds only the never-flipped residue, avoiding double-credit.
Chapter 5: Volume Extension as a Nested Top Action
Section titled “Chapter 5: Volume Extension as a Nested Top Action”The reader question this chapter answers: what happens inside Step 1 of sector reservation when even the permanent-type / temporary-purpose fallback runs dry and the cache can no longer satisfy the request? The reserving thread must grow the database — extend an existing OS file or create a new volume — before it can finish reserving. This chapter traces that escalation from disk_reserve_from_cache through disk_extend, disk_volume_expand, and disk_add_volume, and shows why the growth must be a nested top action committed independently of the triggering reservation. It continues Chapter 4 and the cache-vs-disk split in the high-level companion (cubrid-disk-manager.md, “Sector reservation”).
5.1 Where extension is triggered: the race window in disk_reserve_from_cache
Section titled “5.1 Where extension is triggered: the race window in disk_reserve_from_cache”When the running free count cannot cover n_cache_reserve_remaining, the function records its intention, drops the reserve mutex, then takes the extend mutex. The order is mandatory — mutex_extend carries the comment never get expand mutex while keeping reserve mutexes; the opposite order would deadlock against a concurrent expander already holding mutex_extend.
// disk_reserve_from_cache -- src/storage/disk_manager.c extend_info->nsect_intention += context->n_cache_reserve_remaining; /* <- publish demand BEFORE releasing */ disk_cache_unlock_reserve (extend_info); disk_lock_extend (); /* <- serializes all expanders; flips reserve -> extend mutex */ disk_cache_lock_reserve (extend_info); if (extend_info->nsect_free > context->n_cache_reserve_remaining) /* <- race: someone already grew it */ { extend_info->nsect_intention -= context->n_cache_reserve_remaining; disk_reserve_from_cache_vols (extend_info->voltype, context); if (context->n_cache_reserve_remaining <= 0) { disk_cache_unlock_reserve (extend_info); disk_unlock_extend (); return NO_ERROR; } /* <- no extend */ extend_info->nsect_intention += context->n_cache_reserve_remaining; } save_remaining = context->n_cache_reserve_remaining; /* <- snapshot, to undo intention after extend */ disk_cache_unlock_reserve (extend_info); error_code = disk_extend (thread_p, extend_info, context); /* <- the slow path */The interval between the two mutexes is the race window: another thread can grab mutex_extend first, grow the volume, and refill nsect_free. The double-check after disk_lock_extend() catches that — if the volume is now large enough this thread reverses its nsect_intention bump and reserves from the grown cache with no disk I/O. (disk_extend opens with assert (disk_Cache->owner_extend == thread_get_entry_index (thread_p)), proving it runs only under the extend mutex.)
INVARIANT —
nsect_intentionis the load-bearing accumulator of unmet demand. A thread adds its remaining need under the reserve mutex and subtracts the samesave_remainingsnapshot once met. If violated (an add with no matching subtract on an error path), every futuredisk_extendover-allocates by the leaked amount forever, since it readsnsect_intentionas the floor of how much to grow.
5.2 disk_extend: deciding how much, then expand-then-add
Section titled “5.2 disk_extend: deciding how much, then expand-then-add”disk_extend runs under mutex_extend over a snapshot of the DISK_EXTEND_INFO counters (Chapter 1), sizing the growth then executing it in two phases.
// disk_extend -- src/storage/disk_manager.c target_free = MAX ((DKNSECTS) (total * 0.01), DISK_MIN_VOLUME_SECTS); /* <- 1% of size, floored */ nsect_extend = MAX (target_free - free, 0) + intention; /* <- coalesce all unmet demand */ if (nsect_extend <= 0) return NO_ERROR; /* <- branch 1: free exceeds target, no intentions */ // ... condensed ... if (total < max) /* <- phase 1: extendable volume still has room */ { to_expand = MIN (nsect_extend, max - total); /* <- never exceed this volume's ceiling */ log_sysop_start (thread_p); /* <- NESTED TOP ACTION begins */ error_code = disk_volume_expand (thread_p, extend_info->volid_extend, voltype, to_expand, &nsect_free_new); if (error_code != NO_ERROR) { ASSERT_ERROR (); log_sysop_abort (thread_p); return error_code; } /* <- header undo */ log_sysop_commit (thread_p); /* <- commit independently of outer reservation */ if (extend_info->nsect_total == extend_info->nsect_max) extend_info->volid_extend = NULL_VOLID; /* <- maxed out; never extend this volume again */ nsect_extend -= nsect_free_new; // ... condensed: bump nsect_total; under reserve mutex update vol_free + reserve ahead ... if (nsect_extend <= 0) return NO_ERROR; /* <- expansion alone covered the demand */ } // ... condensed: assert (nsect_extend > 0); volext init (nsect_max, voltype, purpose, overwrite=false) ... while (nsect_extend > 0) /* <- phase 2: add fresh volumes */ { if (check_interrupt && logtb_is_interrupted (thread_p, true, &continue_check)) { er_set (..., ER_INTERRUPTED, 0); return ER_INTERRUPTED; } /* <- branch: only if re-enabled */ volext.nsect_total = nsect_extend + DISK_SYS_NSECT_SIZE (volext.nsect_max); // ... condensed: clamp to [DISK_MIN_VOLUME_SECTS, nsect_max] then DISK_SECTS_ROUND_UP ... error_code = disk_add_volume (thread_p, &volext, &volid_new, &nsect_free_new); if (error_code != NO_ERROR) { ASSERT_ERROR (); return error_code; } /* <- disk_add_volume aborted its own sysop */ nsect_extend -= nsect_free_new; // ... condensed: bump nsect_total/nsect_max; under reserve mutex set vol_free + reserve ahead ... if (extend_info->nsect_total < extend_info->nsect_max) extend_info->volid_extend = volid_new; /* <- newest non-maxed volume becomes extendable */ } return NO_ERROR;nsect_extend adds the (non-negative) headroom shortfall to intention, so one expansion serves every thread blocked on this purpose. Phase 1 grows the sub-ceiling volid_extend volume and reserves ahead; phase 2’s three branches — interrupt, disk_add_volume error (callee already aborted), and the volid_extend update (only if sub-max) — are annotated inline.
INVARIANT — exactly one volume per purpose is “extendable”.
volid_extendnames the single volume phase 1 grows; the code clears it toNULL_VOLIDthe instant a volume reachesnsect_maxand re-points it at the newest sub-max volume. If violated, a maxed volume could reachdisk_volume_expandwithto_expand = MIN(nsect_extend, max - total)non-positive, tripping an assert.
flowchart TD
C{"nsect_extend <= 0?"} -->|yes| Z1["return NO_ERROR"]
C -->|no| D{"total < max?"}
D -->|yes| E["sysop_start; disk_volume_expand"]
E --> F{"error?"} -->|yes| G["sysop_abort; return error"]
F -->|no| H["sysop_commit; cache + reserve ahead"]
H --> I{"nsect_extend <= 0?"} -->|yes| Z2["return NO_ERROR"]
I -->|no| J["phase 2 loop"]
D -->|no| J
J --> K{"interrupted?"} -->|yes| L["return ER_INTERRUPTED"]
K -->|no| M["size volext; disk_add_volume"]
M --> N{"error?"} -->|yes| O["return error"]
N -->|no| P["cache + reserve ahead; set volid_extend if sub-max"]
P --> Q{"nsect_extend > 0?"} -->|yes| K
Q -->|no| Z3["return NO_ERROR"]
Figure 5-1. Branch-complete flow of disk_extend: sizing, optional in-place expand, then the add-volume loop.
5.3 disk_volume_expand: growing one file as its own sysop
Section titled “5.3 disk_volume_expand: growing one file as its own sysop”disk_volume_expand grows a single volume in place. Its six-step recipe’s ordering is the whole point — it makes the growth crash-safe.
// disk_volume_expand -- src/storage/disk_manager.c error_code = disk_get_volheader (thread_p, volid, PGBUF_LATCH_WRITE, &page_volheader, &volheader); if (error_code != NO_ERROR) { assert_release (false); er_set (..., ER_GENERIC_ERROR, 0); return ER_FAILED; } /* <- header fix fatal */ do_logging = (volheader->type == DB_PERMANENT_VOLTYPE); /* <- temp volumes are not logged */ log_sysop_start (thread_p); /* step 1: own sysop so header change can be undone */ volheader->nsect_total += nsect_extend; if (do_logging) log_append_undoredo_data2 (thread_p, RVDK_VOLHEAD_EXPAND, ...); /* step 2: header undo/redo */ volume_new_npages = DISK_SECTS_NPAGES (volheader->nsect_total); if (do_logging) log_append_dboutside_redo (thread_p, RVDK_EXPAND_VOLUME, ...); /* step 3: unattached redo */ pgbuf_set_dirty_and_free (thread_p, page_volheader); /* free header only after step 3 is logged */ log_sysop_commit (thread_p); /* step 4: cancel the header-undo */ logpb_force_flush_pages (thread_p); /* step 5: log MUST be on disk before the file grows */ error_code = fileio_expand_to (thread_p, volid, volume_new_npages, voltype); /* step 6: grow OS file */ if (error_code != NO_ERROR) { assert (false); return error_code; } /* <- cannot-happen: growth already durable; cache desyncs */ *nsect_extended_out = nsect_extend; return NO_ERROR;The header-fix failure is fatal; the do_logging branch skips both log records for temp volumes (never recovered); the fileio_expand_to failure is a cannot-happen branch since log_sysop_commit already made the growth durable. RVDK_VOLHEAD_EXPAND (disk_rv_volhead_extend_undo/..._redo) adjusts nsect_total and the cache by the same delta; RVDK_EXPAND_VOLUME re-runs fileio_expand_to on recovery.
INVARIANT — the file-growth redo log must be durable before the file is grown. Step 5 (
logpb_force_flush_pages) sits between the committed header update andfileio_expand_to. If skipped, a crash in between leaves the recovered header and the OS file disagreeing on size.
5.4 disk_add_volume: a fresh OS file plumbed into three registries
Section titled “5.4 disk_add_volume: a fresh OS file plumbed into three registries”When in-place expansion is exhausted, disk_extend calls disk_add_volume — the second nested top action — wrapping the cache-mutating steps in log_sysop_start/log_sysop_commit.
// disk_add_volume -- src/storage/disk_manager.c if (disk_Cache->nvols_perm + disk_Cache->nvols_temp >= LOG_MAX_DBVOLID) return ER_BO_MAXNUM_VOLS_HAS_BEEN_EXCEEDED; /* <- volume-id space exhausted */ error_code = boot_get_new_volume_name_and_id (..., &volid); /* step 1: name + id from boot */ // ... condensed: raw-device symlink, partition free-space check ... if (nsect_part_max >= 0 && nsect_part_max < extinfo->nsect_max) return ER_IO_FORMAT_OUT_OF_SPACE; /* step 2 failed: not enough OS disk space */ if (!extinfo->overwrite && fileio_is_volume_exist (extinfo->name)) { /* ... condensed: disk_can_overwrite_data_volume check ... */ return ER_BO_VOLUME_EXISTS; /* <- refuse to clobber an existing file */ } log_sysop_start (thread_p); /* NESTED TOP ACTION begins */ if (extinfo->voltype == DB_PERMANENT_VOLTYPE) disk_Cache->nvols_perm++; /* step 3: cache before format */ else disk_Cache->nvols_temp++; disk_Cache->vols[volid].purpose = extinfo->purpose; error_code = disk_format (thread_p, boot_db_full_name (), volid, extinfo, nsects_free_out); /* step 4 */ if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } if (extinfo->voltype == DB_PERMANENT_VOLTYPE) if (logpb_add_volume (NULL, volid, extinfo->name, DB_PERMANENT_DATA_PURPOSE) == NULL_VOLID) { ASSERT_ERROR_AND_SET (error_code); goto exit; } /* step 5: register in _vinf (perm only) */ error_code = boot_dbparm_save_volume (thread_p, extinfo->voltype, volid); /* step 6: persist in boot_Db_parm */ if (error_code != NO_ERROR) { ASSERT_ERROR (); if (extinfo->voltype == DB_TEMPORARY_VOLTYPE && disk_unformat (thread_p, extinfo->name) != NO_ERROR) assert (false); /* <- rollback won't drop temp file; do it by hand */ goto exit; } *volid_out = volid;exit: if (error_code == NO_ERROR) log_sysop_commit (thread_p); else { log_sysop_abort (thread_p); if (extinfo->voltype == DB_TEMPORARY_VOLTYPE) disk_Cache->nvols_temp--; /* <- undo cache count manually */ else disk_Cache->nvols_perm--; } return error_code;Three registries (Figure 5-2): boot_Db_parm updated last (a crash before it leaves no dangling reference); the _vinf registry via logpb_add_volume, permanent only; and disk_Cache, nvols_* and vols[volid].purpose bumped first so disk_format’s page fixes find the volume classified. Every goto exit funnels into one log_sysop_abort; two things logging cannot undo are fixed by hand in the abort arm — the raw nvols_* counter and, for a temp volume, the file (disk_unformat, since temp creation is not journaled). A permanent volume’s file is handled by recovery via the logged format records.
graph TD AV["disk_add_volume\nnew volume file"] --> BP["boot_Db_parm\nboot_dbparm_save_volume()"] AV --> VI["_vinf volinfo registry\nlogpb_add_volume() perm only"] AV --> DC["disk_Cache\nnvols_*++, vols[volid].purpose"] AV --> FMT["disk_format()\nzeroes file, writes volheader + sector table"]
Figure 5-2. The three registries disk_add_volume plumbs a new volume into, plus the on-disk format step.
5.5 disk_add_volume_extension: the addvoldb / boot-time entry, and the retired daemon
Section titled “5.5 disk_add_volume_extension: the addvoldb / boot-time entry, and the retired daemon”disk_extend is the automatic path; disk_add_volume_extension is the explicit entry, called by addvoldb and at database creation. It does not size against nsect_intention — the caller dictates npages — but respects the same serialization, taking disk_lock_extend() and the CSECT_DISK_CHECK reader latch so an admin addvol cannot race an automatic disk_extend.
// disk_add_volume_extension -- src/storage/disk_manager.c error_code = csect_enter_as_reader (thread_p, CSECT_DISK_CHECK, INF_WAIT); disk_lock_extend (); /* <- block other expansions */ // ... condensed: realpath, fill ext_info from caller args ... ext_info.nsect_total = disk_sectors_to_extend_npages (npages); ext_info.nsect_max = ext_info.nsect_total; /* <- born at its max: never auto-grown */ if (voltype == DB_TEMPORARY_VOLTYPE) { if (disk_Cache->temp_purpose_info.extend_info.nsect_total + ext_info.nsect_total > disk_Temp_max_sects) { er_set (..., ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED, ...); disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK); return ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED; } /* <- temp-space cap: release BOTH locks */ ext_info.voltype = DB_TEMPORARY_VOLTYPE; } else ext_info.voltype = DB_PERMANENT_VOLTYPE; error_code = disk_add_volume (thread_p, &ext_info, &volid_new, &nsect_free); if (error_code != NO_ERROR) { ASSERT_ERROR (); disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK); return error_code; } // ... condensed: bump per-purpose nsect_total/nsect_max, update vol_free under reserve mutex ... disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK); *volid_out = volid_new; return NO_ERROR;ext_info.nsect_max = ext_info.nsect_total means a user-added volume is born at its maximum size, never a candidate for in-place expansion. Three branches: the temp-space-exceeded early return, the disk_add_volume error return (both releasing the extend mutex and the critical section), and the success path. The post-add bookkeeping distinguishes a permanent-type volume serving temporary purpose from a true temporary-type volume — the three-way classification used throughout the cache.
The retired daemon. The comment atop disk_extend still mentions an auto-expansion thread keeping “a stable level of free space,” but that daemon has been removed — which is why nsect_intention is now the sole coalescing mechanism: the first thread to take the extend mutex must grow enough for itself and every thread that published an intention while it waited.
5.6 Why a nested top action — and not the outer transaction
Section titled “5.6 Why a nested top action — and not the outer transaction”Both disk_volume_expand and disk_add_volume wrap their durable work in log_sysop_start/log_sysop_commit rather than letting it ride on the outer reservation’s transaction. The grower acts on behalf of all co-users of the new space: reserve-ahead hands fresh sectors to the triggering reservation, but other waiting threads reserve from the same volume once the extend mutex is released. If the growth rode the outer transaction and that transaction later rolled back, every co-user would be forced to roll back too — a volume several transactions depend on would vanish. Committing as an independent nested top action makes the space durable regardless of the triggering transaction’s fate: the reservation can still abort; the volume stays. This is the discipline the companion describes for file-table updates, applied to the coarsest unit of growth.
5.7 Chapter summary — key takeaways
Section titled “5.7 Chapter summary — key takeaways”- The extend path is entered only after the cache fails twice.
disk_reserve_from_cacherecordsnsect_intention, releases the reserve mutex, takesmutex_extend, and re-checks free space — the double-check absorbs the race where another thread already grew the volume between the two mutexes. nsect_intentionis the load-bearing accumulator. With the auto-expansion daemon removed, it is the only mechanism coalescing concurrent demand;disk_extendadds it tonsect_extendso one expansion serves every waiting thread, and paired+=/-=(with asave_remainingsnapshot) keep it balanced across error paths.disk_extendis expand-then-add. It grows the singlevolid_extendvolume in place up tonsect_max(one volume per purpose may grow), then loops adding fresh volumes for residual demand, reserving ahead into the caller’s context after each step.disk_volume_expandorders log-before-grow. Header undo/redo plus an unattachedRVDK_EXPAND_VOLUMEredo, a forced log flush, thenfileio_expand_to— whose failure is unrecoverable by construction.disk_add_volumeplumbs the new file into three registries —boot_Db_parm(last), the_vinffile (permanent only), anddisk_Cache(counts first) — manually undoing the unlogged cache counter and unformatting orphaned temp files on error.disk_add_volume_extensionis the explicit addvol / boot-time twin: samemutex_extendserialization, caller-supplied size, andnsect_max == nsect_totalso user volumes are never auto-grown.- Growth is a nested top action so co-users are not held hostage: committing the expansion independently means a later rollback of the triggering reservation cannot destroy a volume other transactions now depend on.
Chapter 6: File Creation and the Three-Table Layout
Section titled “Chapter 6: File Creation and the Three-Table Layout”The high-level companion (cubrid-disk-manager.md) explains why a file is a set of reserved sectors. This chapter answers the mechanical follow-up: once disk_reserve_sectors (Ch.4) returns a sorted VSID array, how does file_create turn it into a usable file — header page, VFID, and the partial / full / user-page tables every later allocation relies on? We assume the VSID array exists and trace the file-manager side.
file_create (in file_manager.c) is the one engine. Everything else — the file_create_heap / temp / ehash family — is a thin wrapper that picks two booleans (is_temp, is_numerable), a FILE_TYPE, a FILE_TABLESPACE, and an optional FILE_DESCRIPTORS, then calls it.
6.1 The four structs that live in the header page
Section titled “6.1 The four structs that live in the header page”A file’s header page (PAGE_FTAB) begins with one file_header struct, followed by one to three file_extensible_data table headers. Two of file_header’s members are themselves structs (FILE_TABLESPACE, FILE_DESCRIPTORS).
// struct file_header -- src/storage/file_manager.cstruct file_header{ INT64 time_creation; /* Time of file creation. */ VFID self; /* Self VFID */ FILE_TABLESPACE tablespace; /* The table space definition */ FILE_DESCRIPTORS descriptor; /* File descriptor. Depends on file type. */ // ... page / sector counters, flags, table offsets, temp+numerable cursors ...};file_header fields.
| Field | Role / why it exists |
|---|---|
time_creation | Wall-clock create time; distinguishes reused fileids. |
self | This file’s own VFID; self-identifying header for recovery. |
tablespace | Embedded FILE_TABLESPACE; perm extension (Ch.5), zeroed temp. |
descriptor | Embedded FILE_DESCRIPTORS union; type-specific owner metadata. |
n_page_total | Total pages over all sectors; allocation ceiling. |
n_page_user | User pages handed out (0); user vs table pages. |
n_page_ftab | Pages used by the file’s tables; starts at 1 (header). |
n_page_free | Reserved-but-unallocated pages; Ch.7/8 draws down. |
n_page_mark_delete | Removed numerable pages; marked, not compacted. |
n_sector_total | Reserved-sector count; equals n_sectors. |
n_sector_partial | Sectors with a free page (total-full); alloc candidates. |
n_sector_full | Sectors fully used by tables; perm only. |
n_sector_empty | Sectors with no page allocated; starts -1 (header sector). |
type | FILE_TYPE enum; type routing. |
file_flags | NUMERABLE/TEMPORARY/ENCRYPTED_*; truth for FILE_IS_*. |
volid_last_expand | Last volume that supplied a sector; seeds next extension. |
offset_to_partial_ftab | Offset to partial table; anchors GET_PART_FTAB. |
offset_to_full_ftab | Offset to full table; perm only, else NULL_OFFSET. |
offset_to_user_page_ftab | Offset to user-page table; numerable only, else NULL_OFFSET. |
vpid_sticky_first | Undeletable first page; set later (Ch.11). |
vpid_last_temp_alloc + offset_to_last_temp_alloc | Temp-alloc cursor (page + offset); temp shortcut (Ch.8). |
vpid_last_user_page_ftab | Last user-page-table page; numerable append (Ch.10). |
vpid_find_nth_last / first_index_find_nth_last | Cached find_nth position; nth-lookup speedup (Ch.10). |
reserved0..3 | Padding, zeroed; forward-compat. |
graph LR FH["file_header"] -->|embeds| TS["FILE_TABLESPACE"] FH -->|embeds union| DES["FILE_DESCRIPTORS"] FH -->|offset_to_partial_ftab| PT["partial table"] FH -->|offset_to_full_ftab perm| FT["full table"] FH -->|offset_to_user_page_ftab numerable| UT["user-page table"]
Figure 6-1. file_header embeds two structs and points at one to three file_extensible_data tables in the header page.
FILE_TABLESPACE — four fields set by FILE_TABLESPACE_FOR_PERM_NPAGES / _FOR_TEMP_NPAGES: initial_size (requested bytes MAX(1,npages)*DB_PAGESIZE, seeds total_size); expand_ratio (geometric-growth fraction, 0 for temp); expand_min_size / expand_max_size (per-extension clamps, both 0 for temp so temp never auto-extends).
FILE_DESCRIPTORS is a union padded to 64 bytes (FILE_DESCRIPTORS_SIZE). Arms: heap (class_oid, hfid), heap_overflow, btree (class_oid, attr_id), btree_key_overflow, ehash (class_oid, attr_id), vacuum_data (vpid_first), dummy_align (forces the 64-byte footprint). The fixed size is load-bearing — the header warns “if you change file descriptors size, make sure to change disk compatibility version too!”: the union size is part of the on-disk format.
file_extensible_data is the table header repeated up to three times after file_header. Four fields: vpid_next (continuation-page link when a table outgrows the header page), max_size (item capacity in bytes, fixed at file_extdata_init), size_of_item (bytes per item — one struct, three item types), n_items (items stored, starts 0).
6.2 Estimating size: data plus worst-case file-table sectors
Section titled “6.2 Estimating size: data plus worst-case file-table sectors”file_create turns the requested byte size into a sector count, then reserves extra sectors for the file’s own tables. The estimate is pessimistic on purpose — over-reserving is cheap, under-reserving forces a mid-create extension.
// file_create -- src/storage/file_manager.ctotal_size = tablespace->initial_size;if (!is_numerable) max_size_ftab = total_size / 8 / 1024; /* <- partial+full (~1 byte/8KB) */else max_size_ftab = total_size * 33 / 8 / 1024; /* <- + user-page table */total_size += max_size_ftab;n_sectors = (int) CEIL_PTVDIV (total_size, DB_SECTORSIZE);vsids_reserved = (VSID *) db_private_alloc (thread_p, n_sectors * sizeof (VSID));On db_private_alloc failure: er_set(ER_OUT_OF_VIRTUAL_MEMORY) then goto exit (nothing reserved yet). Otherwise, for permanent files only (do_logging = !is_temp), log_sysop_start opens a system operation; temp files skip it. This do_logging split recurs at every dirty/unfix call below.
6.3 Reserving and choosing the VFID
Section titled “6.3 Reserving and choosing the VFID”// file_create -- src/storage/file_manager.cvolpurpose = is_temp ? DB_TEMPORARY_DATA_PURPOSE : DB_PERMANENT_DATA_PURPOSE;error_code = disk_reserve_sectors (thread_p, volpurpose, NULL_VOLID, n_sectors, vsids_reserved);if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; }was_temp_reserved = is_temp; /* <- arm temp-leak cleanup */volid_last_expand = vsids_reserved[n_sectors - 1].volid; /* <- before sort! */qsort (vsids_reserved, n_sectors, sizeof (VSID), disk_compare_vsids);volid_last_expand is grabbed before the sort: sectors come back in reservation order, and the last one is the most-recently-extended volume, where future growth should continue. was_temp_reserved arms the manual unreserve at exit (temp reservations are not undone by recovery).
Header-page (hence VFID) selection then branches on type (Figure 6-2):
flowchart TD
B{"SERVER_MODE and type in\nBTREE/HEAP/HEAP_REUSE_SLOTS?"}
B -->|yes| C["scan fileids in first volume\nvacuum_is_file_dropped per fileid"]
C --> D{"non-dropped found?"}
D -->|yes| E["vfid = found_vfid"]
D -->|no| F["assert_release false -> exit"]
B -->|no| G["vfid = first page of sectid[0]"]
E --> H["vpid_fhead = vfid"]
G --> H
Figure 6-2. Header-page / VFID selection.
The default branch takes the first page of the first sorted sector. The vacuum-aware branch (SERVER_MODE and type in BTREE/HEAP/HEAP_REUSE_SLOTS) exists because reusing a VFID vacuum still believes is “dropped” would corrupt its dropped-files list. It walks every fileid of every sector in the first volume (the VFID must share that volume) and picks the first vacuum_is_file_dropped reports clean; that function erroring is goto exit, a fully-dropped first volume is assert_release(false) (impossible).
6.4 Initializing the header
Section titled “6.4 Initializing the header”// file_create -- src/storage/file_manager.cpage_fhead = pgbuf_fix (thread_p, &vpid_fhead, NEW_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);if (page_fhead == NULL) { ASSERT_ERROR_AND_SET (error_code); goto exit; }// ... condensed: memset(0), set ptype PAGE_FTAB, fhead = page; self/tablespace/type set ...if (des != NULL) { fhead->descriptor = *des; } /* <- temp/query-area pass NULL */if (is_numerable) { fhead->file_flags |= FILE_FLAG_NUMERABLE; }if (is_temp) { fhead->file_flags |= FILE_FLAG_TEMPORARY; }// ... condensed: time_creation, NULL cursors, zero counters ...fhead->n_page_ftab = 1; /* <- the header page is itself a table page */fhead->n_sector_empty--; /* <- start negative: header sector is not empty */The header is fixed new (error path on NULL), zeroed, typed PAGE_FTAB, and self/tablespace/type stamped in. Two non-obvious seeds: n_page_ftab starts at 1 (the header is a table page) and n_sector_empty at -1 so the header’s sector is not counted as empty when partial sectors are later tallied.
6.5 The three-table layout — four flavors of the header byte budget
Section titled “6.5 The three-table layout — four flavors of the header byte budget”After the offset cursor offset_ftab is seeded to FILE_HEADER_ALIGNED_SIZE (the first byte past the fixed header), file_create carves the remaining DB_PAGESIZE - offset_ftab bytes into tables (Figure 6-3). The four-way split keys on the (is_temp, is_numerable) pair:
flowchart TD
N{"is_numerable?"}
N -->|yes| NT{"is_temp?"}
N -->|no| RT{"is_temp?"}
NT -->|yes| A["temp numerable\npartial 1/16, user-page 15/16"]
NT -->|no| B["perm numerable\npartial 1/32, full 1/32, user-page 15/16"]
RT -->|yes| C["temp regular\npartial = all remaining"]
RT -->|no| D["perm regular\npartial 1/2, full 1/2"]
Figure 6-3. The four flavors of header-page partitioning. Every flavor allocates a partial table; full and user-page are conditional.
Each table is initialized with file_extdata_init(item_size, size, extdata) — item_size is sizeof(FILE_PARTIAL_SECTOR) for partial, sizeof(VSID) for full, sizeof(VPID) for user-page. Each assignment fhead->offset_to_*_ftab = offset_ftab is followed by offset_ftab += file_extdata_max_size(extdata) so the next table starts aligned past it. The permanent-numerable arm is the only one to advance the cursor twice (after partial, after full) before the user-page table consumes the remainder; all others advance it at most once.
Invariant: every file has a partial table, correctly aligned. All four branches end asserting offset_to_partial_ftab != NULL_OFFSET, and every offset_to_*_ftab assignment is followed by assert((INT16) DB_ALIGN(offset, MAX_ALIGNMENT) == offset). The partial table is the universal entry point (Ch.7/8 walk it first); full/user-page offsets stay NULL_OFFSET when unused. Alignment holds by construction (FILE_HEADER_ALIGNED_SIZE is pre-aligned, file_extdata_max_size returns an aligned span). The FILE_HEADER_GET_*_FTAB macros enforce the contract on every later read: GET_FULL_FTAB asserts !FILE_IS_TEMPORARY(fh), GET_USER_PAGE_FTAB asserts FILE_IS_NUMERABLE(fh), and all three bound the offset in [FILE_HEADER_ALIGNED_SIZE, DB_PAGESIZE). A mis-set offset is a loud crash, not silent corruption; broken alignment means unaligned INT64/VPID access.
6.6 Populating the partial table and splitting full sectors
Section titled “6.6 Populating the partial table and splitting full sectors”file_create walks vsids_reserved, appending one FILE_PARTIAL_SECTOR per sector into the partial table (file_extdata_append). When the in-header table fills (file_extdata_is_full), it allocates a continuation page from the sectors it is currently recording, chains it via vpid_next, bumps n_page_ftab, and continues there; continuation pages’ bits are set in their sectors’ bitmaps so they are never re-handed to a user.
After the walk the last sector partsect_ftab points at may itself be full (it held the last table page); if so, partsect_ftab++; fhead->n_sector_full++;. Then, for permanent files only, sectors fully consumed by the file table migrate from the partial table into the full table:
// file_create (full-sector migration) -- src/storage/file_manager.cif (!is_temp && fhead->n_sector_full > 0) { // ... condensed: GET_PART_FTAB + GET_FULL_FTAB into extdata_part_ftab / extdata_full_ftab ... for (i = 0; i < fhead->n_sector_full; i++) { partsect_iter = (FILE_PARTIAL_SECTOR *) file_extdata_at (extdata_part_ftab, i); /* ... condensed: drops the file_extdata_is_full / assert_release(false) guard ... */ file_extdata_append (extdata_full_ftab, &partsect_iter->vsid); /* <- VSID only */ } file_extdata_remove_at (extdata_part_ftab, 0, fhead->n_sector_full); }Temp files skip this entirely (no full table); they instead seed the temp cursor (vpid_last_temp_alloc = vpid_fhead, offset_to_last_temp_alloc = n_sector_full). Numerable files (temp or perm) seed the user-page-table head (vpid_last_user_page_ftab and vpid_find_nth_last both set to vpid_fhead). Finally the counters are reconciled — n_sector_total = n_sectors, n_sector_partial = total - full, n_sector_empty += n_sector_partial, n_page_total = n_sector_total * DISK_SECTOR_NPAGES, n_page_free = n_page_total - n_page_ftab — and file_header_sanity_check asserts the header is internally consistent.
6.7 Commit, tracker registration, and the error/exit path
Section titled “6.7 Commit, tracker registration, and the error/exit path”// file_create (finish) -- src/storage/file_manager.cif (do_logging) { pgbuf_log_new_page (thread_p, page_fhead, DB_PAGESIZE, PAGE_FTAB); pgbuf_unfix_and_init (thread_p, page_fhead); }else { pgbuf_set_dirty_and_free (thread_p, page_fhead); } /* <- temp: no redo log */if (!is_temp && file_type != FILE_TRACKER) { error_code = file_tracker_register (thread_p, vfid, file_type, NULL); if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } }if (is_temp) { ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.nfile, 1); /* ...stats... */ }Permanent files log the header for redo and register with the file tracker — except FILE_TRACKER itself, which would be circular (tracker registration is Ch.11). Temp files only bump in-memory spacedb_temp counters. The shared exit label handles every branch’s failure: (1) unfix page_ftab / page_fhead if still held; (2) if is_sysop_started, on error log_sysop_abort (rolls back reserve+layout), on success log_sysop_end_logical_undo(RVFL_DESTROY, vfid) so a later transaction abort tears the whole file down; (3) on error VFID_SET_NULL(vfid) so callers never see a half-built id, and if was_temp_reserved the temp sectors are manually unreserved here (recovery won’t, since temp work isn’t logged) under logtb_set_check_interrupt(false); (4) always db_private_free(vsids_reserved).
6.8 The wrappers and what each sets
Section titled “6.8 The wrappers and what each sets”Every public creator funnels into file_create with a fixed (is_temp, is_numerable, file_type):
| Wrapper | file_type | is_temp | is_numerable | Descriptor | Tablespace |
|---|---|---|---|---|---|
file_create_heap | FILE_HEAP / FILE_HEAP_REUSE_SLOTS | no | no | heap (class_oid) | perm, npages=1 |
file_create_temp | FILE_TEMP | yes | no | NULL | temp |
file_create_temp_numerable | FILE_TEMP | yes | yes | NULL | temp |
file_create_query_area | FILE_QUERY_AREA | yes | no | NULL | temp, npages=1 |
file_create_ehash | FILE_EXTENDIBLE_HASH | caller’s is_tmp | yes | ehash | temp-sized |
file_create_ehash_dir | FILE_EXTENDIBLE_HASH_DIRECTORY | caller’s is_tmp | yes | ehash | temp-sized |
file_create_heap builds the descriptor (memset, then des.heap.class_oid = *class_oid) and routes through file_create_with_npages. The three temp wrappers all go through file_create_temp_internal, which is not a thin pass-through:
// file_create_temp_internal -- src/storage/file_manager.cerror_code = file_tempcache_get (thread_p, ftype, is_numerable, &tempcache_entry);if (VFID_ISNULL (&tempcache_entry->vfid)) /* <- cache miss: create fresh */ { FILE_TABLESPACE_FOR_TEMP_NPAGES (&tablespace, npages); file_tempcache_lock_tran_entry (tran_entry); /* <- rmutex_topop guard */ error_code = file_create (thread_p, ftype, &tablespace, NULL, true, is_numerable, vfid_out); file_tempcache_unlock_tran_entry (tran_entry); // ... condensed: on error file_tempcache_retire_entry + return; else cache the vfid ... }else { *vfid_out = tempcache_entry->vfid; } /* <- cache hit: reuse, no file_create */file_tempcache_push_tran_file (thread_p, tempcache_entry);So temp creation may skip file_create entirely and return a cached file. When it does call file_create, it wraps the call in a per-transaction lock because file_create’s log_sysop_start uses rmutex_topop, unsafe across parallel workers of one transaction (tempcache is Ch.11). The ehash wrappers are thin: temp-sized tablespace, FILE_EHASH_DES as descriptor, is_numerable = true unconditionally (nth-page lookup), is_temp forwarded from the caller’s is_tmp.
6.9 Chapter summary — key takeaways
Section titled “6.9 Chapter summary — key takeaways”file_createis the single engine; the wrappers only pick(file_type, is_temp, is_numerable, descriptor, tablespace). Heap/ehash supply a descriptor; temp/query-area passNULL.- The reserved-sector count is over-estimated to fit the file’s own tables (
total/8/1024extra bytes regular,total*33/8/1024numerable), avoiding a mid-create extension. The VFID is the first page of the first reserved sector — except heap/btree underSERVER_MODE, which scan past any fileid vacuum still considers dropped. - The header page hosts one
file_headerplus one to threefile_extensible_datatables, partitioned by flavor: perm regular 1/2+1/2, perm numerable 1/32+1/32+15/16, temp regular partial-only, temp numerable 1/16+15/16. Two invariants hold throughout: every file has a partial table, and every offset isMAX_ALIGNMENT-aligned (enforced byFILE_HEADER_GET_*_FTAB). - Permanent files migrate fully-consumed file-table sectors into the full table; temp files keep one cursor (
vpid_last_temp_alloc) instead. do_logging = !is_tempgoverns durability: perm files run a sysop that logs the header and registersRVFL_DESTROYas logical undo; temp files are set-dirty-and-free, manually unreserved on error, and may be served straight from the tempcache without reachingfile_create.
Chapter 7: Permanent File Page Allocation
Section titled “Chapter 7: Permanent File Page Allocation”This chapter answers: given a permanent file that already owns sectors, how does file_alloc hand out the next user page while preserving the head-of-Partial-table invariant that keeps the next allocation O(1)? We trace file_alloc, the engine file_perm_alloc, and its helpers. Theory of sectors, partial vs. full tables, and FILE_EXTENSIBLE_DATA lives in the companion (cubrid-disk-manager.md, “File layout” / “Three-table model”). Temporary allocation (Ch.8) and numerable tables (Ch.10) are out of scope.
7.1 The data unit: file_partial_sector
Section titled “7.1 The data unit: file_partial_sector”Every entry in the partial table is a file_partial_sector (typedef FILE_PARTIAL_SECTOR); file_perm_alloc mutates its bitmap on every fast-path allocation.
// file_partial_sector -- src/storage/file_manager.hstruct file_partial_sector{ VSID vsid; /* Important - VSID must be first member ... * Sometimes, the FILE_PARTIAL_SECTOR pointers * in file table are reinterpreted as VSID. */ FILE_ALLOC_BITMAP page_bitmap;};| Field | Role | Why it exists |
|---|---|---|
vsid | Sector identity { volid, sectid }; the on-disk address of the 64-page run this entry covers. | Says “this sector is reserved by this file.” Also the bare full-table entry (see invariant). |
page_bitmap | 64-bit FILE_ALLOC_BITMAP (UINT64); bit k set ⇒ page k allocated. DISK_SECTOR_NPAGES == 64, one bit per page. | Flip one bit instead of scanning. 0x0…0 = FILE_EMPTY_PAGE_BITMAP, 0xF…F = FILE_FULL_PAGE_BITMAP. |
Invariant — VSID must be the first member. Full-table and expansion code reinterprets a FILE_PARTIAL_SECTOR * as a VSID * (a full-table entry is just a VSID). A field placed before vsid would make that cast read garbage. The struct layout is the contract.
classDiagram
class FILE_PARTIAL_SECTOR {
VSID vsid
FILE_ALLOC_BITMAP page_bitmap
}
FILE_PARTIAL_SECTOR --> "full table reuses the prefix" VSID : vsid is first
Figure 7-1. The full table stores a bare VSID — exactly the leading field of a partial entry. This prefix compatibility recurs throughout the chapter.
7.2 file_alloc: the dispatcher
Section titled “7.2 file_alloc: the dispatcher”file_alloc fixes the header, branches on FILE_IS_TEMPORARY, allocates, optionally registers and initializes the page, and frames the permanent path inside a logical-undo system operation.
// file_alloc -- src/storage/file_manager.cpage_fhead = pgbuf_fix (thread_p, &vpid_fhead, OLD_PAGE, PGBUF_LATCH_WRITE, ...);// ... condensed ...if (FILE_IS_TEMPORARY (fhead)) error_code = file_temp_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- Ch.8 */else { log_sysop_start_atomic (thread_p); /* <- nested top action, atomic so undo is one unit */ is_sysop_started = true; error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- 7.3 */ VFID_COPY ((VFID *) undo_log_data, vfid); VPID_COPY ((VPID *) (undo_log_data + sizeof (VFID)), vpid_out); /* <- undo payload {vfid,vpid} */ }Remaining exits (errors → goto exit): (1) pgbuf_fix fails → return, nothing fixed. (2) numerable → file_numerable_add_page (tail call, Ch.10). (3) f_init supplied → fix the new page NEW_PAGE, init, set TDE, hand back via page_out or unfix; failure unfixes. (4) no f_init → asserted temporary; return the raw page if page_out requested. (5) exit → sysop aborts on error, else commit-and-undo via log_sysop_end_logical_undo (RVFL_ALLOC, …), then sanity-check and unfix. Structural changes are redo-logged eagerly inside file_perm_alloc; the sysop’s single logical undo (RVFL_ALLOC) is the “deallocate {vfid,vpid}” record — the nested-top-action discipline of Ch.5.
7.3 file_perm_alloc: the engine
Section titled “7.3 file_perm_alloc: the engine”Four phases: ensure free pages, ensure the header section holds a partial sector, flip a head-sector bit, then migrate to the full table if it just filled.
flowchart TD
A["file_perm_alloc(alloc_type)"] --> B{"n_page_free == 0 ?"}
B -- yes --> C["file_perm_expand\nreserve more sectors"]
B -- no --> D
C --> D{"header partial section empty ?"}
D -- yes --> E["file_table_move_partial_sectors_to_header"]
E --> F{"vpid_alloc_out set ?"}
F -- yes --> Z["goto exit\npage already chosen"]
F -- no --> G
D -- no --> G["partsect = head of partial section"]
G --> H["file_partsect_alloc:\nset first 0-bit, emit vpid"]
H --> I{"alloc_type ==\nTABLE_PAGE_FULL_SECTOR ?"}
I -- yes --> J["file_table_append_full_sector_page"]
I -- no --> K
J --> K["file_header_alloc:\ncounters + WAL"]
K --> L{"sector now full ?"}
L -- no --> Z2["exit OK"]
L -- yes --> M["remove head from partial table"]
M --> N["file_table_add_full_sector(vsid)"]
N --> Z2
Figure 7-2. file_perm_alloc control flow — every branch and goto.
Phase 1 — guarantee a free page
Section titled “Phase 1 — guarantee a free page”// file_perm_alloc -- src/storage/file_manager.cif (fhead->n_page_free == 0) { error_code = file_perm_expand (thread_p, page_fhead); /* <- 7.4 */ if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } }assert (fhead->n_page_free > 0 && fhead->n_sector_partial > 0);Invariant — the header holds a Partial entry while n_page_free > 0. Free pages live only inside partial sectors (full sectors have none; empty is a subset of partial), so any free page implies a partial sector — the two asserts confirm it. Phase 2 guarantees one sits in the header section.
Phase 2 — guarantee the head section is non-empty
Section titled “Phase 2 — guarantee the head section is non-empty”FILE_HEADER_GET_PART_FTAB (fhead, extdata_part_ftab);if (file_extdata_is_empty (extdata_part_ftab)) { error_code = file_table_move_partial_sectors_to_header (thread_p, page_fhead, alloc_type, vpid_alloc_out); /* <- 7.5 */ if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } if (!VPID_ISNULL (vpid_alloc_out)) { goto exit; /* <- a freed overflow page was reused as the allocation; done */ } }assert (!file_extdata_is_empty (extdata_part_ftab));Either the move repopulated the header (vpid_alloc_out NULL, fall through) or it drained an overflow page and reused that page as the result (vpid_alloc_out set → goto exit, no bitmap touched — the CBRD-21242 path, 7.5).
Phase 3 — flip the head bit
Section titled “Phase 3 — flip the head bit”partsect = (FILE_PARTIAL_SECTOR *) file_extdata_start (extdata_part_ftab); /* <- head item, position 0 */assert (!file_partsect_is_full (partsect));was_empty = file_partsect_is_empty (partsect);if (!file_partsect_alloc (partsect, vpid_alloc_out, &offset_to_alloc_bit)) /* <- 7.6 */ { assert_release (false); /* head sector must have a free page (invariant) */ error_code = ER_FAILED; goto exit; }log_append_undoredo_data2 (thread_p, RVFL_PARTSECT_ALLOC, NULL, page_fhead, (PGLENGTH) ((char *) partsect - page_fhead), /* <- offset of partsect in page */ ..., &offset_to_alloc_bit, &offset_to_alloc_bit); /* <- undo == redo == bit offset */Invariant — the head sector always has a free page. Allocation always reads position 0 (file_extdata_start); Phases 1–2 guarantee a non-full partial sector there, so the two asserts treat a full head as a logic error. RVFL_PARTSECT_ALLOC logs only the bit offset (undo == redo) at partsect’s byte offset in the header page.
FILE_ALLOC_TABLE_PAGE vs FILE_ALLOC_USER_PAGE
Section titled “FILE_ALLOC_TABLE_PAGE vs FILE_ALLOC_USER_PAGE”Right after the bit flip, if (alloc_type == FILE_ALLOC_TABLE_PAGE_FULL_SECTOR) calls file_table_append_full_sector_page (...) (7.7). alloc_type says what the page is for; file_header_alloc (7.8) bumps n_page_user or n_page_ftab accordingly. The enum file_alloc_type has three values — FILE_ALLOC_USER_PAGE, FILE_ALLOC_TABLE_PAGE, FILE_ALLOC_TABLE_PAGE_FULL_SECTOR. The last is requested by file_table_add_full_sector when the full table needs a page; that page must link in before the current sector migrates, else migration finds no room and recurses — the reason the third value exists.
Phase 4 — migrate to full on overflow
Section titled “Phase 4 — migrate to full on overflow”is_full = file_partsect_is_full (partsect);file_header_alloc (fhead, alloc_type, was_empty, is_full); /* <- 7.8: counters + WAL */file_log_fhead_alloc (thread_p, page_fhead, alloc_type, was_empty, is_full);if (is_full) { VSID vsid_full = partsect->vsid; /* <- save before removal */ file_log_extdata_remove (thread_p, extdata_part_ftab, page_fhead, 0, 1); file_extdata_remove_at (extdata_part_ftab, 0, 1); /* <- drop head item */ error_code = file_table_add_full_sector (thread_p, page_fhead, &vsid_full); /* <- 7.7 */ if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } }Counters update first (correct before any nested allocation), then the full head sector is removed from position 0 and added to the full table — restoring the head-of-Partial invariant. Rollback is owned by the enclosing file_alloc sysop.
7.4 file_perm_expand: refill the partial table
Section titled “7.4 file_perm_expand: refill the partial table”Called when n_page_free == 0. Reserves a batch of new sectors, appending them as empty partial entries in the header.
// file_perm_expand -- src/storage/file_manager.cexpand_size_in_sectors = (int) ((float) fhead->n_sector_total * fhead->tablespace.expand_ratio);expand_size_in_sectors = MAX (expand_size_in_sectors, expand_min_size_in_sectors);expand_size_in_sectors = MIN (expand_size_in_sectors, expand_max_size_in_sectors); /* <- clamp to header capacity */// ... condensed: db_private_alloc vsids_reserved buffer ...log_sysop_start (thread_p); /* <- separate committed sysop: expansion is permanent */error_code = disk_reserve_sectors (thread_p, DB_PERMANENT_DATA_PURPOSE, fhead->volid_last_expand, expand_size_in_sectors, vsids_reserved); /* fail -> goto exit, abort */qsort (vsids_reserved, expand_size_in_sectors, sizeof (VSID), disk_compare_vsids);partsect.page_bitmap = FILE_EMPTY_PAGE_BITMAP;for (... each reserved vsid ...) { partsect.vsid = *vsid_iter; file_extdata_append (extdata_part_ftab, &partsect); } /* <- empty entries into header */fhead->n_sector_total += expand_size_in_sectors;fhead->n_sector_empty = fhead->n_sector_partial = expand_size_in_sectors; /* asserted 0 before */fhead->n_page_free = expand_size_in_sectors * DISK_SECTOR_NPAGES; /* asserted 0 before */fhead->n_page_total += fhead->n_page_free;Branches: (1) size clamped to header file_extdata_remaining_capacity — expansion never needs a new table page. (2) VSID-buffer db_private_alloc fails → ER_OUT_OF_VIRTUAL_MEMORY, return before any sysop. (3) disk_reserve_sectors fails → goto exit, sysop aborted. (4) success sets the counters (each asserted 0 first, confirming expand runs only on full exhaustion). The inner sysop commits on success, aborts on error (its own nested top action, Ch.5); RVFL_EXPAND logs the reserved VSID array as redo with empty undo.
7.5 file_table_move_partial_sectors_to_header
Section titled “7.5 file_table_move_partial_sectors_to_header”Header section empty but overflow pages still hold partial sectors: hoist items from the first overflow page up.
// file_table_move_partial_sectors_to_header -- src/storage/file_manager.cpage_part_ftab_first = pgbuf_fix (thread_p, &extdata_part_ftab_head->vpid_next, OLD_PAGE, ...); /* fail -> exit */n_items_to_move = file_extdata_item_count (extdata_part_ftab_first);if (n_items_to_move == 0) { assert_release (false); error_code = ER_FAILED; goto exit; }// ... condensed: re-check header is empty ...n_items_to_move = MIN (n_items_to_move, file_extdata_remaining_capacity (extdata_part_ftab_head)); /* <- cap to header room */file_extdata_append_array (extdata_part_ftab_head, file_extdata_start (extdata_part_ftab_first), n_items_to_move);file_log_extdata_add (thread_p, extdata_part_ftab_head, page_fhead, 0, n_items_to_move, ...);if (n_items_to_move < file_extdata_item_count (extdata_part_ftab_first)) { /* partial move: remove copied prefix; first page survives */ file_log_extdata_remove (thread_p, extdata_part_ftab_first, page_part_ftab_first, 0, n_items_to_move); file_extdata_remove_at (extdata_part_ftab_first, 0, n_items_to_move); }else { /* whole page drained: unlink and REUSE it (CBRD-21242) */ VPID save_next = extdata_part_ftab_head->vpid_next; /* <- drained page id, saved before relink */ // ... relink: head->vpid_next = first->vpid_next (skip drained page) ... *vpid_alloc_out = save_next; pgbuf_dealloc_page (thread_p, page_part_ftab_first); if (alloc_type == FILE_ALLOC_TABLE_PAGE_FULL_SECTOR) { file_table_append_full_sector_page (...); } else if (alloc_type == FILE_ALLOC_USER_PAGE) { fhead->n_page_ftab--; fhead->n_page_user++; log_append_undoredo_data2 (thread_p, RVFL_FHEAD_CONVERT_FTAB_TO_USER, ...); } }Error/assert branches before the split: header vpid_next NULL → assert(false), ER_FAILED; pgbuf_fix of the first overflow page fails → goto exit; n_items_to_move == 0 → assert_release(false); header not actually empty → silent goto exit. The full-drain path saves vpid_next before relinking, reuses the drained page as the result, and converts a table page to a user page (RVFL_FHEAD_CONVERT_FTAB_TO_USER) — avoiding a deallocate-then-reallocate loop, which is why Phase 2 short-circuits on !VPID_ISNULL (vpid_alloc_out).
7.6 file_partsect_alloc and the bit helpers
Section titled “7.6 file_partsect_alloc and the bit helpers”Allocation is one bit flip in the head sector’s bitmap.
// file_partsect_alloc -- src/storage/file_manager.cint offset_to_zero = bit64_count_trailing_ones (partsect->page_bitmap); /* <- index of first 0-bit */if (offset_to_zero >= FILE_ALLOC_BITMAP_NBITS) /* 64: bitmap all ones */ { assert (file_partsect_is_full (partsect)); return false; } /* <- caller treats as logic error */file_partsect_set_bit (partsect, offset_to_zero);if (offset_out) *offset_out = offset_to_zero;if (vpid_out) /* <- reconstruct VPID from vsid + offset */ { vpid_out->volid = partsect->vsid.volid; vpid_out->pageid = SECTOR_FIRST_PAGEID (partsect->vsid.sectid) + offset_to_zero; }return true;bit64_count_trailing_ones finds the lowest unset bit (pages go out densely from the sector bottom). file_partsect_set_bit asserts the bit is clear and ORs it via bit64_set. The inverse file_partsect_pageid_to_offset subtracts SECTOR_FIRST_PAGEID (sectid) — used by deallocation (Ch.9). The bitmap is the page list.
7.7 Adding a full sector: file_table_add_full_sector and file_table_append_full_sector_page
Section titled “7.7 Adding a full sector: file_table_add_full_sector and file_table_append_full_sector_page”When the head sector fills, its VSID migrates to the full table.
// file_table_add_full_sector -- src/storage/file_manager.cFILE_HEADER_GET_FULL_FTAB (fhead, extdata_full_ftab);error_code = file_extdata_find_not_full (thread_p, &extdata_full_ftab, &page_ftab, &found);if (!found) { /* full table is full: allocate a NEW table page for it */ error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_TABLE_PAGE_FULL_SECTOR, &vpid_ftab_new); /* <- recursion */ page_ftab = pgbuf_fix (thread_p, &vpid_ftab_new, OLD_PAGE, ...); /* already initialized */ extdata_full_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab; }page_extdata = page_ftab != NULL ? page_ftab : page_fhead; /* <- which page the add is logged against */file_extdata_find_ordered (extdata_full_ftab, vsid, disk_compare_vsids, &found, &pos);if (found) { assert_release (false); error_code = ER_FAILED; goto exit; } /* duplicate VSID */file_extdata_insert_at (extdata_full_ftab, pos, 1, vsid); /* + file_log_extdata_add(..., page_extdata, ...) */Branches: (1) free space in an existing component → insert ordered. (2) no space → recurse into file_perm_alloc with FILE_ALLOC_TABLE_PAGE_FULL_SECTOR; bounded because that type appends the new page to the full table before further migration. (3) duplicate VSID → ER_FAILED. Entries stay sorted by disk_compare_vsids for binary search.
file_table_append_full_sector_page initializes the new page and links it at the head of the chain:
// file_table_append_full_sector_page -- src/storage/file_manager.cpage_ftab = pgbuf_fix (thread_p, vpid_new, NEW_PAGE, ...); /* fail -> ASSERT_ERROR_AND_SET, return */pgbuf_set_page_ptype (thread_p, page_ftab, PAGE_FTAB);file_extdata_init (sizeof (VSID), DB_PAGESIZE, extdata_new_ftab); /* <- full entries are bare VSIDs */VPID_COPY (&extdata_new_ftab->vpid_next, &extdata_full_ftab->vpid_next); /* new page points at old head */pgbuf_log_new_page (thread_p, page_ftab, file_extdata_size (extdata_new_ftab), PAGE_FTAB);pgbuf_unfix_and_init (thread_p, page_ftab); /* <- new page no longer fixed */file_log_extdata_set_next (thread_p, extdata_full_ftab, page_fhead, vpid_new); /* old head -> new page */VPID_COPY (&extdata_full_ftab->vpid_next, vpid_new);file_extdata_init uses sizeof (VSID), not sizeof (FILE_PARTIAL_SECTOR) — the 7.1 prefix compatibility in action.
7.8 Counter updates in file_header_alloc
Section titled “7.8 Counter updates in file_header_alloc”file_header_alloc is the single place maintaining the eight header counters (n_page_total/user/ftab/free, n_sector_total/partial/full/empty).
// file_header_alloc -- src/storage/file_manager.cfhead->n_page_free--;if (alloc_type == FILE_ALLOC_USER_PAGE) fhead->n_page_user++;else fhead->n_page_ftab++; /* table page of either flavor */if (was_empty) fhead->n_sector_empty--; /* sector now holds a page: no longer empty */if (is_full) { fhead->n_sector_partial--; fhead->n_sector_full++; } /* migrated to full */The leading assert (!was_empty || !is_full) enforces that one allocation cannot take a sector empty→full (only empty→partial or partial→full). file_log_fhead_alloc writes a 3-bool redo {is_ftab_page, was_empty, is_full} replayed by file_rv_fhead_alloc. n_page_total/n_sector_total change only on expansion (7.4).
7.9 Chapter summary — key takeaways
Section titled “7.9 Chapter summary — key takeaways”file_allocdispatches onFILE_IS_TEMPORARY: temporary →file_temp_alloc(Ch.8), no sysop; permanent → an atomic nested-top-action sysop closed bylog_sysop_end_logical_undo (RVFL_ALLOC, {vfid,vpid}).file_partial_sectoris{vsid, page_bitmap},vsidMUST be first — full-table code reinterprets the pointer as a bareVSID; the 64-bit bitmap is one bit per page of a 64-page sector.- Phases 1–2 (expand, then move-to-header) restore the two bold invariants before the bit flip in
file_partsect_alloc, which usesbit64_count_trailing_onesand reconstructs the VPID fromSECTOR_FIRST_PAGEID + offset. - A filled sector migrates to the full table:
file_header_alloccounters update first, then the head item moves to the sorted full table, which grows via bounded recursion usingFILE_ALLOC_TABLE_PAGE_FULL_SECTOR.FILE_ALLOC_USER_PAGEvs_TABLE_PAGE[_FULL_SECTOR]decidesn_page_uservsn_page_ftab; onlyfile_perm_expandgrowsn_*_total(RVFL_EXPAND, its own committed sysop). - Numerable registration is a tail call (
file_numerable_add_page, Ch.10); the full-drain branch reuses the emptied overflow page (CBRD-21242), logging the table-to-user conversion viaRVFL_FHEAD_CONVERT_FTAB_TO_USER.
Chapter 8: Temporary File Page Allocation
Section titled “Chapter 8: Temporary File Page Allocation”Temporary files back sorts, hash joins, and query-result materialization. They live and die inside a single transaction (or get parked in the tempcache for reuse — Ch.11), so the disk manager throws away most of the machinery permanent files depend on. This chapter answers: why do temporary files skip the Partial-to-Full migration, and how does a single header cursor make allocation O(1) with no logging? The high-level rationale lives in the companion cubrid-disk-manager.md; this chapter traces the code, contrasting file_perm_alloc (Ch.7) rather than re-deriving it.
8.1 The fork in file_alloc
Section titled “8.1 The fork in file_alloc”Every page allocation enters through file_alloc. The header is fixed and sanity-checked, then a single predicate splits the world:
// file_alloc -- src/storage/file_manager.cif (FILE_IS_TEMPORARY (fhead)) error_code = file_temp_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- no sysop, no undo */else { log_sysop_start_atomic (thread_p); /* <- permanent path opens a nested top action (Ch.5) */ is_sysop_started = true; error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); VFID_COPY ((VFID *) undo_log_data, vfid); /* <- pack (VFID,VPID) logical-undo payload */ VPID_COPY ((VPID *) (undo_log_data + sizeof (VFID)), vpid_out); }Three asymmetries propagate everywhere: the temp branch starts no system operation (is_sysop_started stays false), builds no undo data, and calls file_temp_alloc. The exit-label sysop epilogue is guarded by if (is_sysop_started), so the temporary path skips the whole block (log_sysop_abort on error, else log_sysop_end_logical_undo (thread_p, RVFL_ALLOC, ...)). A temporary allocation thus produces nothing for recovery to replay; if the transaction dies mid-flight the file is simply discarded — nothing was logged, so nothing to roll back.
The f_init handling also diverges: a temporary file’s f_init may be NULL (sort buffers init their own pages), and the else branch asserts FILE_IS_TEMPORARY (fhead) before fixing the page NEW_PAGE. Numerable temp files still call file_numerable_add_page (Ch.10) — temporary does not exempt a file from the user page table. The fork is the top of Figure 8-2.
8.2 The header cursor: the entire bookkeeping state
Section titled “8.2 The header cursor: the entire bookkeeping state”A permanent file tracks two extensible tables (Partial and Full) and migrates sectors between them. A temporary file keeps only the Partial table plus a two-field cursor — its entire allocation state:
| Field | Role | Why it exists |
|---|---|---|
vpid_last_temp_alloc | VPID of the Partial-table page holding the sector being filled | Lets allocation jump straight to the live table page; equals the header VPID for the in-header copy, else an overflow PAGE_FTAB page |
offset_to_last_temp_alloc | Index, in that page’s extensible data, of the FILE_PARTIAL_SECTOR being filled | Names the exact sector; advances only when the sector fills, so it also counts fully-consumed sectors in the page |
The struct comment states the design contract directly — “Temporary file pages are never deallocated … keep a cursor: when the sector becomes full it is incremented; when all page becomes full it moves to next page”:
// FILE_HEADER -- src/storage/file_manager.cVPID vpid_last_temp_alloc; /* VPID of partial table page last used to allocate a page. */int offset_to_last_temp_alloc; /* Sector offset in partial table last used to allocate a page. */The cursor is seeded at creation: file_create’s temp branch sets vpid_last_temp_alloc = vpid_fhead (the header’s own Partial table) and offset_to_last_temp_alloc = fhead->n_sector_full, skipping sectors already full at creation.
Invariant (cursor consistency).
offset_to_last_temp_allocis always a valid index into the extensible data atvpid_last_temp_alloc, or exactly its item count (“advance to next page next call”);file_temp_allocasserts both halves before dereferencing. If violated,file_extdata_atindexes past the array and corrupts an adjacent sector descriptor or reads garbage as a VSID.
graph LR H["FILE_HEADER"] -->|vpid_last_temp_alloc| P0["Partial table page\nin-header or PAGE_FTAB"] H -->|offset_to_last_temp_alloc| PS["FILE_PARTIAL_SECTOR + page_bitmap"] P0 -->|vpid_next| P1["next Partial table page ..."]
Figure 8-1. Cursor-to-table relationship. No Full table — full sectors stay in place ahead of the cursor.
8.3 Walking file_temp_alloc branch by branch
Section titled “8.3 Walking file_temp_alloc branch by branch”The function first disables interrupt checking (logtb_set_check_interrupt (thread_p, false), saved into save_check_interrupt) — there is no rollback, so a half-finished temp allocation must not be torn down — then asserts FILE_IS_TEMPORARY (fhead).
Step 1 — locate the live Partial-table page. If the cursor points at the header the in-header table is used directly; otherwise the overflow page is fixed with a write latch, only ER_INTERRUPTED tolerated on failure:
// file_temp_alloc -- src/storage/file_manager.cif (VPID_EQ (&vpid_fhead, &fhead->vpid_last_temp_alloc)) FILE_HEADER_GET_PART_FTAB (fhead, extdata_part_ftab); /* <- table lives in header page */else { page_ftab = pgbuf_fix (thread_p, &fhead->vpid_last_temp_alloc, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH); if (page_ftab == NULL) { error_code = er_errid (); if (error_code != ER_INTERRUPTED) assert_release (false); goto exit; } extdata_part_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab; }Step 2 — expand if out of free pages. The inline equivalent of file_temp_expand: when n_page_free == 0 it reserves one new sector via the disk manager (Ch.4) with DB_TEMPORARY_DATA_PURPOSE, so it lands in a temp volume:
// file_temp_alloc -- src/storage/file_manager.cif (fhead->n_page_free == 0) { FILE_PARTIAL_SECTOR partsect_new = FILE_PARTIAL_SECTOR_INITIALIZER; error_code = disk_reserve_sectors (thread_p, DB_TEMPORARY_DATA_PURPOSE, fhead->volid_last_expand, 1, &partsect_new.vsid); if (error_code != NO_ERROR) { /* same ER_INTERRUPTED-tolerated handling as Step 1 */ goto exit; }Two sub-branches follow, on whether the current page has room for one more FILE_PARTIAL_SECTOR. Sub-branch 2a — table page is full: the new sector cannot be recorded here, so its first page is stolen to host a fresh Partial-table page (bit 0 set, type PAGE_FTAB, previous vpid_next linked forward, cursor wrapped to offset 0):
// file_temp_alloc -- src/storage/file_manager.cif (file_extdata_is_full (extdata_part_ftab)) { vpid_ftab_new.volid = partsect_new.vsid.volid; vpid_ftab_new.pageid = SECTOR_FIRST_PAGEID (partsect_new.vsid.sectid); file_partsect_set_bit (&partsect_new, 0); /* <- page 0 becomes the table page */ page_ftab_new = pgbuf_fix (thread_p, &vpid_ftab_new, NEW_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH); if (page_ftab_new == NULL) { error_code = ER_FAILED; goto exit; } pgbuf_set_page_ptype (thread_p, page_ftab_new, PAGE_FTAB); VPID_COPY (&extdata_part_ftab->vpid_next, &vpid_ftab_new); /* <- link old table -> new table */ if (page_ftab != NULL) pgbuf_set_dirty_and_free (thread_p, page_ftab); VPID_COPY (&fhead->vpid_last_temp_alloc, &vpid_ftab_new); /* <- cursor wraps to fresh table page */ fhead->offset_to_last_temp_alloc = 0; page_ftab = page_ftab_new; extdata_part_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab; file_extdata_init (sizeof (FILE_PARTIAL_SECTOR), DB_PAGESIZE, extdata_part_ftab); ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_reserved, DISK_SECTOR_NPAGES - 1); ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_ftab, 1); }else ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_reserved, DISK_SECTOR_NPAGES); /* all pages reservable */This is the only place the cursor wraps to a fresh table page during expansion; when a table page is carved out, one page counts as npage_ftab and only DISK_SECTOR_NPAGES - 1 are reservable. After either sub-branch the sector is appended and counters bumped — empty vs. table-hosting is encoded in partsect_new.page_bitmap (non-empty only in 2a):
// file_temp_alloc -- src/storage/file_manager.cfile_extdata_append (extdata_part_ftab, &partsect_new);fhead->n_sector_partial++; fhead->n_sector_total++; // n_page_free/n_page_total += DISK_SECTOR_NPAGESif (partsect_new.page_bitmap == FILE_EMPTY_PAGE_BITMAP) fhead->n_sector_empty++;else { fhead->n_page_free--; fhead->n_page_ftab++; } /* <- table page already consumed */Invariant (sectors never leave Partial). A filled sector keeps its all-ones bitmap in place; nothing migrates it to a Full table. If violated, the cursor offset (which counts consumed sectors in the page) would no longer match the extensible-data layout and the Step-3 page-hop would skip live sectors.
Step 3 — advance to the next page if the cursor sits at the item count. A previous call may have left offset_to_last_temp_alloc one past the last sector of a now-full page. The guard if (fhead->offset_to_last_temp_alloc == file_extdata_item_count (extdata_part_ftab)) then fires: it asserts file_extdata_is_full (...) && !VPID_ISNULL (&extdata_part_ftab->vpid_next), unfixes the old page_ftab, fixes vpid_next (write latch, only ER_INTERRUPTED tolerated), and sets vpid_last_temp_alloc = vpid_next; offset_to_last_temp_alloc = 0.
Step 4 — allocate from the sector under the cursor. file_partsect_alloc sets the first zero bit. Its false return is impossible here (the cursor never points at a full sector) and is treated as a logic error:
// file_temp_alloc -- src/storage/file_manager.cpartsect = (FILE_PARTIAL_SECTOR *) file_extdata_at (extdata_part_ftab, fhead->offset_to_last_temp_alloc);was_empty = file_partsect_is_empty (partsect);if (!file_partsect_alloc (partsect, vpid_alloc_out, NULL)) { assert_release (false); error_code = ER_FAILED; goto exit; } /* <- full sector under cursor == bug */if (file_partsect_is_full (partsect)) { is_full = true; fhead->offset_to_last_temp_alloc++; } /* <- advance cursor; page hop deferred to next call */file_header_alloc (fhead, alloc_type, was_empty, is_full); /* <- shared with perm path: pure counter math */pgbuf_set_dirty (thread_p, page_fhead, DONT_FREE);The cursor advances on fullness, not on every allocation: while a sector has free bits it stays put, so the common case touches only the header and one table page. file_header_alloc is the permanent path’s helper (Ch.7); its is_full shuffle still updates n_sector_full/n_sector_partial here, but no table migration accompanies it — those counters are advisory statistics, not table membership.
Step 5 — unconditional cleanup. The exit label runs on every path: file_header_sanity_check, unfix page_ftab if held, restore the saved interrupt flag via logtb_set_check_interrupt, return error_code. No pgbuf_set_dirty is ever paired with a log append — the only durability action is marking pages dirty for the non-WAL-ordered flush of temp data.
flowchart TD
A["file_temp_alloc\ndisable interrupt check"] --> C{"cursor == header VPID?"}
C -->|yes| F{"n_page_free == 0?"}
C -->|no| E["fix cursor's table page"] --> F
F -->|yes| G["disk_reserve_sectors 1 sector"] --> H{"table page full?"}
H -->|yes| I["carve page0 as PAGE_FTAB\nlink vpid_next, wrap cursor"] --> L["append sector, bump counters"]
H -->|no| J["reserve DISK_SECTOR_NPAGES"] --> L
F -->|no| K{"offset == item_count?"}
L --> K
K -->|yes| N["fix vpid_next, cursor offset 0"] --> O["file_extdata_at + file_partsect_alloc"]
K -->|no| O
O --> Q{"sector now full?"}
Q -->|yes| R["offset_to_last_temp_alloc++"] --> T["file_header_alloc, set_dirty"]
Q -->|no| T
T --> U["exit: unfix, restore interrupt"]
Figure 8-2. file_temp_alloc complete branch map, including both expansion sub-branches and the deferred page-hop.
8.4 Why no Full table, no postpone, no WAL
Section titled “8.4 Why no Full table, no postpone, no WAL”The Full table exists in permanent files only so the allocation scan can skip sectors with no free pages (Ch.7). A temporary file never scans — it allocates from the single cursor sector and advances linearly — so a filled sector is never revisited and a second table buys nothing while costing the logging the design avoids. The companion cubrid-disk-manager.md enumerates the savings; the code above is the mechanism.
Invariant (monotone, bookkeeping-free allocation). The cursor only advances (offset++ on sector-full, page-hop on item-count), never backward.
file_deallocnever clears an allocation bit for a temporary file — it takes the emptyelsebranch (no postpone, no bitmap change) and skips the deallocation entirely (Ch.9) — so no sector regains a free bit behind the cursor and the monotone property holds without reconciliation. If violated (a freed bit behind the cursor), that page is silently leaked andn_page_freedrifts from reality.
8.5 Recycling: the cursor reset
Section titled “8.5 Recycling: the cursor reset”This minimalism lets the tempcache (Ch.11) recycle a file by reset rather than rebuild. file_temp_reset_user_pages re-collects the partial-table bitmaps, rebuilds the n_sector_*/n_page_* counters, zeroes the user count, and rewinds the cursor to the header VPID, offset 0:
// file_temp_reset_user_pages -- src/storage/file_manager.cfhead->n_page_user = 0; // ... n_sector_*/n_page_* rebuilt from re-collected bitmaps ...fhead->vpid_last_temp_alloc = vpid_fhead; /* <- cursor rewinds to header VPID, offset 0 */fhead->offset_to_last_temp_alloc = 0;This seed differs from file_create’s, which sets offset_to_last_temp_alloc = fhead->n_sector_full; reset always rewinds to offset 0. A reset file keeps its reserved sectors (no disk round-trip) and hands pages out from the front again — the payoff of skipping the Partial-to-Full machinery: allocation state collapses to two integers that cost nothing to reset.
8.6 Chapter summary — key takeaways
Section titled “8.6 Chapter summary — key takeaways”file_allocforks onFILE_IS_TEMPORARY: the temp lane callsfile_temp_allocwith no sysop, no undo data, no log records; the permanent lane wrapsfile_perm_allocin a nested top action withRVFL_ALLOClogical undo.- Temporary files keep only a Partial sectors table; a filled sector stays in place. The complete allocation state is
vpid_last_temp_alloc/offset_to_last_temp_alloc. - The cursor makes allocation O(1): Step 4 allocates directly from the cursor sector via
file_partsect_alloc, advancing the offset only on sector-full and deferring the page-hop to the next call (Step 3). - Expansion is inline (
n_page_free == 0): it reserves one sector withDB_TEMPORARY_DATA_PURPOSE; the table-full sub-branch carves the sector’s first page into a freshPAGE_FTAB, linksvpid_next, and wraps the cursor to offset 0. - There is no Full-table migration, no postpone, zero WAL — temp pages are never individually deallocated (
file_dealloctakes the emptyelsebranch), so the cursor is provably monotone and needs no reconciliation. file_header_allocis shared with the permanent path, but for temp files itsis_fullshuffle is advisory statistics only — no table movement.- The bookkeeping-free design lets the tempcache recycle a file by resetting the cursor (
vpid_last_temp_alloc = header VPID,offset_to_last_temp_alloc = 0,n_page_user = 0) and rebuilding counters from the bitmaps, not rebuilding tables (Ch.11). Reset rewinds to offset 0, unlikefile_create’sn_sector_fullseed.
Chapter 9: Page Deallocation and File Destruction
Section titled “Chapter 9: Page Deallocation and File Destruction”This chapter traces the inverse of permanent allocation (Chapter 7):
how is a page — and an entire file — given back, and why is the actual
bit-flip postponed to commit time? It assumes Chapter 4 (the two-step
reservation protocol, the bitmap-then-cache release-order invariant)
and Chapter 7 (file_perm_alloc, the Partial/Full tables); the companion
cubrid-disk-manager.md covers the sector-bitmap and disk/file split. The
central fact: a freed permanent page or sector is not cleared
synchronously — the releaser stages a postpone log record and the
clear runs at do-postpone.
9.1 Why postpone — the committed-releaser hazard
Section titled “9.1 Why postpone — the committed-releaser hazard”If the bit cleared immediately when transaction T1 freed page P, a second transaction could reserve that sector, allocate P, and commit its data; should T1 then abort, undo would restore P’s old contents and clobber the second’s committed work. CUBRID defers the clear to do-postpone, which runs only after commit is logically certain — until then the bit stays set, so no allocator hands the page out. Same reasoning as Chapter 4’s release-order invariant.
INVARIANT (deferred-free): A permanent page/sector freed by an active transaction keeps its bit set until do-postpone, enforced by routing all permanent frees through
log_append_postpone (RVFL_DEALLOC)/(RVDK_UNRESERVE_SECTORS)instead of mutating the bitmap inline. If violated, a concurrent allocator re-hands-out the page and a later abort corrupts the new owner’s data.
stateDiagram-v2 [*] --> Allocated Allocated --> PostponeStaged : file_dealloc \n RVFL_DEALLOC appended, bit still set PostponeStaged --> Allocated : transaction abort \n postpone discarded, page stays allocated PostponeStaged --> Freed : do-postpone \n file_perm_dealloc clears bit Freed --> [*]
Figure 9-1 — Lifecycle of a permanent page bit. The abort edge is the point: until do-postpone, nothing changed on disk.
9.2 file_dealloc — staging, not freeing
Section titled “9.2 file_dealloc — staging, not freeing”file_dealloc is the public entry for giving back one page; despite its
name it usually stages rather than frees. The header fix is
conditional: a release build with a trustworthy concrete file_type_hint
skips it to save an I/O, while a debug build always fixes (the #if defined (NDEBUG) guard) to assert the hint matches fhead->type and that
vpid is not the sticky first page. The postpone decision is
conservative under uncertainty — it postpones unless it can prove the
file temporary:
// file_dealloc -- src/storage/file_manager.c if ((fhead != NULL && !FILE_IS_TEMPORARY (fhead)) || file_type_hint != FILE_TEMP) { VFID_COPY ((VFID *) log_data, vfid); VPID_COPY ((VPID *) (log_data + sizeof (VFID)), vpid); log_append_postpone (thread_p, RVFL_DEALLOC, &log_addr, LOG_DATA_SIZE, log_data); /* <- stage only */ } /* else: we do not deallocate pages from temporary files */The RVFL_DEALLOC record carries only (VFID, VPID) — no bitmap state —
because the real work is recomputed at do-postpone. Temporary files take
the else (reclaimed wholesale at destroy / tempcache reset, Chapter 11).
Two early exits then key on numerability: goto exit if
!FILE_TYPE_CAN_BE_NUMERABLE (file_type_hint) (not numerable by type) and
again if !FILE_IS_NUMERABLE (fhead) (type allows it but this file is
not). Only a genuinely numerable file acts now — it searches the user
page table and sets FILE_USER_PAGE_MARK_DELETED, logging
RVFL_USER_PAGE_MARK_DELETE for non-temporary files (mechanics deferred
to Chapter 10).
INVARIANT (numerable consistency): In a numerable file the page must exist in the user page table and not already be marked deleted (enforced by
assert_release (false)on!foundand onFILE_USER_PAGE_IS_MARKED_DELETED). If violated, the user page table and the allocation tables have diverged — a hard bug.
The exit: label unfixes page_fhead and page_ftab if held.
9.3 file_perm_dealloc — the actual bit-flip at do-postpone
Section titled “9.3 file_perm_dealloc — the actual bit-flip at do-postpone”At commit, do-postpone replays each RVFL_DEALLOC through
file_rv_dealloc_on_postpone → file_rv_dealloc_internal, which fixes
the header, starts a system operation, and calls file_perm_dealloc —
where the bit is finally cleared. Entry asserts the contract:
log_check_system_op_is_started (must be inside a sysop) and
!FILE_IS_TEMPORARY (fhead) (permanent only); it then computes
vsid_dealloc from vpid_dealloc (SECTOR_FROM_PAGEID).
INVARIANT (sysop-wrapped table change): All file-table mutations in
file_perm_deallocmust commit as a nested system operation before the header page is unfixed. If violated, a crash mid-update leaves the Partial/Full tables and header counters inconsistent with no atomic recovery boundary.
flowchart TB
START["file_perm_dealloc(vpid)"] --> SEARCH["search Partial table"]
SEARCH --> FOUND{found in Partial?}
FOUND -- yes --> CLEAR["clear bit in partsect<br/>log RVFL_PARTSECT_DEALLOC<br/>is_empty?"]
FOUND -- no --> REMOVE["remove vsid from Full table<br/>was_full = true"]
REMOVE --> MERGED{ftab page merged away?}
MERGED -- "same sector" --> SAMESEC["clear merged page's bit too<br/>simulate ftab dealloc"]
MERGED -- "other sector" --> RECURSE["file_perm_dealloc(merged) recursive"]
MERGED -- none --> BUILD["build partsect_new = FULL minus bit"]
SAMESEC --> BUILD
RECURSE --> BUILD
BUILD --> SPACE{free slot in Partial?}
SPACE -- yes --> INSERT["file_extdata_insert_at ordered"]
SPACE -- no --> NEWPG["file_perm_alloc new ftab page"]
CLEAR --> HDR["file_header_dealloc<br/>update counters"]
INSERT --> HDR
NEWPG --> HDR
HDR --> DEALLOC["pgbuf_dealloc_page(vpid)"]
DEALLOC --> EXIT["exit: unfix page_ftab"]
Figure 9-2 — Branch map of file_perm_dealloc. Left: sector already
Partial (common case). Right: sector was Full, where the
Full-to-Partial migration happens and may recurse.
Left branch — already Partial. The sector has a free page so it is
already in the Partial table: clear the bit, recompute is_empty, log it
with RVFL_PARTSECT_DEALLOC via log_append_undoredo_data — undoredo,
not postpone, because by do-postpone time we are executing the free,
so the table edit is a normal recoverable change.
Right branch — sector was Full. Every reserved sector is in exactly
one table (Chapter 6), so if not Partial it is Full. The function sets
was_full = true and calls file_extdata_find_and_remove_item on the
Full table; this may empty the last Full-table component, returning a
vpid_merged — a now-orphaned table page that must itself be freed. The
guard, written as the two merged cases in Figure 9-2, hinges on
VSID_IS_SECTOR_OF_VPID (&vsid_dealloc, &vpid_merged):
- different sector →
file_perm_dealloc (..., &vpid_merged, FILE_ALLOC_TABLE_PAGE)recurses to free it normally; - same sector (the one being moved to Partial) → do not recurse;
set
is_merged_page_from_sector, clear that page’s bit too in the new descriptor, simulate accounting viafile_header_dealloc (..., FILE_ALLOC_TABLE_PAGE, ...)thenpgbuf_dealloc_page (vpid_merged).
The new descriptor starts from partsect_new.page_bitmap = FILE_FULL_PAGE_BITMAP with the freed bit(s) file_partsect_clear_bit’d,
then is inserted at the ordered position; if Partial has no free slot a
new table page comes from file_perm_alloc (FILE_ALLOC_TABLE_PAGE).
Guards: file_extdata_find_ordered must report the VSID not present
(assert_release (false) on duplicate), and assert (page_ftab == NULL)
confirms all transient table pages were unfixed.
Tail — both branches. file_header_dealloc (fhead, alloc_type, is_empty, was_full) adjusts n_page_free / sector counters
(file_log_fhead_dealloc logs it); the page is then fixed and handed to
pgbuf_dealloc_page (§9.6), and PSTAT_FILE_NUM_PAGE_DEALLOCS bumps.
is_empty/was_full drive the math: a was_full sector now contributes
free pages, an is_empty sector becomes fully free. Most error paths
unfix any held page_ftab via ASSERT_ERROR (); goto exit; the two
Full-branch sub-paths — the recursive file_perm_dealloc of an
other-sector orphan and the same-sector merged-page pgbuf_fix failure —
instead return error_code directly, which is safe because page_ftab
is still NULL at those points. The hard-fail-during-recovery guard lives
one level up in file_rv_dealloc_internal (§9.8), not here.
9.4 file_destroy — giving back the whole file
Section titled “9.4 file_destroy — giving back the whole file”Destroying a file returns every sector it reserved; is_temp forks the
entire function. The prologue: a permanent file calls
file_tracker_unregister (catalog-visible, dropped first); a temporary
file calls logtb_set_check_interrupt (thread_p, false) so destroy cannot
abort halfway and leak pages. The header is fixed,
file_table_collect_all_vsids gathers every sector, then the forks
diverge on eviction and re-converge on one disk_unreserve_ordered_sectors
call.
flowchart TB
P["file_destroy(vfid, is_temp)"] --> FORK{is_temp?}
FORK -- no --> UNREG["file_tracker_unregister"]
FORK -- yes --> NOINT["disable interrupt check"]
UNREG --> FIX["fix header page"]
NOINT --> FIX
FIX --> COLLECT["file_table_collect_all_vsids<br/>-> vsid_collector"]
COLLECT --> FORK2{permanent or temporary?}
FORK2 -- permanent --> PDEAL["file_sector_map_dealloc over Partial+Full<br/>pgbuf_dealloc_page each user+ftab page<br/>pgbuf_dealloc_page(header)"]
FORK2 -- temporary --> TDEAL["file_sector_map_dealloc_temp over Partial<br/>pgbuf_dealloc_temp_page each<br/>decrement Tempcache counters"]
PDEAL --> UNRES["disk_unreserve_ordered_sectors"]
TDEAL --> UNRES
UNRES --> EXIT["exit: free collectors, unfix header,<br/>restore interrupt check"]
Figure 9-3 — file_destroy two forks.
Permanent fork. file_extdata_apply_funcs over Partial then Full
passes file_extdata_collect_ftab_pages (gather file-table-page sectors
into a FILE_FTAB_COLLECTOR) and file_sector_map_dealloc (fix each user
page, pgbuf_dealloc_page); it then evicts each collected table-page
sector and finally the header. Every owned page becomes a PAGE_UNKNOWN
eviction candidate before sectors are unreserved.
Temporary fork. No Full table, so only Partial is walked via
file_sector_map_dealloc_temp / pgbuf_dealloc_temp_page. It logs
nothing and tolerates a missing page (pgbuf_simple_fix NULL →
continue) since temporary pages need not be on disk, then decrements the
global tempcache spacedb_temp counters (Chapter 11) and frees the
header.
INVARIANT (evict-before-unreserve): Every buffer-pool page of a file must become an eviction candidate (
pgbuf_dealloc_page/pgbuf_dealloc_temp_page) before its sectors are unreserved. If violated, a stale dirty BCB could be flushed to a sector already unreserved and re-reserved by another file, writing one file’s bytes into another.
The exit: label is universal cleanup: unfix the header, db_private_free
both collector arrays, restore the interrupt-check flag for the temporary
case.
9.5 file_vsid_collector and file_table_collect_all_vsids
Section titled “9.5 file_vsid_collector and file_table_collect_all_vsids”The collector is a fixed-size array plus count:
// struct file_vsid_collector -- src/storage/file_manager.cstruct file_vsid_collector { VSID *vsids; int n_vsids; };| Field | Role | Why it exists |
|---|---|---|
vsids | Pointer to a db_private_alloc’d array of fhead->n_sector_total VSIDs | Output buffer, sized exactly to the sector count so no realloc is ever needed. |
n_vsids | Running count of sectors appended | Both the array cursor during collection and the element count handed to disk_unreserve_ordered_sectors. After collection it must equal n_sector_total. |
file_table_collect_all_vsids allocates the array, then applies
file_table_collect_vsid (collector->vsids[collector->n_vsids++] = *vsid)
across Partial and — for permanent files only — Full:
// file_table_collect_all_vsids -- src/storage/file_manager.c collector_out->vsids = (VSID *) db_private_alloc (thread_p, fhead->n_sector_total * sizeof (VSID)); FILE_HEADER_GET_PART_FTAB (fhead, extdata_ftab); error_code = file_extdata_apply_funcs (thread_p, extdata_ftab, NULL, NULL, file_table_collect_vsid, collector_out, ...); if (!FILE_IS_TEMPORARY (fhead)) { FILE_HEADER_GET_FULL_FTAB (fhead, extdata_ftab); /* <- temporary files have no full table */ error_code = file_extdata_apply_funcs (thread_p, extdata_ftab, NULL, NULL, file_table_collect_vsid, collector_out, ...); } if (collector_out->n_vsids != fhead->n_sector_total) assert_release (false); /* <- the count invariant, checked */ qsort (collector_out->vsids, fhead->n_sector_total, sizeof (VSID), disk_compare_vsids); /* <- ordered output */INVARIANT (complete collection): The collected VSID count must equal
fhead->n_sector_total. If violated, the file’s bookkeeping is corrupt and destroy fails withassert_release (false).
The final qsort establishes the next function’s precondition — the VSID
list ordered by (volid, sectid) so disk_unreserve_ordered_sectors can
batch per volume in one pass.
9.6 pgbuf_dealloc_page — the eviction hint
Section titled “9.6 pgbuf_dealloc_page — the eviction hint”Both file_perm_dealloc and the permanent file_destroy fork hand each
freed page to pgbuf_dealloc_page, which does no flush or write I/O — it
resets the page type to PAGE_UNKNOWN and steers the BCB toward
victimization:
// pgbuf_dealloc_page -- src/storage/page_buffer.c /* how it works: page is "deallocated" by resetting its type to PAGE_UNKNOWN. also prepare bcb for victimization. * note: the bcb used to be invalidated. but that means flushing page to disk and waiting for IO write. that may be * too slow. if we add the bcb to the bottom of a lru list, it will be eventually flushed by flush thread and * victimized. */ CAST_PGPTR_TO_BFPTR (bcb, page_dealloc); assert (get_fcnt (&bcb->atomic_latch) == 1); /* <- caller must hold the only latch */Deallocation is a hint, not a synchronous discard — the page may still be flushed later by the flush thread, which is exactly why the evict-before-unreserve invariant (§9.4) requires it be issued before the sector becomes reusable.
9.7 disk_unreserve_ordered_sectors — returning sectors
Section titled “9.7 disk_unreserve_ordered_sectors — returning sectors”The disk-manager counterpart of Chapter 4’s reservation: a thin wrapper
that takes CSECT_DISK_CHECK as a reader and delegates to
disk_unreserve_ordered_sectors_without_csect. The worker exploits the
§9.5 sort — it groups consecutive vsids sharing a volid into per-volume
runs in a DISK_RESERVE_CONTEXT (asserting volid strictly increasing
across runs, sectid within one) and issues one
disk_unreserve_sectors_from_volume per volume, which iterates sector-table
units calling disk_stab_unit_unreserve — the leaf where the
permanent-vs-temporary postpone split lands, mirroring §9.2 at the
sector level:
// disk_stab_unit_unreserve -- src/storage/disk_manager.c assert ((unreserve_bits & (*cursor->unit)) == unreserve_bits); /* <- all target bits were actually set */ if (unreserve_bits != 0) { if (context->purpose == DB_PERMANENT_DATA_PURPOSE) log_append_postpone (thread_p, RVDK_UNRESERVE_SECTORS, &addr, sizeof (unreserve_bits), &unreserve_bits); /* <- deferred */ else { (*cursor->unit) &= ~unreserve_bits; /* <- bitmap cleared NOW */ /* ... pgbuf_set_dirty + lock_reserve_for_purpose condensed ... */ disk_cache_update_vol_free (cursor->volheader->volid, nsect); /* <- then cache, Ch.4 order */ } }Permanent purpose stages the clear via
log_append_postpone (RVDK_UNRESERVE_SECTORS), upholding the deferred-free
invariant (§9.1) at sector granularity; temporary purpose clears the
bits immediately, in the bitmap-then-cache order Chapter 4’s
release-order invariant mandates (bit cleared, then
disk_cache_update_vol_free). The entry assert guards that every freed
sector was genuinely reserved.
9.8 The abort path — restoring state
Section titled “9.8 The abort path — restoring state”Aborting a permanent deallocation is free: its staged postpone records
are discarded, never run, so the page stays allocated (Figure 9-1’s back
edge). Real undo happens only when a page allocation is rolled back.
Both do-postpone and undo route through file_rv_dealloc_internal, which
fixes the header, opens the sysop, calls file_perm_dealloc, and — because
a recovery replay must not be tolerated to fail silently — hard-fails via
if (error_code != NO_ERROR) { assert_release (false); } on any non-NO_ERROR
return. It then seals the sysop by one parameter: log_sysop_abort on
error, log_sysop_end_logical_compensate for FILE_RV_DEALLOC_COMPENSATE
(undo of an alloc), otherwise log_sysop_end_logical_run_postpone
(do-postpone of a dealloc) — all three making the table change durable
before the header is unfixed (§9.3).
9.9 Chapter summary — key takeaways
Section titled “9.9 Chapter summary — key takeaways”file_deallocstages, it does not free — a non-temporary file appends anRVFL_DEALLOCpostpone record carrying(VFID, VPID); temporary files deallocate nothing; numerable files also mark-delete the user-page-table entry now (Chapter 10).- Postpone closes the committed-releaser window — no transaction can grab a freed page before the releaser’s commit is irreversible, the sector-level analogue of Chapter 4’s release ordering.
file_perm_deallocis the real free, branch-rich. Partial: clear a bit. Full: migrate to Partial, recursing to free an orphaned table page except a same-sector orphan (inlined). Must run inside a system operation.file_destroyforks onis_tempend to end. Permanent: unregister, evict viapgbuf_dealloc_page, unreserve postponed. Temporary: disable interrupts, evict viapgbuf_dealloc_temp_page, adjust tempcache counters, unreserve immediately.- Collection precedes destruction, sorted —
file_table_collect_all_vsidsgathers exactlyn_sector_totalVSIDs (asserting the count) andqsorts them so unreserve batches per volume. pgbuf_dealloc_pageis an eviction hint, not a flush — it queues thePAGE_UNKNOWNBCB for victimization, so pages must be evicted before their sectors are unreserved.- The postpone split bottoms out in
disk_stab_unit_unreserve— permanent stagesRVDK_UNRESERVE_SECTORS; temporary clears bitmap then cache inline; abort of a permanent dealloc is free.
Chapter 10: Numerable Files and the User Page Table
Section titled “Chapter 10: Numerable Files and the User Page Table”A numerable file adds one promise: ask for “the n-th page I allocated”, in allocation order, in amortized O(1). The sector allocation machinery of Ch 3-Ch 7 cannot answer this — it stores ownership, not order — so the numerable layer keeps a second, separately-externalized index over the same VPIDs: the User Page Table. See cubrid-disk-manager.md (“Numerable files”) for the high-level contract; here we trace every branch.
10.1 Why the sector table cannot recover allocation order
Section titled “10.1 Why the sector table cannot recover allocation order”The Partial and Full sector tables (Ch 3) are kept VSID-sorted via disk_compare_vsids so reservation and lookup are binary searches. That sort destroys history two ways: (1) promotion erases batch identity — a filled partial sector migrates Partial -> Full and is re-sorted by VSID, losing which batch reserved it; (2) cross-expand reorders — a batch spanning a fresh reservation gives new sectors a VSID order unrelated to produce order, so a sorted bitmap scan yields pages in a different sequence than the user received them.
The sector table answers membership but not order. The User Page Table re-externalizes that lost order as an append-only list of VPIDs — one entry per user page, in allocation order — so find_nth(n) is a positional index into it.
The table is a chain of FILE_EXTENSIBLE_DATA components (the extdata primitive used throughout file_manager.c) whose items are bare VPIDs. The header caches the last component in vpid_last_user_page_ftab for O(1) appends; FILE_HEADER_GET_USER_PAGE_FTAB locates the first component in the header page.
10.2 The find-nth context and the header’s order-keeping fields
Section titled “10.2 The find-nth context and the header’s order-keeping fields”file_find_nth_context (struct { VPID *vpid_nth; int nth; int first_index; }) is the accumulator threaded through the scan callbacks:
| Field | Role | Why it exists |
|---|---|---|
vpid_nth | Out-param pointer for the found VPID | Scan writes through it so the caller’s slot is filled in place |
nth | Remaining index, decremented as components/items are skipped | Countdown; the scan stops when it reaches the target item |
first_index | Absolute item index of the current component’s entry 0 | Feeds the cache: where in the global sequence the landing component begins |
Five FILE_HEADER fields carry the order machinery (struct covered in Ch 1):
| Field | Role | Why it exists |
|---|---|---|
vpid_last_user_page_ftab | Hint to the last UPT component page | O(1) append target; equals the header VPID while the table lives in-header |
vpid_find_nth_last | Cached page of the last find_nth landing | Lets sequential find_nth(n), find_nth(n+1)... resume mid-table |
first_index_find_nth_last | Global index of entry 0 on vpid_find_nth_last | Turns the cached page into an absolute offset for the next search |
n_page_user | Total user pages (incl. mark-deleted) | Numerator of the live-page count |
n_page_mark_delete | Count of mark-delete-bit entries | Correction term: live pages = n_page_user - n_page_mark_delete |
Invariant (live-count correction). Findable pages =
n_page_user - n_page_mark_delete, nevern_page_user.file_numerable_find_nthenforces this at the auto-alloc test and when skipping marked entries; drift would makefind_nthreturn a deleted page or allocate at the wrong index. Kept exact byfile_header_update_mark_deletedlogging+1/-1on every set/clear.
Invariant (cache validity).
FILE_CACHE_LAST_FIND_NTHis true only forFILE_TEMPnumerable files on a non-parallel thread, so the cache may be read/written without a write latch or dirty flag. Any deallocation resets it (VPID_SET_NULL (&fhead->vpid_find_nth_last)); appends leave it valid because they only extend the tail.
10.3 file_numerable_add_page — appending on every allocation
Section titled “10.3 file_numerable_add_page — appending on every allocation”file_alloc calls file_numerable_add_page right after a page’s bit is set, whenever FILE_IS_NUMERABLE (fhead), so the UPT grows in lock-step. It resolves the tail from vpid_last_user_page_ftab (in-header if equal to the header VPID, else pgbuf_fix WRITE), chains a component if full, then appends:
// file_numerable_add_page -- src/storage/file_manager.cif (VPID_EQ (&fhead->vpid_last_user_page_ftab, &vpid_fhead)) FILE_HEADER_GET_USER_PAGE_FTAB (fhead, extdata_user_page_ftab); /* tail in header */else page_ftab = pgbuf_fix (..., OLD_PAGE, PGBUF_LATCH_WRITE, ...); /* else fix tail page */// ... condensed: if (file_extdata_is_full) chain via file_temp_alloc/file_perm_alloc ...file_extdata_append (extdata_user_page_ftab, vpid); /* <- the append */flowchart TD
A["hint = vpid_last_user_page_ftab"] --> B{"hint == header VPID?"}
B -->|yes| C["extdata = in-header UPT"]
B -->|no| D{"pgbuf_fix WRITE ok?"}
D -->|no| Z["ASSERT_ERROR_AND_SET, goto exit"]
D -->|yes| F["extdata = that ftab page"]
C --> G{"file_extdata_is_full?"}
F --> G
G -->|no| M["file_extdata_append vpid"]
G -->|yes| H{"FILE_IS_TEMPORARY?"}
H -->|yes| I["file_temp_alloc TABLE_PAGE"]
H -->|no| J["file_perm_alloc TABLE_PAGE"]
I --> K["fix NEW_PAGE, link prev->next, init extdata,\n advance last_user_page_ftab"]
J --> K
K --> M
M --> N{"temporary?"}
N -->|no| O["file_log_extdata_add WAL"]
N -->|yes| P["pgbuf_set_dirty only"]
O --> Q["exit: unfix page_ftab if held"]
P --> Q
Z --> Q
Figure 10-1. file_numerable_add_page, all branches.
The branch worth restating is the temp-vs-permanent logging asymmetry (Figure 10-1 node N): a permanent append emits file_log_extdata_add WAL (plus RVFL_FHEAD_SET_LAST_USER_PAGE_FTAB undoredo when a component is chained), a temporary append only marks pages dirty. A closing assert (!file_extdata_is_full (...)) rules out overflow.
10.4 file_numerable_find_nth — the indexed lookup
Section titled “10.4 file_numerable_find_nth — the indexed lookup”The function fixes the header READ, asserts numerable, then branches three ways. Auto-alloc-at-end (auto_alloc && nth == fhead->n_page_user - fhead->n_page_mark_delete) promotes the latch and calls file_alloc to grow the file, re-fixing WRITE and re-checking on ER_PAGE_LATCH_PROMOTE_FAIL. Otherwise the search splits on n_page_mark_delete: with holes it visits every item (file_extdata_find_nth_vpid_and_skip_marked); with no holes it strides components and may resume from the cache, whose predicate is load-bearing:
// file_numerable_find_nth (no-holes branch) -- src/storage/file_manager.cif (FILE_CACHE_LAST_FIND_NTH (fhead, thread_p) && !VPID_ISNULL (&fhead->vpid_find_nth_last) && !VPID_EQ (&vpid_fhead, &fhead->vpid_find_nth_last) && nth >= fhead->first_index_find_nth_last) { find_nth_context.first_index = fhead->first_index_find_nth_last; /* resume from cache */ find_nth_context.nth -= fhead->first_index_find_nth_last; } /* <- rebase the countdown */flowchart TD
A["fix header READ, assert numerable"] --> B{"auto_alloc and nth == live count?"}
B -->|yes| C{"promote latch ok?"}
C -->|FAIL| E["re-fix WRITE, re-check, file_alloc, exit"]
C -->|ok| F["file_alloc, exit"]
B -->|no| G{"n_page_mark_delete > 0?"}
G -->|yes| H["skip-marked over EVERY item"]
G -->|no| I{"cache usable?"}
I -->|yes| J["fix cached page, rebase nth"]
I -->|no| K["first_index = 0, from head"]
J --> L["find_nth_vpid: stride components"]
K --> L
L --> M{"cache eligible?"}
M -->|yes| N["store landing page + first_index"]
M -->|no| O["skip cache update"]
H --> P{"vpid_nth still NULL?"}
N --> P
O --> P
P -->|yes| Q["assert_release false, ER_FAILED"]
P -->|no| R["exit: unfix pages"]
Figure 10-2. file_numerable_find_nth, all branches.
The three predicate conjuncts above (notably nth >= first_index_find_nth_last, which forbids a backward resume) let the search start mid-table and walk only the landing component — the amortized O(1) for the run-merge pattern find_nth(0), find_nth(1), .... Exit cleanup avoids double-unfixing aliased page pointers.
10.5 The two scan callbacks
Section titled “10.5 The two scan callbacks”file_extdata_apply_funcs invokes a per-component and/or per-item function. file_extdata_find_nth_vpid is the per-component (no-holes) callback — a whole component is one O(1) stride:
// file_extdata_find_nth_vpid -- src/storage/file_manager.cint count_vpid = file_extdata_item_count (extdata);if (count_vpid <= find_nth_context->nth) { find_nth_context->nth -= count_vpid; /* <- skip whole component */ find_nth_context->first_index += count_vpid; } /* <- keep global index accurate */else { VPID_COPY (find_nth_context->vpid_nth, (VPID *) file_extdata_at (extdata, find_nth_context->nth)); assert (!FILE_USER_PAGE_IS_MARKED_DELETED (find_nth_context->vpid_nth)); /* <- no holes */ *stop = true; }file_extdata_find_nth_vpid_and_skip_marked is the per-item (holes) callback; it inspects every VPID because a deleted entry consumes a slot but not an index:
// file_extdata_find_nth_vpid_and_skip_marked -- src/storage/file_manager.cif (FILE_USER_PAGE_IS_MARKED_DELETED (vpidp)) return NO_ERROR; /* <- skip, do not advance nth */if (find_nth_context->nth == 0) { *find_nth_context->vpid_nth = *vpidp; *stop = true; }else find_nth_context->nth--;The asymmetry is the point: no holes lets you stride components and keep first_index to prime the cache; holes do not, since a component’s live-entry count is not its item count.
10.6 The mark-delete machinery (permanent numerable)
Section titled “10.6 The mark-delete machinery (permanent numerable)”A numerable page cannot vanish from the middle of the UPT mid-transaction — that would renumber later pages and corrupt concurrent find_nth — so file_dealloc removes in two phases. Phase 1, in-transaction, only sets the top bit of the pageid (FILE_USER_PAGE_MARK_DELETE_FLAG == 0x80000000) via FILE_USER_PAGE_MARK_DELETED (vpid_found), logs RVFL_USER_PAGE_MARK_DELETE undoredo (permanent only), bumps the counter via file_header_update_mark_deleted (..., 1), and resets the cache if FILE_CACHE_LAST_FIND_NTH. The entry keeps its slot, later indices are undisturbed, and find_nth skips it via the per-item callback.
Phase 2, at commit run-postpone, physically removes the entry via file_extdata_find_and_remove_item: it walks the chain (linear, ordered=false, since the UPT is append-ordered not VSID-ordered), removes the item with file_extdata_remove_at (logged via file_log_extdata_remove), pops the VPID into an out-param, and merges an emptied component with its predecessor, reporting the freed table page through vpid_merged; it asserts a system op is active and assert_release(false)s on a missing item. A marked pop decrements the counter:
// file_dealloc run-postpone body -- src/storage/file_manager.cfile_extdata_find_and_remove_item (..., vpid_dealloc, file_compare_vpids, false, &vpid_removed, &vpid_merged);if (!VPID_ISNULL (&vpid_merged)) /* table page emptied -> free it */ file_perm_dealloc (thread_p, page_fhead, &vpid_merged, FILE_ALLOC_TABLE_PAGE);if (FILE_USER_PAGE_IS_MARKED_DELETED (&vpid_removed)) file_header_update_mark_deleted (thread_p, page_fhead, -1); /* <- counter back down */On abort, file_rv_user_page_unmark_delete_logical undoes phase 1. Because concurrent transactions may have shifted the table, it cannot trust the original position — it re-searches by VPID (file_extdata_search_item), asserts the bit is set, clears it with FILE_USER_PAGE_CLEAR_MARK_DELETED, and logs a RVFL_USER_PAGE_MARK_DELETE_COMPENSATE record via log_append_compensate.
Invariant (slot stability under deletion). A marked-deleted entry never moves or re-indexes until commit, so a concurrent reader’s cached
vpid_find_nth_laststays structurally valid through a mark (deallocation resets only the cache, not the slots). Compacting on mark would renumber pages mid-transaction.
10.7 file_numerable_truncate — dealloc-driven shrink
Section titled “10.7 file_numerable_truncate — dealloc-driven shrink”Truncation is the only public shrink path, leaning on find_nth + file_dealloc:
// file_numerable_truncate -- src/storage/file_manager.cif (!FILE_IS_NUMERABLE (fhead)) { assert_release (false); error_code = ER_FAILED; goto exit; }if (fhead->n_page_mark_delete != 0) { assert (false); return NO_ERROR; } /* <- refuse mid-dealloc */while (fhead->n_page_user > npages) { /* repeatedly drop index npages */ file_numerable_find_nth (thread_p, vfid, npages, false, NULL, NULL, &vpid); /* auto-alloc off */ file_dealloc (thread_p, vfid, &vpid, fhead->type); }Each iteration deallocates the page now at index npages; as n_page_user drops the loop ends exactly at npages. It bails on n_page_mark_delete != 0, since a half-finished dealloc makes the index meaningless.
10.8 Real callers and the dead-code finding
Section titled “10.8 Real callers and the dead-code finding”file_numerable_find_nth has three callers across two file-type families; mark-delete is exercised only by the permanent family. The extendible-hash family is consumed by both src/storage/extendible_hash.c and the file-hash-scan code in src/query/query_hash_scan.c — fhs_fix_nth_page calls file_numerable_find_nth, and its files are created via file_create_ehash / file_create_ehash_dir, so they are FILE_EXTENDIBLE_HASH(_DIRECTORY), the same family as the storage row.
| Caller | File type | Deallocates? | Mark-delete used? |
|---|---|---|---|
External sort run files (external_sort.c, file_create_temp_numerable) | FILE_TEMP | never | no (dead) |
Extendible hash bucket/directory (extendible_hash.c find_nth, truncate) | FILE_EXTENDIBLE_HASH(_DIRECTORY) | yes | yes |
File-hash-scan FHS (query_hash_scan.c, fhs_fix_nth_page) | FILE_EXTENDIBLE_HASH(_DIRECTORY) | via truncate path | yes |
Non-numerable temp consumers — list_file query intermediates and the query result cache (FILE_QUERY_AREA) — never touch this layer; they chain pages via QFILE_PAGE_HEADER.next_vpid, with no find_nth contract.
Critical finding. For
FILE_TEMPnumerable files (external sort),file_temp_allocnever deallocates, soFILE_USER_PAGE_MARK_DELETED,n_page_mark_delete, and the wholeRVFL_USER_PAGE_MARK_DELETE*chain are effectively dead code there:n_page_mark_deletestays 0 andfind_nthalways takes the no-holes/cache branch. The table data structure is still mandatory — it supplies the order contract the sort merge depends on. The dead part is the deletion sub-apparatus, not the table.
10.9 Chapter summary — key takeaways
Section titled “10.9 Chapter summary — key takeaways”- VSID-sorted sector tables store which pages a file owns but discard order; the User Page Table re-externalizes order as an append-only VPID list, so
find_nth(n)is a positional index. file_numerable_add_pageappends one VPID per allocation insidefile_alloc, usingvpid_last_user_page_ftabas an O(1) tail hint and chaining a component (logged for permanent, dirty-only for temporary) when the tail fills.file_numerable_find_nthis O(1)-amortized only in the no-holes branch (file_extdata_find_nth_vpidstrides components, the cache resumes mid-table); the holes branch falls back to a per-item skip scan.- Live page count is
n_page_user - n_page_mark_delete, governing auto-alloc-at-end and deleted-entry skipping, kept exact by logged deltas. - Permanent deletion is two-phase: phase 1 sets
FILE_USER_PAGE_MARK_DELETE_FLAGand bumps the counter (slot kept); phase 2 at run-postpone removes viafile_extdata_find_and_remove_item; abort re-searches by VPID and clears the bit. file_numerable_truncateis a thinfind_nth(npages)+file_deallocloop, refusing to run whilen_page_mark_delete != 0.- Mark-delete is dead code for
FILE_TEMPnumerable (external sort never deallocates), yet the table itself stays necessary — the dead part is the deletion sub-apparatus, not the structure.
Chapter 11: Special Paths Tempcache Tracker Sticky Page TDE and Recovery
Section titled “Chapter 11: Special Paths Tempcache Tracker Sticky Page TDE and Recovery”Five machines sit beside the single-page lifecycle of Ch 6-10: the temp-file cache, the File Tracker, the sticky-first-page escape hatch, the TDE flags, and the recovery handlers. This chapter dissects only the code; for the why, see the companion’s “Temporary file cache”, “File destruction and the File Tracker”, and “Two-step sector reservation” sections.
11.1 The temp-file cache: recycling whole files
Section titled “11.1 The temp-file cache: recycling whole files”file_Tempcache is a global pool holding retired temp files intact so the next request of the
same shape gets one back instead of destroy-and-recreate. Three structs cooperate.
// file_tempcache_entry -- src/storage/file_manager.cstruct file_tempcache_entry { VFID vfid; FILE_TYPE ftype; FILE_TEMPCACHE_ENTRY *next; };
// file_tempcache_tran_entry -- src/storage/file_manager.cstruct file_tempcache_tran_entry { pthread_mutex_t mutex; FILE_TEMPCACHE_ENTRY *head;#if !defined (NDEBUG) int owner_mutex;#endif};
// file_tempcache -- src/storage/file_manager.cstruct file_tempcache { FILE_TEMPCACHE_ENTRY *free_entries; int nfree_entries_max, nfree_entries; FILE_TEMPCACHE_ENTRY *cached_not_numerable, *cached_numerable; int ncached_max, ncached_not_numerable, ncached_numerable; pthread_mutex_t mutex;#if !defined (NDEBUG) int owner_mutex;#endif FILE_TEMPCACHE_TRAN_ENTRY *tran_files; SPACEDB_FILES spacedb_temp;};static FILE_TEMPCACHE file_Tempcache;file_tempcache_entry
| Field | Role | Why it exists |
|---|---|---|
vfid | identifies the cached file | the cache stores real, allocated files, not descriptors |
ftype | file type of the cached file | a get matches by type; a near-miss is re-typed in place |
next | list link | one entry travels between free_entries, a tran list, and a cached list — never on two at once |
file_tempcache_tran_entry (one per transaction index)
| Field | Role | Why it exists |
|---|---|---|
mutex | per-transaction lock over this transaction’s head | held by file_tempcache_lock_tran_entry / unlock_tran_entry during the commit/abort drain |
head | files this transaction created and still owns | drained at commit/abort by file_tempcache_drop_tran_temp_files |
owner_mutex | NDEBUG-only ownership tracker | records which thread holds mutex for the lock/unlock assertions |
file_tempcache (the global)
| Field | Role | Why it exists |
|---|---|---|
free_entries | pool of empty entry shells | avoids malloc/free on every cache op |
nfree_entries_max / nfree_entries | cap / current size of the shell pool | init sets max to ntrans * 8 |
cached_not_numerable | retired regular temp files | a get(numerable=false) pops here |
cached_numerable | retired numerable temp files | separate list since the user page table differs (Ch 10) |
ncached_max | total capacity (PRM_ID_MAX_ENTRIES_IN_TEMP_FILE_CACHE) | put refuses once not_numerable + numerable >= max |
ncached_not_numerable / ncached_numerable | per-list counts | kept lock-step with the lists (see invariant) |
mutex | guards global lists and shell pool | one lock for all global state |
owner_mutex | NDEBUG-only ownership tracker | which thread holds mutex, for file_tempcache_lock / unlock asserts |
tran_files | array of per-transaction lists | indexed by tran index so commit is O(1) |
spacedb_temp | temp-space accounting | feeds SPACEDB reporting |
Invariant — list head and count agree.
(cached_not_numerable == NULL) == (ncached_not_numerable == 0)and likewise for numerable;putasserts both before linking. If a count drifted,getunderflows (it assertsncached_* > 0) orputover-admits pastncached_max, leaking temp files.
11.1.1 file_tempcache_get — hand out a recycled file or a fresh shell
Section titled “11.1.1 file_tempcache_get — hand out a recycled file or a fresh shell”// file_tempcache_get -- src/storage/file_manager.c*entry = numerable ? file_Tempcache.cached_numerable : file_Tempcache.cached_not_numerable;if (*entry != NULL && (*entry)->ftype != ftype) { /* cached file is wrong type */ error_code = file_temp_set_type (thread_p, &(*entry)->vfid, ftype); if (error_code != NO_ERROR) *entry = NULL; /* <- re-type failed: fall to miss */ else (*entry)->ftype = ftype;}if (*entry != NULL) { /* hit: unlink, decrement the matching ncached_* */ ... }else { error_code = file_tempcache_alloc_entry (entry); /* miss: bare shell, VFID_SET_NULL */ }Five branches: hit/type-matches pops and decrements; hit/re-type succeeds patches ftype then
pops; hit/re-type fails nulls *entry to the miss path; miss allocates a shell
(vfid == NULL); shell-alloc failure propagates. A hit names an allocated file; a miss returns
a shell for the caller to create into.
11.1.2 file_tempcache_put — admit a file back, or refuse
Section titled “11.1.2 file_tempcache_put — admit a file back, or refuse”// file_tempcache_put -- src/storage/file_manager.cif (file_header_copy (...&entry->vfid, &fhead) != NO_ERROR || fhead.n_page_user > prm_get_integer_value (PRM_ID_MAX_PAGES_IN_TEMP_FILE_CACHE)) return false; /* <- too big / unreadable: no lock taken yet */file_tempcache_lock ();if (ncached_not_numerable + ncached_numerable < ncached_max) { if (file_temp_reset_user_pages (thread_p, &entry->vfid) != NO_ERROR) { file_tempcache_unlock (); return false; } /* <- reset failed: cannot reuse */ /* push onto cached_numerable / cached_not_numerable per FILE_IS_NUMERABLE(&fhead) */ file_tempcache_unlock (); return true;}file_tempcache_unlock (); return false; /* cache full */Four exits, one keeps the file: header-copy-fails-or-too-big (false, before locking),
cache-full (false), reset-fails (false), and all-clear (push onto the list chosen by the
real header’s FILE_IS_NUMERABLE, true). A false return tells the caller to destroy it.
11.1.3 Commit/abort drain — file_tempcache_drop_tran_temp_files
Section titled “11.1.3 Commit/abort drain — file_tempcache_drop_tran_temp_files”// file_tempcache_drop_tran_temp_files -- src/storage/file_manager.cint tran_index = file_get_tempcache_entry_index (thread_p);file_tempcache_lock_tran_entry (&file_Tempcache.tran_files[tran_index]);if (file_Tempcache.tran_files[tran_index].head != NULL) file_tempcache_cache_or_drop_entries (thread_p, &file_Tempcache.tran_files[tran_index].head);file_tempcache_unlock_tran_entry (&file_Tempcache.tran_files[tran_index]);file_tempcache_cache_or_drop_entries walks head; per entry it calls file_tempcache_put,
and on false calls file_destroy(..., true) (interrupts suppressed so nothing leaks mid-drop)
then file_tempcache_retire_entry; the list ends empty. tran_files is sized ntrans, where
ntrans = logtb_get_number_of_total_tran_indices () + 1 in server mode (the +1 reserves index
0) and 1 in SA mode — the array is ntrans-sized, not ntrans + 1.
11.1.4 Query-manager-owned files — file_temp_preserve / file_temp_retire_preserved
Section titled “11.1.4 Query-manager-owned files — file_temp_preserve / file_temp_retire_preserved”A temp file that must outlive the request but not the session cannot stay on the
transaction list, or the next commit reclaims it. file_temp_preserve removes it:
// file_temp_preserve -- src/storage/file_manager.centry = file_tempcache_pop_tran_file (thread_p, vfid);if (entry == NULL) assert_release (false); /* must have been on the list */else file_tempcache_retire_entry (entry); /* return the shell; file is now untracked */When done the owner calls file_temp_retire_preserved = file_temp_retire_internal(..., /*was_preserved=*/true). The flag changes how the entry is obtained: a preserved file is on no
list, so retire allocates a fresh shell with vfid from the argument; a non-preserved retire
pops the existing entry. Both funnel into file_tempcache_put and on false file_destroy(..., true).
Invariant — a temp file lives on exactly one tracking list: its transaction’s
head, OR preserved (on no list), OR a global cached list, OR destroyed.file_temp_preserveenforces the hand-off by popping before retiring. Skip the pop and both the commit drain and the query manager retire it — a double-free.
11.2 The File Tracker: the catalog of permanent files
Section titled “11.2 The File Tracker: the catalog of permanent files”The File Tracker is one permanent file per database whose body is a single
FILE_EXTENSIBLE_DATA chain of FILE_TRACK_ITEM records — one per permanent file. It is
located through two globals seeded at boot from boot_Db_parm->trk_vfid: file_Tracker_vfid
(its VFID) and file_Tracker_vpid (its sticky first page, 11.3). boot_sr.c calls
file_tracker_create at creation, file_tracker_load at every restart.
// file_track_metadata / file_track_item -- src/storage/file_manager.cunion file_track_metadata { /* 8 bytes, role depends on item->type */ FILE_TRACK_HEAP_METADATA heap; /* { bool is_marked_deleted; bool dummy[7]; } */ INT64 metadata_size_tracker; /* forces the union to exactly 8 bytes */};struct file_track_item { INT32 fileid; INT16 volid; INT16 type; /* type is a FILE_TYPE cast to INT16 */ FILE_TRACK_METADATA metadata; /* total 16 bytes */};file_track_item — (volid, fileid) is the search key:
| Field | Role | Why it exists |
|---|---|---|
fileid | low 4 bytes of the VFID | with volid, uniquely names the file |
volid | volume of the file | items kept ordered by file_compare_track_items for binary search |
type | FILE_TYPE as 16 bits | lets file_tracker_map filter by type without fixing each header |
metadata | per-type side data | meaningful only for heaps; otherwise zero |
file_track_metadata — a role matrix, because the union means different things by type:
item->type | Active member | Meaning |
|---|---|---|
FILE_HEAP / FILE_HEAP_REUSE_SLOTS | heap.is_marked_deleted | heap is logically dropped but kept for reuse (file_tracker_item_reuse_heap) |
| any other type | metadata_size_tracker | unused; written 0 by file_tracker_register when metadata == NULL |
Invariant — items are ordered across the whole chain by
file_compare_track_items;registerinserts at the binary-search position. Bothunregister(file_extdata_find_and_remove_item) and the iterator’s resume-by-cursor logic rely on this order — an out-of-order insert makes a later lookup silently miss a file that exists.
flowchart LR parm["boot_Db_parm->trk_vfid"] --> vfid["file_Tracker_vfid"] parm --> sticky["sticky first page"] sticky --> vpid["file_Tracker_vpid"] vpid --> head["FILE_EXTENSIBLE_DATA (head page)"] head -->|vpid_next| more["FILE_EXTENSIBLE_DATA (more pages)"] head --> items["FILE_TRACK_ITEM[] (volid,fileid,type,metadata)"]
Figure 11-1. Boot parameter to tracker globals to the extensible-data item chain.
11.2.1 file_tracker_register — add an item on permanent create
Section titled “11.2.1 file_tracker_register — add an item on permanent create”Called from file_create for every permanent file (Ch 6), under a started system op.
// file_tracker_register -- src/storage/file_manager.cassert (log_check_system_op_is_started (thread_p));item.volid = vfid->volid; item.fileid = vfid->fileid; item.type = (INT16) ftype;if (metadata == NULL) item.metadata.metadata_size_tracker = 0; /* zero-fill */else item.metadata = *metadata;page_track_head = pgbuf_fix (..., &file_Tracker_vpid, OLD_PAGE, PGBUF_LATCH_WRITE, ...);if (page_track_head == NULL) { ASSERT_ERROR_AND_SET (error_code); return error_code; }error_code = file_tracker_register_internal (thread_p, page_track_head, &item);Placement lives in file_tracker_register_internal: find a not-full page
(file_extdata_find_not_full); if none, allocate a new tracker page
(file_alloc(&file_Tracker_vfid, ...)) linked via file_log_extdata_set_next; binary-search
the slot; assert no duplicate (assert_release(false)); file_extdata_insert_at +
file_log_extdata_add, mark dirty. Both error exits and the duplicate path goto exit.
11.2.2 file_tracker_unregister — remove an item on permanent destroy
Section titled “11.2.2 file_tracker_unregister — remove an item on permanent destroy”// file_tracker_unregister -- src/storage/file_manager.clog_sysop_start (thread_p); /* its own nested system op */item_inout.volid = vfid->volid; item_inout.fileid = vfid->fileid;error_code = file_extdata_find_and_remove_item (..., &item_inout, file_compare_track_items, true, &item_inout, &vpid_merged);if (error_code != NO_ERROR) goto exit; /* -> sysop_abort */if (!VPID_ISNULL (&vpid_merged)) /* removal emptied/merged a page */ error_code = file_dealloc (thread_p, &file_Tracker_vfid, &vpid_merged, FILE_TRACKER);exit: if (error_code != NO_ERROR) log_sysop_abort (thread_p); else log_sysop_end_logical_undo (thread_p, RVFL_TRACKER_UNREGISTER, NULL, sizeof (item_inout), &item_inout);Branches: fix-fails returns early (no sysop); find-and-remove-fails or merge-then-dealloc-fails
both goto exit → abort; success ends with logical undo. Logical (not physical) undo is the
key — items shift between pages as the chain compacts, so a physical undo would target the
wrong slot; the undo replays file_tracker_register_internal from the saved item
(file_rv_tracker_unregister_undo).
11.2.3 file_tracker_map — enumerate every file
Section titled “11.2.3 file_tracker_map — enumerate every file”// file_tracker_map -- src/storage/file_manager.cpage_track_head = pgbuf_fix (..., &file_Tracker_vpid, OLD_PAGE, latch_mode, ...);while (true) { /* walk the extdata chain */ for (index_item = 0; index_item < file_extdata_item_count (extdata); index_item++) { error_code = func (thread_p, page_extdata, extdata, index_item, &stop, args); if (error_code != NO_ERROR || stop) goto exit; /* error, or callback early-out */ } if (page_track_other != NULL) pgbuf_unfix_and_init (thread_p, page_track_other); if (VPID_ISNULL (&extdata->vpid_next)) break; /* end of chain */ page_track_other = pgbuf_fix (..., &extdata->vpid_next, OLD_PAGE, latch_mode, ...); if (page_track_other == NULL) goto exit; page_extdata = page_track_other;}map holds the head page and rotates one page_track_other (at most two latched at once). The
companion file_tracker_interruptable_iterate instead returns a cursor (vfid) plus an OID
lock so a long scan can be interrupted and resumed without pinning the tracker — its
FILE_GET_TRACKER_LOCK_MODE macro picks IX_LOCK for B-trees and SCH_S_LOCK otherwise.
11.3 Sticky first page
Section titled “11.3 Sticky first page”Some files must keep their first user page at a fixed VPID forever — the tracker itself and
the boot HFID heap. file_alloc_sticky_first_page allocates page #1 and records it.
// file_alloc_sticky_first_page -- src/storage/file_manager.cassert (fhead->n_page_user == 0 && VPID_ISNULL (&fhead->vpid_sticky_first)); /* brand-new file */error_code = file_alloc (thread_p, vfid, f_init, f_init_args, vpid_out, page_out);if (error_code != NO_ERROR) goto exit;log_append_undoredo_data2 (thread_p, RVFL_FHEAD_STICKY_PAGE, NULL, page_fhead, 0, sizeof (VPID), sizeof (VPID), &fhead->vpid_sticky_first, vpid_out);fhead->vpid_sticky_first = *vpid_out; /* remember it */pgbuf_set_dirty (thread_p, page_fhead, DONT_FREE);An ordinary file_alloc plus one logged header write (recovered by
file_rv_fhead_sticky_page). The payoff is on dealloc: file_dealloc and its helpers
assert (!VPID_EQ (&fhead->vpid_sticky_first, vpid)), exempting the sticky page from the Ch 9
lifecycle. This is a debug assertion (compiled out under NDEBUG), not a runtime short-circuit
— callers are simply expected never to pass the sticky VPID. file_get_sticky_first_page
reads it back (assert_release(false) if NULL); this is how file_tracker_load recovers
file_Tracker_vpid.
11.4 TDE flags — orthogonal to allocation
Section titled “11.4 TDE flags — orthogonal to allocation”TDE is two mutually exclusive bits in fhead->file_flags: FILE_FLAG_ENCRYPTED_AES (0x4)
and FILE_FLAG_ENCRYPTED_ARIA (0x8).
// file_set_tde_algorithm_internal -- src/storage/file_manager.cfhead->file_flags &= ~FILE_FLAG_ENCRYPTED_MASK; /* clear both bits first */switch (tde_algo) { case TDE_ALGORITHM_AES: fhead->file_flags |= FILE_FLAG_ENCRYPTED_AES; break; case TDE_ALGORITHM_ARIA: fhead->file_flags |= FILE_FLAG_ENCRYPTED_ARIA; break; case TDE_ALGORITHM_NONE: break; /* already cleared */}Neither sector reservation (Ch 4) nor page allocation (Ch 7/8) consults these flags.
file_get_tde_algorithm_internal asserts the two bits are never both set, then reports AES,
ARIA, or NONE. Encryption is applied per page at the buffer layer; the allocation machinery is
algorithm-blind, so for a reader modifying allocation TDE is a non-event.
11.5 The shared primitive — file_extdata_apply_funcs
Section titled “11.5 The shared primitive — file_extdata_apply_funcs”Every table in the module (tracker items, user-page table, sector tables) is a
FILE_EXTENSIBLE_DATA chain, and almost every walk goes through this generic visitor.
// file_extdata_apply_funcs -- src/storage/file_manager.cwhile (true) { if (f_extdata != NULL) { error_code = f_extdata (...); if (error_code || stop) goto exit; } /* per-page */ if (f_item != NULL) /* per-item */ for (i = 0; i < file_extdata_item_count (extdata_in); i++) { error_code = f_item (..., file_extdata_at (extdata_in, i), i, &stop, ...); if (error_code || stop) goto exit; } if (VPID_ISNULL (&extdata_in->vpid_next)) break; // ... unfix current, fix extdata_in->vpid_next, goto exit on NULL ...}exit: if (stop && page_out != NULL) *page_out = page_extdata; /* hand page back, latched */ else if (page_extdata != NULL) pgbuf_unfix (thread_p, page_extdata);Two optional callbacks (f_extdata per page, f_item per item), either can stop or error. The
exit policy is the subtle part: on stop with page_out the page is handed back still latched
so search-then-modify can act on it; otherwise it is unfixed here. This underlies
file_extdata_search_item / _find_ordered (binary search), _insert_at / _remove_at, and
file_extdata_merge. latch_mode is WRITE when for_write, else READ.
11.6 Recovery handlers for this chapter, and the open question
Section titled “11.6 Recovery handlers for this chapter, and the open question”Both modules register undo/redo/dump callbacks indexed by the RV* enum. The handlers
introduced by this chapter’s machinery:
| RV index | Handler(s) | What it replays |
|---|---|---|
RVFL_FHEAD_STICKY_PAGE | file_rv_fhead_sticky_page | sticky-first-page VPID (11.3) |
RVFL_TRACKER_UNREGISTER | file_rv_tracker_unregister_undo | logical undo of tracker removal (11.2.2) |
RVFL_SET_TDE_ALGORITHM | file_rv_set_tde_algorithm | TDE flag change (11.4) |
RVFL_EXTDATA_ADD/REMOVE/SET_NEXT/MERGE | file_rv_extdata_add / _remove / _set_next / _merge | every extensible-data edit (11.5) |
RVDK_FORMAT | disk_rv_undo_format, disk_rv_redo_format, disk_rv_dump_hdr | volume create/format — the open question below |
The core-lifecycle handlers — sector reserve/unreserve, volume-header expand, file-header
alloc/dealloc, partial-sector bitmap edits, postponed dealloc, file destroy — belong to
Chapters 3-5, 7, 9; see those for their RVDK_* / RVFL_* rows.
Open question — mid-
disk_formatcrash idempotency.disk_rv_redo_formatcarries anis_first_callflag (rcv->offset == -1) that skips the disk-cache update on the first of its two calls, so the format handlers encode an implicit assumption about how fardisk_formatgot before a crash. Whether every interleaving (volume file created / cache registered /volume_infowritten) is covered — notably a crash between the redo’s two calls — is not provable from the handlers alone and is left as a verification target.
11.7 Chapter summary — key takeaways
Section titled “11.7 Chapter summary — key takeaways”file_Tempcacherecycles whole temp files:getpops a cached list or allocates a shell;putadmits a reset file or returnsfalseso the caller destroys it.- A temp file lives on exactly one tracking list;
file_temp_preservepops it off the transaction list for the query manager,drop_tran_temp_filesdrains the rest at commit/abort. - The File Tracker is a
(volid, fileid)-orderedFILE_TRACK_ITEMchain reached viatrk_vfid→file_Tracker_vfid/file_Tracker_vpid;unregisteruses logical undo because items migrate between pages. - Sticky first pages are lifecycle-exempt by contract —
file_dealloconly asserts (debug build) the page is nevervpid_sticky_first; release builds rely on callers never passing it. - TDE (mutually exclusive AES/ARIA bits in
file_flags) is orthogonal to allocation; the bits are read only when bytes hit disk. file_extdata_apply_funcsis the one engine behind every table, with per-page/per-item callbacks and an exit policy that can hand the stopped-on page back still latched.- Recovery is indexed by
RV*constants; the one unproven corner isdisk_rv_*_formatidempotency across mid-disk_formatcrash points.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-09; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
bit64_count_trailing_ones | src/base/bit.c | 515 |
PRM_ID_BOSR_MAXTMP_PAGES | src/base/system_parameter.c | 1246 |
DB_VOLPURPOSE | src/compat/dbtype_def.h | 196 |
DB_VOLTYPE | src/compat/dbtype_def.h | 203 |
VSID | src/compat/dbtype_def.h | 939 |
fhs_fix_nth_page | src/query/query_hash_scan.c | 1078 |
disk_volume_header | src/storage/disk_manager.c | 75 |
disk_cache_volinfo | src/storage/disk_manager.c | 155 |
disk_extend_info | src/storage/disk_manager.c | 162 |
disk_perm_info | src/storage/disk_manager.c | 180 |
disk_temp_info | src/storage/disk_manager.c | 186 |
nsect_perm_free | src/storage/disk_manager.c | 189 |
disk_cache | src/storage/disk_manager.c | 194 |
disk_Cache | src/storage/disk_manager.c | 209 |
disk_Temp_max_sects | src/storage/disk_manager.c | 211 |
DISK_STAB_UNIT | src/storage/disk_manager.c | 224 |
disk_stab_cursor | src/storage/disk_manager.c | 229 |
DISK_STAB_PAGE_BIT_COUNT | src/storage/disk_manager.c | 250 |
DISK_ALLOCTBL_SECTOR_PAGE_OFFSET | src/storage/disk_manager.c | 253 |
DISK_ALLOCTBL_SECTOR_UNIT_OFFSET | src/storage/disk_manager.c | 255 |
DISK_STAB_NPAGES | src/storage/disk_manager.c | 263 |
disk_cache_vol_reserve | src/storage/disk_manager.c | 273 |
DISK_PRERESERVE_BUF_DEFAULT | src/storage/disk_manager.c | 278 |
disk_reserve_context | src/storage/disk_manager.c | 281 |
DISK_MIN_VOLUME_SECTS | src/storage/disk_manager.c | 300 |
DISK_SYS_NSECT_SIZE | src/storage/disk_manager.c | 347 |
disk_format | src/storage/disk_manager.c | 512 |
disk_unformat | src/storage/disk_manager.c | 822 |
disk_rv_undo_format | src/storage/disk_manager.c | 1235 |
disk_rv_redo_format | src/storage/disk_manager.c | 1340 |
disk_extend | src/storage/disk_manager.c | 1633 |
disk_volume_expand | src/storage/disk_manager.c | 1904 |
disk_rv_volhead_extend_redo | src/storage/disk_manager.c | 2022 |
disk_rv_volhead_extend_undo | src/storage/disk_manager.c | 2081 |
disk_add_volume | src/storage/disk_manager.c | 2117 |
disk_add_volume_extension | src/storage/disk_manager.c | 2326 |
disk_volume_boot | src/storage/disk_manager.c | 2443 |
disk_cache_load_volume | src/storage/disk_manager.c | 2567 |
disk_cache_init | src/storage/disk_manager.c | 2627 |
disk_cache_final | src/storage/disk_manager.c | 2688 |
disk_cache_load_all_volumes | src/storage/disk_manager.c | 2714 |
disk_cache_free_reserved | src/storage/disk_manager.c | 2728 |
disk_cache_update_vol_free | src/storage/disk_manager.c | 2748 |
disk_lock_extend | src/storage/disk_manager.c | 2791 |
disk_unlock_extend | src/storage/disk_manager.c | 2817 |
disk_cache_lock_reserve_for_purpose | src/storage/disk_manager.c | 2837 |
disk_volume_header_set_stab | src/storage/disk_manager.c | 3166 |
disk_verify_volume_header | src/storage/disk_manager.c | 3179 |
disk_stab_cursor_set_at_sectid | src/storage/disk_manager.c | 3258 |
disk_stab_cursor_set_at_end | src/storage/disk_manager.c | 3284 |
disk_stab_cursor_set_at_start | src/storage/disk_manager.c | 3303 |
disk_stab_cursor_check_valid | src/storage/disk_manager.c | 3372 |
disk_stab_cursor_is_bit_set | src/storage/disk_manager.c | 3414 |
disk_stab_cursor_set_bit | src/storage/disk_manager.c | 3429 |
disk_stab_cursor_fix | src/storage/disk_manager.c | 3493 |
disk_stab_unit_reserve | src/storage/disk_manager.c | 3544 |
disk_stab_iterate_units | src/storage/disk_manager.c | 3665 |
disk_stab_iterate_units_all | src/storage/disk_manager.c | 3738 |
disk_stab_set_bits_contiguous | src/storage/disk_manager.c | 3807 |
disk_rv_reserve_sectors | src/storage/disk_manager.c | 3899 |
disk_rv_unreserve_sectors | src/storage/disk_manager.c | 3982 |
disk_reserve_sectors_in_volume | src/storage/disk_manager.c | 4066 |
disk_reserve_sectors | src/storage/disk_manager.c | 4290 |
disk_reserve_from_cache | src/storage/disk_manager.c | 4463 |
disk_reserve_from_cache_vols | src/storage/disk_manager.c | 4612 |
disk_reserve_from_cache_volume | src/storage/disk_manager.c | 4666 |
disk_unreserve_ordered_sectors | src/storage/disk_manager.c | 4703 |
disk_unreserve_ordered_sectors_without_csect | src/storage/disk_manager.c | 4735 |
disk_unreserve_sectors_from_volume | src/storage/disk_manager.c | 4794 |
disk_stab_unit_unreserve | src/storage/disk_manager.c | 4848 |
disk_stab_init | src/storage/disk_manager.c | 4909 |
disk_manager_init | src/storage/disk_manager.c | 5002 |
disk_manager_final | src/storage/disk_manager.c | 5044 |
disk_format_first_volume | src/storage/disk_manager.c | 5062 |
disk_sectors_to_extend_npages | src/storage/disk_manager.c | 6845 |
DISK_VOLHEADER_PAGE | src/storage/disk_manager.h | 35 |
fileio_map_mounted | src/storage/file_io.c | 3448 |
file_header | src/storage/file_manager.c | 90 |
n_page_mark_delete | src/storage/file_manager.c | 104 |
volid_last_expand | src/storage/file_manager.c | 117 |
vpid_sticky_first | src/storage/file_manager.c | 123 |
vpid_last_temp_alloc | src/storage/file_manager.c | 132 |
offset_to_last_temp_alloc | src/storage/file_manager.c | 133 |
vpid_last_user_page_ftab | src/storage/file_manager.c | 139 |
vpid_find_nth_last | src/storage/file_manager.c | 156 |
first_index_find_nth_last | src/storage/file_manager.c | 157 |
FILE_HEADER_ALIGNED_SIZE | src/storage/file_manager.c | 167 |
FILE_FLAG_NUMERABLE | src/storage/file_manager.c | 170 |
FILE_FLAG_ENCRYPTED_AES | src/storage/file_manager.c | 172 |
FILE_FLAG_ENCRYPTED_ARIA | src/storage/file_manager.c | 173 |
FILE_CACHE_LAST_FIND_NTH | src/storage/file_manager.c | 181 |
FILE_HEADER_GET_PART_FTAB | src/storage/file_manager.c | 199 |
FILE_HEADER_GET_FULL_FTAB | src/storage/file_manager.c | 203 |
FILE_HEADER_GET_USER_PAGE_FTAB | src/storage/file_manager.c | 208 |
file_extensible_data | src/storage/file_manager.c | 232 |
FILE_EXTDATA_HEADER_ALIGNED_SIZE | src/storage/file_manager.c | 240 |
FILE_TABLESPACE_FOR_PERM_NPAGES | src/storage/file_manager.c | 281 |
FILE_TABLESPACE_FOR_TEMP_NPAGES | src/storage/file_manager.c | 287 |
file_vsid_collector | src/storage/file_manager.c | 296 |
file_alloc_type | src/storage/file_manager.c | 388 |
FILE_USER_PAGE_MARK_DELETE_FLAG | src/storage/file_manager.c | 425 |
FILE_USER_PAGE_IS_MARKED_DELETED | src/storage/file_manager.c | 426 |
FILE_USER_PAGE_MARK_DELETED | src/storage/file_manager.c | 427 |
FILE_USER_PAGE_CLEAR_MARK_DELETED | src/storage/file_manager.c | 428 |
file_find_nth_context | src/storage/file_manager.c | 433 |
file_tempcache_entry | src/storage/file_manager.c | 448 |
file_tempcache_tran_entry | src/storage/file_manager.c | 457 |
file_tempcache | src/storage/file_manager.c | 467 |
file_Tempcache | src/storage/file_manager.c | 490 |
file_Tracker_vfid | src/storage/file_manager.c | 496 |
file_Tracker_vpid | src/storage/file_manager.c | 497 |
file_track_metadata | src/storage/file_manager.c | 507 |
file_track_item | src/storage/file_manager.c | 515 |
file_manager_init | src/storage/file_manager.c | 859 |
file_manager_final | src/storage/file_manager.c | 872 |
file_header_alloc | src/storage/file_manager.c | 1093 |
file_header_update_mark_deleted | src/storage/file_manager.c | 1317 |
file_extdata_init | src/storage/file_manager.c | 1492 |
file_extdata_max_size | src/storage/file_manager.c | 1520 |
file_extdata_apply_funcs | src/storage/file_manager.c | 1886 |
file_extdata_find_and_remove_item | src/storage/file_manager.c | 2571 |
file_partsect_is_full | src/storage/file_manager.c | 2758 |
file_partsect_is_empty | src/storage/file_manager.c | 2770 |
file_partsect_set_bit | src/storage/file_manager.c | 2796 |
file_partsect_pageid_to_offset | src/storage/file_manager.c | 2826 |
file_partsect_alloc | src/storage/file_manager.c | 2847 |
file_create_with_npages | src/storage/file_manager.c | 3101 |
file_create_heap | src/storage/file_manager.c | 3126 |
file_create_temp_internal | src/storage/file_manager.c | 3155 |
file_create_temp | src/storage/file_manager.c | 3217 |
file_create_temp_numerable | src/storage/file_manager.c | 3231 |
file_create_query_area | src/storage/file_manager.c | 3244 |
file_create_ehash | src/storage/file_manager.c | 3261 |
file_create_ehash_dir | src/storage/file_manager.c | 3285 |
file_create | src/storage/file_manager.c | 3311 |
file_table_collect_vsid | src/storage/file_manager.c | 3915 |
file_table_collect_all_vsids | src/storage/file_manager.c | 3934 |
file_destroy | src/storage/file_manager.c | 4121 |
file_temp_retire_preserved | src/storage/file_manager.c | 4445 |
file_temp_retire_internal | src/storage/file_manager.c | 4476 |
file_perm_expand | src/storage/file_manager.c | 4644 |
file_table_move_partial_sectors_to_header | src/storage/file_manager.c | 4772 |
file_table_append_full_sector_page | src/storage/file_manager.c | 4976 |
file_table_add_full_sector | src/storage/file_manager.c | 5026 |
file_perm_alloc | src/storage/file_manager.c | 5166 |
file_alloc | src/storage/file_manager.c | 5405 |
file_alloc_sticky_first_page | src/storage/file_manager.c | 5681 |
file_rv_fhead_sticky_page | src/storage/file_manager.c | 5753 |
file_get_sticky_first_page | src/storage/file_manager.c | 5779 |
file_set_tde_algorithm_internal | src/storage/file_manager.c | 5896 |
file_get_tde_algorithm_internal | src/storage/file_manager.c | 5963 |
file_dealloc | src/storage/file_manager.c | 6116 |
file_perm_dealloc | src/storage/file_manager.c | 6309 |
file_rv_dealloc_internal | src/storage/file_manager.c | 6616 |
file_rv_dealloc_on_undo | src/storage/file_manager.c | 6758 |
file_rv_dealloc_on_postpone | src/storage/file_manager.c | 6773 |
file_numerable_add_page | src/storage/file_manager.c | 7935 |
file_extdata_find_nth_vpid | src/storage/file_manager.c | 8119 |
file_extdata_find_nth_vpid_and_skip_marked | src/storage/file_manager.c | 8153 |
file_numerable_find_nth | src/storage/file_manager.c | 8193 |
file_rv_user_page_mark_delete | src/storage/file_manager.c | 8381 |
file_rv_user_page_unmark_delete_logical | src/storage/file_manager.c | 8406 |
file_numerable_truncate | src/storage/file_manager.c | 8577 |
file_temp_alloc | src/storage/file_manager.c | 8650 |
disk_reserve_sectors | src/storage/file_manager.c | 8715 |
file_temp_reset_user_pages | src/storage/file_manager.c | 8949 |
file_temp_preserve | src/storage/file_manager.c | 9143 |
file_tempcache_init | src/storage/file_manager.c | 9171 |
file_tempcache_final | src/storage/file_manager.c | 9234 |
file_tempcache_get | src/storage/file_manager.c | 9414 |
file_tempcache_put | src/storage/file_manager.c | 9541 |
file_tempcache_drop_tran_temp_files | src/storage/file_manager.c | 9645 |
file_tempcache_cache_or_drop_entries | src/storage/file_manager.c | 9664 |
file_tempcache_pop_tran_file | src/storage/file_manager.c | 9702 |
file_tracker_create | src/storage/file_manager.c | 9861 |
file_tracker_load | src/storage/file_manager.c | 9910 |
file_tracker_register | src/storage/file_manager.c | 9960 |
file_tracker_register_internal | src/storage/file_manager.c | 10016 |
file_tracker_unregister | src/storage/file_manager.c | 10113 |
file_tracker_map | src/storage/file_manager.c | 10306 |
file_tracker_interruptable_iterate | src/storage/file_manager.c | 10992 |
file_heap_des | src/storage/file_manager.h | 82 |
file_btree_des | src/storage/file_manager.h | 98 |
file_ovf_btree_des | src/storage/file_manager.h | 106 |
FILE_DESCRIPTORS_SIZE | src/storage/file_manager.h | 128 |
file_descriptors | src/storage/file_manager.h | 130 |
file_tablespace | src/storage/file_manager.h | 143 |
FILE_ALLOC_BITMAP | src/storage/file_manager.h | 153 |
FILE_FULL_PAGE_BITMAP | src/storage/file_manager.h | 154 |
FILE_ALLOC_BITMAP_NBITS | src/storage/file_manager.h | 157 |
file_partial_sector | src/storage/file_manager.h | 162 |
pgbuf_dealloc_page | src/storage/page_buffer.c | 14562 |
DISK_SECTOR_NPAGES | src/storage/storage_common.h | 109 |
trk_vfid | src/transaction/boot_sr.c | 119 |
LOG_MAX_DBVOLID | src/transaction/log_volids.hpp | 34 |
Sources
Section titled “Sources”cubrid-disk-manager.md— the high-level companion (covers both file and disk managers).- Raw analyses under
raw/code-analysis/cubrid/storage/disk_manager/and the numerable-file Q&A noteraw/code-analysis/cubrid/file-manager-numerable-qa.md. - Code:
src/storage/file_manager.{c,h},src/storage/disk_manager.{c,h}. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.