Skip to content

CUBRID File & Disk Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-disk-manager.md covers design intent and theoretical background for both the file and disk managers. This document traces every branch and field at the code level, centred on file_manager.c with the disk manager as its substrate. Each chapter is self-contained, but reading in order follows the full lifecycle of a single data page — from reserved sector to owning file — inside the kernel.

Contents:

ChTitleStatus
1Data-Structure Map
2Initialization and Memory Management
3Volume Format and the Sector Allocation Table
4Sector Reservation Two-Step Protocol
5Volume Extension as a Nested Top Action
6File Creation and the Three-Table Layout
7Permanent File Page Allocation
8Temporary File Page Allocation
9Page Deallocation and File Destruction
10Numerable Files and the User Page Table
11Special Paths Tempcache Tracker Sticky Page TDE and Recovery

This chapter is the field dictionary for the whole document; later chapters trace operations over these structures without re-explaining a field. The reader question: what are all the structures the disk and file managers share, and what does every field mean? For design rationale, read the companion cubrid-disk-manager.md (“Volume layout”, “File architecture”, “Permanent vs temporary purpose split”); this chapter assumes that theory and only names fields.

Two boundaries organize everything. The disk/file boundary: the disk manager owns volumes and hands out sectors (64-page extents); the file manager carves pages from them. The on-disk/in-memory boundary: some structures persist byte-for-byte in pages (disk_volume_header, file_header, file_extensible_data, file_partial_sector); others live only in server heap to summarize or coordinate them (disk_cache, disk_extend_info, disk_stab_cursor, disk_reserve_context).

flowchart TB
  subgraph ondisk["On disk (one per volume)"]
    VH["disk_volume_header<br/>page 0 of every volume"]
    STAB["sector allocation table<br/>bitmap pages, 1 bit per sector"]
  end
  subgraph mem["In memory (one disk_Cache, process-wide)"]
    DC["disk_cache"]
    DC --> VOLS["vols[LOG_MAX_DBVOLID+1]<br/>per-volume disk_cache_volinfo"]
    DC --> PERM["perm_purpose_info<br/>disk_perm_info"]
    DC --> TEMP["temp_purpose_info<br/>disk_temp_info"]
    PERM --> PEI["extend_info: disk_extend_info"]
    TEMP --> TEI["extend_info: disk_extend_info"]
  end
  subgraph transient["Transient (per reserve call / per iteration)"]
    RC["disk_reserve_context"]
    RC --> CVR["cache_vol_reserve[]:<br/>disk_cache_vol_reserve"]
    CUR["disk_stab_cursor"]
  end
  VH -. "cached as" .-> VOLS
  STAB -. "walked by" .-> CUR
  DC -. "drained into" .-> RC

Figure 1-1. Disk-side structure relationships. disk_Cache is the single in-memory summary of all volumes; disk_reserve_context and disk_stab_cursor are transient scratch used while reserving sectors and walking the bitmap.

disk_volume_header — the persisted page-0 of every volume

Section titled “disk_volume_header — the persisted page-0 of every volume”

The only disk-manager structure with variable size: it ends in a var_fields[1] flexible region holding the full volume path strings, so sizeof is never used on it (note the literal comment DON'T USE sizeof on this structure).

// disk_volume_header -- src/storage/disk_manager.c
struct disk_volume_header
{
char magic[CUBRID_MAGIC_MAX_LENGTH]; /* magic for file/magic Unix utility; DON'T MOVE */
INT16 iopagesize;
INT16 volid;
INT8 db_charset;
INT8 dummy1;
DB_VOLPURPOSE purpose;
DB_VOLTYPE type;
DKNPAGES sect_npgs; /* pages per sector (== DISK_SECTOR_NPAGES = 64) */
DKNSECTS nsect_total;
DKNSECTS nsect_max;
SECTID hint_allocsect;
DKNPAGES stab_npages;
PAGEID stab_first_page;
PAGEID sys_lastpage;
INT32 dummy2;
INT64 db_creation;
INT64 vol_creation;
LOG_LSA chkpt_lsa;
HFID boot_hfid;
INT32 reserved0; INT32 reserved1; INT32 reserved2; INT32 reserved3;
INT16 next_volid;
INT16 offset_to_vol_fullname;
INT16 offset_to_next_vol_fullname;
INT16 offset_to_vol_remarks;
char var_fields[1]; /* variable: vol_fullname, next_vol_fullname, remarks */
};
FieldRoleWhy it exists
magicFixed signature at byte 0file/magic(5) and CUBRID’s own check identify a volume by it; must not move.
iopagesizeIO page size at formatSanity check only; authoritative size is in the log.
volidThis volume’s idSelf-identification; traces a stray page to its volume.
db_charsetDatabase charset codeVolume must match db charset; checked at attach.
dummy1, dummy2Alignment padding.
purposePermanent vs temporary data purposePicks which rollup the free space feeds (§1.2).
typePermanent vs temporary volume typeDiffers from purpose: a perm-typed volume may serve temp purpose.
sect_npgsPages per sectorAlways 64; stored so the format is self-describing.
nsect_totalSectors currently formattedUpper bound for sector ids that physically exist now.
nsect_maxMax sectors after all extensionSizes the allocation table once so the bitmap never moves.
hint_allocsectNext sector to scanSkips known-full prefix of the bitmap.
stab_npagesTable length in pagesDISK_STAB_NPAGES(nsect_max); bounds the cursor walk.
stab_first_pageFirst bitmap page idTable starts after the header; cursor maps offsets via this.
sys_lastpageLast system pageEverything <= sys_lastpage is header+table; user sectors follow.
db_creationDB creation timestampReplicated everywhere so a foreign volume can’t be attached.
vol_creationThis volume’s creation timePer-volume provenance.
chkpt_lsaRecovery start LSARecovery skips older log records for this volume.
boot_hfidBoot/system heap file idBootstraps multivolume access.
reserved0..3Four spare INT32 for forward-compatible growth without an offset change.
next_volidLink to next volumeThe volume set is a singly linked chain.
offset_to_vol_fullnameOffset within var_fieldsThis volume’s path string.
offset_to_next_vol_fullnameOffset within var_fieldsNext volume’s path (chain followed without a catalog).
offset_to_vol_remarksOffset within var_fieldsFree-text remarks.
var_fieldsFlexible tailHolds the three strings; length is DB_PAGESIZE minus the byte offset of var_fields within the page.

Invariant — the sector allocation table is sized once, for nsect_max, never for nsect_total. stab_npages == DISK_STAB_NPAGES(nsect_max) and sys_lastpage cover header plus the full table at creation. Extension (Ch.5) raises nsect_total toward nsect_max but never touches stab_first_page/stab_npages. If the table could move, every cached disk_stab_cursor.pageid and reserved VSID would dangle.

disk_cache_volinfo, disk_extend_info, disk_perm_info, disk_temp_info, disk_cache

Section titled “disk_cache_volinfo, disk_extend_info, disk_perm_info, disk_temp_info, disk_cache”

These five form the in-memory free-space summary. disk_cache is the root; there is exactly one (static DISK_CACHE *disk_Cache).

// disk_cache_volinfo -- src/storage/disk_manager.c
struct disk_cache_volinfo
{
DB_VOLPURPOSE purpose;
DKNSECTS nsect_free; /* hint of free sectors on this volume */
};
Field (disk_cache_volinfo)RoleWhy it exists
purposePer-volume purpose (perm/temp)Classifies vols[volid] without reading the volume header.
nsect_freePer-volume free-sector hintFast per-volume estimate; the bitmap holds the authoritative count, this is a cache hint.
// disk_extend_info -- src/storage/disk_manager.c
struct disk_extend_info
{
volatile DKNSECTS nsect_free; /* free sectors across all volumes of this purpose */
volatile DKNSECTS nsect_total;
volatile DKNSECTS nsect_max;
volatile DKNSECTS nsect_intention; /* sectors a thread intends to add by extending */
pthread_mutex_t mutex_reserve;
#if !defined (NDEBUG)
volatile int owner_reserve; /* debug: tid holding mutex_reserve */
#endif
DKNSECTS nsect_vol_max;
VOLID volid_extend;
DB_VOLTYPE voltype;
};
Field (disk_extend_info)RoleWhy it exists
nsect_freeFree sectors over all volumes of one purposeFast number reservation decrements before touching any bitmap (Ch.4); volatile for cross-thread visibility.
nsect_totalFormatted sectors of this purposeDistinguishes exhausted from merely fragmented.
nsect_maxCeiling of this purposeDistinguishes “extend existing” from “add new volume”.
nsect_intentionSectors promised but not yet committed by an extenderPrevents thundering-herd extension (Ch.5).
mutex_reserveLock guarding the four countersSerializes the hot reservation path.
owner_reserveDebug owner tidNDEBUG-only lock-discipline aid.
nsect_vol_maxLargest sector count a new volume may takeCaps a single extension’s size.
volid_extendVolume the next extension growsCached target, no rescan.
voltypeVolume type for this rollupTags perm vs temp.
// disk_perm_info / disk_temp_info -- src/storage/disk_manager.c
struct disk_perm_info { DISK_EXTEND_INFO extend_info; };
struct disk_temp_info {
DISK_EXTEND_INFO extend_info;
DKNSECTS nsect_perm_free; /* free sectors on PERMANENT volumes usable for temp purpose */
DKNSECTS nsect_perm_total;
};
FieldRoleWhy it exists
disk_perm_info.extend_infoThe perm-purpose rollupAll permanent free space funnels here.
disk_temp_info.extend_infoThe temp-volume rollupFree space on genuine temp volumes.
disk_temp_info.nsect_perm_freeFree temp-usable sectors on perm volumesFallback pool when temp volumes exhausted (temp-on-perm); kept separate so temp alloc prefers real temp volumes first.
disk_temp_info.nsect_perm_totalTotal such sectorsSizes the fallback pool.
// disk_cache -- src/storage/disk_manager.c
struct disk_cache
{
int nvols_perm;
int nvols_temp;
DISK_CACHE_VOLINFO vols[LOG_MAX_DBVOLID + 1]; /* per-volume free hint, indexed by volid */
DISK_PERM_PURPOSE_INFO perm_purpose_info;
DISK_TEMP_PURPOSE_INFO temp_purpose_info;
pthread_mutex_t mutex_extend; /* never take while holding a reserve mutex */
#if !defined (NDEBUG)
volatile int owner_extend;
#endif
};
Field (disk_cache)RoleWhy it exists
nvols_permNumber of permanent volumesIteration bound / placement.
nvols_tempNumber of temporary volumesSame; temp volumes index from the high end of vols.
vols[]Per-volume disk_cache_volinfoDirect vols[volid] lookup; sized LOG_MAX_DBVOLID + 1.
perm_purpose_infoPermanent rollupAggregate perm free space.
temp_purpose_infoTemporary rollupAggregate temp free space plus perm-fallback.
mutex_extendLock for volume-set extensionCoarser than mutex_reserve.
owner_extendDebug ownerNDEBUG-only.

LOG_MAX_DBVOLID is VOLID_MAX - 1 (SHRT_MAX - 1), so vols[] indexes any valid VOLID.

Invariant — lock ordering: mutex_reserve before mutex_extend, never the reverse. Both struct comments state it. Reservation (frequent) takes mutex_reserve; extension (rare) takes mutex_extend. Opposite ordering across two threads would deadlock. Ch.4 and Ch.5 rely on this.

The sector allocation table is a bitmap, one bit per sector. The iteration unit is a UINT64.

// DISK_STAB_UNIT -- src/storage/disk_manager.c
typedef UINT64 DISK_STAB_UNIT; /* one 64-bit word of the bitmap */
// disk_stab_cursor -- src/storage/disk_manager.c
struct disk_stab_cursor
{
const DISK_VOLUME_HEADER *volheader;
PAGEID pageid; /* current bitmap page id (real, not table-relative) */
int offset_to_unit;
int offset_to_bit;
SECTID sectid;
PAGE_PTR page; /* fixed bitmap page (NULL until fixed) */
DISK_STAB_UNIT *unit; /* pointer to current unit inside page */
};
FieldRoleWhy it exists
volheaderVolume being walkedSource of stab_first_page/nsect_total bounds.
pageidCurrent real bitmap pageFrom sectid plus stab_first_page.
offset_to_unitWhich UINT64 word in the pageDISK_ALLOCTBL_SECTOR_UNIT_OFFSET.
offset_to_bitWhich bit in the wordDISK_ALLOCTBL_SECTOR_BIT_OFFSET.
sectidSector the cursor namesThe (page, unit, bit) triple decomposes this.
pagePinned page pointerNULL = no page fixed; non-NULL = a latch is held.
unitPointer into page at offset_to_unitReads/writes the live word without recomputing the address.

Invariant — page == NULL iff no latch is held. Crossing a page boundary must unfix the old page before fixing the next and resetting unit. A non-NULL page left after the walk is a leaked latch. Ch.4’s bitmap-commit step depends on this.

disk_cache_vol_reserve and disk_reserve_context

Section titled “disk_cache_vol_reserve and disk_reserve_context”

Transient scratch the two-step reservation (Ch.4) uses. disk_reserve_context lives on the caller’s stack for one reservation.

// disk_cache_vol_reserve -- src/storage/disk_manager.c
struct disk_cache_vol_reserve
{
VOLID volid; /* a volume from which sectors were drawn */
DKNSECTS nsect; /* how many sectors drawn from it */
};
// disk_reserve_context -- src/storage/disk_manager.c
struct disk_reserve_context
{
int nsect_total; /* total sectors this request must reserve */
VSID *vsidp; /* output cursor: next VSID write position */
DISK_CACHE_VOL_RESERVE cache_vol_reserve[VOLID_MAX]; /* per-volume tally drawn from cache */
int n_cache_vol_reserve;
int n_cache_reserve_remaining; /* entries not yet committed to bitmaps */
DKNSECTS nsects_lastvol_remaining; /* sectors still owed on the last volume */
DB_VOLPURPOSE purpose;
};
Field (disk_reserve_context)RoleWhy it exists
nsect_totalSectors the request needsThe loop’s goal.
vsidpWrite cursor into the caller’s VSID[]Each committed sector appends here.
cache_vol_reserve[]Per-volume plan (volume, count)Step one fills it from cache; step two replays against bitmaps. Sized VOLID_MAX.
n_cache_vol_reserveCount of populated plan entriesBounds the replay loop.
n_cache_reserve_remainingEntries not yet committedEnables precise rollback.
nsects_lastvol_remainingSectors still owed on the current volumeProgress within one entry.
purposeRequest purposeRoutes to perm or temp rollup.

disk_cache_vol_reserve is just a (volid, nsect) pair; an array is the reservation plan. DISK_PRERESERVE_BUF_DEFAULT (16) is the default batch the cache reserve fills.

flowchart TB
  subgraph hdrpage["File header page (page 0 of a file)"]
    FH["file_header"]
    FH --> TS["tablespace: file_tablespace"]
    FH --> DESC["descriptor: file_descriptors (union, 64 B)"]
    FH -. "offset_to_partial_ftab" .-> PART["partial table:<br/>file_extensible_data of file_partial_sector"]
    FH -. "offset_to_full_ftab" .-> FULL["full table:<br/>file_extensible_data of VSID"]
    FH -. "offset_to_user_page_ftab" .-> UPT["user page table (numerable):<br/>file_extensible_data of VPID"]
  end
  PART --> PS["file_partial_sector<br/>{ vsid, page_bitmap }"]
  PART -. "vpid_next" .-> MOREP["overflow extdata page"]

Figure 1-2. File-side structure relationships. The header page embeds file_header, which carries the tablespace policy, the typed descriptor, and three byte-offsets into the three extensible tables co-located in the same page (overflowing via vpid_next).

file_header — the persisted page-0 of every file

Section titled “file_header — the persisted page-0 of every file”
// file_header -- src/storage/file_manager.c
struct file_header
{
INT64 time_creation;
VFID self; /* this file's own VFID */
FILE_TABLESPACE tablespace;
FILE_DESCRIPTORS descriptor;
/* Page counts. */
int n_page_total;
int n_page_user;
int n_page_ftab;
int n_page_free; /* reserved-on-disk, not yet allocated */
int n_page_mark_delete; /* numerable: pages marked deleted */
/* Sector counts. */
int n_sector_total;
int n_sector_partial;
int n_sector_full;
int n_sector_empty; /* empty sectors are a subset of partial */
FILE_TYPE type;
INT32 file_flags; /* NUMERABLE / TEMPORARY / ENCRYPTED_* */
VOLID volid_last_expand;
INT16 offset_to_partial_ftab;
INT16 offset_to_full_ftab;
INT16 offset_to_user_page_ftab; /* user page table (numerable only) */
VPID vpid_sticky_first; /* first page if sticky; never deallocated */
/* Temporary files: last-allocation cursor. */
VPID vpid_last_temp_alloc;
int offset_to_last_temp_alloc;
/* Numerable files. */
VPID vpid_last_user_page_ftab; /* last page of user page table (append point) */
VPID vpid_find_nth_last; /* cache: page of last find_nth result */
int first_index_find_nth_last; /* cache: index of first entry in that page */
INT32 reserved0; INT32 reserved1; INT32 reserved2; INT32 reserved3;
};
FieldRoleWhy it exists
time_creationFile creation timestampProvenance.
selfThe file’s own VFIDA header page in isolation knows its file.
tablespaceExpansion policy (below)Drives growth aggressiveness.
descriptorTyped metadata union (below)Each type stashes its ids here.
n_page_totalPages owned (user + table + free)Master accounting.
n_page_userPages handed to the ownerThe useful count.
n_page_ftabPages used by the three tablesOverhead; grows on overflow.
n_page_freeReserved-but-unallocated pagesAvailable without a new reservation.
n_page_mark_deleteNumerable pages marked deletedNumerable files flag, not remove (Ch.10).
n_sector_totalSectors reserved= n_sector_partial + n_sector_full.
n_sector_partialPartial sectorsHave a free page; in the partial table.
n_sector_fullFull sectorsAll 64 pages used; in the full table (perm only).
n_sector_emptySectors with zero pagesSubset of partial; reclaimed first on extension.
typeFILE_TYPESelects table layout and numerable/temp eligibility.
file_flagsBit flagsNUMERABLE 0x1, TEMPORARY 0x2, ENCRYPTED_AES 0x4, ENCRYPTED_ARIA 0x8; via FILE_IS_*.
volid_last_expandVolume last grownLocality hint for next expansion.
offset_to_partial_ftabOffset to partial table in this pageFILE_HEADER_GET_PART_FTAB asserts range.
offset_to_full_ftabOffset to full tableAsserted non-temporary (temp has no full table).
offset_to_user_page_ftabOffset to user page tableNumerable only; asserted numerable.
vpid_sticky_firstFirst page, if stickyNever deallocated (Ch.11).
vpid_last_temp_allocTemp alloc cursor: pageTemp files alloc forward, never dealloc (Ch.8).
offset_to_last_temp_allocTemp alloc cursor: sector offsetOffset component of that cursor.
vpid_last_user_page_ftabNumerable: append pointNew user pages appended here (Ch.10).
vpid_find_nth_lastNumerable: cached find-nth pageOptimizes sequential find-nth (Ch.10).
first_index_find_nth_lastNumerable: cached first-entry indexCompanion to the cache above.
reserved0..3Four spare INT32 for forward compatibility.

Invariant — accounting balances: n_page_total == n_page_user + n_page_ftab + n_page_free, n_sector_total == n_sector_partial + n_sector_full, and n_sector_empty <= n_sector_partial. Every alloc/dealloc in Ch.7–Ch.9 adjusts these as a set under the header latch. Drift means the file believes it owns space it does not, or leaks it; file_validate and the FILE_HEADER_GET_*_FTAB assertions guard. The empty-subset relation lets extension prefer empty sectors without a separate table.

file_extensible_data — the generic multi-page table component

Section titled “file_extensible_data — the generic multi-page table component”

All three file tables are file_extensible_data: a small header followed by an array of fixed-size items, chained page-to-page.

// file_extensible_data -- src/storage/file_manager.c
struct file_extensible_data
{
VPID vpid_next; /* next component page, NULL if last */
INT16 max_size; /* capacity in bytes for items in this component */
INT16 size_of_item; /* byte size of one item */
INT16 n_items; /* number of items currently stored */
};
FieldRoleWhy it exists
vpid_nextLink to next componentChains overflow pages; NULL terminates.
max_sizeByte capacity hereBounds n_items.
size_of_itemSize of one itemPartial table = file_partial_sector (16 B), full = VSID (8 B), user-page = VPID (8 B). One format, three item types.
n_itemsItems storedDrives iteration; insert/delete bump it.

Invariant — n_items * size_of_item <= max_size, items kept densely packed from FILE_EXTDATA_HEADER_ALIGNED_SIZE. An insert that would overflow allocates a new component linked via vpid_next; a delete shifts the tail down. The density invariant is what lets find-nth index by position (Ch.6–Ch.10).

file_partial_sector, FILE_ALLOC_BITMAP, file_tablespace, file_descriptors

Section titled “file_partial_sector, FILE_ALLOC_BITMAP, file_tablespace, file_descriptors”
// FILE_ALLOC_BITMAP -- src/storage/file_manager.h
typedef UINT64 FILE_ALLOC_BITMAP; /* one bit per page in a sector (64 pages) */
#define FILE_FULL_PAGE_BITMAP 0xFFFFFFFFFFFFFFFF /* Full allocation bitmap */
#define FILE_EMPTY_PAGE_BITMAP 0x0000000000000000 /* Empty allocation bitmap */
#define FILE_ALLOC_BITMAP_NBITS ((int) (sizeof (FILE_ALLOC_BITMAP) * CHAR_BIT)) /* 64 */
// file_partial_sector -- src/storage/file_manager.h
struct file_partial_sector
{
VSID vsid; /* MUST be first member: reinterpreted as VSID in file table */
FILE_ALLOC_BITMAP page_bitmap;
};

VSID is { int32_t sectid; short volid; } = 6 bytes padded to 8; FILE_ALLOC_BITMAP is a UINT64 = 8 bytes; so sizeof (file_partial_sector) == 16.

FieldRoleWhy it exists
file_partial_sector.vsidThe reserved sector’s id (8 B)First member by contract: the full table stores bare VSID, so a file_partial_sector* is reinterpreted as VSID* on promotion.
file_partial_sector.page_bitmap64-bit page allocation map (8 B)Bit i set = page i allocated. FILE_FULL_PAGE_BITMAP = full; FILE_EMPTY_PAGE_BITMAP = empty.

Invariant — vsid is the first member, deliberately. The source comment: “VSID must be first member … the FILE_PARTIAL_SECTOR pointers in file table are reinterpreted as VSID.” When a partial sector fills, the file manager moves the leading VSID bytes into the full table without copying the bitmap. Reordering would corrupt the full table silently. FILE_ALLOC_BITMAP_NBITS == DISK_SECTOR_NPAGES == 64, so one bitmap covers one sector exactly.

// file_tablespace -- src/storage/file_manager.h
struct file_tablespace
{
INT64 initial_size; /* bytes the file starts with */
float expand_ratio; /* fraction of current size to add when expanding */
int expand_min_size; /* lower clamp on an expansion, in bytes */
int expand_max_size; /* upper clamp on an expansion, in bytes */
};
FieldRoleWhy it exists
initial_sizeStarting byte sizeMAX(1, npages) * DB_PAGESIZE at create.
expand_ratioGrowth fraction~1% of current size for perm (FILE_TABLESPACE_DEFAULT_RATIO_EXPAND); 0 for temp.
expand_min_sizeMinimum expansionAt least one sector for perm; 0 for temp.
expand_max_sizeMaximum expansionCaps one growth (1024 sectors perm; 0 temp).

Temp files use FILE_TABLESPACE_FOR_TEMP_NPAGES, zeroing ratio/min/max — temp files do not auto-expand the same way.

// file_descriptors -- src/storage/file_manager.h
/* note: if you change file descriptors size, make sure to change disk compatibility version too! */
#define FILE_DESCRIPTORS_SIZE 64
union file_descriptors
{
FILE_HEAP_DES heap;
FILE_OVF_HEAP_DES heap_overflow;
FILE_BTREE_DES btree;
FILE_OVF_BTREE_DES btree_key_overflow; /* TODO: rename FILE_OVF_BTREE_DES */
FILE_EHASH_DES ehash;
FILE_VACUUM_DATA_DES vacuum_data;
char dummy_align[FILE_DESCRIPTORS_SIZE];
};

The per-member struct shapes below are added annotations (the source defines each FILE_*_DES separately, not inline):

MemberShape (annotation)RoleWhy it exists
heap{ OID class_oid; HFID hfid; }Heap file’s class OID + HFIDA heap file points back to its class and heap id.
heap_overflow{ HFID hfid; OID class_oid; }Overflow heap’s HFID + class OIDOverflow records for large heap rows.
btree{ OID class_oid; int attr_id; }Index file’s class OID + attribute idA btree file knows the indexed class and attribute.
btree_key_overflow{ BTID btid; OID class_oid; }Long-key overflow file (file_ovf_btree_des)Long keys overflow into a separate file.
ehash{ OID class_oid; int attr_id; }Extensible hash’s class OID + attr idIdentifies the hashed attribute.
vacuum_data{ VPID vpid_first; }First VPID of vacuum dataVacuum’s bookkeeping file.
dummy_alignchar[FILE_DESCRIPTORS_SIZE]64-byte paddingPins the union at FILE_DESCRIPTORS_SIZE; the source ties this size to the on-disk compatibility version, so it must not change casually.

The union is interpreted per file_header.type. FILE_TYPE_CAN_BE_NUMERABLE, FILE_TYPE_IS_ALWAYS_TEMP, and the file_flags bits decide which tables the file actually carries — covered in Ch.6 and Ch.10.

  1. There are two persisted page-0 structures — disk_volume_header (one per volume) and file_header (one per file) — and the rest either summarize them in memory (disk_cache family) or are scratch for one operation (disk_reserve_context, disk_stab_cursor).
  2. The disk manager hands out sectors (64-page extents) tracked by a one-bit-per-sector table sized once for nsect_max; the table is immovable, which is why every reserved VSID and cached cursor stays valid across volume extension.
  3. disk_cache is the single in-memory free-space oracle: vols[] per-volume hints feed two purpose rollups (disk_perm_info, disk_temp_info), and the lock order mutex_reserve before mutex_extend is a hard invariant against deadlock.
  4. Sector reservation is two-step: disk_reserve_context drains a plan from the cache into cache_vol_reserve[], then replays it against the bitmaps via a disk_stab_cursor; the *_remaining counters make a partial reservation precisely reversible.
  5. The file manager carves pages from reserved sectors using three file_extensible_data tables — partial, full, user-page — the same chained, densely-packed, fixed-item format differing only in size_of_item.
  6. file_partial_sector (16 B) puts vsid first on purpose so a filled sector can be promoted to the full table by reinterpreting the pointer as a bare VSID; its 64-bit page_bitmap maps exactly one sector’s 64 pages.
  7. file_header’s page and sector counters must balance as accounting identities; its three offset_to_*_ftab, the temp-alloc cursor, and the numerable find-nth cache are the only state distinguishing regular, temporary, and numerable files — operational meaning deferred to Ch.7, Ch.8, and Ch.10.

Chapter 2: Initialization and Memory Management

Section titled “Chapter 2: Initialization and Memory Management”

The Chapter 1 structures have no on-disk form; the source of truth is the per-volume header plus its sector allocation table (Chapter 3), and disk_Cache is a derived rollup recomputed from those headers at every boot. This chapter answers: where do disk_Cache and the file-manager globals come from at server start, and how is the cache rebuilt by walking the mounted-volume chain? For why CUBRID keeps a coarse RAM counter, see the companion’s “In-memory cache” section.

Two modules wake up at boot: the disk manager (owns disk_Cache, disk_manager_init) reconstructs state from disk; the file manager (owns file_Tempcache and the tracker globals, file_manager_init) only zeroes RAM.

flowchart TD
  boot["server boot"] --> dmi["disk_manager_init(load_from_disk=true)"]
  dmi --> dci["disk_cache_init -> malloc disk_Cache"]
  dci --> dclav["disk_cache_load_all_volumes"]
  dclav --> fmm["fileio_map_mounted: walk mounted volumes"]
  fmm --> dclv["disk_cache_load_volume (per volid)"]
  dclv --> dvb["disk_volume_boot: read header + count free"]
  boot --> fmi["file_manager_init"]
  fmi --> ftci["file_tempcache_init -> zero file_Tempcache"]

Figure 2-1. Boot-time initialization fan-out.

2.2 disk_manager_init — parameter capture, reload guard, optional load

Section titled “2.2 disk_manager_init — parameter capture, reload guard, optional load”

disk_manager_init does four things in order: derive the temp-volume sector cap, capture the logging flag, (re)allocate the cache, and conditionally load from disk.

// disk_manager_init -- src/storage/disk_manager.c
int
disk_manager_init (THREAD_ENTRY * thread_p, bool load_from_disk)
{
int error_code = NO_ERROR;
disk_Temp_max_sects = (DKNSECTS) prm_get_integer_value (PRM_ID_BOSR_MAXTMP_PAGES);
if (disk_Temp_max_sects < 0)
disk_Temp_max_sects = SECTID_MAX; /* <- negative param means "no cap" (infinite) */
else
disk_Temp_max_sects = disk_Temp_max_sects / DISK_SECTOR_NPAGES; /* <- pages -> sectors */
// ... condensed: disk_Logging = prm_get_bool_value (PRM_ID_DISK_LOGGING) ...
if (disk_Cache != NULL)
disk_cache_final (); /* <- idempotent reload: tear down stale cache first */
error_code = disk_cache_init ();
if (error_code != NO_ERROR)
{
ASSERT_ERROR ();
return error_code; /* <- malloc failure: nothing to clean up */
}
assert (disk_Cache != NULL);
if (load_from_disk && !disk_cache_load_all_volumes (thread_p))
{
ASSERT_ERROR_AND_SET (error_code);
disk_manager_final (); /* <- partial load failed: roll the whole cache back */
return error_code;
}
return NO_ERROR;
}

Branch accounting:

BranchConditionEffect
disk_Temp_max_sects < 0param negative (default -1)cap = SECTID_MAX -> infinite temp space
elseparam >= 0param is a page count; / DISK_SECTOR_NPAGES -> sector cap
disk_Cache != NULLprior cache exists (reload)disk_cache_final frees it first — makes init idempotent
disk_cache_init != NO_ERRORmalloc failedearly return; nothing allocated to free
load_from_disk && load failsa volume failed to bootdisk_manager_final frees the half-cache, propagate error
load_from_disk == falsefirst-volume format pathcache stays empty; caller fills it manually

The static initializer static DKNSECTS disk_Temp_max_sects = -2; is a pre-init sentinel (“not yet computed”), distinct from the parameter default -1 (“Infinite”). disk_manager_init always overwrites it from PRM_ID_BOSR_MAXTMP_PAGES (temp_file_max_size_in_pages) per the branch table; this later bounds permanent-volume-as-temp growth.

Invariant — the reload path is destructive-then-rebuilding. disk_manager_init may run more than once (reload after recovery phases), so it must never leak the old cache. The disk_Cache != NULL guard calls disk_cache_final first; without it a second init leaks the previous allocation and its three mutexes.

2.3 disk_cache_init — allocating and zeroing the global cache

Section titled “2.3 disk_cache_init — allocating and zeroing the global cache”

disk_cache_init is the only allocator of disk_Cache. It mallocs one flat DISK_CACHE (the vols[] array is inline, sized for LOG_MAX_DBVOLID), then zeroes every counter so the per-volume load can simply add into the rollup.

// disk_cache_init -- src/storage/disk_manager.c
static int
disk_cache_init (void)
{
int i;
assert (disk_Cache == NULL); /* <- never double-allocate */
disk_Cache = (DISK_CACHE *) malloc (sizeof (DISK_CACHE));
if (disk_Cache == NULL)
{ /* ... er_set OUT_OF_VIRTUAL_MEMORY, return ER_OUT_OF_VIRTUAL_MEMORY ... */ }
disk_Cache->nvols_perm = disk_Cache->nvols_temp = 0;
disk_Cache->perm_purpose_info.extend_info.nsect_vol_max = /* default new-vol size */
DISK_SECTS_ROUND_UP ((DKNSECTS) (prm_get_bigint_value (PRM_ID_DB_VOLUME_SIZE) / IO_SECTORSIZE));
// ... condensed: perm free/total/max = 0 (load ADDS in); volid_extend = NULL_VOLID; voltype = PERM ...
// ... condensed: temp extend_info same vol_max, zeroed, NULL_VOLID; nsect_perm_free/total = 0 ...
// ... condensed: 3 pthread_mutex_init (perm/temp mutex_reserve, mutex_extend) ...
for (i = 0; i <= LOG_MAX_DBVOLID; i++) /* <- inclusive of highest legal volid */
{
disk_Cache->vols[i].purpose = DISK_UNKNOWN_PURPOSE; /* <- every slot starts "no volume here" */
disk_Cache->vols[i].nsect_free = 0;
}
return NO_ERROR;
}

nsect_vol_max (both purposes) is the default new-volume size for later auto-extension, not a current value. Both volid_extend start NULL_VOLID (discovered during load), both nvols_* start 0, and every slot starts DISK_UNKNOWN_PURPOSE / zero free. Since load only adds, a fresh disk_cache_init must precede any load.

2.4 disk_cache_load_all_volumes — walking the mounted-volume chain

Section titled “2.4 disk_cache_load_all_volumes — walking the mounted-volume chain”

disk_cache_load_all_volumes is a thin wrapper — it asserts the cache exists and returns fileio_map_mounted (thread_p, disk_cache_load_volume, NULL), handing the per-volume callback to the chain walker.

fileio_map_mounted (in file_io.c) is that walker. It iterates the file-IO volume-info header in two passes: permanent volumes ascending from volid 0 up to next_perm_volid - 1, then temporary volumes descending to next_temp_volid (the file-IO equivalent of the on-disk next_volid chain). Unmounted slots (vol_info_p->vdes == NULL_VOLDES) are skipped. If the callback returns false, the walk stops and returns false, which disk_manager_init treats as fatal.

flowchart TD
  start["fileio_map_mounted"] --> permloop{"perm volid <= next_perm_volid-1?"}
  permloop -- "vdes live" --> cb1["disk_cache_load_volume(volid)"]
  permloop -- "skip / done" --> temploop{"temp volid >= next_temp_volid?"}
  cb1 -- false --> stopf["return false"]
  cb1 -- true --> permloop
  temploop -- "vdes live" --> cb2["disk_cache_load_volume(volid)"]
  cb2 -- false --> stopf
  cb2 -- true --> temploop
  temploop -- done --> okt["return true"]

Figure 2-2. fileio_map_mounted two-pass walk driving the cache load.

2.5 disk_cache_load_volume — rolling one header into the rollup

Section titled “2.5 disk_cache_load_volume — rolling one header into the rollup”

The heart of cache reconstruction. Per volume it boots the header via disk_volume_boot (reads the header, counts free sectors — Chapter 3), then folds the result into the right purpose info.

// disk_cache_load_volume -- src/storage/disk_manager.c
static bool
disk_cache_load_volume (THREAD_ENTRY * thread_p, INT16 volid, void *ignore)
{
DB_VOLPURPOSE vol_purpose;
DB_VOLTYPE vol_type;
DISK_VOLUME_SPACE_INFO space_info = DISK_VOLUME_SPACE_INFO_INITIALIZER;
if (disk_volume_boot (thread_p, volid, &vol_purpose, &vol_type, &space_info) != NO_ERROR)
{
ASSERT_ERROR ();
return false; /* <- aborts the whole map walk */
}
if (vol_type != DB_PERMANENT_VOLTYPE)
{
/* don't save temporary volumes... they will be dropped anyway */
return true; /* <- temp-type volumes are not cached at all */
}
if (vol_purpose == DB_PERMANENT_DATA_PURPOSE)
{
// perm_purpose_info.extend_info.nsect_{free,total,max} += space_info.n_{free,total,max}_sects
// ... condensed: assert nsect_free <= nsect_total <= nsect_max ...
if (space_info.n_total_sects < space_info.n_max_sects)
{
assert (disk_Cache->perm_purpose_info.extend_info.volid_extend == NULL_VOLID);
disk_Cache->perm_purpose_info.extend_info.volid_extend = volid; /* <- this vol can still grow */
}
}
else /* perm type, temp purpose */
{
assert (space_info.n_total_sects == space_info.n_max_sects); /* <- perm-as-temp is fully grown */
// temp_purpose_info.nsect_perm_{free,total} += space_info.n_{free,total}_sects
// ... condensed: assert nsect_perm_free <= nsect_perm_total ...
}
disk_Cache->vols[volid].nsect_free = space_info.n_free_sects;
disk_Cache->vols[volid].purpose = vol_purpose;
disk_Cache->nvols_perm++; /* <- runs for BOTH branches above */
return true;
}

Branch accounting:

BranchConditionEffect
boot failsdisk_volume_boot != NO_ERRORreturn false; map walk and whole init abort
vol_type != DB_PERMANENT_VOLTYPEtemporary-type volumereturn truenot cached (dropped/reformatted at boot)
vol_purpose == DB_PERMANENT_DATA_PURPOSEperm volume, perm dataadd free/total/max into perm_purpose_info.extend_info; if below max size, set volid_extend
else (perm type, temp purpose)perm volume repurposed for tempadd free/total into temp_purpose_info.nsect_perm_*; assert fully grown

The else-branch is the subtle case: type (survives restart?) and purpose (what it holds) are orthogonal. A perm-type/temp-purpose volume’s space rolls into nsect_perm_* (“permanent sectors lent to temp”), distinct from temp_purpose_info.extend_info (genuine temporary-type volumes, skipped above). The perm-path assert (... == NULL_VOLID) enforces at most one permanent volume “growing”. After the if/else, the slot recording (vols[volid].*) and nvols_perm++ run unconditionally for every permanent-TYPE volume regardless of purpose — so a perm-as-temp volume is counted in nvols_perm, never nvols_temp; since temporary-type volumes returned early, after a full load nvols_temp == 0.

Invariant — the cache is a derived rollup and may legitimately undercount. nsect_free is allowed to be lower than reality at any time; the two-step reservation protocol (Chapter 4) depends on this — a reservation may pessimistically decrement the cache and reconcile against the allocation table later. Never treat nsect_free as exact; the allocation table is the source of truth.

2.6 disk_manager_final / disk_cache_final — teardown

Section titled “2.6 disk_manager_final / disk_cache_final — teardown”

Teardown is branch-light; disk_manager_final delegates to disk_cache_final.

// disk_manager_final -- src/storage/disk_manager.c
void disk_manager_final (void) { disk_cache_final (); }
// disk_cache_final -- src/storage/disk_manager.c
static void
disk_cache_final (void)
{
if (disk_Cache == NULL)
{
return; /* <- safe to call when never initialized */
}
// ... condensed: assert perm/temp owner_reserve == -1 and owner_extend == -1 (no lock held at teardown) ...
// ... condensed: pthread_mutex_destroy the perm/temp mutex_reserve and mutex_extend ...
free_and_init (disk_Cache); /* <- frees and NULLs the pointer */
}

The disk_Cache == NULL guard makes final idempotent, which is why both the reload path and the load-failure rollback call it unconditionally. The three owner_* asserts (debug only) document that no thread may hold the reserve or extend mutex at teardown — a violation is caught here, not as a destroyed locked mutex. free_and_init zeroes the pointer so a later disk_cache_init passes assert (disk_Cache == NULL).

2.7 file_manager_init / file_manager_final and the file-manager globals

Section titled “2.7 file_manager_init / file_manager_final and the file-manager globals”

The file manager reconstructs nothing from disk: it captures one logging flag, sanity-checks a size assumption, and initializes the temporary-file cache.

// file_manager_init -- src/storage/file_manager.c
int
file_manager_init (void)
{
file_Logging = prm_get_bool_value (PRM_ID_FILE_LOGGING);
assert (FILE_DESCRIPTORS_SIZE == sizeof (FILE_DESCRIPTORS)); /* <- layout self-check */
return file_tempcache_init ();
}
// file_manager_final -- src/storage/file_manager.c
void file_manager_final (void) { file_tempcache_final (); }

file_manager_init does not touch file_Tracker_vfid / file_Tracker_vpid; they are statically zero-initialized (VFID_INITIALIZER / VPID_INITIALIZER) and only filled when the tracker file is created or located during boot (Chapters 6 and 9). file_Tempcache is likewise static, “empty” until file_tempcache_init populates it:

// file_tempcache_init -- src/storage/file_manager.c
static int
file_tempcache_init (void)
{
int ntrans = logtb_get_number_of_total_tran_indices () + 1; /* SERVER_MODE; else 1 */
assert (file_Tempcache.tran_files == NULL); /* <- tran_files != NULL means "initialized" */
// ... condensed: free_entries/cached_* = NULL, ncached_* = 0, nfree_entries_max = ntrans*8 ...
file_Tempcache.ncached_max = prm_get_integer_value (PRM_ID_MAX_ENTRIES_IN_TEMP_FILE_CACHE);
pthread_mutex_init (&file_Tempcache.mutex, NULL);
file_Tempcache.tran_files = (FILE_TEMPCACHE_TRAN_ENTRY *) malloc (ntrans * sizeof (...));
if (file_Tempcache.tran_files == NULL)
{
pthread_mutex_destroy (&file_Tempcache.mutex); /* <- undo the mutex on alloc failure */
// ... er_set OUT_OF_VIRTUAL_MEMORY; return ER_OUT_OF_VIRTUAL_MEMORY ...
}
// ... condensed: memset tran_files; per-tran mutex_init loop; memset spacedb_temp ...
return NO_ERROR;
}

Branch accounting: the only non-trivial branch is the malloc failure, which destroys file_Tempcache.mutex before returning so nothing is half-constructed. file_tempcache_final mirrors this — early return if tran_files == NULL, else free every per-transaction list, the cached numerable / not-numerable lists and the free-entry pool, and destroy the mutexes.

Invariant — file_Tempcache.tran_files == NULL is the “uninitialized” sentinel. Both init (via assert) and final (via early return) treat tran_files as the single truth for whether the tempcache exists. Code that allocates or frees it must keep this honest, or final skips a real teardown or double-frees.

  1. disk_manager_init is the only assembler of disk_Cache and idempotent: the disk_Cache != NULL guard tears down any prior cache, disk_cache_init allocates, and a failed load_from_disk rolls back via disk_manager_final.
  2. disk_cache_init zeroes all rollup counters so load purely adds, and seeds every vols[] slot to DISK_UNKNOWN_PURPOSE.
  3. The cache is rebuilt by walking mounted volumesfileio_map_mounted (two-pass perm-ascending / temp-descending, bounded by next_*_volid), one disk_cache_load_volume per live descriptor.
  4. disk_cache_load_volume distinguishes type from purpose: temp-type volumes are skipped; perm-data feeds perm_purpose_info.extend_info (may set the single volid_extend); perm-type/temp-purpose feeds temp_purpose_info.nsect_perm_*. nvols_perm++ runs for every permanent-type volume regardless of purpose, so after a full load nvols_temp == 0.
  5. The cache is a derived, lower-bound rollup that may legitimately undercount free sectors; the allocation table is the source of truth.
  6. disk_Temp_max_sects starts at -2 (pre-init sentinel, vs parameter default -1 = Infinite), overwritten from PRM_ID_BOSR_MAXTMP_PAGES: negatives map to SECTID_MAX, non-negative pages divide by DISK_SECTOR_NPAGES.
  7. The file manager reconstructs nothing from disk: file_manager_init only captures a flag and runs file_tempcache_init; the tracker globals stay static *_INITIALIZER zeros, and file_Tempcache.tran_files == NULL is the uninitialized sentinel guarding both init and final.

Chapter 3: Volume Format and the Sector Allocation Table

Section titled “Chapter 3: Volume Format and the Sector Allocation Table”

This chapter answers: how is a CUBRID volume laid out on disk, and how does the disk manager flip bits in the sector allocation table without scanning the bitmap one bit at a time? The high-level companion (cubrid-disk-manager.md) covers why a sector is the allocation quantum and why a bitmap beats a free-list; here we trace the byte layout, the format-time writers, and the bitmap-as-functor machinery. DISK_VOLUME_HEADER and DISK_STAB_CURSOR are introduced field-by-field in Chapter 1.

Every CUBRID volume — permanent or temporary, first or extension — shares one macro-layout: page 0 is the volume header, then a contiguous run of sector-table (STAB) pages, then data. Three header fields fix it:

// disk_volume_header_set_stab -- src/storage/disk_manager.c
volheader->stab_first_page = DISK_VOLHEADER_PAGE + 1; /* <- STAB always starts at page 1 */
volheader->stab_npages = CEIL_PTVDIV (volheader->nsect_max, DISK_STAB_PAGE_BIT_COUNT); /* <- sized by nsect_max, not nsect_total */
volheader->sys_lastpage = volheader->stab_first_page + volheader->stab_npages - 1; /* <- last reserved sys page */

DISK_VOLHEADER_PAGE is 0, so stab_first_page is always page 1. The decisive choice is the divisor — nsect_max, not nsect_total: a volume grows its used size up to its capacity without moving the data region, because the STAB was sized for the maximum on day one. Chapter 5 (extension) depends on this — extension flips already-present STAB bits and never re-lays-out the volume.

flowchart LR
  subgraph Volume["Volume file (pages)"]
    H["page 0<br/>DISK_VOLUME_HEADER<br/>magic, volid, purpose,<br/>nsect_total, nsect_max,<br/>stab_first_page, stab_npages,<br/>sys_lastpage, hint_allocsect"]
    S["pages 1 .. sys_lastpage<br/>SECTOR ALLOCATION TABLE<br/>stab_npages pages of UINT64 units<br/>1 bit == 1 sector"]
    D["pages sys_lastpage+1 .. end<br/>DATA SECTORS<br/>64 pages each"]
  end
  H --> S --> D

Figure 3-1: macro-layout of any CUBRID volume. The STAB is sized for nsect_max so the data region’s start never moves.

A “sector” is 64 consecutive pages (DISK_SECTOR_NPAGES); SECTOR_FROM_PAGEID(pageid) is pageid / 64. The system sectors a volume self-reserves at format time number SECTOR_FROM_PAGEID(sys_lastpage) + 1 (header + all STAB pages, rounded up) — the value that drives disk_stab_init (§3.3).

Invariant — STAB sizing is pinned to nsect_max. disk_verify_volume_header asserts stab_npages == CEIL_PTVDIV(nsect_max, DISK_STAB_PAGE_BIT_COUNT), stab_npages >= CEIL_PTVDIV(nsect_total, ...), and stab_first_page == DISK_VOLHEADER_PAGE + 1. Sizing by nsect_total instead would leave a later extension with no bitmap bits for the new sectors, and the assert would fire on the next header fetch.

3.2 disk_format and disk_format_first_volume — writing the header

Section titled “3.2 disk_format and disk_format_first_volume — writing the header”

disk_format creates any volume; disk_format_first_volume is a thin shim that bootstraps the first volume (LOG_DBFIRST_VOLID) plus the cache: it calls disk_manager_init, bumps disk_Cache->nvols_perm = 1 (rolled back to 0 on failure), and sets ext_info.nsect_total == ext_info.nsect_max (no headroom on the first volume).

disk_format has many error paths. The flowchart accounts for every branch via its edge labels; the prose below adds only what the flowchart cannot carry.

flowchart TD
  A["validate name & purpose"] -->|name too long| RET1["return ER_..._TOO_LONG"]
  A -->|bad purpose| RET2["return ER_DISK_UNKNOWN_PURPOSE"]
  A -->|ok| B{"voltype == PERMANENT?"}
  B -->|yes: log undo RVDK_FORMAT| C["force flush both paths<br/>then fileio_format OS file"]
  B -->|no| C
  C -->|NULL_VOLDES| RET3["return error, nothing to clean"]
  C -->|ok| E["fix page 0 NEW_PAGE,<br/>ptype PAGE_VOLHEADER"]
  E -->|fix fails| X["goto exit"]
  E -->|ok| F["fill header,<br/>set_stab"]
  F --> G{"sys_lastpage >= extend_npages?"}
  G -->|yes: ER_IO_FORMAT_BAD_NPAGES| X
  G -->|no: set params/name/remarks, err goto exit| I{"PERMANENT?"}
  I -->|yes: RVDK_NEWVOL + RVDK_FORMAT redo offset=-1| K["disk_stab_init"]
  I -->|no| K
  K -->|err| X
  K -->|ok| L{"PERMANENT and volid != FIRST?"}
  L -->|yes: disk_set_link prev vol, err goto exit| N{"PERMANENT?"}
  L -->|no| N
  N -->|yes: RVDK_FORMAT redo offset=0| P{"TEMPORARY?"}
  N -->|no| P
  P -->|yes: flush+dwb, sys pages temp-LSA, err goto exit| R["nsect_free_out, dirty_and_free,<br/>flush + dwb_synchronize"]
  P -->|no| R
  R --> X["exit: unfix header page"]
  X --> S{"error_code != NO_ERROR?"}
  S -->|no| RET4["return NO_ERROR"]
  S -->|yes| T["pgbuf_invalidate_all"]
  T --> U{"TEMPORARY?"}
  U -->|yes| V["disk_unformat now,<br/>temp not logged"]
  U -->|no| RET5["return error, rollback removes it"]
  V --> RET5

Figure 3-2: every branch of disk_format. The cleanup split at the bottom is the heart of crash safety.

Two points the flowchart cannot fully carry:

  • Undo is logical, force-flush is unconditional. Only the undo RVDK_FORMAT (log_append_undo_data, carrying just the name) is gated on voltype == DB_PERMANENT_VOLTYPE — it lets rollback remove the whole volume, since there is no page-level undo. But logpb_force_flush_pages then runs on both paths, so the log reaches disk before the OS file exists and a crash mid-format is recoverable.
  • The exit: split. After any post-fix error, goto exit unfixes the header page, then pgbuf_invalidate_all. A temporary volume is then disk_unformat-ed immediately (no log, no rollback to lean on); a permanent volume returns the error and lets the top-action rollback (Chapter 5) replay the logical undo. The two permanent RVDK_FORMAT redos use addr.offset = -1 before disk_stab_init and 0 after linking — the sentinel recovery uses to tell a started format from a completed one.

3.3 disk_stab_init — laying out the bitmap

Section titled “3.3 disk_stab_init — laying out the bitmap”

After the header is written, disk_stab_init walks every STAB page and marks the system sectors (those the header+STAB occupy) reserved, leaving the rest zero (free).

// disk_stab_init -- src/storage/disk_manager.c
DKNSECTS nsects_sys = SECTOR_FROM_PAGEID (volheader->sys_lastpage) + 1; /* <- sectors to pre-reserve */
assert (nsects_sys < DISK_STAB_PAGE_BIT_COUNT); /* <- sys region fits in STAB page 0 */
for ( /* each STAB page */ ; ...; vpid_stab.pageid++)
{
page_stab = pgbuf_fix (..., NEW_PAGE, PGBUF_LATCH_WRITE, ...); // NULL -> return error
pgbuf_set_page_ptype (thread_p, page_stab, PAGE_VOLBITMAP);
if (volheader->purpose == DB_TEMPORARY_DATA_PURPOSE) pgbuf_set_lsa_as_temporary (...); /* <- no log for temp */
memset (page_stab, 0, DB_PAGESIZE); /* <- all sectors free by default */
if (nsects_sys > 0) /* <- only while sys sectors remain (page 0 only) */
{ nsect_copy = nsects_sys;
disk_stab_cursor_set_at_sectid (volheader, /* page start */ ..., &start_cursor);
if ( /* last STAB page */ ) disk_stab_cursor_set_at_end (volheader, &end_cursor); /* <- end = nsect_total */
else disk_stab_cursor_set_at_sectid (volheader, /* next page start */ ..., &end_cursor);
error_code = disk_stab_iterate_units (..., disk_stab_set_bits_contiguous, &nsect_copy); } // err -> unfix + return
if (volheader->purpose != DB_TEMPORARY_DATA_PURPOSE) /* <- permanent: log only the count, not the image */
{ DKNSECTS nsects_set = nsects_sys - nsect_copy;
log_append_redo_data2 (thread_p, RVDK_INITMAP, NULL, page_stab, NULL_OFFSET, sizeof (nsects_set), &nsects_set); }
if (!LOG_ISRESTARTED ()) { pgbuf_set_dirty (...); pgbuf_flush (..., FREE); page_stab = NULL; } /* <- format: flush, pool invalidated next */
else pgbuf_set_dirty_and_free (thread_p, page_stab); /* <- recovery replay: dirty+free */
nsects_sys = nsect_copy; nsect_copy = 0; /* <- carry leftover to next page (normally 0 after page 1) */
}

Every branch is tagged inline. The loop runs stab_npages times zeroing each page; the nsects_sys > 0 block fires only on the first page (the assert guarantees the system sectors fit there), and disk_stab_set_bits_contiguous fills whole BIT64_FULL units then trailing bits up to the end cursor.

3.4 disk_unformat — removing the OS file

Section titled “3.4 disk_unformat — removing the OS file”

Destruction is anticlimactic: the disk manager owns no in-memory bitmap, so disk_unformat only flushes, invalidates the page-buffer image, and deletes the file.

// disk_unformat -- src/storage/disk_manager.c
volid = fileio_find_volume_id_with_label (thread_p, vol_fullname);
if (volid != NULL_VOLID)
{
(void) pgbuf_flush_all (thread_p, volid); /* <- push any dirty pages */
(void) pgbuf_invalidate_all (thread_p, volid); /* <- drop them from the pool */
}
fileio_unformat (thread_p, vol_fullname); /* <- delete the OS file */
return ret; /* <- always NO_ERROR */

The single branch is volid != NULL_VOLID: an unmounted volume (no id for the label) skips flush/invalidate and only fileio_unformat runs. This is what disk_format calls on its temporary-volume error path (§3.2) and what recovery calls when undoing a permanent format.

Callers never read the STAB bit-by-bit. The manager quantizes it into 64-bit units and exposes one iterator — disk_stab_iterate_units — driving a DISK_STAB_UNIT_FUNC callback over a unit range. Reserve, unreserve, count-free, has-used, and contiguous-set are all just different callbacks.

Quantization. DISK_STAB_UNIT is UINT64. The macros mapping a SECTID to a position are pure integer arithmetic — a flat index split into (page, unit, bit):

// allocation-table addressing macros -- src/storage/disk_manager.c
#define DISK_ALLOCTBL_SECTOR_PAGE_OFFSET(sect) ((sect) / DISK_STAB_PAGE_BIT_COUNT)
#define DISK_ALLOCTBL_SECTOR_UNIT_OFFSET(sect) (((sect) % DISK_STAB_PAGE_BIT_COUNT) / DISK_STAB_UNIT_BIT_COUNT)
#define DISK_ALLOCTBL_SECTOR_BIT_OFFSET(sect) (((sect) % DISK_STAB_PAGE_BIT_COUNT) % DISK_STAB_UNIT_BIT_COUNT)
#define DISK_STAB_NPAGES(nsect_max) (CEIL_PTVDIV (nsect_max, DISK_STAB_PAGE_BIT_COUNT))

DISK_STAB_NPAGES is the same CEIL_PTVDIV as in disk_volume_header_set_stab, keeping the header field and the macro in agreement.

flowchart LR
  SECT["SECTID"] --> PG["page offset<br/>sect / PAGE_BIT_COUNT"]
  SECT --> UN["unit offset<br/>(sect mod PAGE_BIT_COUNT) / 64"]
  SECT --> BT["bit offset<br/>(sect mod PAGE_BIT_COUNT) mod 64"]
  PG --> POS["cursor.pageid = stab_first_page + page offset"]
  UN --> POS2["cursor.offset_to_unit"]
  BT --> POS3["cursor.offset_to_bit"]

Figure 3-3: a SECTID split into (page, unit, bit) by three modulo/divide macros. The cursor stores all three plus the live unit pointer.

Three inline setters seed a DISK_STAB_CURSOR (fields in Chapter 1), differing only in the target sector; all leave page/unit NULL (the page is fixed lazily by disk_stab_cursor_fix).

  • disk_stab_cursor_set_at_sectid — general case: asserts 0 <= sectid <= nsect_total, fills pageid/offset_to_unit/offset_to_bit from the three macros, asserting pageid stays within stab_npages.
  • disk_stab_cursor_set_at_end — one past the last valid sector via set_at_sectid(volheader, nsect_total, cursor), first asserting nsect_total is unit-rounded (DISK_SECTS_ASSERT_ROUNDED) so iteration ends on a 64-bit boundary.
  • disk_stab_cursor_set_at_start — hard-codes sectid = 0, pageid = stab_first_page, both offsets 0 (skips set_at_sectid; the all-zero position is trivial).

Invariant — cursor position consistency. disk_stab_cursor_check_valid asserts (pageid - stab_first_page) * PAGE_BIT_COUNT + offset_to_unit * 64 + offset_to_bit == sectid, and that whenever unit != NULL, (char*)unit - page == offset_to_unit * DISK_STAB_UNIT_SIZE_OF. The iterator re-establishes this before every callback. If the offsets drift from sectid, reserved VSIDs name the wrong sectors — silent cross-linking corruption.

// disk_stab_iterate_units -- src/storage/disk_manager.c
assert (disk_stab_cursor_compare (start, end) < 0); /* <- start strictly before end */
for (cursor = *start; cursor.pageid <= end->pageid; cursor.pageid++, cursor.offset_to_unit = 0)
{
error_code = disk_stab_cursor_fix (thread_p, &cursor, mode); /* <- fix this STAB page */
// ... err -> return ...
end_unit = ((DISK_STAB_UNIT *) cursor.page)
+ (cursor.pageid == end->pageid ? end->offset_to_unit : DISK_STAB_PAGE_UNITS_COUNT); /* <- clamp last page */
for (; cursor.unit < end_unit;
cursor.unit++, cursor.offset_to_unit++,
cursor.sectid += (DISK_STAB_UNIT_BIT_COUNT - cursor.offset_to_bit), /* <- advance by remaining bits */
cursor.offset_to_bit = 0)
{
error_code = f_unit (thread_p, &cursor, &stop, f_unit_args); /* <- the functor */
if (error_code != NO_ERROR) { disk_stab_cursor_unfix (...); return error_code; }
if (stop) { disk_stab_cursor_unfix (...); return NO_ERROR; } /* <- early-out */
}
disk_stab_cursor_unfix (thread_p, &cursor);
}

The inner stride advances sectid by DISK_STAB_UNIT_BIT_COUNT - cursor.offset_to_bit — normally a full 64, but a callback may leave offset_to_bit partway through a unit (as disk_stab_unit_reserve does), so the stride compensates. Two short-circuits unfix the page first: a callback error (returns the error) and a callback setting *stop = true (returns NO_ERROR). disk_stab_iterate_units_all wraps this with set_at_start/set_at_end.

The most branch-rich functor: it reserves up to nsects_lastvol_remaining free bits and records each VSID. All three branches are tagged inline.

// disk_stab_unit_reserve -- src/storage/disk_manager.c
if (*cursor->unit == BIT64_FULL) return NO_ERROR; /* <- (1) full unit: nothing free, skip; no dirty/log */
context = (DISK_RESERVE_CONTEXT *) args;
if (*cursor->unit == 0) /* <- (2) empty unit: grab up to 64 in one store */
{ int bits_to_set = MIN (context->nsects_lastvol_remaining, DISK_STAB_UNIT_BIT_COUNT);
*cursor->unit = (bits_to_set == DISK_STAB_UNIT_BIT_COUNT) ? BIT64_FULL
: bit64_set_trailing_bits (*cursor->unit, bits_to_set);
log_unit = *cursor->unit; context->nsects_lastvol_remaining -= bits_to_set; /* ... emit one VSID per bit ... */ }
else /* <- (3) mixed unit: skip leading ones, set each free bit */
{ log_unit = 0;
for (cursor->offset_to_bit = bit64_count_trailing_ones (*cursor->unit), cursor->sectid += cursor->offset_to_bit;
cursor->offset_to_bit < DISK_STAB_UNIT_BIT_COUNT && context->nsects_lastvol_remaining > 0;
cursor->offset_to_bit++, cursor->sectid++)
if (!disk_stab_cursor_is_bit_set (cursor))
{ disk_stab_cursor_set_bit (cursor); log_unit = bit64_set (log_unit, cursor->offset_to_bit);
context->nsects_lastvol_remaining--; /* ... push VSID ... */ } }
assert (log_unit != 0 && (log_unit & *cursor->unit) == log_unit);
if (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* <- permanent: undoredo delta; temp skips logging */
log_append_undoredo_data2 (thread_p, RVDK_RESERVE_SECTORS, NULL, cursor->page, cursor->offset_to_unit,
sizeof (log_unit), sizeof (log_unit), &log_unit, &log_unit);
pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE);
if (context->nsects_lastvol_remaining <= 0) *stop = true;

log_unit accumulates only the bits this call set; for permanent volumes it is both the redo and undo image of RVDK_RESERVE_SECTORS (redo re-sets, undo clears).

Invariant — log_unit is a strict subset of the unit’s set bits. The assert (log_unit != 0 && (log_unit & *cursor->unit) == log_unit) guarantees the logged delta holds only bits actually set and is never a no-op. A bit absent from *cursor->unit would make recovery’s redo set a bit the live run never set — divergence between logged and live bitmaps.

The mirror functor clears bits whose sector IDs the caller already knows (sorted in context->vsidp).

// disk_stab_unit_unreserve -- src/storage/disk_manager.c
while (context->nsects_lastvol_remaining > 0 && context->vsidp->sectid < cursor->sectid + DISK_STAB_UNIT_BIT_COUNT)
{ unreserve_bits = bit64_set (unreserve_bits, context->vsidp->sectid - cursor->sectid); /* <- accumulate this unit's window, abs->rel bit */
context->nsects_lastvol_remaining--; context->vsidp++; nsect++; }
assert ((unreserve_bits & (*cursor->unit)) == unreserve_bits); /* <- only clear bits that are set */
if (unreserve_bits != 0) /* <- skip an untouched unit */
{
if (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* <- permanent: postpone clears at commit, rollback skips it */
log_append_postpone (thread_p, RVDK_UNRESERVE_SECTORS, &addr /* page,offset_to_unit */, ..., &unreserve_bits);
else /* <- temp: clear now + cache update under temp reserve lock */
{ (*cursor->unit) &= ~unreserve_bits; pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE);
disk_cache_update_vol_free (cursor->volheader->volid, nsect); }
}
if (context->nsects_lastvol_remaining <= 0) *stop = true;

The purpose split is the asymmetry worth remembering, and it is tagged inline: permanent unreserve emits a postpone record, so a rollback never runs it and the sectors stay reserved; temporary unreserve clears immediately and updates the cache free count.

Invariant — unreserve only clears set bits. assert((unreserve_bits & *cursor->unit) == unreserve_bits) enforces that every sector being freed was actually reserved; a violation means double-free or a stale VSID list, corrupting free-sector accounting.

3.6 The 64-bit coupling and the hint_allocsect note

Section titled “3.6 The 64-bit coupling and the hint_allocsect note”

Hidden 64-bit coupling. The cursor primitives call bit64_is_set, bit64_set, bit64_set_trailing_bits, bit64_count_trailing_ones, bit64_count_zeros — all hard-wired to 64-bit operands. The DISK_STAB_UNIT comment suggests the unit type “can be modified and handled automatically,” but changing typedef UINT64 DISK_STAB_UNIT would silently break every bit64_* call and BIT64_FULL. The quantization macros adapt via DISK_STAB_UNIT_SIZE_OF; the bit-op layer does not. (Open question: whether the “automatic” claim was ever true.) Treat 64 bits as a fixed contract.

hint_allocsect. disk_format only seeds this to NULL_SECTID; the live update is on the reservation path Chapter 4 owns (disk_reserve_sectors_in_volume). The subtlety relevant here: it goes stale after an unreservedisk_stab_unit_unreserve frees bits below the hint but never lowers it, so a later reservation skips the freshly freed sectors until the wrap-around pass reclaims them. It is an optimization, not an invariant, so the code neither logs nor dirties it.

  1. Layout is fixed and header-driven. Page 0 is the header; stab_first_page (always 1) begins a contiguous STAB sized by DISK_STAB_NPAGES(nsect_max); data follows sys_lastpage. Sizing by nsect_max not nsect_total lets a volume grow without re-layout.
  2. disk_format is branch-heavy for crash safety. The logical undo (RVDK_FORMAT) is permanent-only, but logpb_force_flush_pages runs unconditionally before the OS file is created; permanent volumes log the header redo twice (offset -1 then 0); temporary volumes get temp-LSAs and are disk_unformat-ed immediately on error.
  3. disk_stab_init pre-reserves exactly the system sectors (SECTOR_FROM_PAGEID(sys_lastpage)+1, all in the first STAB page), leaves the rest free, and logs only the count (RVDK_INITMAP), not the page image.
  4. The bitmap is never scanned bit-by-bit. A SECTID decomposes into (page, unit, bit) via three macros, and disk_stab_iterate_units drives a DISK_STAB_UNIT_FUNC over 64-bit units, short-circuiting on full/empty units.
  5. Reserve and unreserve are mirror functors with a purpose split. Permanent reserve logs an undoredo delta; permanent unreserve uses a postpone record so rollback keeps the sectors; temporary skips logging and updates the cache directly. The log_unit/unreserve_bits invariants keep logged and live bitmaps in lockstep.
  6. 64 bits is a hard contract, not a tunable: the bit64_* primitives and BIT64_FULL are not parameterized by unit size, despite the optimistic comment on DISK_STAB_UNIT.
  7. hint_allocsect is live state owned by Chapter 4; disk_format only seeds it to NULL_SECTID. Its one subtlety here is staleness after unreserve — freeing sectors below the hint never lowers it.

Chapter 4: Sector Reservation Two-Step Protocol

Section titled “Chapter 4: Sector Reservation Two-Step Protocol”

A file that needs N sectors does not flip N bits under one lock. The disk manager splits the work into two disjoint phases (the high-level companion, CUBRID Disk Manager, explains why the cache exists). This chapter answers: when a file needs N sectors, how does the disk manager hand them out across volumes while keeping the hot mutex short and staying crash-safe?

4.1 The two structs that carry a reservation

Section titled “4.1 The two structs that carry a reservation”

A reservation is disk_reserve_context, a stack local in disk_reserve_sectors (re-built in the unreserve path), threaded through every function below.

// disk_reserve_context -- src/storage/disk_manager.c
struct disk_reserve_context
{
int nsect_total; /* original request size */
VSID *vsidp; /* write cursor into output array */
DISK_CACHE_VOL_RESERVE cache_vol_reserve[VOLID_MAX]; /* per-volume ledger from step 1 */
int n_cache_vol_reserve; /* ledger slots used */
int n_cache_reserve_remaining; /* cache-phase debt */
DKNSECTS nsects_lastvol_remaining; /* current-volume bitmap debt */
DB_VOLPURPOSE purpose; /* permanent-data / temporary-data */
};
FieldRoleWhy it exists
nsect_totalImmutable copy of request N.Final assert (vsidp - reserved_sectors == n_sectors); never decremented.
vsidpWrite pointer into reserved_sectors[].vsidp - reserved_sectors = sectors reserved so far; error path reads it for rollback.
cache_vol_reserve[]Step-1 ledger, one {volid, nsect} per volume drawn from.Step 2 replays it; error path refunds un-flipped sectors from it.
n_cache_vol_reserveCount of used ledger slots.Loop bound for step 2 and the rollback scan.
n_cache_reserve_remainingCache-phase debt; starts N, decremented by disk_reserve_from_cache_volume, 0 when satisfied.Drives volume-iteration and extend decisions in step 1.
nsects_lastvol_remainingBitmap-phase debt within the current volume; seeded per-volume, decremented as bits flip.disk_stab_unit_reserve drives off it; 0 sets *stop.
purposeDB_PERMANENT_DATA_PURPOSE / DB_TEMPORARY_DATA_PURPOSE.Selects cache mutex/extend-info and whether STAB changes are logged.
// disk_cache_vol_reserve -- src/storage/disk_manager.c
struct disk_cache_vol_reserve { VOLID volid; DKNSECTS nsect; };
FieldRoleWhy it exists
volidVolume the cache reserved from.Step 2 fixes its header and flips its bits; rollback decrements its cache counter.
nsectCount promised from volid.Seeds nsects_lastvol_remaining; rollback decrements it per sector returned by undo, leaving the not-yet-flipped remainder.

Each ledger entry {volid, nsect} seeds one step-2 per-volume scan (Figure 4-1).

flowchart LR
  RC["disk_reserve_context"] --> L["cache_vol_reserve[i]\n{volid, nsect}"] -.seeds nsects_lastvol_remaining.-> S2["step 2 per-volume scan -> reserved_sectors[]"]

Figure 4-1. Reserve context, its per-volume ledger, and the step-2 scan that fills the output array.

Invariant — the two remaining-counters never alias. n_cache_reserve_remaining is the cache debt; nsects_lastvol_remaining is the current-volume bitmap debt. The cache phase finishes with n_cache_reserve_remaining == 0 and sum(cache_vol_reserve[i].nsect) == N. Separating them lets step 2 be re-driven per volume without re-touching the cache; aliasing would let a partial volume scan corrupt cache accounting.

4.2 The outer driver: disk_reserve_sectors

Section titled “4.2 The outer driver: disk_reserve_sectors”

disk_reserve_sectors(thread_p, purpose, volid_hint, n_sectors, reserved_sectors) is the disk/file boundary call. volid_hint is accepted but ignored; volume order is governed by purpose.

  1. Guards. assert purpose is perm or temp; n_sectors <= 0 || reserved_sectors == NULL -> assert_release(false); ER_FAILED.
  2. Sysop precondition for permanent reservations (their STAB changes are logged onto the outer transaction):
    // disk_reserve_sectors -- src/storage/disk_manager.c
    if (purpose != DB_TEMPORARY_DATA_PURPOSE && !log_check_system_op_is_started (thread_p))
    { assert (false); er_set (...ER_GENERIC_ERROR, 0); return ER_FAILED; } /* caller forgot sysop */
  3. retry: / log_sysop_start — even temp reservations open a sysop to scope the bitmap phase.
  4. CSECT_DISK_CHECK as reader (excludes the consistency checker). Fail -> log_sysop_abort; return.
  5. Init context in place: nsect_total = n_cache_reserve_remaining = n_sectors, vsidp = reserved_sectors, n_cache_vol_reserve = 0.
  6. Step 1 — disk_reserve_from_cache. Error -> goto error.
  7. Step 2 — loop disk_reserve_sectors_in_volume over [0, n_cache_vol_reserve). Any error -> goto error.
  8. Success. assert ((vsidp - reserved_sectors) == n_sectors); exit csect; log_sysop_attach_to_outer; in debug, if did_extend, disk_check; return NO_ERROR. The error: path (4.7) handles rollback.
flowchart TD
  C["csect_enter CSECT_DISK_CHECK after sysop start"] -->|fail| AB["log_sysop_abort, return err"]
  C -->|ok| D["init context"] --> E["step 1: disk_reserve_from_cache"]
  E -->|err| ERR["goto error: rollback (4.7)"]
  E -->|ok| F["step 2 loop: disk_reserve_sectors_in_volume per ledger entry"]
  F -->|err| ERR
  F -->|ok| H["assert vsidp-base==N, attach_to_outer, NO_ERROR"]

Figure 4-2. disk_reserve_sectors control flow.

Moves free-sector counts into the ledger, extending the disk if short, holding the reserve mutex only across counter math.

  1. disk_Cache == NULL -> assert_release(false); return ER_FAILED.
  2. Lock the purpose’s reserve mutex (disk_cache_lock_reserve_for_purpose).
  3. Temp purpose prefers perm-type-temp-purpose volumes before genuine temp volumes:
    // disk_reserve_from_cache -- src/storage/disk_manager.c
    if (context->purpose == DB_TEMPORARY_DATA_PURPOSE)
    {
    extend_info = &disk_Cache->temp_purpose_info.extend_info;
    if (disk_Cache->temp_purpose_info.nsect_perm_free > 0)
    disk_reserve_from_cache_vols (DB_PERMANENT_VOLTYPE, context); /* <- perm-temp first */
    if (context->n_cache_reserve_remaining <= 0) /* satisfied from perm-temp */
    { disk_cache_unlock_reserve_for_purpose (context->purpose); return NO_ERROR; }
    // ... temp-ceiling check, then fall through to temp-volume extend ...
    }
    else
    extend_info = &disk_Cache->perm_purpose_info.extend_info;
    nsect_perm_free = free sectors on perm-type volumes carrying temp purpose; when 0 those volumes are skipped.
  4. Temp-space ceiling (temp branch, before extending temp volumes): if extend_info->nsect_total - extend_info->nsect_free + n_cache_reserve_remaining > disk_Temp_max_sects -> er_set (ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED ...); unlock; return. Operands are the extend-info pool aggregates, not the context’s nsect_total.
  5. Common tail: assert (n_cache_reserve_remaining > 0) and assert this thread holds the mutex.
  6. Reserve from existing free space if the pool is big enough:
    if (extend_info->nsect_free > context->n_cache_reserve_remaining) /* strict >: a hair of headroom, see Ch 5 */
    {
    disk_reserve_from_cache_vols (extend_info->voltype, context);
    if (context->n_cache_reserve_remaining <= 0)
    { disk_cache_unlock_reserve (extend_info); return NO_ERROR; } /* <- done from existing */
    }
  7. Short -> extend. Bump extend_info->nsect_intention (signals concurrent reservers), drop the reserve mutex, take disk_lock_extend(), re-take the reserve mutex and re-check. If a peer already extended so nsect_free now suffices: decrement intention, retry disk_reserve_from_cache_vols, return. Else call disk_extend (Ch 5) and back the intention out. Both locks released on every exit.
  8. Post-extend. disk_extend error -> return it. Still n_cache_reserve_remaining > 0 -> assert_release(false); ER_FAILED. Else *did_extend = true; return NO_ERROR.

Invariant — the reserve mutex is never held across a STAB scan or an extend. The intention counter is the hand-off token that lets the mutex drop during the slow extend without two threads double-extending.

4.4 Iterating volumes: disk_reserve_from_cache_vols

Section titled “4.4 Iterating volumes: disk_reserve_from_cache_vols”
// disk_reserve_from_cache_vols -- src/storage/disk_manager.c
if (type == DB_PERMANENT_VOLTYPE) /* perm: ascend 0..nvols_perm */
{ start_iter = 0; end_iter = disk_Cache->nvols_perm; incr = 1; min_free = MIN (context->nsect_total, perm...nsect_vol_max) / 2; }
else /* temp: descend from top of volid space */
{ start_iter = LOG_MAX_DBVOLID; end_iter = LOG_MAX_DBVOLID - disk_Cache->nvols_temp; incr = -1; min_free = MIN (context->nsect_total, temp...nsect_vol_max) / 2; }
min_free = MAX (min_free, 1); /* half the smaller of request/per-vol max, floored at 1 */
for (volid_iter = start_iter;
volid_iter != end_iter && context->n_cache_reserve_remaining > 0; /* stop when range exhausted or debt paid */
volid_iter += incr)
{
if (disk_Cache->vols[volid_iter].purpose != context->purpose) continue; /* wrong purpose */
if (disk_Cache->vols[volid_iter].nsect_free < min_free) continue; /* too fragmented */
disk_reserve_from_cache_volume (volid_iter, context);
}

4.5 Decrementing one volume’s counter: disk_reserve_from_cache_volume

Section titled “4.5 Decrementing one volume’s counter: disk_reserve_from_cache_volume”

The only place step 1 actually moves sectors out of the cache.

// disk_reserve_from_cache_volume -- src/storage/disk_manager.c
if (context->n_cache_vol_reserve >= LOG_MAX_DBVOLID)
{ assert_release (false); return; } /* <- ledger overflow guard */
disk_check_own_reserve_for_purpose (context->purpose); /* <- assert mutex held by us */
nsects = MIN (disk_Cache->vols[volid].nsect_free, context->n_cache_reserve_remaining);
disk_cache_update_vol_free (volid, -nsects); /* <- decrement cache + purpose pool */
context->cache_vol_reserve[context->n_cache_vol_reserve].volid = volid;
context->cache_vol_reserve[context->n_cache_vol_reserve].nsect = nsects;
context->n_cache_reserve_remaining -= nsects;
context->n_cache_vol_reserve++; /* <- bitmap untouched, only counters */

disk_cache_update_vol_free also adjusts the matching purpose-pool aggregate.

4.6 Step 2: disk_reserve_sectors_in_volume flips the bits

Section titled “4.6 Step 2: disk_reserve_sectors_in_volume flips the bits”

Per ledger entry, fixes the volume header under a write latch (cache mutex not held) and flips STAB bits until the per-volume debt hits zero.

  1. Read ledger. volid = cache_vol_reserve[vol_index].volid; if NULL_VOLID -> assert_release(false); ER_FAILED. Seed nsects_lastvol_remaining = cache_vol_reserve[vol_index].nsect.
  2. Fix volume header PGBUF_LATCH_WRITE; on error -> return.
  3. Hint-guided scan. Three scan shapes via disk_stab_iterate_units(..., disk_stab_unit_reserve, context); each error path does goto exit:
    // disk_reserve_sectors_in_volume -- src/storage/disk_manager.c
    if (volheader->hint_allocsect > 0 && volheader->hint_allocsect < volheader->nsect_total)
    {
    // ... cursors hint..end; iterate ... /* after hint */
    if (context->nsects_lastvol_remaining > 0) /* still short: wrap start..hint */
    { end_cursor = start_cursor; disk_stab_cursor_set_at_start (volheader, &start_cursor);
    error_code = disk_stab_iterate_units (...); }
    }
    else
    { /* ... cursors start..end; iterate whole table ... */ }
  4. Must be satisfied. if (nsects_lastvol_remaining != 0) { assert_release(false); ER_FAILED; goto exit; } — residue means cache and bitmap disagree (a bug).
  5. Advance the hint. hint_allocsect = (vsidp - 1)->sectid + 1; best-effort, neither dirtied nor logged.
  6. exit: unfix the header if fixed; return error_code.

The bit-flip lives in the disk_stab_unit_reserve callback, invoked per 64-bit STAB unit (full unit BIT64_FULL returns early; empty unit 0 filled in bulk; partial unit walked bit by bit), recording each new sector into context->vsidp. Permanent purpose logs each change:

// disk_stab_unit_reserve -- src/storage/disk_manager.c
if (context->purpose == DB_PERMANENT_DATA_PURPOSE) /* redo+undo image = the changed bits mask */
log_append_undoredo_data2 (thread_p, RVDK_RESERVE_SECTORS, NULL, cursor->page,
cursor->offset_to_unit, sizeof (log_unit), sizeof (log_unit), &log_unit, &log_unit);
pgbuf_set_dirty (thread_p, cursor->page, DONT_FREE);
if (context->nsects_lastvol_remaining <= 0) { *stop = true; } /* <- end the volume scan */

Redo and undo images are the same log_unit mask; the recovery handlers disk_rv_reserve_sectors / disk_rv_unreserve_sectors re-sync the cache under CSECT_DISK_CHECK (recovery chapter). Temporary reservations log nothing — their bits reset wholesale on restart.

Invariant — the cache mutex is released throughout step 2. Step 1 charged the counters; step 2 touches only page latches and the WAL, so the hot reserve mutex is held for O(volumes) counter math, never O(sectors) bitmap I/O.

If either step errors, disk_reserve_sectors jumps to error:. Let nreserved = vsidp - reserved_sectors be the sectors actually flipped.

  1. nreserved > 0 and temp purpose: nothing was logged, so abort cannot undo the partial bitmap changes; disable interrupt checks, qsort the VSIDs, call disk_unreserve_ordered_sectors_without_csect. Permanent skips this — the log_sysop_abort below undoes its logged changes.
  2. Reconcile the ledger with what abort/undo already returned: for each flipped sector, decrement its volume’s cache_vol_reserve[].nsect, leaving only sectors charged to the cache but never flipped:
    // disk_reserve_sectors (error path) -- src/storage/disk_manager.c
    for (iter_vsid = 0; iter_vsid < nreserved; iter_vsid++)
    {
    for (iter = 0; iter < context.n_cache_vol_reserve; iter++)
    if (reserved_sectors[iter_vsid].volid == context.cache_vol_reserve[iter].volid)
    { context.cache_vol_reserve[iter].nsect--; break; } /* <- don't double-credit */
    assert (iter < context.n_cache_vol_reserve);
    }
  3. Refund the residue via disk_cache_free_reserved(&context) (adds remaining nsect back through disk_cache_update_vol_free under the reserve mutex).
  4. Exit csect, log_sysop_abort — for permanent purpose this rolls the logged STAB bits back.
  5. Classify the error. Expected IO/interrupt errors (ER_INTERRUPTED, ER_IO_MOUNT_FAIL, ER_IO_FORMAT_OUT_OF_SPACE, ER_IO_WRITE, ER_BO_CANNOT_CREATE_VOL) return as-is. Anything else trips assert_release(false) and self-heals: if not yet retried, disk_check(thread_p, true); if it reports DISK_INVALID, clear the error, set retried = true, goto retry. A second failure or non-skew cause returns.

disk_unreserve_ordered_sectors_without_csect rebuilds a fresh context from the ordered VSID list, grouping consecutive same-volid runs into ledger entries (asserting increasing volids and sectids), then calls disk_unreserve_sectors_from_volume per group, returning the first error (ASSERT_ERROR (); return error_code;) without refunding the remaining groups. Its disk_stab_unit_unreserve callback clears bits and returns sectors to the cache — the “removed from cache too” effect loop (2) compensates for.

Invariant — reserve order cache->bitmap, release order bitmap->cache; the cache never overcounts. On reserve the counter drops before the bit is set; on release the bit clears before the counter rises. Both transients leave the cache showing less free than the bitmap, so two reservers can never both be told a sector is free; disk_check repairs the bounded skew, which is why the error: path can retry through it.

  1. Two disjoint phases. Step 1 (disk_reserve_from_cache) moves free-sector counts out of the cache under the short reserve mutex; step 2 (disk_reserve_sectors_in_volume) flips STAB bits under page latches with that mutex released.
  2. Two independent debt counters. n_cache_reserve_remaining (cache) and nsects_lastvol_remaining (per-volume bitmap) never alias, so step 2 is driven volume by volume off cache_vol_reserve[].
  3. Temp prefers perm-type-temp-purpose volumes. When nsect_perm_free > 0 those are scanned first, then fall through to temp-volume extension bounded by disk_Temp_max_sects.
  4. The hot mutex is never held across slow work. An intention counter lets the reserve mutex drop during disk_extend; step 2 never re-takes it.
  5. Permanent reservations are WAL-logged per STAB unit; temporary ones are not. Temp bits reset on restart, so temp rollback physically un-flips them via disk_unreserve_ordered_sectors_without_csect.
  6. The transient skew is always conservative. Reserve cache->bitmap, release bitmap->cache, so the cache never reports more free than exists; disk_check repairs the bounded skew and the error: path retries through it once.
  7. The error path reconciles before refunding. It decrements ledger entries for already-returned sectors, then disk_cache_free_reserved refunds only the never-flipped residue, avoiding double-credit.

Chapter 5: Volume Extension as a Nested Top Action

Section titled “Chapter 5: Volume Extension as a Nested Top Action”

The reader question this chapter answers: what happens inside Step 1 of sector reservation when even the permanent-type / temporary-purpose fallback runs dry and the cache can no longer satisfy the request? The reserving thread must grow the database — extend an existing OS file or create a new volume — before it can finish reserving. This chapter traces that escalation from disk_reserve_from_cache through disk_extend, disk_volume_expand, and disk_add_volume, and shows why the growth must be a nested top action committed independently of the triggering reservation. It continues Chapter 4 and the cache-vs-disk split in the high-level companion (cubrid-disk-manager.md, “Sector reservation”).

5.1 Where extension is triggered: the race window in disk_reserve_from_cache

Section titled “5.1 Where extension is triggered: the race window in disk_reserve_from_cache”

When the running free count cannot cover n_cache_reserve_remaining, the function records its intention, drops the reserve mutex, then takes the extend mutex. The order is mandatory — mutex_extend carries the comment never get expand mutex while keeping reserve mutexes; the opposite order would deadlock against a concurrent expander already holding mutex_extend.

// disk_reserve_from_cache -- src/storage/disk_manager.c
extend_info->nsect_intention += context->n_cache_reserve_remaining; /* <- publish demand BEFORE releasing */
disk_cache_unlock_reserve (extend_info);
disk_lock_extend (); /* <- serializes all expanders; flips reserve -> extend mutex */
disk_cache_lock_reserve (extend_info);
if (extend_info->nsect_free > context->n_cache_reserve_remaining) /* <- race: someone already grew it */
{
extend_info->nsect_intention -= context->n_cache_reserve_remaining;
disk_reserve_from_cache_vols (extend_info->voltype, context);
if (context->n_cache_reserve_remaining <= 0)
{ disk_cache_unlock_reserve (extend_info); disk_unlock_extend (); return NO_ERROR; } /* <- no extend */
extend_info->nsect_intention += context->n_cache_reserve_remaining;
}
save_remaining = context->n_cache_reserve_remaining; /* <- snapshot, to undo intention after extend */
disk_cache_unlock_reserve (extend_info);
error_code = disk_extend (thread_p, extend_info, context); /* <- the slow path */

The interval between the two mutexes is the race window: another thread can grab mutex_extend first, grow the volume, and refill nsect_free. The double-check after disk_lock_extend() catches that — if the volume is now large enough this thread reverses its nsect_intention bump and reserves from the grown cache with no disk I/O. (disk_extend opens with assert (disk_Cache->owner_extend == thread_get_entry_index (thread_p)), proving it runs only under the extend mutex.)

INVARIANT — nsect_intention is the load-bearing accumulator of unmet demand. A thread adds its remaining need under the reserve mutex and subtracts the same save_remaining snapshot once met. If violated (an add with no matching subtract on an error path), every future disk_extend over-allocates by the leaked amount forever, since it reads nsect_intention as the floor of how much to grow.

5.2 disk_extend: deciding how much, then expand-then-add

Section titled “5.2 disk_extend: deciding how much, then expand-then-add”

disk_extend runs under mutex_extend over a snapshot of the DISK_EXTEND_INFO counters (Chapter 1), sizing the growth then executing it in two phases.

// disk_extend -- src/storage/disk_manager.c
target_free = MAX ((DKNSECTS) (total * 0.01), DISK_MIN_VOLUME_SECTS); /* <- 1% of size, floored */
nsect_extend = MAX (target_free - free, 0) + intention; /* <- coalesce all unmet demand */
if (nsect_extend <= 0)
return NO_ERROR; /* <- branch 1: free exceeds target, no intentions */
// ... condensed ...
if (total < max) /* <- phase 1: extendable volume still has room */
{
to_expand = MIN (nsect_extend, max - total); /* <- never exceed this volume's ceiling */
log_sysop_start (thread_p); /* <- NESTED TOP ACTION begins */
error_code = disk_volume_expand (thread_p, extend_info->volid_extend, voltype, to_expand, &nsect_free_new);
if (error_code != NO_ERROR)
{ ASSERT_ERROR (); log_sysop_abort (thread_p); return error_code; } /* <- header undo */
log_sysop_commit (thread_p); /* <- commit independently of outer reservation */
if (extend_info->nsect_total == extend_info->nsect_max)
extend_info->volid_extend = NULL_VOLID; /* <- maxed out; never extend this volume again */
nsect_extend -= nsect_free_new;
// ... condensed: bump nsect_total; under reserve mutex update vol_free + reserve ahead ...
if (nsect_extend <= 0)
return NO_ERROR; /* <- expansion alone covered the demand */
}
// ... condensed: assert (nsect_extend > 0); volext init (nsect_max, voltype, purpose, overwrite=false) ...
while (nsect_extend > 0) /* <- phase 2: add fresh volumes */
{
if (check_interrupt && logtb_is_interrupted (thread_p, true, &continue_check))
{ er_set (..., ER_INTERRUPTED, 0); return ER_INTERRUPTED; } /* <- branch: only if re-enabled */
volext.nsect_total = nsect_extend + DISK_SYS_NSECT_SIZE (volext.nsect_max);
// ... condensed: clamp to [DISK_MIN_VOLUME_SECTS, nsect_max] then DISK_SECTS_ROUND_UP ...
error_code = disk_add_volume (thread_p, &volext, &volid_new, &nsect_free_new);
if (error_code != NO_ERROR)
{ ASSERT_ERROR (); return error_code; } /* <- disk_add_volume aborted its own sysop */
nsect_extend -= nsect_free_new;
// ... condensed: bump nsect_total/nsect_max; under reserve mutex set vol_free + reserve ahead ...
if (extend_info->nsect_total < extend_info->nsect_max)
extend_info->volid_extend = volid_new; /* <- newest non-maxed volume becomes extendable */
}
return NO_ERROR;

nsect_extend adds the (non-negative) headroom shortfall to intention, so one expansion serves every thread blocked on this purpose. Phase 1 grows the sub-ceiling volid_extend volume and reserves ahead; phase 2’s three branches — interrupt, disk_add_volume error (callee already aborted), and the volid_extend update (only if sub-max) — are annotated inline.

INVARIANT — exactly one volume per purpose is “extendable”. volid_extend names the single volume phase 1 grows; the code clears it to NULL_VOLID the instant a volume reaches nsect_max and re-points it at the newest sub-max volume. If violated, a maxed volume could reach disk_volume_expand with to_expand = MIN(nsect_extend, max - total) non-positive, tripping an assert.

flowchart TD
  C{"nsect_extend <= 0?"} -->|yes| Z1["return NO_ERROR"]
  C -->|no| D{"total < max?"}
  D -->|yes| E["sysop_start; disk_volume_expand"]
  E --> F{"error?"} -->|yes| G["sysop_abort; return error"]
  F -->|no| H["sysop_commit; cache + reserve ahead"]
  H --> I{"nsect_extend <= 0?"} -->|yes| Z2["return NO_ERROR"]
  I -->|no| J["phase 2 loop"]
  D -->|no| J
  J --> K{"interrupted?"} -->|yes| L["return ER_INTERRUPTED"]
  K -->|no| M["size volext; disk_add_volume"]
  M --> N{"error?"} -->|yes| O["return error"]
  N -->|no| P["cache + reserve ahead; set volid_extend if sub-max"]
  P --> Q{"nsect_extend > 0?"} -->|yes| K
  Q -->|no| Z3["return NO_ERROR"]

Figure 5-1. Branch-complete flow of disk_extend: sizing, optional in-place expand, then the add-volume loop.

5.3 disk_volume_expand: growing one file as its own sysop

Section titled “5.3 disk_volume_expand: growing one file as its own sysop”

disk_volume_expand grows a single volume in place. Its six-step recipe’s ordering is the whole point — it makes the growth crash-safe.

// disk_volume_expand -- src/storage/disk_manager.c
error_code = disk_get_volheader (thread_p, volid, PGBUF_LATCH_WRITE, &page_volheader, &volheader);
if (error_code != NO_ERROR)
{ assert_release (false); er_set (..., ER_GENERIC_ERROR, 0); return ER_FAILED; } /* <- header fix fatal */
do_logging = (volheader->type == DB_PERMANENT_VOLTYPE); /* <- temp volumes are not logged */
log_sysop_start (thread_p); /* step 1: own sysop so header change can be undone */
volheader->nsect_total += nsect_extend;
if (do_logging)
log_append_undoredo_data2 (thread_p, RVDK_VOLHEAD_EXPAND, ...); /* step 2: header undo/redo */
volume_new_npages = DISK_SECTS_NPAGES (volheader->nsect_total);
if (do_logging)
log_append_dboutside_redo (thread_p, RVDK_EXPAND_VOLUME, ...); /* step 3: unattached redo */
pgbuf_set_dirty_and_free (thread_p, page_volheader); /* free header only after step 3 is logged */
log_sysop_commit (thread_p); /* step 4: cancel the header-undo */
logpb_force_flush_pages (thread_p); /* step 5: log MUST be on disk before the file grows */
error_code = fileio_expand_to (thread_p, volid, volume_new_npages, voltype); /* step 6: grow OS file */
if (error_code != NO_ERROR)
{ assert (false); return error_code; } /* <- cannot-happen: growth already durable; cache desyncs */
*nsect_extended_out = nsect_extend;
return NO_ERROR;

The header-fix failure is fatal; the do_logging branch skips both log records for temp volumes (never recovered); the fileio_expand_to failure is a cannot-happen branch since log_sysop_commit already made the growth durable. RVDK_VOLHEAD_EXPAND (disk_rv_volhead_extend_undo/..._redo) adjusts nsect_total and the cache by the same delta; RVDK_EXPAND_VOLUME re-runs fileio_expand_to on recovery.

INVARIANT — the file-growth redo log must be durable before the file is grown. Step 5 (logpb_force_flush_pages) sits between the committed header update and fileio_expand_to. If skipped, a crash in between leaves the recovered header and the OS file disagreeing on size.

5.4 disk_add_volume: a fresh OS file plumbed into three registries

Section titled “5.4 disk_add_volume: a fresh OS file plumbed into three registries”

When in-place expansion is exhausted, disk_extend calls disk_add_volume — the second nested top action — wrapping the cache-mutating steps in log_sysop_start/log_sysop_commit.

// disk_add_volume -- src/storage/disk_manager.c
if (disk_Cache->nvols_perm + disk_Cache->nvols_temp >= LOG_MAX_DBVOLID)
return ER_BO_MAXNUM_VOLS_HAS_BEEN_EXCEEDED; /* <- volume-id space exhausted */
error_code = boot_get_new_volume_name_and_id (..., &volid); /* step 1: name + id from boot */
// ... condensed: raw-device symlink, partition free-space check ...
if (nsect_part_max >= 0 && nsect_part_max < extinfo->nsect_max)
return ER_IO_FORMAT_OUT_OF_SPACE; /* step 2 failed: not enough OS disk space */
if (!extinfo->overwrite && fileio_is_volume_exist (extinfo->name))
{ /* ... condensed: disk_can_overwrite_data_volume check ... */
return ER_BO_VOLUME_EXISTS; /* <- refuse to clobber an existing file */ }
log_sysop_start (thread_p); /* NESTED TOP ACTION begins */
if (extinfo->voltype == DB_PERMANENT_VOLTYPE) disk_Cache->nvols_perm++; /* step 3: cache before format */
else disk_Cache->nvols_temp++;
disk_Cache->vols[volid].purpose = extinfo->purpose;
error_code = disk_format (thread_p, boot_db_full_name (), volid, extinfo, nsects_free_out); /* step 4 */
if (error_code != NO_ERROR)
{ ASSERT_ERROR (); goto exit; }
if (extinfo->voltype == DB_PERMANENT_VOLTYPE)
if (logpb_add_volume (NULL, volid, extinfo->name, DB_PERMANENT_DATA_PURPOSE) == NULL_VOLID)
{ ASSERT_ERROR_AND_SET (error_code); goto exit; } /* step 5: register in _vinf (perm only) */
error_code = boot_dbparm_save_volume (thread_p, extinfo->voltype, volid); /* step 6: persist in boot_Db_parm */
if (error_code != NO_ERROR)
{
ASSERT_ERROR ();
if (extinfo->voltype == DB_TEMPORARY_VOLTYPE && disk_unformat (thread_p, extinfo->name) != NO_ERROR)
assert (false); /* <- rollback won't drop temp file; do it by hand */
goto exit;
}
*volid_out = volid;
exit:
if (error_code == NO_ERROR)
log_sysop_commit (thread_p);
else
{
log_sysop_abort (thread_p);
if (extinfo->voltype == DB_TEMPORARY_VOLTYPE) disk_Cache->nvols_temp--; /* <- undo cache count manually */
else disk_Cache->nvols_perm--;
}
return error_code;

Three registries (Figure 5-2): boot_Db_parm updated last (a crash before it leaves no dangling reference); the _vinf registry via logpb_add_volume, permanent only; and disk_Cache, nvols_* and vols[volid].purpose bumped first so disk_format’s page fixes find the volume classified. Every goto exit funnels into one log_sysop_abort; two things logging cannot undo are fixed by hand in the abort arm — the raw nvols_* counter and, for a temp volume, the file (disk_unformat, since temp creation is not journaled). A permanent volume’s file is handled by recovery via the logged format records.

graph TD
  AV["disk_add_volume\nnew volume file"] --> BP["boot_Db_parm\nboot_dbparm_save_volume()"]
  AV --> VI["_vinf volinfo registry\nlogpb_add_volume() perm only"]
  AV --> DC["disk_Cache\nnvols_*++, vols[volid].purpose"]
  AV --> FMT["disk_format()\nzeroes file, writes volheader + sector table"]

Figure 5-2. The three registries disk_add_volume plumbs a new volume into, plus the on-disk format step.

5.5 disk_add_volume_extension: the addvoldb / boot-time entry, and the retired daemon

Section titled “5.5 disk_add_volume_extension: the addvoldb / boot-time entry, and the retired daemon”

disk_extend is the automatic path; disk_add_volume_extension is the explicit entry, called by addvoldb and at database creation. It does not size against nsect_intention — the caller dictates npages — but respects the same serialization, taking disk_lock_extend() and the CSECT_DISK_CHECK reader latch so an admin addvol cannot race an automatic disk_extend.

// disk_add_volume_extension -- src/storage/disk_manager.c
error_code = csect_enter_as_reader (thread_p, CSECT_DISK_CHECK, INF_WAIT);
disk_lock_extend (); /* <- block other expansions */
// ... condensed: realpath, fill ext_info from caller args ...
ext_info.nsect_total = disk_sectors_to_extend_npages (npages);
ext_info.nsect_max = ext_info.nsect_total; /* <- born at its max: never auto-grown */
if (voltype == DB_TEMPORARY_VOLTYPE)
{
if (disk_Cache->temp_purpose_info.extend_info.nsect_total + ext_info.nsect_total > disk_Temp_max_sects)
{ er_set (..., ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED, ...);
disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK);
return ER_BO_MAXTEMP_SPACE_HAS_BEEN_EXCEEDED; } /* <- temp-space cap: release BOTH locks */
ext_info.voltype = DB_TEMPORARY_VOLTYPE;
}
else
ext_info.voltype = DB_PERMANENT_VOLTYPE;
error_code = disk_add_volume (thread_p, &ext_info, &volid_new, &nsect_free);
if (error_code != NO_ERROR)
{ ASSERT_ERROR (); disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK); return error_code; }
// ... condensed: bump per-purpose nsect_total/nsect_max, update vol_free under reserve mutex ...
disk_unlock_extend (); csect_exit (thread_p, CSECT_DISK_CHECK);
*volid_out = volid_new;
return NO_ERROR;

ext_info.nsect_max = ext_info.nsect_total means a user-added volume is born at its maximum size, never a candidate for in-place expansion. Three branches: the temp-space-exceeded early return, the disk_add_volume error return (both releasing the extend mutex and the critical section), and the success path. The post-add bookkeeping distinguishes a permanent-type volume serving temporary purpose from a true temporary-type volume — the three-way classification used throughout the cache.

The retired daemon. The comment atop disk_extend still mentions an auto-expansion thread keeping “a stable level of free space,” but that daemon has been removed — which is why nsect_intention is now the sole coalescing mechanism: the first thread to take the extend mutex must grow enough for itself and every thread that published an intention while it waited.

5.6 Why a nested top action — and not the outer transaction

Section titled “5.6 Why a nested top action — and not the outer transaction”

Both disk_volume_expand and disk_add_volume wrap their durable work in log_sysop_start/log_sysop_commit rather than letting it ride on the outer reservation’s transaction. The grower acts on behalf of all co-users of the new space: reserve-ahead hands fresh sectors to the triggering reservation, but other waiting threads reserve from the same volume once the extend mutex is released. If the growth rode the outer transaction and that transaction later rolled back, every co-user would be forced to roll back too — a volume several transactions depend on would vanish. Committing as an independent nested top action makes the space durable regardless of the triggering transaction’s fate: the reservation can still abort; the volume stays. This is the discipline the companion describes for file-table updates, applied to the coarsest unit of growth.

  1. The extend path is entered only after the cache fails twice. disk_reserve_from_cache records nsect_intention, releases the reserve mutex, takes mutex_extend, and re-checks free space — the double-check absorbs the race where another thread already grew the volume between the two mutexes.
  2. nsect_intention is the load-bearing accumulator. With the auto-expansion daemon removed, it is the only mechanism coalescing concurrent demand; disk_extend adds it to nsect_extend so one expansion serves every waiting thread, and paired +=/-= (with a save_remaining snapshot) keep it balanced across error paths.
  3. disk_extend is expand-then-add. It grows the single volid_extend volume in place up to nsect_max (one volume per purpose may grow), then loops adding fresh volumes for residual demand, reserving ahead into the caller’s context after each step.
  4. disk_volume_expand orders log-before-grow. Header undo/redo plus an unattached RVDK_EXPAND_VOLUME redo, a forced log flush, then fileio_expand_to — whose failure is unrecoverable by construction.
  5. disk_add_volume plumbs the new file into three registriesboot_Db_parm (last), the _vinf file (permanent only), and disk_Cache (counts first) — manually undoing the unlogged cache counter and unformatting orphaned temp files on error.
  6. disk_add_volume_extension is the explicit addvol / boot-time twin: same mutex_extend serialization, caller-supplied size, and nsect_max == nsect_total so user volumes are never auto-grown.
  7. Growth is a nested top action so co-users are not held hostage: committing the expansion independently means a later rollback of the triggering reservation cannot destroy a volume other transactions now depend on.

Chapter 6: File Creation and the Three-Table Layout

Section titled “Chapter 6: File Creation and the Three-Table Layout”

The high-level companion (cubrid-disk-manager.md) explains why a file is a set of reserved sectors. This chapter answers the mechanical follow-up: once disk_reserve_sectors (Ch.4) returns a sorted VSID array, how does file_create turn it into a usable file — header page, VFID, and the partial / full / user-page tables every later allocation relies on? We assume the VSID array exists and trace the file-manager side.

file_create (in file_manager.c) is the one engine. Everything else — the file_create_heap / temp / ehash family — is a thin wrapper that picks two booleans (is_temp, is_numerable), a FILE_TYPE, a FILE_TABLESPACE, and an optional FILE_DESCRIPTORS, then calls it.

6.1 The four structs that live in the header page

Section titled “6.1 The four structs that live in the header page”

A file’s header page (PAGE_FTAB) begins with one file_header struct, followed by one to three file_extensible_data table headers. Two of file_header’s members are themselves structs (FILE_TABLESPACE, FILE_DESCRIPTORS).

// struct file_header -- src/storage/file_manager.c
struct file_header
{
INT64 time_creation; /* Time of file creation. */
VFID self; /* Self VFID */
FILE_TABLESPACE tablespace; /* The table space definition */
FILE_DESCRIPTORS descriptor; /* File descriptor. Depends on file type. */
// ... page / sector counters, flags, table offsets, temp+numerable cursors ...
};

file_header fields.

FieldRole / why it exists
time_creationWall-clock create time; distinguishes reused fileids.
selfThis file’s own VFID; self-identifying header for recovery.
tablespaceEmbedded FILE_TABLESPACE; perm extension (Ch.5), zeroed temp.
descriptorEmbedded FILE_DESCRIPTORS union; type-specific owner metadata.
n_page_totalTotal pages over all sectors; allocation ceiling.
n_page_userUser pages handed out (0); user vs table pages.
n_page_ftabPages used by the file’s tables; starts at 1 (header).
n_page_freeReserved-but-unallocated pages; Ch.7/8 draws down.
n_page_mark_deleteRemoved numerable pages; marked, not compacted.
n_sector_totalReserved-sector count; equals n_sectors.
n_sector_partialSectors with a free page (total-full); alloc candidates.
n_sector_fullSectors fully used by tables; perm only.
n_sector_emptySectors with no page allocated; starts -1 (header sector).
typeFILE_TYPE enum; type routing.
file_flagsNUMERABLE/TEMPORARY/ENCRYPTED_*; truth for FILE_IS_*.
volid_last_expandLast volume that supplied a sector; seeds next extension.
offset_to_partial_ftabOffset to partial table; anchors GET_PART_FTAB.
offset_to_full_ftabOffset to full table; perm only, else NULL_OFFSET.
offset_to_user_page_ftabOffset to user-page table; numerable only, else NULL_OFFSET.
vpid_sticky_firstUndeletable first page; set later (Ch.11).
vpid_last_temp_alloc + offset_to_last_temp_allocTemp-alloc cursor (page + offset); temp shortcut (Ch.8).
vpid_last_user_page_ftabLast user-page-table page; numerable append (Ch.10).
vpid_find_nth_last / first_index_find_nth_lastCached find_nth position; nth-lookup speedup (Ch.10).
reserved0..3Padding, zeroed; forward-compat.
graph LR
  FH["file_header"] -->|embeds| TS["FILE_TABLESPACE"]
  FH -->|embeds union| DES["FILE_DESCRIPTORS"]
  FH -->|offset_to_partial_ftab| PT["partial table"]
  FH -->|offset_to_full_ftab perm| FT["full table"]
  FH -->|offset_to_user_page_ftab numerable| UT["user-page table"]

Figure 6-1. file_header embeds two structs and points at one to three file_extensible_data tables in the header page.

FILE_TABLESPACE — four fields set by FILE_TABLESPACE_FOR_PERM_NPAGES / _FOR_TEMP_NPAGES: initial_size (requested bytes MAX(1,npages)*DB_PAGESIZE, seeds total_size); expand_ratio (geometric-growth fraction, 0 for temp); expand_min_size / expand_max_size (per-extension clamps, both 0 for temp so temp never auto-extends).

FILE_DESCRIPTORS is a union padded to 64 bytes (FILE_DESCRIPTORS_SIZE). Arms: heap (class_oid, hfid), heap_overflow, btree (class_oid, attr_id), btree_key_overflow, ehash (class_oid, attr_id), vacuum_data (vpid_first), dummy_align (forces the 64-byte footprint). The fixed size is load-bearing — the header warns “if you change file descriptors size, make sure to change disk compatibility version too!”: the union size is part of the on-disk format.

file_extensible_data is the table header repeated up to three times after file_header. Four fields: vpid_next (continuation-page link when a table outgrows the header page), max_size (item capacity in bytes, fixed at file_extdata_init), size_of_item (bytes per item — one struct, three item types), n_items (items stored, starts 0).

6.2 Estimating size: data plus worst-case file-table sectors

Section titled “6.2 Estimating size: data plus worst-case file-table sectors”

file_create turns the requested byte size into a sector count, then reserves extra sectors for the file’s own tables. The estimate is pessimistic on purpose — over-reserving is cheap, under-reserving forces a mid-create extension.

// file_create -- src/storage/file_manager.c
total_size = tablespace->initial_size;
if (!is_numerable) max_size_ftab = total_size / 8 / 1024; /* <- partial+full (~1 byte/8KB) */
else max_size_ftab = total_size * 33 / 8 / 1024; /* <- + user-page table */
total_size += max_size_ftab;
n_sectors = (int) CEIL_PTVDIV (total_size, DB_SECTORSIZE);
vsids_reserved = (VSID *) db_private_alloc (thread_p, n_sectors * sizeof (VSID));

On db_private_alloc failure: er_set(ER_OUT_OF_VIRTUAL_MEMORY) then goto exit (nothing reserved yet). Otherwise, for permanent files only (do_logging = !is_temp), log_sysop_start opens a system operation; temp files skip it. This do_logging split recurs at every dirty/unfix call below.

// file_create -- src/storage/file_manager.c
volpurpose = is_temp ? DB_TEMPORARY_DATA_PURPOSE : DB_PERMANENT_DATA_PURPOSE;
error_code = disk_reserve_sectors (thread_p, volpurpose, NULL_VOLID, n_sectors, vsids_reserved);
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; }
was_temp_reserved = is_temp; /* <- arm temp-leak cleanup */
volid_last_expand = vsids_reserved[n_sectors - 1].volid; /* <- before sort! */
qsort (vsids_reserved, n_sectors, sizeof (VSID), disk_compare_vsids);

volid_last_expand is grabbed before the sort: sectors come back in reservation order, and the last one is the most-recently-extended volume, where future growth should continue. was_temp_reserved arms the manual unreserve at exit (temp reservations are not undone by recovery).

Header-page (hence VFID) selection then branches on type (Figure 6-2):

flowchart TD
  B{"SERVER_MODE and type in\nBTREE/HEAP/HEAP_REUSE_SLOTS?"}
  B -->|yes| C["scan fileids in first volume\nvacuum_is_file_dropped per fileid"]
  C --> D{"non-dropped found?"}
  D -->|yes| E["vfid = found_vfid"]
  D -->|no| F["assert_release false -> exit"]
  B -->|no| G["vfid = first page of sectid[0]"]
  E --> H["vpid_fhead = vfid"]
  G --> H

Figure 6-2. Header-page / VFID selection.

The default branch takes the first page of the first sorted sector. The vacuum-aware branch (SERVER_MODE and type in BTREE/HEAP/HEAP_REUSE_SLOTS) exists because reusing a VFID vacuum still believes is “dropped” would corrupt its dropped-files list. It walks every fileid of every sector in the first volume (the VFID must share that volume) and picks the first vacuum_is_file_dropped reports clean; that function erroring is goto exit, a fully-dropped first volume is assert_release(false) (impossible).

// file_create -- src/storage/file_manager.c
page_fhead = pgbuf_fix (thread_p, &vpid_fhead, NEW_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
if (page_fhead == NULL) { ASSERT_ERROR_AND_SET (error_code); goto exit; }
// ... condensed: memset(0), set ptype PAGE_FTAB, fhead = page; self/tablespace/type set ...
if (des != NULL) { fhead->descriptor = *des; } /* <- temp/query-area pass NULL */
if (is_numerable) { fhead->file_flags |= FILE_FLAG_NUMERABLE; }
if (is_temp) { fhead->file_flags |= FILE_FLAG_TEMPORARY; }
// ... condensed: time_creation, NULL cursors, zero counters ...
fhead->n_page_ftab = 1; /* <- the header page is itself a table page */
fhead->n_sector_empty--; /* <- start negative: header sector is not empty */

The header is fixed new (error path on NULL), zeroed, typed PAGE_FTAB, and self/tablespace/type stamped in. Two non-obvious seeds: n_page_ftab starts at 1 (the header is a table page) and n_sector_empty at -1 so the header’s sector is not counted as empty when partial sectors are later tallied.

6.5 The three-table layout — four flavors of the header byte budget

Section titled “6.5 The three-table layout — four flavors of the header byte budget”

After the offset cursor offset_ftab is seeded to FILE_HEADER_ALIGNED_SIZE (the first byte past the fixed header), file_create carves the remaining DB_PAGESIZE - offset_ftab bytes into tables (Figure 6-3). The four-way split keys on the (is_temp, is_numerable) pair:

flowchart TD
  N{"is_numerable?"}
  N -->|yes| NT{"is_temp?"}
  N -->|no| RT{"is_temp?"}
  NT -->|yes| A["temp numerable\npartial 1/16, user-page 15/16"]
  NT -->|no| B["perm numerable\npartial 1/32, full 1/32, user-page 15/16"]
  RT -->|yes| C["temp regular\npartial = all remaining"]
  RT -->|no| D["perm regular\npartial 1/2, full 1/2"]

Figure 6-3. The four flavors of header-page partitioning. Every flavor allocates a partial table; full and user-page are conditional.

Each table is initialized with file_extdata_init(item_size, size, extdata)item_size is sizeof(FILE_PARTIAL_SECTOR) for partial, sizeof(VSID) for full, sizeof(VPID) for user-page. Each assignment fhead->offset_to_*_ftab = offset_ftab is followed by offset_ftab += file_extdata_max_size(extdata) so the next table starts aligned past it. The permanent-numerable arm is the only one to advance the cursor twice (after partial, after full) before the user-page table consumes the remainder; all others advance it at most once.

Invariant: every file has a partial table, correctly aligned. All four branches end asserting offset_to_partial_ftab != NULL_OFFSET, and every offset_to_*_ftab assignment is followed by assert((INT16) DB_ALIGN(offset, MAX_ALIGNMENT) == offset). The partial table is the universal entry point (Ch.7/8 walk it first); full/user-page offsets stay NULL_OFFSET when unused. Alignment holds by construction (FILE_HEADER_ALIGNED_SIZE is pre-aligned, file_extdata_max_size returns an aligned span). The FILE_HEADER_GET_*_FTAB macros enforce the contract on every later read: GET_FULL_FTAB asserts !FILE_IS_TEMPORARY(fh), GET_USER_PAGE_FTAB asserts FILE_IS_NUMERABLE(fh), and all three bound the offset in [FILE_HEADER_ALIGNED_SIZE, DB_PAGESIZE). A mis-set offset is a loud crash, not silent corruption; broken alignment means unaligned INT64/VPID access.

6.6 Populating the partial table and splitting full sectors

Section titled “6.6 Populating the partial table and splitting full sectors”

file_create walks vsids_reserved, appending one FILE_PARTIAL_SECTOR per sector into the partial table (file_extdata_append). When the in-header table fills (file_extdata_is_full), it allocates a continuation page from the sectors it is currently recording, chains it via vpid_next, bumps n_page_ftab, and continues there; continuation pages’ bits are set in their sectors’ bitmaps so they are never re-handed to a user.

After the walk the last sector partsect_ftab points at may itself be full (it held the last table page); if so, partsect_ftab++; fhead->n_sector_full++;. Then, for permanent files only, sectors fully consumed by the file table migrate from the partial table into the full table:

// file_create (full-sector migration) -- src/storage/file_manager.c
if (!is_temp && fhead->n_sector_full > 0)
{
// ... condensed: GET_PART_FTAB + GET_FULL_FTAB into extdata_part_ftab / extdata_full_ftab ...
for (i = 0; i < fhead->n_sector_full; i++)
{
partsect_iter = (FILE_PARTIAL_SECTOR *) file_extdata_at (extdata_part_ftab, i);
/* ... condensed: drops the file_extdata_is_full / assert_release(false) guard ... */
file_extdata_append (extdata_full_ftab, &partsect_iter->vsid); /* <- VSID only */
}
file_extdata_remove_at (extdata_part_ftab, 0, fhead->n_sector_full);
}

Temp files skip this entirely (no full table); they instead seed the temp cursor (vpid_last_temp_alloc = vpid_fhead, offset_to_last_temp_alloc = n_sector_full). Numerable files (temp or perm) seed the user-page-table head (vpid_last_user_page_ftab and vpid_find_nth_last both set to vpid_fhead). Finally the counters are reconciled — n_sector_total = n_sectors, n_sector_partial = total - full, n_sector_empty += n_sector_partial, n_page_total = n_sector_total * DISK_SECTOR_NPAGES, n_page_free = n_page_total - n_page_ftab — and file_header_sanity_check asserts the header is internally consistent.

6.7 Commit, tracker registration, and the error/exit path

Section titled “6.7 Commit, tracker registration, and the error/exit path”
// file_create (finish) -- src/storage/file_manager.c
if (do_logging)
{ pgbuf_log_new_page (thread_p, page_fhead, DB_PAGESIZE, PAGE_FTAB);
pgbuf_unfix_and_init (thread_p, page_fhead); }
else
{ pgbuf_set_dirty_and_free (thread_p, page_fhead); } /* <- temp: no redo log */
if (!is_temp && file_type != FILE_TRACKER)
{ error_code = file_tracker_register (thread_p, vfid, file_type, NULL);
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; } }
if (is_temp) { ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.nfile, 1); /* ...stats... */ }

Permanent files log the header for redo and register with the file tracker — except FILE_TRACKER itself, which would be circular (tracker registration is Ch.11). Temp files only bump in-memory spacedb_temp counters. The shared exit label handles every branch’s failure: (1) unfix page_ftab / page_fhead if still held; (2) if is_sysop_started, on error log_sysop_abort (rolls back reserve+layout), on success log_sysop_end_logical_undo(RVFL_DESTROY, vfid) so a later transaction abort tears the whole file down; (3) on error VFID_SET_NULL(vfid) so callers never see a half-built id, and if was_temp_reserved the temp sectors are manually unreserved here (recovery won’t, since temp work isn’t logged) under logtb_set_check_interrupt(false); (4) always db_private_free(vsids_reserved).

Every public creator funnels into file_create with a fixed (is_temp, is_numerable, file_type):

Wrapperfile_typeis_tempis_numerableDescriptorTablespace
file_create_heapFILE_HEAP / FILE_HEAP_REUSE_SLOTSnonoheap (class_oid)perm, npages=1
file_create_tempFILE_TEMPyesnoNULLtemp
file_create_temp_numerableFILE_TEMPyesyesNULLtemp
file_create_query_areaFILE_QUERY_AREAyesnoNULLtemp, npages=1
file_create_ehashFILE_EXTENDIBLE_HASHcaller’s is_tmpyesehashtemp-sized
file_create_ehash_dirFILE_EXTENDIBLE_HASH_DIRECTORYcaller’s is_tmpyesehashtemp-sized

file_create_heap builds the descriptor (memset, then des.heap.class_oid = *class_oid) and routes through file_create_with_npages. The three temp wrappers all go through file_create_temp_internal, which is not a thin pass-through:

// file_create_temp_internal -- src/storage/file_manager.c
error_code = file_tempcache_get (thread_p, ftype, is_numerable, &tempcache_entry);
if (VFID_ISNULL (&tempcache_entry->vfid)) /* <- cache miss: create fresh */
{
FILE_TABLESPACE_FOR_TEMP_NPAGES (&tablespace, npages);
file_tempcache_lock_tran_entry (tran_entry); /* <- rmutex_topop guard */
error_code = file_create (thread_p, ftype, &tablespace, NULL, true, is_numerable, vfid_out);
file_tempcache_unlock_tran_entry (tran_entry);
// ... condensed: on error file_tempcache_retire_entry + return; else cache the vfid ...
}
else { *vfid_out = tempcache_entry->vfid; } /* <- cache hit: reuse, no file_create */
file_tempcache_push_tran_file (thread_p, tempcache_entry);

So temp creation may skip file_create entirely and return a cached file. When it does call file_create, it wraps the call in a per-transaction lock because file_create’s log_sysop_start uses rmutex_topop, unsafe across parallel workers of one transaction (tempcache is Ch.11). The ehash wrappers are thin: temp-sized tablespace, FILE_EHASH_DES as descriptor, is_numerable = true unconditionally (nth-page lookup), is_temp forwarded from the caller’s is_tmp.

  1. file_create is the single engine; the wrappers only pick (file_type, is_temp, is_numerable, descriptor, tablespace). Heap/ehash supply a descriptor; temp/query-area pass NULL.
  2. The reserved-sector count is over-estimated to fit the file’s own tables (total/8/1024 extra bytes regular, total*33/8/1024 numerable), avoiding a mid-create extension. The VFID is the first page of the first reserved sector — except heap/btree under SERVER_MODE, which scan past any fileid vacuum still considers dropped.
  3. The header page hosts one file_header plus one to three file_extensible_data tables, partitioned by flavor: perm regular 1/2+1/2, perm numerable 1/32+1/32+15/16, temp regular partial-only, temp numerable 1/16+15/16. Two invariants hold throughout: every file has a partial table, and every offset is MAX_ALIGNMENT-aligned (enforced by FILE_HEADER_GET_*_FTAB).
  4. Permanent files migrate fully-consumed file-table sectors into the full table; temp files keep one cursor (vpid_last_temp_alloc) instead.
  5. do_logging = !is_temp governs durability: perm files run a sysop that logs the header and registers RVFL_DESTROY as logical undo; temp files are set-dirty-and-free, manually unreserved on error, and may be served straight from the tempcache without reaching file_create.

This chapter answers: given a permanent file that already owns sectors, how does file_alloc hand out the next user page while preserving the head-of-Partial-table invariant that keeps the next allocation O(1)? We trace file_alloc, the engine file_perm_alloc, and its helpers. Theory of sectors, partial vs. full tables, and FILE_EXTENSIBLE_DATA lives in the companion (cubrid-disk-manager.md, “File layout” / “Three-table model”). Temporary allocation (Ch.8) and numerable tables (Ch.10) are out of scope.

Every entry in the partial table is a file_partial_sector (typedef FILE_PARTIAL_SECTOR); file_perm_alloc mutates its bitmap on every fast-path allocation.

// file_partial_sector -- src/storage/file_manager.h
struct file_partial_sector
{
VSID vsid; /* Important - VSID must be first member ...
* Sometimes, the FILE_PARTIAL_SECTOR pointers
* in file table are reinterpreted as VSID. */
FILE_ALLOC_BITMAP page_bitmap;
};
FieldRoleWhy it exists
vsidSector identity { volid, sectid }; the on-disk address of the 64-page run this entry covers.Says “this sector is reserved by this file.” Also the bare full-table entry (see invariant).
page_bitmap64-bit FILE_ALLOC_BITMAP (UINT64); bit k set ⇒ page k allocated. DISK_SECTOR_NPAGES == 64, one bit per page.Flip one bit instead of scanning. 0x0…0 = FILE_EMPTY_PAGE_BITMAP, 0xF…F = FILE_FULL_PAGE_BITMAP.

Invariant — VSID must be the first member. Full-table and expansion code reinterprets a FILE_PARTIAL_SECTOR * as a VSID * (a full-table entry is just a VSID). A field placed before vsid would make that cast read garbage. The struct layout is the contract.

classDiagram
  class FILE_PARTIAL_SECTOR {
    VSID vsid
    FILE_ALLOC_BITMAP page_bitmap
  }
  FILE_PARTIAL_SECTOR --> "full table reuses the prefix" VSID : vsid is first

Figure 7-1. The full table stores a bare VSID — exactly the leading field of a partial entry. This prefix compatibility recurs throughout the chapter.

file_alloc fixes the header, branches on FILE_IS_TEMPORARY, allocates, optionally registers and initializes the page, and frames the permanent path inside a logical-undo system operation.

// file_alloc -- src/storage/file_manager.c
page_fhead = pgbuf_fix (thread_p, &vpid_fhead, OLD_PAGE, PGBUF_LATCH_WRITE, ...);
// ... condensed ...
if (FILE_IS_TEMPORARY (fhead))
error_code = file_temp_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- Ch.8 */
else
{
log_sysop_start_atomic (thread_p); /* <- nested top action, atomic so undo is one unit */
is_sysop_started = true;
error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- 7.3 */
VFID_COPY ((VFID *) undo_log_data, vfid);
VPID_COPY ((VPID *) (undo_log_data + sizeof (VFID)), vpid_out); /* <- undo payload {vfid,vpid} */
}

Remaining exits (errors → goto exit): (1) pgbuf_fix fails → return, nothing fixed. (2) numerable → file_numerable_add_page (tail call, Ch.10). (3) f_init supplied → fix the new page NEW_PAGE, init, set TDE, hand back via page_out or unfix; failure unfixes. (4) no f_init → asserted temporary; return the raw page if page_out requested. (5) exit → sysop aborts on error, else commit-and-undo via log_sysop_end_logical_undo (RVFL_ALLOC, …), then sanity-check and unfix. Structural changes are redo-logged eagerly inside file_perm_alloc; the sysop’s single logical undo (RVFL_ALLOC) is the “deallocate {vfid,vpid}” record — the nested-top-action discipline of Ch.5.

Four phases: ensure free pages, ensure the header section holds a partial sector, flip a head-sector bit, then migrate to the full table if it just filled.

flowchart TD
  A["file_perm_alloc(alloc_type)"] --> B{"n_page_free == 0 ?"}
  B -- yes --> C["file_perm_expand\nreserve more sectors"]
  B -- no --> D
  C --> D{"header partial section empty ?"}
  D -- yes --> E["file_table_move_partial_sectors_to_header"]
  E --> F{"vpid_alloc_out set ?"}
  F -- yes --> Z["goto exit\npage already chosen"]
  F -- no --> G
  D -- no --> G["partsect = head of partial section"]
  G --> H["file_partsect_alloc:\nset first 0-bit, emit vpid"]
  H --> I{"alloc_type ==\nTABLE_PAGE_FULL_SECTOR ?"}
  I -- yes --> J["file_table_append_full_sector_page"]
  I -- no --> K
  J --> K["file_header_alloc:\ncounters + WAL"]
  K --> L{"sector now full ?"}
  L -- no --> Z2["exit OK"]
  L -- yes --> M["remove head from partial table"]
  M --> N["file_table_add_full_sector(vsid)"]
  N --> Z2

Figure 7-2. file_perm_alloc control flow — every branch and goto.

// file_perm_alloc -- src/storage/file_manager.c
if (fhead->n_page_free == 0)
{
error_code = file_perm_expand (thread_p, page_fhead); /* <- 7.4 */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; }
}
assert (fhead->n_page_free > 0 && fhead->n_sector_partial > 0);

Invariant — the header holds a Partial entry while n_page_free > 0. Free pages live only inside partial sectors (full sectors have none; empty is a subset of partial), so any free page implies a partial sector — the two asserts confirm it. Phase 2 guarantees one sits in the header section.

Phase 2 — guarantee the head section is non-empty

Section titled “Phase 2 — guarantee the head section is non-empty”
FILE_HEADER_GET_PART_FTAB (fhead, extdata_part_ftab);
if (file_extdata_is_empty (extdata_part_ftab))
{
error_code = file_table_move_partial_sectors_to_header (thread_p, page_fhead, alloc_type, vpid_alloc_out); /* <- 7.5 */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; }
if (!VPID_ISNULL (vpid_alloc_out))
{
goto exit; /* <- a freed overflow page was reused as the allocation; done */
}
}
assert (!file_extdata_is_empty (extdata_part_ftab));

Either the move repopulated the header (vpid_alloc_out NULL, fall through) or it drained an overflow page and reused that page as the result (vpid_alloc_out set → goto exit, no bitmap touched — the CBRD-21242 path, 7.5).

partsect = (FILE_PARTIAL_SECTOR *) file_extdata_start (extdata_part_ftab); /* <- head item, position 0 */
assert (!file_partsect_is_full (partsect));
was_empty = file_partsect_is_empty (partsect);
if (!file_partsect_alloc (partsect, vpid_alloc_out, &offset_to_alloc_bit)) /* <- 7.6 */
{
assert_release (false); /* head sector must have a free page (invariant) */
error_code = ER_FAILED; goto exit;
}
log_append_undoredo_data2 (thread_p, RVFL_PARTSECT_ALLOC, NULL, page_fhead,
(PGLENGTH) ((char *) partsect - page_fhead), /* <- offset of partsect in page */
..., &offset_to_alloc_bit, &offset_to_alloc_bit); /* <- undo == redo == bit offset */

Invariant — the head sector always has a free page. Allocation always reads position 0 (file_extdata_start); Phases 1–2 guarantee a non-full partial sector there, so the two asserts treat a full head as a logic error. RVFL_PARTSECT_ALLOC logs only the bit offset (undo == redo) at partsect’s byte offset in the header page.

FILE_ALLOC_TABLE_PAGE vs FILE_ALLOC_USER_PAGE

Section titled “FILE_ALLOC_TABLE_PAGE vs FILE_ALLOC_USER_PAGE”

Right after the bit flip, if (alloc_type == FILE_ALLOC_TABLE_PAGE_FULL_SECTOR) calls file_table_append_full_sector_page (...) (7.7). alloc_type says what the page is for; file_header_alloc (7.8) bumps n_page_user or n_page_ftab accordingly. The enum file_alloc_type has three values — FILE_ALLOC_USER_PAGE, FILE_ALLOC_TABLE_PAGE, FILE_ALLOC_TABLE_PAGE_FULL_SECTOR. The last is requested by file_table_add_full_sector when the full table needs a page; that page must link in before the current sector migrates, else migration finds no room and recurses — the reason the third value exists.

is_full = file_partsect_is_full (partsect);
file_header_alloc (fhead, alloc_type, was_empty, is_full); /* <- 7.8: counters + WAL */
file_log_fhead_alloc (thread_p, page_fhead, alloc_type, was_empty, is_full);
if (is_full)
{
VSID vsid_full = partsect->vsid; /* <- save before removal */
file_log_extdata_remove (thread_p, extdata_part_ftab, page_fhead, 0, 1);
file_extdata_remove_at (extdata_part_ftab, 0, 1); /* <- drop head item */
error_code = file_table_add_full_sector (thread_p, page_fhead, &vsid_full); /* <- 7.7 */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto exit; }
}

Counters update first (correct before any nested allocation), then the full head sector is removed from position 0 and added to the full table — restoring the head-of-Partial invariant. Rollback is owned by the enclosing file_alloc sysop.

7.4 file_perm_expand: refill the partial table

Section titled “7.4 file_perm_expand: refill the partial table”

Called when n_page_free == 0. Reserves a batch of new sectors, appending them as empty partial entries in the header.

// file_perm_expand -- src/storage/file_manager.c
expand_size_in_sectors = (int) ((float) fhead->n_sector_total * fhead->tablespace.expand_ratio);
expand_size_in_sectors = MAX (expand_size_in_sectors, expand_min_size_in_sectors);
expand_size_in_sectors = MIN (expand_size_in_sectors, expand_max_size_in_sectors); /* <- clamp to header capacity */
// ... condensed: db_private_alloc vsids_reserved buffer ...
log_sysop_start (thread_p); /* <- separate committed sysop: expansion is permanent */
error_code = disk_reserve_sectors (thread_p, DB_PERMANENT_DATA_PURPOSE, fhead->volid_last_expand,
expand_size_in_sectors, vsids_reserved); /* fail -> goto exit, abort */
qsort (vsids_reserved, expand_size_in_sectors, sizeof (VSID), disk_compare_vsids);
partsect.page_bitmap = FILE_EMPTY_PAGE_BITMAP;
for (... each reserved vsid ...)
{ partsect.vsid = *vsid_iter; file_extdata_append (extdata_part_ftab, &partsect); } /* <- empty entries into header */
fhead->n_sector_total += expand_size_in_sectors;
fhead->n_sector_empty = fhead->n_sector_partial = expand_size_in_sectors; /* asserted 0 before */
fhead->n_page_free = expand_size_in_sectors * DISK_SECTOR_NPAGES; /* asserted 0 before */
fhead->n_page_total += fhead->n_page_free;

Branches: (1) size clamped to header file_extdata_remaining_capacity — expansion never needs a new table page. (2) VSID-buffer db_private_alloc fails → ER_OUT_OF_VIRTUAL_MEMORY, return before any sysop. (3) disk_reserve_sectors fails → goto exit, sysop aborted. (4) success sets the counters (each asserted 0 first, confirming expand runs only on full exhaustion). The inner sysop commits on success, aborts on error (its own nested top action, Ch.5); RVFL_EXPAND logs the reserved VSID array as redo with empty undo.

7.5 file_table_move_partial_sectors_to_header

Section titled “7.5 file_table_move_partial_sectors_to_header”

Header section empty but overflow pages still hold partial sectors: hoist items from the first overflow page up.

// file_table_move_partial_sectors_to_header -- src/storage/file_manager.c
page_part_ftab_first = pgbuf_fix (thread_p, &extdata_part_ftab_head->vpid_next, OLD_PAGE, ...); /* fail -> exit */
n_items_to_move = file_extdata_item_count (extdata_part_ftab_first);
if (n_items_to_move == 0) { assert_release (false); error_code = ER_FAILED; goto exit; }
// ... condensed: re-check header is empty ...
n_items_to_move = MIN (n_items_to_move, file_extdata_remaining_capacity (extdata_part_ftab_head)); /* <- cap to header room */
file_extdata_append_array (extdata_part_ftab_head, file_extdata_start (extdata_part_ftab_first), n_items_to_move);
file_log_extdata_add (thread_p, extdata_part_ftab_head, page_fhead, 0, n_items_to_move, ...);
if (n_items_to_move < file_extdata_item_count (extdata_part_ftab_first))
{ /* partial move: remove copied prefix; first page survives */
file_log_extdata_remove (thread_p, extdata_part_ftab_first, page_part_ftab_first, 0, n_items_to_move);
file_extdata_remove_at (extdata_part_ftab_first, 0, n_items_to_move);
}
else
{ /* whole page drained: unlink and REUSE it (CBRD-21242) */
VPID save_next = extdata_part_ftab_head->vpid_next; /* <- drained page id, saved before relink */
// ... relink: head->vpid_next = first->vpid_next (skip drained page) ...
*vpid_alloc_out = save_next;
pgbuf_dealloc_page (thread_p, page_part_ftab_first);
if (alloc_type == FILE_ALLOC_TABLE_PAGE_FULL_SECTOR) { file_table_append_full_sector_page (...); }
else if (alloc_type == FILE_ALLOC_USER_PAGE) { fhead->n_page_ftab--; fhead->n_page_user++;
log_append_undoredo_data2 (thread_p, RVFL_FHEAD_CONVERT_FTAB_TO_USER, ...); }
}

Error/assert branches before the split: header vpid_next NULL → assert(false), ER_FAILED; pgbuf_fix of the first overflow page fails → goto exit; n_items_to_move == 0assert_release(false); header not actually empty → silent goto exit. The full-drain path saves vpid_next before relinking, reuses the drained page as the result, and converts a table page to a user page (RVFL_FHEAD_CONVERT_FTAB_TO_USER) — avoiding a deallocate-then-reallocate loop, which is why Phase 2 short-circuits on !VPID_ISNULL (vpid_alloc_out).

7.6 file_partsect_alloc and the bit helpers

Section titled “7.6 file_partsect_alloc and the bit helpers”

Allocation is one bit flip in the head sector’s bitmap.

// file_partsect_alloc -- src/storage/file_manager.c
int offset_to_zero = bit64_count_trailing_ones (partsect->page_bitmap); /* <- index of first 0-bit */
if (offset_to_zero >= FILE_ALLOC_BITMAP_NBITS) /* 64: bitmap all ones */
{ assert (file_partsect_is_full (partsect)); return false; } /* <- caller treats as logic error */
file_partsect_set_bit (partsect, offset_to_zero);
if (offset_out) *offset_out = offset_to_zero;
if (vpid_out) /* <- reconstruct VPID from vsid + offset */
{
vpid_out->volid = partsect->vsid.volid;
vpid_out->pageid = SECTOR_FIRST_PAGEID (partsect->vsid.sectid) + offset_to_zero;
}
return true;

bit64_count_trailing_ones finds the lowest unset bit (pages go out densely from the sector bottom). file_partsect_set_bit asserts the bit is clear and ORs it via bit64_set. The inverse file_partsect_pageid_to_offset subtracts SECTOR_FIRST_PAGEID (sectid) — used by deallocation (Ch.9). The bitmap is the page list.

7.7 Adding a full sector: file_table_add_full_sector and file_table_append_full_sector_page

Section titled “7.7 Adding a full sector: file_table_add_full_sector and file_table_append_full_sector_page”

When the head sector fills, its VSID migrates to the full table.

// file_table_add_full_sector -- src/storage/file_manager.c
FILE_HEADER_GET_FULL_FTAB (fhead, extdata_full_ftab);
error_code = file_extdata_find_not_full (thread_p, &extdata_full_ftab, &page_ftab, &found);
if (!found)
{ /* full table is full: allocate a NEW table page for it */
error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_TABLE_PAGE_FULL_SECTOR, &vpid_ftab_new); /* <- recursion */
page_ftab = pgbuf_fix (thread_p, &vpid_ftab_new, OLD_PAGE, ...); /* already initialized */
extdata_full_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab;
}
page_extdata = page_ftab != NULL ? page_ftab : page_fhead; /* <- which page the add is logged against */
file_extdata_find_ordered (extdata_full_ftab, vsid, disk_compare_vsids, &found, &pos);
if (found) { assert_release (false); error_code = ER_FAILED; goto exit; } /* duplicate VSID */
file_extdata_insert_at (extdata_full_ftab, pos, 1, vsid); /* + file_log_extdata_add(..., page_extdata, ...) */

Branches: (1) free space in an existing component → insert ordered. (2) no space → recurse into file_perm_alloc with FILE_ALLOC_TABLE_PAGE_FULL_SECTOR; bounded because that type appends the new page to the full table before further migration. (3) duplicate VSID → ER_FAILED. Entries stay sorted by disk_compare_vsids for binary search.

file_table_append_full_sector_page initializes the new page and links it at the head of the chain:

// file_table_append_full_sector_page -- src/storage/file_manager.c
page_ftab = pgbuf_fix (thread_p, vpid_new, NEW_PAGE, ...); /* fail -> ASSERT_ERROR_AND_SET, return */
pgbuf_set_page_ptype (thread_p, page_ftab, PAGE_FTAB);
file_extdata_init (sizeof (VSID), DB_PAGESIZE, extdata_new_ftab); /* <- full entries are bare VSIDs */
VPID_COPY (&extdata_new_ftab->vpid_next, &extdata_full_ftab->vpid_next); /* new page points at old head */
pgbuf_log_new_page (thread_p, page_ftab, file_extdata_size (extdata_new_ftab), PAGE_FTAB);
pgbuf_unfix_and_init (thread_p, page_ftab); /* <- new page no longer fixed */
file_log_extdata_set_next (thread_p, extdata_full_ftab, page_fhead, vpid_new); /* old head -> new page */
VPID_COPY (&extdata_full_ftab->vpid_next, vpid_new);

file_extdata_init uses sizeof (VSID), not sizeof (FILE_PARTIAL_SECTOR) — the 7.1 prefix compatibility in action.

file_header_alloc is the single place maintaining the eight header counters (n_page_total/user/ftab/free, n_sector_total/partial/full/empty).

// file_header_alloc -- src/storage/file_manager.c
fhead->n_page_free--;
if (alloc_type == FILE_ALLOC_USER_PAGE) fhead->n_page_user++;
else fhead->n_page_ftab++; /* table page of either flavor */
if (was_empty) fhead->n_sector_empty--; /* sector now holds a page: no longer empty */
if (is_full) { fhead->n_sector_partial--; fhead->n_sector_full++; } /* migrated to full */

The leading assert (!was_empty || !is_full) enforces that one allocation cannot take a sector empty→full (only empty→partial or partial→full). file_log_fhead_alloc writes a 3-bool redo {is_ftab_page, was_empty, is_full} replayed by file_rv_fhead_alloc. n_page_total/n_sector_total change only on expansion (7.4).

  1. file_alloc dispatches on FILE_IS_TEMPORARY: temporary → file_temp_alloc (Ch.8), no sysop; permanent → an atomic nested-top-action sysop closed by log_sysop_end_logical_undo (RVFL_ALLOC, {vfid,vpid}).
  2. file_partial_sector is {vsid, page_bitmap}, vsid MUST be first — full-table code reinterprets the pointer as a bare VSID; the 64-bit bitmap is one bit per page of a 64-page sector.
  3. Phases 1–2 (expand, then move-to-header) restore the two bold invariants before the bit flip in file_partsect_alloc, which uses bit64_count_trailing_ones and reconstructs the VPID from SECTOR_FIRST_PAGEID + offset.
  4. A filled sector migrates to the full table: file_header_alloc counters update first, then the head item moves to the sorted full table, which grows via bounded recursion using FILE_ALLOC_TABLE_PAGE_FULL_SECTOR. FILE_ALLOC_USER_PAGE vs _TABLE_PAGE[_FULL_SECTOR] decides n_page_user vs n_page_ftab; only file_perm_expand grows n_*_total (RVFL_EXPAND, its own committed sysop).
  5. Numerable registration is a tail call (file_numerable_add_page, Ch.10); the full-drain branch reuses the emptied overflow page (CBRD-21242), logging the table-to-user conversion via RVFL_FHEAD_CONVERT_FTAB_TO_USER.

Temporary files back sorts, hash joins, and query-result materialization. They live and die inside a single transaction (or get parked in the tempcache for reuse — Ch.11), so the disk manager throws away most of the machinery permanent files depend on. This chapter answers: why do temporary files skip the Partial-to-Full migration, and how does a single header cursor make allocation O(1) with no logging? The high-level rationale lives in the companion cubrid-disk-manager.md; this chapter traces the code, contrasting file_perm_alloc (Ch.7) rather than re-deriving it.

Every page allocation enters through file_alloc. The header is fixed and sanity-checked, then a single predicate splits the world:

// file_alloc -- src/storage/file_manager.c
if (FILE_IS_TEMPORARY (fhead))
error_code = file_temp_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out); /* <- no sysop, no undo */
else
{
log_sysop_start_atomic (thread_p); /* <- permanent path opens a nested top action (Ch.5) */
is_sysop_started = true;
error_code = file_perm_alloc (thread_p, page_fhead, FILE_ALLOC_USER_PAGE, vpid_out);
VFID_COPY ((VFID *) undo_log_data, vfid); /* <- pack (VFID,VPID) logical-undo payload */
VPID_COPY ((VPID *) (undo_log_data + sizeof (VFID)), vpid_out);
}

Three asymmetries propagate everywhere: the temp branch starts no system operation (is_sysop_started stays false), builds no undo data, and calls file_temp_alloc. The exit-label sysop epilogue is guarded by if (is_sysop_started), so the temporary path skips the whole block (log_sysop_abort on error, else log_sysop_end_logical_undo (thread_p, RVFL_ALLOC, ...)). A temporary allocation thus produces nothing for recovery to replay; if the transaction dies mid-flight the file is simply discarded — nothing was logged, so nothing to roll back.

The f_init handling also diverges: a temporary file’s f_init may be NULL (sort buffers init their own pages), and the else branch asserts FILE_IS_TEMPORARY (fhead) before fixing the page NEW_PAGE. Numerable temp files still call file_numerable_add_page (Ch.10) — temporary does not exempt a file from the user page table. The fork is the top of Figure 8-2.

8.2 The header cursor: the entire bookkeeping state

Section titled “8.2 The header cursor: the entire bookkeeping state”

A permanent file tracks two extensible tables (Partial and Full) and migrates sectors between them. A temporary file keeps only the Partial table plus a two-field cursor — its entire allocation state:

FieldRoleWhy it exists
vpid_last_temp_allocVPID of the Partial-table page holding the sector being filledLets allocation jump straight to the live table page; equals the header VPID for the in-header copy, else an overflow PAGE_FTAB page
offset_to_last_temp_allocIndex, in that page’s extensible data, of the FILE_PARTIAL_SECTOR being filledNames the exact sector; advances only when the sector fills, so it also counts fully-consumed sectors in the page

The struct comment states the design contract directly — “Temporary file pages are never deallocated … keep a cursor: when the sector becomes full it is incremented; when all page becomes full it moves to next page”:

// FILE_HEADER -- src/storage/file_manager.c
VPID vpid_last_temp_alloc; /* VPID of partial table page last used to allocate a page. */
int offset_to_last_temp_alloc; /* Sector offset in partial table last used to allocate a page. */

The cursor is seeded at creation: file_create’s temp branch sets vpid_last_temp_alloc = vpid_fhead (the header’s own Partial table) and offset_to_last_temp_alloc = fhead->n_sector_full, skipping sectors already full at creation.

Invariant (cursor consistency). offset_to_last_temp_alloc is always a valid index into the extensible data at vpid_last_temp_alloc, or exactly its item count (“advance to next page next call”); file_temp_alloc asserts both halves before dereferencing. If violated, file_extdata_at indexes past the array and corrupts an adjacent sector descriptor or reads garbage as a VSID.

graph LR
  H["FILE_HEADER"] -->|vpid_last_temp_alloc| P0["Partial table page\nin-header or PAGE_FTAB"]
  H -->|offset_to_last_temp_alloc| PS["FILE_PARTIAL_SECTOR + page_bitmap"]
  P0 -->|vpid_next| P1["next Partial table page ..."]

Figure 8-1. Cursor-to-table relationship. No Full table — full sectors stay in place ahead of the cursor.

8.3 Walking file_temp_alloc branch by branch

Section titled “8.3 Walking file_temp_alloc branch by branch”

The function first disables interrupt checking (logtb_set_check_interrupt (thread_p, false), saved into save_check_interrupt) — there is no rollback, so a half-finished temp allocation must not be torn down — then asserts FILE_IS_TEMPORARY (fhead).

Step 1 — locate the live Partial-table page. If the cursor points at the header the in-header table is used directly; otherwise the overflow page is fixed with a write latch, only ER_INTERRUPTED tolerated on failure:

// file_temp_alloc -- src/storage/file_manager.c
if (VPID_EQ (&vpid_fhead, &fhead->vpid_last_temp_alloc))
FILE_HEADER_GET_PART_FTAB (fhead, extdata_part_ftab); /* <- table lives in header page */
else
{
page_ftab = pgbuf_fix (thread_p, &fhead->vpid_last_temp_alloc, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
if (page_ftab == NULL)
{ error_code = er_errid (); if (error_code != ER_INTERRUPTED) assert_release (false); goto exit; }
extdata_part_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab;
}

Step 2 — expand if out of free pages. The inline equivalent of file_temp_expand: when n_page_free == 0 it reserves one new sector via the disk manager (Ch.4) with DB_TEMPORARY_DATA_PURPOSE, so it lands in a temp volume:

// file_temp_alloc -- src/storage/file_manager.c
if (fhead->n_page_free == 0)
{
FILE_PARTIAL_SECTOR partsect_new = FILE_PARTIAL_SECTOR_INITIALIZER;
error_code = disk_reserve_sectors (thread_p, DB_TEMPORARY_DATA_PURPOSE, fhead->volid_last_expand, 1, &partsect_new.vsid);
if (error_code != NO_ERROR) { /* same ER_INTERRUPTED-tolerated handling as Step 1 */ goto exit; }

Two sub-branches follow, on whether the current page has room for one more FILE_PARTIAL_SECTOR. Sub-branch 2a — table page is full: the new sector cannot be recorded here, so its first page is stolen to host a fresh Partial-table page (bit 0 set, type PAGE_FTAB, previous vpid_next linked forward, cursor wrapped to offset 0):

// file_temp_alloc -- src/storage/file_manager.c
if (file_extdata_is_full (extdata_part_ftab))
{
vpid_ftab_new.volid = partsect_new.vsid.volid;
vpid_ftab_new.pageid = SECTOR_FIRST_PAGEID (partsect_new.vsid.sectid);
file_partsect_set_bit (&partsect_new, 0); /* <- page 0 becomes the table page */
page_ftab_new = pgbuf_fix (thread_p, &vpid_ftab_new, NEW_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
if (page_ftab_new == NULL) { error_code = ER_FAILED; goto exit; }
pgbuf_set_page_ptype (thread_p, page_ftab_new, PAGE_FTAB);
VPID_COPY (&extdata_part_ftab->vpid_next, &vpid_ftab_new); /* <- link old table -> new table */
if (page_ftab != NULL) pgbuf_set_dirty_and_free (thread_p, page_ftab);
VPID_COPY (&fhead->vpid_last_temp_alloc, &vpid_ftab_new); /* <- cursor wraps to fresh table page */
fhead->offset_to_last_temp_alloc = 0;
page_ftab = page_ftab_new; extdata_part_ftab = (FILE_EXTENSIBLE_DATA *) page_ftab;
file_extdata_init (sizeof (FILE_PARTIAL_SECTOR), DB_PAGESIZE, extdata_part_ftab);
ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_reserved, DISK_SECTOR_NPAGES - 1);
ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_ftab, 1);
}
else
ATOMIC_INC_32 (&file_Tempcache.spacedb_temp.npage_reserved, DISK_SECTOR_NPAGES); /* all pages reservable */

This is the only place the cursor wraps to a fresh table page during expansion; when a table page is carved out, one page counts as npage_ftab and only DISK_SECTOR_NPAGES - 1 are reservable. After either sub-branch the sector is appended and counters bumped — empty vs. table-hosting is encoded in partsect_new.page_bitmap (non-empty only in 2a):

// file_temp_alloc -- src/storage/file_manager.c
file_extdata_append (extdata_part_ftab, &partsect_new);
fhead->n_sector_partial++; fhead->n_sector_total++; // n_page_free/n_page_total += DISK_SECTOR_NPAGES
if (partsect_new.page_bitmap == FILE_EMPTY_PAGE_BITMAP) fhead->n_sector_empty++;
else { fhead->n_page_free--; fhead->n_page_ftab++; } /* <- table page already consumed */

Invariant (sectors never leave Partial). A filled sector keeps its all-ones bitmap in place; nothing migrates it to a Full table. If violated, the cursor offset (which counts consumed sectors in the page) would no longer match the extensible-data layout and the Step-3 page-hop would skip live sectors.

Step 3 — advance to the next page if the cursor sits at the item count. A previous call may have left offset_to_last_temp_alloc one past the last sector of a now-full page. The guard if (fhead->offset_to_last_temp_alloc == file_extdata_item_count (extdata_part_ftab)) then fires: it asserts file_extdata_is_full (...) && !VPID_ISNULL (&extdata_part_ftab->vpid_next), unfixes the old page_ftab, fixes vpid_next (write latch, only ER_INTERRUPTED tolerated), and sets vpid_last_temp_alloc = vpid_next; offset_to_last_temp_alloc = 0.

Step 4 — allocate from the sector under the cursor. file_partsect_alloc sets the first zero bit. Its false return is impossible here (the cursor never points at a full sector) and is treated as a logic error:

// file_temp_alloc -- src/storage/file_manager.c
partsect = (FILE_PARTIAL_SECTOR *) file_extdata_at (extdata_part_ftab, fhead->offset_to_last_temp_alloc);
was_empty = file_partsect_is_empty (partsect);
if (!file_partsect_alloc (partsect, vpid_alloc_out, NULL))
{ assert_release (false); error_code = ER_FAILED; goto exit; } /* <- full sector under cursor == bug */
if (file_partsect_is_full (partsect))
{ is_full = true; fhead->offset_to_last_temp_alloc++; } /* <- advance cursor; page hop deferred to next call */
file_header_alloc (fhead, alloc_type, was_empty, is_full); /* <- shared with perm path: pure counter math */
pgbuf_set_dirty (thread_p, page_fhead, DONT_FREE);

The cursor advances on fullness, not on every allocation: while a sector has free bits it stays put, so the common case touches only the header and one table page. file_header_alloc is the permanent path’s helper (Ch.7); its is_full shuffle still updates n_sector_full/n_sector_partial here, but no table migration accompanies it — those counters are advisory statistics, not table membership.

Step 5 — unconditional cleanup. The exit label runs on every path: file_header_sanity_check, unfix page_ftab if held, restore the saved interrupt flag via logtb_set_check_interrupt, return error_code. No pgbuf_set_dirty is ever paired with a log append — the only durability action is marking pages dirty for the non-WAL-ordered flush of temp data.

flowchart TD
  A["file_temp_alloc\ndisable interrupt check"] --> C{"cursor == header VPID?"}
  C -->|yes| F{"n_page_free == 0?"}
  C -->|no| E["fix cursor's table page"] --> F
  F -->|yes| G["disk_reserve_sectors 1 sector"] --> H{"table page full?"}
  H -->|yes| I["carve page0 as PAGE_FTAB\nlink vpid_next, wrap cursor"] --> L["append sector, bump counters"]
  H -->|no| J["reserve DISK_SECTOR_NPAGES"] --> L
  F -->|no| K{"offset == item_count?"}
  L --> K
  K -->|yes| N["fix vpid_next, cursor offset 0"] --> O["file_extdata_at + file_partsect_alloc"]
  K -->|no| O
  O --> Q{"sector now full?"}
  Q -->|yes| R["offset_to_last_temp_alloc++"] --> T["file_header_alloc, set_dirty"]
  Q -->|no| T
  T --> U["exit: unfix, restore interrupt"]

Figure 8-2. file_temp_alloc complete branch map, including both expansion sub-branches and the deferred page-hop.

8.4 Why no Full table, no postpone, no WAL

Section titled “8.4 Why no Full table, no postpone, no WAL”

The Full table exists in permanent files only so the allocation scan can skip sectors with no free pages (Ch.7). A temporary file never scans — it allocates from the single cursor sector and advances linearly — so a filled sector is never revisited and a second table buys nothing while costing the logging the design avoids. The companion cubrid-disk-manager.md enumerates the savings; the code above is the mechanism.

Invariant (monotone, bookkeeping-free allocation). The cursor only advances (offset++ on sector-full, page-hop on item-count), never backward. file_dealloc never clears an allocation bit for a temporary file — it takes the empty else branch (no postpone, no bitmap change) and skips the deallocation entirely (Ch.9) — so no sector regains a free bit behind the cursor and the monotone property holds without reconciliation. If violated (a freed bit behind the cursor), that page is silently leaked and n_page_free drifts from reality.

This minimalism lets the tempcache (Ch.11) recycle a file by reset rather than rebuild. file_temp_reset_user_pages re-collects the partial-table bitmaps, rebuilds the n_sector_*/n_page_* counters, zeroes the user count, and rewinds the cursor to the header VPID, offset 0:

// file_temp_reset_user_pages -- src/storage/file_manager.c
fhead->n_page_user = 0; // ... n_sector_*/n_page_* rebuilt from re-collected bitmaps ...
fhead->vpid_last_temp_alloc = vpid_fhead; /* <- cursor rewinds to header VPID, offset 0 */
fhead->offset_to_last_temp_alloc = 0;

This seed differs from file_create’s, which sets offset_to_last_temp_alloc = fhead->n_sector_full; reset always rewinds to offset 0. A reset file keeps its reserved sectors (no disk round-trip) and hands pages out from the front again — the payoff of skipping the Partial-to-Full machinery: allocation state collapses to two integers that cost nothing to reset.

  1. file_alloc forks on FILE_IS_TEMPORARY: the temp lane calls file_temp_alloc with no sysop, no undo data, no log records; the permanent lane wraps file_perm_alloc in a nested top action with RVFL_ALLOC logical undo.
  2. Temporary files keep only a Partial sectors table; a filled sector stays in place. The complete allocation state is vpid_last_temp_alloc/offset_to_last_temp_alloc.
  3. The cursor makes allocation O(1): Step 4 allocates directly from the cursor sector via file_partsect_alloc, advancing the offset only on sector-full and deferring the page-hop to the next call (Step 3).
  4. Expansion is inline (n_page_free == 0): it reserves one sector with DB_TEMPORARY_DATA_PURPOSE; the table-full sub-branch carves the sector’s first page into a fresh PAGE_FTAB, links vpid_next, and wraps the cursor to offset 0.
  5. There is no Full-table migration, no postpone, zero WAL — temp pages are never individually deallocated (file_dealloc takes the empty else branch), so the cursor is provably monotone and needs no reconciliation.
  6. file_header_alloc is shared with the permanent path, but for temp files its is_full shuffle is advisory statistics only — no table movement.
  7. The bookkeeping-free design lets the tempcache recycle a file by resetting the cursor (vpid_last_temp_alloc = header VPID, offset_to_last_temp_alloc = 0, n_page_user = 0) and rebuilding counters from the bitmaps, not rebuilding tables (Ch.11). Reset rewinds to offset 0, unlike file_create’s n_sector_full seed.

Chapter 9: Page Deallocation and File Destruction

Section titled “Chapter 9: Page Deallocation and File Destruction”

This chapter traces the inverse of permanent allocation (Chapter 7): how is a page — and an entire file — given back, and why is the actual bit-flip postponed to commit time? It assumes Chapter 4 (the two-step reservation protocol, the bitmap-then-cache release-order invariant) and Chapter 7 (file_perm_alloc, the Partial/Full tables); the companion cubrid-disk-manager.md covers the sector-bitmap and disk/file split. The central fact: a freed permanent page or sector is not cleared synchronously — the releaser stages a postpone log record and the clear runs at do-postpone.

9.1 Why postpone — the committed-releaser hazard

Section titled “9.1 Why postpone — the committed-releaser hazard”

If the bit cleared immediately when transaction T1 freed page P, a second transaction could reserve that sector, allocate P, and commit its data; should T1 then abort, undo would restore P’s old contents and clobber the second’s committed work. CUBRID defers the clear to do-postpone, which runs only after commit is logically certain — until then the bit stays set, so no allocator hands the page out. Same reasoning as Chapter 4’s release-order invariant.

INVARIANT (deferred-free): A permanent page/sector freed by an active transaction keeps its bit set until do-postpone, enforced by routing all permanent frees through log_append_postpone (RVFL_DEALLOC) / (RVDK_UNRESERVE_SECTORS) instead of mutating the bitmap inline. If violated, a concurrent allocator re-hands-out the page and a later abort corrupts the new owner’s data.

stateDiagram-v2
  [*] --> Allocated
  Allocated --> PostponeStaged : file_dealloc \n RVFL_DEALLOC appended, bit still set
  PostponeStaged --> Allocated : transaction abort \n postpone discarded, page stays allocated
  PostponeStaged --> Freed : do-postpone \n file_perm_dealloc clears bit
  Freed --> [*]

Figure 9-1 — Lifecycle of a permanent page bit. The abort edge is the point: until do-postpone, nothing changed on disk.

file_dealloc is the public entry for giving back one page; despite its name it usually stages rather than frees. The header fix is conditional: a release build with a trustworthy concrete file_type_hint skips it to save an I/O, while a debug build always fixes (the #if defined (NDEBUG) guard) to assert the hint matches fhead->type and that vpid is not the sticky first page. The postpone decision is conservative under uncertainty — it postpones unless it can prove the file temporary:

// file_dealloc -- src/storage/file_manager.c
if ((fhead != NULL && !FILE_IS_TEMPORARY (fhead)) || file_type_hint != FILE_TEMP)
{
VFID_COPY ((VFID *) log_data, vfid);
VPID_COPY ((VPID *) (log_data + sizeof (VFID)), vpid);
log_append_postpone (thread_p, RVFL_DEALLOC, &log_addr, LOG_DATA_SIZE, log_data); /* <- stage only */
}
/* else: we do not deallocate pages from temporary files */

The RVFL_DEALLOC record carries only (VFID, VPID) — no bitmap state — because the real work is recomputed at do-postpone. Temporary files take the else (reclaimed wholesale at destroy / tempcache reset, Chapter 11). Two early exits then key on numerability: goto exit if !FILE_TYPE_CAN_BE_NUMERABLE (file_type_hint) (not numerable by type) and again if !FILE_IS_NUMERABLE (fhead) (type allows it but this file is not). Only a genuinely numerable file acts now — it searches the user page table and sets FILE_USER_PAGE_MARK_DELETED, logging RVFL_USER_PAGE_MARK_DELETE for non-temporary files (mechanics deferred to Chapter 10).

INVARIANT (numerable consistency): In a numerable file the page must exist in the user page table and not already be marked deleted (enforced by assert_release (false) on !found and on FILE_USER_PAGE_IS_MARKED_DELETED). If violated, the user page table and the allocation tables have diverged — a hard bug.

The exit: label unfixes page_fhead and page_ftab if held.

9.3 file_perm_dealloc — the actual bit-flip at do-postpone

Section titled “9.3 file_perm_dealloc — the actual bit-flip at do-postpone”

At commit, do-postpone replays each RVFL_DEALLOC through file_rv_dealloc_on_postponefile_rv_dealloc_internal, which fixes the header, starts a system operation, and calls file_perm_dealloc — where the bit is finally cleared. Entry asserts the contract: log_check_system_op_is_started (must be inside a sysop) and !FILE_IS_TEMPORARY (fhead) (permanent only); it then computes vsid_dealloc from vpid_dealloc (SECTOR_FROM_PAGEID).

INVARIANT (sysop-wrapped table change): All file-table mutations in file_perm_dealloc must commit as a nested system operation before the header page is unfixed. If violated, a crash mid-update leaves the Partial/Full tables and header counters inconsistent with no atomic recovery boundary.

flowchart TB
  START["file_perm_dealloc(vpid)"] --> SEARCH["search Partial table"]
  SEARCH --> FOUND{found in Partial?}
  FOUND -- yes --> CLEAR["clear bit in partsect<br/>log RVFL_PARTSECT_DEALLOC<br/>is_empty?"]
  FOUND -- no --> REMOVE["remove vsid from Full table<br/>was_full = true"]
  REMOVE --> MERGED{ftab page merged away?}
  MERGED -- "same sector" --> SAMESEC["clear merged page's bit too<br/>simulate ftab dealloc"]
  MERGED -- "other sector" --> RECURSE["file_perm_dealloc(merged) recursive"]
  MERGED -- none --> BUILD["build partsect_new = FULL minus bit"]
  SAMESEC --> BUILD
  RECURSE --> BUILD
  BUILD --> SPACE{free slot in Partial?}
  SPACE -- yes --> INSERT["file_extdata_insert_at ordered"]
  SPACE -- no --> NEWPG["file_perm_alloc new ftab page"]
  CLEAR --> HDR["file_header_dealloc<br/>update counters"]
  INSERT --> HDR
  NEWPG --> HDR
  HDR --> DEALLOC["pgbuf_dealloc_page(vpid)"]
  DEALLOC --> EXIT["exit: unfix page_ftab"]

Figure 9-2 — Branch map of file_perm_dealloc. Left: sector already Partial (common case). Right: sector was Full, where the Full-to-Partial migration happens and may recurse.

Left branch — already Partial. The sector has a free page so it is already in the Partial table: clear the bit, recompute is_empty, log it with RVFL_PARTSECT_DEALLOC via log_append_undoredo_dataundoredo, not postpone, because by do-postpone time we are executing the free, so the table edit is a normal recoverable change.

Right branch — sector was Full. Every reserved sector is in exactly one table (Chapter 6), so if not Partial it is Full. The function sets was_full = true and calls file_extdata_find_and_remove_item on the Full table; this may empty the last Full-table component, returning a vpid_merged — a now-orphaned table page that must itself be freed. The guard, written as the two merged cases in Figure 9-2, hinges on VSID_IS_SECTOR_OF_VPID (&vsid_dealloc, &vpid_merged):

  • different sectorfile_perm_dealloc (..., &vpid_merged, FILE_ALLOC_TABLE_PAGE) recurses to free it normally;
  • same sector (the one being moved to Partial) → do not recurse; set is_merged_page_from_sector, clear that page’s bit too in the new descriptor, simulate accounting via file_header_dealloc (..., FILE_ALLOC_TABLE_PAGE, ...) then pgbuf_dealloc_page (vpid_merged).

The new descriptor starts from partsect_new.page_bitmap = FILE_FULL_PAGE_BITMAP with the freed bit(s) file_partsect_clear_bit’d, then is inserted at the ordered position; if Partial has no free slot a new table page comes from file_perm_alloc (FILE_ALLOC_TABLE_PAGE). Guards: file_extdata_find_ordered must report the VSID not present (assert_release (false) on duplicate), and assert (page_ftab == NULL) confirms all transient table pages were unfixed.

Tail — both branches. file_header_dealloc (fhead, alloc_type, is_empty, was_full) adjusts n_page_free / sector counters (file_log_fhead_dealloc logs it); the page is then fixed and handed to pgbuf_dealloc_page (§9.6), and PSTAT_FILE_NUM_PAGE_DEALLOCS bumps. is_empty/was_full drive the math: a was_full sector now contributes free pages, an is_empty sector becomes fully free. Most error paths unfix any held page_ftab via ASSERT_ERROR (); goto exit; the two Full-branch sub-paths — the recursive file_perm_dealloc of an other-sector orphan and the same-sector merged-page pgbuf_fix failure — instead return error_code directly, which is safe because page_ftab is still NULL at those points. The hard-fail-during-recovery guard lives one level up in file_rv_dealloc_internal (§9.8), not here.

9.4 file_destroy — giving back the whole file

Section titled “9.4 file_destroy — giving back the whole file”

Destroying a file returns every sector it reserved; is_temp forks the entire function. The prologue: a permanent file calls file_tracker_unregister (catalog-visible, dropped first); a temporary file calls logtb_set_check_interrupt (thread_p, false) so destroy cannot abort halfway and leak pages. The header is fixed, file_table_collect_all_vsids gathers every sector, then the forks diverge on eviction and re-converge on one disk_unreserve_ordered_sectors call.

flowchart TB
  P["file_destroy(vfid, is_temp)"] --> FORK{is_temp?}
  FORK -- no --> UNREG["file_tracker_unregister"]
  FORK -- yes --> NOINT["disable interrupt check"]
  UNREG --> FIX["fix header page"]
  NOINT --> FIX
  FIX --> COLLECT["file_table_collect_all_vsids<br/>-> vsid_collector"]
  COLLECT --> FORK2{permanent or temporary?}
  FORK2 -- permanent --> PDEAL["file_sector_map_dealloc over Partial+Full<br/>pgbuf_dealloc_page each user+ftab page<br/>pgbuf_dealloc_page(header)"]
  FORK2 -- temporary --> TDEAL["file_sector_map_dealloc_temp over Partial<br/>pgbuf_dealloc_temp_page each<br/>decrement Tempcache counters"]
  PDEAL --> UNRES["disk_unreserve_ordered_sectors"]
  TDEAL --> UNRES
  UNRES --> EXIT["exit: free collectors, unfix header,<br/>restore interrupt check"]

Figure 9-3 — file_destroy two forks.

Permanent fork. file_extdata_apply_funcs over Partial then Full passes file_extdata_collect_ftab_pages (gather file-table-page sectors into a FILE_FTAB_COLLECTOR) and file_sector_map_dealloc (fix each user page, pgbuf_dealloc_page); it then evicts each collected table-page sector and finally the header. Every owned page becomes a PAGE_UNKNOWN eviction candidate before sectors are unreserved.

Temporary fork. No Full table, so only Partial is walked via file_sector_map_dealloc_temp / pgbuf_dealloc_temp_page. It logs nothing and tolerates a missing page (pgbuf_simple_fix NULL → continue) since temporary pages need not be on disk, then decrements the global tempcache spacedb_temp counters (Chapter 11) and frees the header.

INVARIANT (evict-before-unreserve): Every buffer-pool page of a file must become an eviction candidate (pgbuf_dealloc_page / pgbuf_dealloc_temp_page) before its sectors are unreserved. If violated, a stale dirty BCB could be flushed to a sector already unreserved and re-reserved by another file, writing one file’s bytes into another.

The exit: label is universal cleanup: unfix the header, db_private_free both collector arrays, restore the interrupt-check flag for the temporary case.

9.5 file_vsid_collector and file_table_collect_all_vsids

Section titled “9.5 file_vsid_collector and file_table_collect_all_vsids”

The collector is a fixed-size array plus count:

// struct file_vsid_collector -- src/storage/file_manager.c
struct file_vsid_collector { VSID *vsids; int n_vsids; };
FieldRoleWhy it exists
vsidsPointer to a db_private_alloc’d array of fhead->n_sector_total VSIDsOutput buffer, sized exactly to the sector count so no realloc is ever needed.
n_vsidsRunning count of sectors appendedBoth the array cursor during collection and the element count handed to disk_unreserve_ordered_sectors. After collection it must equal n_sector_total.

file_table_collect_all_vsids allocates the array, then applies file_table_collect_vsid (collector->vsids[collector->n_vsids++] = *vsid) across Partial and — for permanent files only — Full:

// file_table_collect_all_vsids -- src/storage/file_manager.c
collector_out->vsids = (VSID *) db_private_alloc (thread_p, fhead->n_sector_total * sizeof (VSID));
FILE_HEADER_GET_PART_FTAB (fhead, extdata_ftab);
error_code = file_extdata_apply_funcs (thread_p, extdata_ftab, NULL, NULL, file_table_collect_vsid, collector_out, ...);
if (!FILE_IS_TEMPORARY (fhead))
{
FILE_HEADER_GET_FULL_FTAB (fhead, extdata_ftab); /* <- temporary files have no full table */
error_code = file_extdata_apply_funcs (thread_p, extdata_ftab, NULL, NULL, file_table_collect_vsid, collector_out, ...);
}
if (collector_out->n_vsids != fhead->n_sector_total)
assert_release (false); /* <- the count invariant, checked */
qsort (collector_out->vsids, fhead->n_sector_total, sizeof (VSID), disk_compare_vsids); /* <- ordered output */

INVARIANT (complete collection): The collected VSID count must equal fhead->n_sector_total. If violated, the file’s bookkeeping is corrupt and destroy fails with assert_release (false).

The final qsort establishes the next function’s precondition — the VSID list ordered by (volid, sectid) so disk_unreserve_ordered_sectors can batch per volume in one pass.

9.6 pgbuf_dealloc_page — the eviction hint

Section titled “9.6 pgbuf_dealloc_page — the eviction hint”

Both file_perm_dealloc and the permanent file_destroy fork hand each freed page to pgbuf_dealloc_page, which does no flush or write I/O — it resets the page type to PAGE_UNKNOWN and steers the BCB toward victimization:

// pgbuf_dealloc_page -- src/storage/page_buffer.c
/* how it works: page is "deallocated" by resetting its type to PAGE_UNKNOWN. also prepare bcb for victimization.
* note: the bcb used to be invalidated. but that means flushing page to disk and waiting for IO write. that may be
* too slow. if we add the bcb to the bottom of a lru list, it will be eventually flushed by flush thread and
* victimized. */
CAST_PGPTR_TO_BFPTR (bcb, page_dealloc);
assert (get_fcnt (&bcb->atomic_latch) == 1); /* <- caller must hold the only latch */

Deallocation is a hint, not a synchronous discard — the page may still be flushed later by the flush thread, which is exactly why the evict-before-unreserve invariant (§9.4) requires it be issued before the sector becomes reusable.

9.7 disk_unreserve_ordered_sectors — returning sectors

Section titled “9.7 disk_unreserve_ordered_sectors — returning sectors”

The disk-manager counterpart of Chapter 4’s reservation: a thin wrapper that takes CSECT_DISK_CHECK as a reader and delegates to disk_unreserve_ordered_sectors_without_csect. The worker exploits the §9.5 sort — it groups consecutive vsids sharing a volid into per-volume runs in a DISK_RESERVE_CONTEXT (asserting volid strictly increasing across runs, sectid within one) and issues one disk_unreserve_sectors_from_volume per volume, which iterates sector-table units calling disk_stab_unit_unreserve — the leaf where the permanent-vs-temporary postpone split lands, mirroring §9.2 at the sector level:

// disk_stab_unit_unreserve -- src/storage/disk_manager.c
assert ((unreserve_bits & (*cursor->unit)) == unreserve_bits); /* <- all target bits were actually set */
if (unreserve_bits != 0)
{
if (context->purpose == DB_PERMANENT_DATA_PURPOSE)
log_append_postpone (thread_p, RVDK_UNRESERVE_SECTORS, &addr, sizeof (unreserve_bits), &unreserve_bits); /* <- deferred */
else
{
(*cursor->unit) &= ~unreserve_bits; /* <- bitmap cleared NOW */
/* ... pgbuf_set_dirty + lock_reserve_for_purpose condensed ... */
disk_cache_update_vol_free (cursor->volheader->volid, nsect); /* <- then cache, Ch.4 order */
}
}

Permanent purpose stages the clear via log_append_postpone (RVDK_UNRESERVE_SECTORS), upholding the deferred-free invariant (§9.1) at sector granularity; temporary purpose clears the bits immediately, in the bitmap-then-cache order Chapter 4’s release-order invariant mandates (bit cleared, then disk_cache_update_vol_free). The entry assert guards that every freed sector was genuinely reserved.

Aborting a permanent deallocation is free: its staged postpone records are discarded, never run, so the page stays allocated (Figure 9-1’s back edge). Real undo happens only when a page allocation is rolled back. Both do-postpone and undo route through file_rv_dealloc_internal, which fixes the header, opens the sysop, calls file_perm_dealloc, and — because a recovery replay must not be tolerated to fail silently — hard-fails via if (error_code != NO_ERROR) { assert_release (false); } on any non-NO_ERROR return. It then seals the sysop by one parameter: log_sysop_abort on error, log_sysop_end_logical_compensate for FILE_RV_DEALLOC_COMPENSATE (undo of an alloc), otherwise log_sysop_end_logical_run_postpone (do-postpone of a dealloc) — all three making the table change durable before the header is unfixed (§9.3).

  1. file_dealloc stages, it does not free — a non-temporary file appends an RVFL_DEALLOC postpone record carrying (VFID, VPID); temporary files deallocate nothing; numerable files also mark-delete the user-page-table entry now (Chapter 10).
  2. Postpone closes the committed-releaser window — no transaction can grab a freed page before the releaser’s commit is irreversible, the sector-level analogue of Chapter 4’s release ordering.
  3. file_perm_dealloc is the real free, branch-rich. Partial: clear a bit. Full: migrate to Partial, recursing to free an orphaned table page except a same-sector orphan (inlined). Must run inside a system operation.
  4. file_destroy forks on is_temp end to end. Permanent: unregister, evict via pgbuf_dealloc_page, unreserve postponed. Temporary: disable interrupts, evict via pgbuf_dealloc_temp_page, adjust tempcache counters, unreserve immediately.
  5. Collection precedes destruction, sortedfile_table_collect_all_vsids gathers exactly n_sector_total VSIDs (asserting the count) and qsorts them so unreserve batches per volume.
  6. pgbuf_dealloc_page is an eviction hint, not a flush — it queues the PAGE_UNKNOWN BCB for victimization, so pages must be evicted before their sectors are unreserved.
  7. The postpone split bottoms out in disk_stab_unit_unreserve — permanent stages RVDK_UNRESERVE_SECTORS; temporary clears bitmap then cache inline; abort of a permanent dealloc is free.

Chapter 10: Numerable Files and the User Page Table

Section titled “Chapter 10: Numerable Files and the User Page Table”

A numerable file adds one promise: ask for “the n-th page I allocated”, in allocation order, in amortized O(1). The sector allocation machinery of Ch 3-Ch 7 cannot answer this — it stores ownership, not order — so the numerable layer keeps a second, separately-externalized index over the same VPIDs: the User Page Table. See cubrid-disk-manager.md (“Numerable files”) for the high-level contract; here we trace every branch.

10.1 Why the sector table cannot recover allocation order

Section titled “10.1 Why the sector table cannot recover allocation order”

The Partial and Full sector tables (Ch 3) are kept VSID-sorted via disk_compare_vsids so reservation and lookup are binary searches. That sort destroys history two ways: (1) promotion erases batch identity — a filled partial sector migrates Partial -> Full and is re-sorted by VSID, losing which batch reserved it; (2) cross-expand reorders — a batch spanning a fresh reservation gives new sectors a VSID order unrelated to produce order, so a sorted bitmap scan yields pages in a different sequence than the user received them.

The sector table answers membership but not order. The User Page Table re-externalizes that lost order as an append-only list of VPIDs — one entry per user page, in allocation order — so find_nth(n) is a positional index into it.

The table is a chain of FILE_EXTENSIBLE_DATA components (the extdata primitive used throughout file_manager.c) whose items are bare VPIDs. The header caches the last component in vpid_last_user_page_ftab for O(1) appends; FILE_HEADER_GET_USER_PAGE_FTAB locates the first component in the header page.

10.2 The find-nth context and the header’s order-keeping fields

Section titled “10.2 The find-nth context and the header’s order-keeping fields”

file_find_nth_context (struct { VPID *vpid_nth; int nth; int first_index; }) is the accumulator threaded through the scan callbacks:

FieldRoleWhy it exists
vpid_nthOut-param pointer for the found VPIDScan writes through it so the caller’s slot is filled in place
nthRemaining index, decremented as components/items are skippedCountdown; the scan stops when it reaches the target item
first_indexAbsolute item index of the current component’s entry 0Feeds the cache: where in the global sequence the landing component begins

Five FILE_HEADER fields carry the order machinery (struct covered in Ch 1):

FieldRoleWhy it exists
vpid_last_user_page_ftabHint to the last UPT component pageO(1) append target; equals the header VPID while the table lives in-header
vpid_find_nth_lastCached page of the last find_nth landingLets sequential find_nth(n), find_nth(n+1)... resume mid-table
first_index_find_nth_lastGlobal index of entry 0 on vpid_find_nth_lastTurns the cached page into an absolute offset for the next search
n_page_userTotal user pages (incl. mark-deleted)Numerator of the live-page count
n_page_mark_deleteCount of mark-delete-bit entriesCorrection term: live pages = n_page_user - n_page_mark_delete

Invariant (live-count correction). Findable pages = n_page_user - n_page_mark_delete, never n_page_user. file_numerable_find_nth enforces this at the auto-alloc test and when skipping marked entries; drift would make find_nth return a deleted page or allocate at the wrong index. Kept exact by file_header_update_mark_deleted logging +1/-1 on every set/clear.

Invariant (cache validity). FILE_CACHE_LAST_FIND_NTH is true only for FILE_TEMP numerable files on a non-parallel thread, so the cache may be read/written without a write latch or dirty flag. Any deallocation resets it (VPID_SET_NULL (&fhead->vpid_find_nth_last)); appends leave it valid because they only extend the tail.

10.3 file_numerable_add_page — appending on every allocation

Section titled “10.3 file_numerable_add_page — appending on every allocation”

file_alloc calls file_numerable_add_page right after a page’s bit is set, whenever FILE_IS_NUMERABLE (fhead), so the UPT grows in lock-step. It resolves the tail from vpid_last_user_page_ftab (in-header if equal to the header VPID, else pgbuf_fix WRITE), chains a component if full, then appends:

// file_numerable_add_page -- src/storage/file_manager.c
if (VPID_EQ (&fhead->vpid_last_user_page_ftab, &vpid_fhead))
FILE_HEADER_GET_USER_PAGE_FTAB (fhead, extdata_user_page_ftab); /* tail in header */
else page_ftab = pgbuf_fix (..., OLD_PAGE, PGBUF_LATCH_WRITE, ...); /* else fix tail page */
// ... condensed: if (file_extdata_is_full) chain via file_temp_alloc/file_perm_alloc ...
file_extdata_append (extdata_user_page_ftab, vpid); /* <- the append */
flowchart TD
  A["hint = vpid_last_user_page_ftab"] --> B{"hint == header VPID?"}
  B -->|yes| C["extdata = in-header UPT"]
  B -->|no| D{"pgbuf_fix WRITE ok?"}
  D -->|no| Z["ASSERT_ERROR_AND_SET, goto exit"]
  D -->|yes| F["extdata = that ftab page"]
  C --> G{"file_extdata_is_full?"}
  F --> G
  G -->|no| M["file_extdata_append vpid"]
  G -->|yes| H{"FILE_IS_TEMPORARY?"}
  H -->|yes| I["file_temp_alloc TABLE_PAGE"]
  H -->|no| J["file_perm_alloc TABLE_PAGE"]
  I --> K["fix NEW_PAGE, link prev->next, init extdata,\n advance last_user_page_ftab"]
  J --> K
  K --> M
  M --> N{"temporary?"}
  N -->|no| O["file_log_extdata_add WAL"]
  N -->|yes| P["pgbuf_set_dirty only"]
  O --> Q["exit: unfix page_ftab if held"]
  P --> Q
  Z --> Q

Figure 10-1. file_numerable_add_page, all branches.

The branch worth restating is the temp-vs-permanent logging asymmetry (Figure 10-1 node N): a permanent append emits file_log_extdata_add WAL (plus RVFL_FHEAD_SET_LAST_USER_PAGE_FTAB undoredo when a component is chained), a temporary append only marks pages dirty. A closing assert (!file_extdata_is_full (...)) rules out overflow.

10.4 file_numerable_find_nth — the indexed lookup

Section titled “10.4 file_numerable_find_nth — the indexed lookup”

The function fixes the header READ, asserts numerable, then branches three ways. Auto-alloc-at-end (auto_alloc && nth == fhead->n_page_user - fhead->n_page_mark_delete) promotes the latch and calls file_alloc to grow the file, re-fixing WRITE and re-checking on ER_PAGE_LATCH_PROMOTE_FAIL. Otherwise the search splits on n_page_mark_delete: with holes it visits every item (file_extdata_find_nth_vpid_and_skip_marked); with no holes it strides components and may resume from the cache, whose predicate is load-bearing:

// file_numerable_find_nth (no-holes branch) -- src/storage/file_manager.c
if (FILE_CACHE_LAST_FIND_NTH (fhead, thread_p) && !VPID_ISNULL (&fhead->vpid_find_nth_last)
&& !VPID_EQ (&vpid_fhead, &fhead->vpid_find_nth_last) && nth >= fhead->first_index_find_nth_last)
{ find_nth_context.first_index = fhead->first_index_find_nth_last; /* resume from cache */
find_nth_context.nth -= fhead->first_index_find_nth_last; } /* <- rebase the countdown */
flowchart TD
  A["fix header READ, assert numerable"] --> B{"auto_alloc and nth == live count?"}
  B -->|yes| C{"promote latch ok?"}
  C -->|FAIL| E["re-fix WRITE, re-check, file_alloc, exit"]
  C -->|ok| F["file_alloc, exit"]
  B -->|no| G{"n_page_mark_delete > 0?"}
  G -->|yes| H["skip-marked over EVERY item"]
  G -->|no| I{"cache usable?"}
  I -->|yes| J["fix cached page, rebase nth"]
  I -->|no| K["first_index = 0, from head"]
  J --> L["find_nth_vpid: stride components"]
  K --> L
  L --> M{"cache eligible?"}
  M -->|yes| N["store landing page + first_index"]
  M -->|no| O["skip cache update"]
  H --> P{"vpid_nth still NULL?"}
  N --> P
  O --> P
  P -->|yes| Q["assert_release false, ER_FAILED"]
  P -->|no| R["exit: unfix pages"]

Figure 10-2. file_numerable_find_nth, all branches.

The three predicate conjuncts above (notably nth >= first_index_find_nth_last, which forbids a backward resume) let the search start mid-table and walk only the landing component — the amortized O(1) for the run-merge pattern find_nth(0), find_nth(1), .... Exit cleanup avoids double-unfixing aliased page pointers.

file_extdata_apply_funcs invokes a per-component and/or per-item function. file_extdata_find_nth_vpid is the per-component (no-holes) callback — a whole component is one O(1) stride:

// file_extdata_find_nth_vpid -- src/storage/file_manager.c
int count_vpid = file_extdata_item_count (extdata);
if (count_vpid <= find_nth_context->nth)
{ find_nth_context->nth -= count_vpid; /* <- skip whole component */
find_nth_context->first_index += count_vpid; } /* <- keep global index accurate */
else
{ VPID_COPY (find_nth_context->vpid_nth, (VPID *) file_extdata_at (extdata, find_nth_context->nth));
assert (!FILE_USER_PAGE_IS_MARKED_DELETED (find_nth_context->vpid_nth)); /* <- no holes */
*stop = true; }

file_extdata_find_nth_vpid_and_skip_marked is the per-item (holes) callback; it inspects every VPID because a deleted entry consumes a slot but not an index:

// file_extdata_find_nth_vpid_and_skip_marked -- src/storage/file_manager.c
if (FILE_USER_PAGE_IS_MARKED_DELETED (vpidp)) return NO_ERROR; /* <- skip, do not advance nth */
if (find_nth_context->nth == 0) { *find_nth_context->vpid_nth = *vpidp; *stop = true; }
else find_nth_context->nth--;

The asymmetry is the point: no holes lets you stride components and keep first_index to prime the cache; holes do not, since a component’s live-entry count is not its item count.

10.6 The mark-delete machinery (permanent numerable)

Section titled “10.6 The mark-delete machinery (permanent numerable)”

A numerable page cannot vanish from the middle of the UPT mid-transaction — that would renumber later pages and corrupt concurrent find_nth — so file_dealloc removes in two phases. Phase 1, in-transaction, only sets the top bit of the pageid (FILE_USER_PAGE_MARK_DELETE_FLAG == 0x80000000) via FILE_USER_PAGE_MARK_DELETED (vpid_found), logs RVFL_USER_PAGE_MARK_DELETE undoredo (permanent only), bumps the counter via file_header_update_mark_deleted (..., 1), and resets the cache if FILE_CACHE_LAST_FIND_NTH. The entry keeps its slot, later indices are undisturbed, and find_nth skips it via the per-item callback.

Phase 2, at commit run-postpone, physically removes the entry via file_extdata_find_and_remove_item: it walks the chain (linear, ordered=false, since the UPT is append-ordered not VSID-ordered), removes the item with file_extdata_remove_at (logged via file_log_extdata_remove), pops the VPID into an out-param, and merges an emptied component with its predecessor, reporting the freed table page through vpid_merged; it asserts a system op is active and assert_release(false)s on a missing item. A marked pop decrements the counter:

// file_dealloc run-postpone body -- src/storage/file_manager.c
file_extdata_find_and_remove_item (..., vpid_dealloc, file_compare_vpids, false,
&vpid_removed, &vpid_merged);
if (!VPID_ISNULL (&vpid_merged)) /* table page emptied -> free it */
file_perm_dealloc (thread_p, page_fhead, &vpid_merged, FILE_ALLOC_TABLE_PAGE);
if (FILE_USER_PAGE_IS_MARKED_DELETED (&vpid_removed))
file_header_update_mark_deleted (thread_p, page_fhead, -1); /* <- counter back down */

On abort, file_rv_user_page_unmark_delete_logical undoes phase 1. Because concurrent transactions may have shifted the table, it cannot trust the original position — it re-searches by VPID (file_extdata_search_item), asserts the bit is set, clears it with FILE_USER_PAGE_CLEAR_MARK_DELETED, and logs a RVFL_USER_PAGE_MARK_DELETE_COMPENSATE record via log_append_compensate.

Invariant (slot stability under deletion). A marked-deleted entry never moves or re-indexes until commit, so a concurrent reader’s cached vpid_find_nth_last stays structurally valid through a mark (deallocation resets only the cache, not the slots). Compacting on mark would renumber pages mid-transaction.

10.7 file_numerable_truncate — dealloc-driven shrink

Section titled “10.7 file_numerable_truncate — dealloc-driven shrink”

Truncation is the only public shrink path, leaning on find_nth + file_dealloc:

// file_numerable_truncate -- src/storage/file_manager.c
if (!FILE_IS_NUMERABLE (fhead)) { assert_release (false); error_code = ER_FAILED; goto exit; }
if (fhead->n_page_mark_delete != 0) { assert (false); return NO_ERROR; } /* <- refuse mid-dealloc */
while (fhead->n_page_user > npages) { /* repeatedly drop index npages */
file_numerable_find_nth (thread_p, vfid, npages, false, NULL, NULL, &vpid); /* auto-alloc off */
file_dealloc (thread_p, vfid, &vpid, fhead->type); }

Each iteration deallocates the page now at index npages; as n_page_user drops the loop ends exactly at npages. It bails on n_page_mark_delete != 0, since a half-finished dealloc makes the index meaningless.

10.8 Real callers and the dead-code finding

Section titled “10.8 Real callers and the dead-code finding”

file_numerable_find_nth has three callers across two file-type families; mark-delete is exercised only by the permanent family. The extendible-hash family is consumed by both src/storage/extendible_hash.c and the file-hash-scan code in src/query/query_hash_scan.cfhs_fix_nth_page calls file_numerable_find_nth, and its files are created via file_create_ehash / file_create_ehash_dir, so they are FILE_EXTENDIBLE_HASH(_DIRECTORY), the same family as the storage row.

CallerFile typeDeallocates?Mark-delete used?
External sort run files (external_sort.c, file_create_temp_numerable)FILE_TEMPneverno (dead)
Extendible hash bucket/directory (extendible_hash.c find_nth, truncate)FILE_EXTENDIBLE_HASH(_DIRECTORY)yesyes
File-hash-scan FHS (query_hash_scan.c, fhs_fix_nth_page)FILE_EXTENDIBLE_HASH(_DIRECTORY)via truncate pathyes

Non-numerable temp consumers — list_file query intermediates and the query result cache (FILE_QUERY_AREA) — never touch this layer; they chain pages via QFILE_PAGE_HEADER.next_vpid, with no find_nth contract.

Critical finding. For FILE_TEMP numerable files (external sort), file_temp_alloc never deallocates, so FILE_USER_PAGE_MARK_DELETED, n_page_mark_delete, and the whole RVFL_USER_PAGE_MARK_DELETE* chain are effectively dead code there: n_page_mark_delete stays 0 and find_nth always takes the no-holes/cache branch. The table data structure is still mandatory — it supplies the order contract the sort merge depends on. The dead part is the deletion sub-apparatus, not the table.

  1. VSID-sorted sector tables store which pages a file owns but discard order; the User Page Table re-externalizes order as an append-only VPID list, so find_nth(n) is a positional index.
  2. file_numerable_add_page appends one VPID per allocation inside file_alloc, using vpid_last_user_page_ftab as an O(1) tail hint and chaining a component (logged for permanent, dirty-only for temporary) when the tail fills.
  3. file_numerable_find_nth is O(1)-amortized only in the no-holes branch (file_extdata_find_nth_vpid strides components, the cache resumes mid-table); the holes branch falls back to a per-item skip scan.
  4. Live page count is n_page_user - n_page_mark_delete, governing auto-alloc-at-end and deleted-entry skipping, kept exact by logged deltas.
  5. Permanent deletion is two-phase: phase 1 sets FILE_USER_PAGE_MARK_DELETE_FLAG and bumps the counter (slot kept); phase 2 at run-postpone removes via file_extdata_find_and_remove_item; abort re-searches by VPID and clears the bit.
  6. file_numerable_truncate is a thin find_nth(npages) + file_dealloc loop, refusing to run while n_page_mark_delete != 0.
  7. Mark-delete is dead code for FILE_TEMP numerable (external sort never deallocates), yet the table itself stays necessary — the dead part is the deletion sub-apparatus, not the structure.

Chapter 11: Special Paths Tempcache Tracker Sticky Page TDE and Recovery

Section titled “Chapter 11: Special Paths Tempcache Tracker Sticky Page TDE and Recovery”

Five machines sit beside the single-page lifecycle of Ch 6-10: the temp-file cache, the File Tracker, the sticky-first-page escape hatch, the TDE flags, and the recovery handlers. This chapter dissects only the code; for the why, see the companion’s “Temporary file cache”, “File destruction and the File Tracker”, and “Two-step sector reservation” sections.

11.1 The temp-file cache: recycling whole files

Section titled “11.1 The temp-file cache: recycling whole files”

file_Tempcache is a global pool holding retired temp files intact so the next request of the same shape gets one back instead of destroy-and-recreate. Three structs cooperate.

// file_tempcache_entry -- src/storage/file_manager.c
struct file_tempcache_entry { VFID vfid; FILE_TYPE ftype; FILE_TEMPCACHE_ENTRY *next; };
// file_tempcache_tran_entry -- src/storage/file_manager.c
struct file_tempcache_tran_entry {
pthread_mutex_t mutex;
FILE_TEMPCACHE_ENTRY *head;
#if !defined (NDEBUG)
int owner_mutex;
#endif
};
// file_tempcache -- src/storage/file_manager.c
struct file_tempcache {
FILE_TEMPCACHE_ENTRY *free_entries;
int nfree_entries_max, nfree_entries;
FILE_TEMPCACHE_ENTRY *cached_not_numerable, *cached_numerable;
int ncached_max, ncached_not_numerable, ncached_numerable;
pthread_mutex_t mutex;
#if !defined (NDEBUG)
int owner_mutex;
#endif
FILE_TEMPCACHE_TRAN_ENTRY *tran_files;
SPACEDB_FILES spacedb_temp;
};
static FILE_TEMPCACHE file_Tempcache;

file_tempcache_entry

FieldRoleWhy it exists
vfididentifies the cached filethe cache stores real, allocated files, not descriptors
ftypefile type of the cached filea get matches by type; a near-miss is re-typed in place
nextlist linkone entry travels between free_entries, a tran list, and a cached list — never on two at once

file_tempcache_tran_entry (one per transaction index)

FieldRoleWhy it exists
mutexper-transaction lock over this transaction’s headheld by file_tempcache_lock_tran_entry / unlock_tran_entry during the commit/abort drain
headfiles this transaction created and still ownsdrained at commit/abort by file_tempcache_drop_tran_temp_files
owner_mutexNDEBUG-only ownership trackerrecords which thread holds mutex for the lock/unlock assertions

file_tempcache (the global)

FieldRoleWhy it exists
free_entriespool of empty entry shellsavoids malloc/free on every cache op
nfree_entries_max / nfree_entriescap / current size of the shell poolinit sets max to ntrans * 8
cached_not_numerableretired regular temp filesa get(numerable=false) pops here
cached_numerableretired numerable temp filesseparate list since the user page table differs (Ch 10)
ncached_maxtotal capacity (PRM_ID_MAX_ENTRIES_IN_TEMP_FILE_CACHE)put refuses once not_numerable + numerable >= max
ncached_not_numerable / ncached_numerableper-list countskept lock-step with the lists (see invariant)
mutexguards global lists and shell poolone lock for all global state
owner_mutexNDEBUG-only ownership trackerwhich thread holds mutex, for file_tempcache_lock / unlock asserts
tran_filesarray of per-transaction listsindexed by tran index so commit is O(1)
spacedb_temptemp-space accountingfeeds SPACEDB reporting

Invariant — list head and count agree. (cached_not_numerable == NULL) == (ncached_not_numerable == 0) and likewise for numerable; put asserts both before linking. If a count drifted, get underflows (it asserts ncached_* > 0) or put over-admits past ncached_max, leaking temp files.

11.1.1 file_tempcache_get — hand out a recycled file or a fresh shell

Section titled “11.1.1 file_tempcache_get — hand out a recycled file or a fresh shell”
// file_tempcache_get -- src/storage/file_manager.c
*entry = numerable ? file_Tempcache.cached_numerable : file_Tempcache.cached_not_numerable;
if (*entry != NULL && (*entry)->ftype != ftype) { /* cached file is wrong type */
error_code = file_temp_set_type (thread_p, &(*entry)->vfid, ftype);
if (error_code != NO_ERROR) *entry = NULL; /* <- re-type failed: fall to miss */
else (*entry)->ftype = ftype;
}
if (*entry != NULL) { /* hit: unlink, decrement the matching ncached_* */ ... }
else { error_code = file_tempcache_alloc_entry (entry); /* miss: bare shell, VFID_SET_NULL */ }

Five branches: hit/type-matches pops and decrements; hit/re-type succeeds patches ftype then pops; hit/re-type fails nulls *entry to the miss path; miss allocates a shell (vfid == NULL); shell-alloc failure propagates. A hit names an allocated file; a miss returns a shell for the caller to create into.

11.1.2 file_tempcache_put — admit a file back, or refuse

Section titled “11.1.2 file_tempcache_put — admit a file back, or refuse”
// file_tempcache_put -- src/storage/file_manager.c
if (file_header_copy (...&entry->vfid, &fhead) != NO_ERROR
|| fhead.n_page_user > prm_get_integer_value (PRM_ID_MAX_PAGES_IN_TEMP_FILE_CACHE))
return false; /* <- too big / unreadable: no lock taken yet */
file_tempcache_lock ();
if (ncached_not_numerable + ncached_numerable < ncached_max) {
if (file_temp_reset_user_pages (thread_p, &entry->vfid) != NO_ERROR)
{ file_tempcache_unlock (); return false; } /* <- reset failed: cannot reuse */
/* push onto cached_numerable / cached_not_numerable per FILE_IS_NUMERABLE(&fhead) */
file_tempcache_unlock (); return true;
}
file_tempcache_unlock (); return false; /* cache full */

Four exits, one keeps the file: header-copy-fails-or-too-big (false, before locking), cache-full (false), reset-fails (false), and all-clear (push onto the list chosen by the real header’s FILE_IS_NUMERABLE, true). A false return tells the caller to destroy it.

11.1.3 Commit/abort drain — file_tempcache_drop_tran_temp_files

Section titled “11.1.3 Commit/abort drain — file_tempcache_drop_tran_temp_files”
// file_tempcache_drop_tran_temp_files -- src/storage/file_manager.c
int tran_index = file_get_tempcache_entry_index (thread_p);
file_tempcache_lock_tran_entry (&file_Tempcache.tran_files[tran_index]);
if (file_Tempcache.tran_files[tran_index].head != NULL)
file_tempcache_cache_or_drop_entries (thread_p, &file_Tempcache.tran_files[tran_index].head);
file_tempcache_unlock_tran_entry (&file_Tempcache.tran_files[tran_index]);

file_tempcache_cache_or_drop_entries walks head; per entry it calls file_tempcache_put, and on false calls file_destroy(..., true) (interrupts suppressed so nothing leaks mid-drop) then file_tempcache_retire_entry; the list ends empty. tran_files is sized ntrans, where ntrans = logtb_get_number_of_total_tran_indices () + 1 in server mode (the +1 reserves index 0) and 1 in SA mode — the array is ntrans-sized, not ntrans + 1.

11.1.4 Query-manager-owned files — file_temp_preserve / file_temp_retire_preserved

Section titled “11.1.4 Query-manager-owned files — file_temp_preserve / file_temp_retire_preserved”

A temp file that must outlive the request but not the session cannot stay on the transaction list, or the next commit reclaims it. file_temp_preserve removes it:

// file_temp_preserve -- src/storage/file_manager.c
entry = file_tempcache_pop_tran_file (thread_p, vfid);
if (entry == NULL) assert_release (false); /* must have been on the list */
else file_tempcache_retire_entry (entry); /* return the shell; file is now untracked */

When done the owner calls file_temp_retire_preserved = file_temp_retire_internal(..., /*was_preserved=*/true). The flag changes how the entry is obtained: a preserved file is on no list, so retire allocates a fresh shell with vfid from the argument; a non-preserved retire pops the existing entry. Both funnel into file_tempcache_put and on false file_destroy(..., true).

Invariant — a temp file lives on exactly one tracking list: its transaction’s head, OR preserved (on no list), OR a global cached list, OR destroyed. file_temp_preserve enforces the hand-off by popping before retiring. Skip the pop and both the commit drain and the query manager retire it — a double-free.

11.2 The File Tracker: the catalog of permanent files

Section titled “11.2 The File Tracker: the catalog of permanent files”

The File Tracker is one permanent file per database whose body is a single FILE_EXTENSIBLE_DATA chain of FILE_TRACK_ITEM records — one per permanent file. It is located through two globals seeded at boot from boot_Db_parm->trk_vfid: file_Tracker_vfid (its VFID) and file_Tracker_vpid (its sticky first page, 11.3). boot_sr.c calls file_tracker_create at creation, file_tracker_load at every restart.

// file_track_metadata / file_track_item -- src/storage/file_manager.c
union file_track_metadata { /* 8 bytes, role depends on item->type */
FILE_TRACK_HEAP_METADATA heap; /* { bool is_marked_deleted; bool dummy[7]; } */
INT64 metadata_size_tracker; /* forces the union to exactly 8 bytes */
};
struct file_track_item {
INT32 fileid; INT16 volid; INT16 type; /* type is a FILE_TYPE cast to INT16 */
FILE_TRACK_METADATA metadata; /* total 16 bytes */
};

file_track_item(volid, fileid) is the search key:

FieldRoleWhy it exists
fileidlow 4 bytes of the VFIDwith volid, uniquely names the file
volidvolume of the fileitems kept ordered by file_compare_track_items for binary search
typeFILE_TYPE as 16 bitslets file_tracker_map filter by type without fixing each header
metadataper-type side datameaningful only for heaps; otherwise zero

file_track_metadata — a role matrix, because the union means different things by type:

item->typeActive memberMeaning
FILE_HEAP / FILE_HEAP_REUSE_SLOTSheap.is_marked_deletedheap is logically dropped but kept for reuse (file_tracker_item_reuse_heap)
any other typemetadata_size_trackerunused; written 0 by file_tracker_register when metadata == NULL

Invariant — items are ordered across the whole chain by file_compare_track_items; register inserts at the binary-search position. Both unregister (file_extdata_find_and_remove_item) and the iterator’s resume-by-cursor logic rely on this order — an out-of-order insert makes a later lookup silently miss a file that exists.

flowchart LR
  parm["boot_Db_parm->trk_vfid"] --> vfid["file_Tracker_vfid"]
  parm --> sticky["sticky first page"]
  sticky --> vpid["file_Tracker_vpid"]
  vpid --> head["FILE_EXTENSIBLE_DATA (head page)"]
  head -->|vpid_next| more["FILE_EXTENSIBLE_DATA (more pages)"]
  head --> items["FILE_TRACK_ITEM[] (volid,fileid,type,metadata)"]

Figure 11-1. Boot parameter to tracker globals to the extensible-data item chain.

11.2.1 file_tracker_register — add an item on permanent create

Section titled “11.2.1 file_tracker_register — add an item on permanent create”

Called from file_create for every permanent file (Ch 6), under a started system op.

// file_tracker_register -- src/storage/file_manager.c
assert (log_check_system_op_is_started (thread_p));
item.volid = vfid->volid; item.fileid = vfid->fileid; item.type = (INT16) ftype;
if (metadata == NULL) item.metadata.metadata_size_tracker = 0; /* zero-fill */
else item.metadata = *metadata;
page_track_head = pgbuf_fix (..., &file_Tracker_vpid, OLD_PAGE, PGBUF_LATCH_WRITE, ...);
if (page_track_head == NULL) { ASSERT_ERROR_AND_SET (error_code); return error_code; }
error_code = file_tracker_register_internal (thread_p, page_track_head, &item);

Placement lives in file_tracker_register_internal: find a not-full page (file_extdata_find_not_full); if none, allocate a new tracker page (file_alloc(&file_Tracker_vfid, ...)) linked via file_log_extdata_set_next; binary-search the slot; assert no duplicate (assert_release(false)); file_extdata_insert_at + file_log_extdata_add, mark dirty. Both error exits and the duplicate path goto exit.

11.2.2 file_tracker_unregister — remove an item on permanent destroy

Section titled “11.2.2 file_tracker_unregister — remove an item on permanent destroy”
// file_tracker_unregister -- src/storage/file_manager.c
log_sysop_start (thread_p); /* its own nested system op */
item_inout.volid = vfid->volid; item_inout.fileid = vfid->fileid;
error_code = file_extdata_find_and_remove_item (..., &item_inout, file_compare_track_items, true,
&item_inout, &vpid_merged);
if (error_code != NO_ERROR) goto exit; /* -> sysop_abort */
if (!VPID_ISNULL (&vpid_merged)) /* removal emptied/merged a page */
error_code = file_dealloc (thread_p, &file_Tracker_vfid, &vpid_merged, FILE_TRACKER);
exit:
if (error_code != NO_ERROR) log_sysop_abort (thread_p);
else log_sysop_end_logical_undo (thread_p, RVFL_TRACKER_UNREGISTER, NULL, sizeof (item_inout), &item_inout);

Branches: fix-fails returns early (no sysop); find-and-remove-fails or merge-then-dealloc-fails both goto exit → abort; success ends with logical undo. Logical (not physical) undo is the key — items shift between pages as the chain compacts, so a physical undo would target the wrong slot; the undo replays file_tracker_register_internal from the saved item (file_rv_tracker_unregister_undo).

11.2.3 file_tracker_map — enumerate every file

Section titled “11.2.3 file_tracker_map — enumerate every file”
// file_tracker_map -- src/storage/file_manager.c
page_track_head = pgbuf_fix (..., &file_Tracker_vpid, OLD_PAGE, latch_mode, ...);
while (true) { /* walk the extdata chain */
for (index_item = 0; index_item < file_extdata_item_count (extdata); index_item++) {
error_code = func (thread_p, page_extdata, extdata, index_item, &stop, args);
if (error_code != NO_ERROR || stop) goto exit; /* error, or callback early-out */
}
if (page_track_other != NULL) pgbuf_unfix_and_init (thread_p, page_track_other);
if (VPID_ISNULL (&extdata->vpid_next)) break; /* end of chain */
page_track_other = pgbuf_fix (..., &extdata->vpid_next, OLD_PAGE, latch_mode, ...);
if (page_track_other == NULL) goto exit;
page_extdata = page_track_other;
}

map holds the head page and rotates one page_track_other (at most two latched at once). The companion file_tracker_interruptable_iterate instead returns a cursor (vfid) plus an OID lock so a long scan can be interrupted and resumed without pinning the tracker — its FILE_GET_TRACKER_LOCK_MODE macro picks IX_LOCK for B-trees and SCH_S_LOCK otherwise.

Some files must keep their first user page at a fixed VPID forever — the tracker itself and the boot HFID heap. file_alloc_sticky_first_page allocates page #1 and records it.

// file_alloc_sticky_first_page -- src/storage/file_manager.c
assert (fhead->n_page_user == 0 && VPID_ISNULL (&fhead->vpid_sticky_first)); /* brand-new file */
error_code = file_alloc (thread_p, vfid, f_init, f_init_args, vpid_out, page_out);
if (error_code != NO_ERROR) goto exit;
log_append_undoredo_data2 (thread_p, RVFL_FHEAD_STICKY_PAGE, NULL, page_fhead, 0,
sizeof (VPID), sizeof (VPID), &fhead->vpid_sticky_first, vpid_out);
fhead->vpid_sticky_first = *vpid_out; /* remember it */
pgbuf_set_dirty (thread_p, page_fhead, DONT_FREE);

An ordinary file_alloc plus one logged header write (recovered by file_rv_fhead_sticky_page). The payoff is on dealloc: file_dealloc and its helpers assert (!VPID_EQ (&fhead->vpid_sticky_first, vpid)), exempting the sticky page from the Ch 9 lifecycle. This is a debug assertion (compiled out under NDEBUG), not a runtime short-circuit — callers are simply expected never to pass the sticky VPID. file_get_sticky_first_page reads it back (assert_release(false) if NULL); this is how file_tracker_load recovers file_Tracker_vpid.

11.4 TDE flags — orthogonal to allocation

Section titled “11.4 TDE flags — orthogonal to allocation”

TDE is two mutually exclusive bits in fhead->file_flags: FILE_FLAG_ENCRYPTED_AES (0x4) and FILE_FLAG_ENCRYPTED_ARIA (0x8).

// file_set_tde_algorithm_internal -- src/storage/file_manager.c
fhead->file_flags &= ~FILE_FLAG_ENCRYPTED_MASK; /* clear both bits first */
switch (tde_algo) {
case TDE_ALGORITHM_AES: fhead->file_flags |= FILE_FLAG_ENCRYPTED_AES; break;
case TDE_ALGORITHM_ARIA: fhead->file_flags |= FILE_FLAG_ENCRYPTED_ARIA; break;
case TDE_ALGORITHM_NONE: break; /* already cleared */
}

Neither sector reservation (Ch 4) nor page allocation (Ch 7/8) consults these flags. file_get_tde_algorithm_internal asserts the two bits are never both set, then reports AES, ARIA, or NONE. Encryption is applied per page at the buffer layer; the allocation machinery is algorithm-blind, so for a reader modifying allocation TDE is a non-event.

11.5 The shared primitive — file_extdata_apply_funcs

Section titled “11.5 The shared primitive — file_extdata_apply_funcs”

Every table in the module (tracker items, user-page table, sector tables) is a FILE_EXTENSIBLE_DATA chain, and almost every walk goes through this generic visitor.

// file_extdata_apply_funcs -- src/storage/file_manager.c
while (true) {
if (f_extdata != NULL) { error_code = f_extdata (...); if (error_code || stop) goto exit; } /* per-page */
if (f_item != NULL) /* per-item */
for (i = 0; i < file_extdata_item_count (extdata_in); i++)
{ error_code = f_item (..., file_extdata_at (extdata_in, i), i, &stop, ...); if (error_code || stop) goto exit; }
if (VPID_ISNULL (&extdata_in->vpid_next)) break;
// ... unfix current, fix extdata_in->vpid_next, goto exit on NULL ...
}
exit:
if (stop && page_out != NULL) *page_out = page_extdata; /* hand page back, latched */
else if (page_extdata != NULL) pgbuf_unfix (thread_p, page_extdata);

Two optional callbacks (f_extdata per page, f_item per item), either can stop or error. The exit policy is the subtle part: on stop with page_out the page is handed back still latched so search-then-modify can act on it; otherwise it is unfixed here. This underlies file_extdata_search_item / _find_ordered (binary search), _insert_at / _remove_at, and file_extdata_merge. latch_mode is WRITE when for_write, else READ.

11.6 Recovery handlers for this chapter, and the open question

Section titled “11.6 Recovery handlers for this chapter, and the open question”

Both modules register undo/redo/dump callbacks indexed by the RV* enum. The handlers introduced by this chapter’s machinery:

RV indexHandler(s)What it replays
RVFL_FHEAD_STICKY_PAGEfile_rv_fhead_sticky_pagesticky-first-page VPID (11.3)
RVFL_TRACKER_UNREGISTERfile_rv_tracker_unregister_undological undo of tracker removal (11.2.2)
RVFL_SET_TDE_ALGORITHMfile_rv_set_tde_algorithmTDE flag change (11.4)
RVFL_EXTDATA_ADD/REMOVE/SET_NEXT/MERGEfile_rv_extdata_add / _remove / _set_next / _mergeevery extensible-data edit (11.5)
RVDK_FORMATdisk_rv_undo_format, disk_rv_redo_format, disk_rv_dump_hdrvolume create/format — the open question below

The core-lifecycle handlers — sector reserve/unreserve, volume-header expand, file-header alloc/dealloc, partial-sector bitmap edits, postponed dealloc, file destroy — belong to Chapters 3-5, 7, 9; see those for their RVDK_* / RVFL_* rows.

Open question — mid-disk_format crash idempotency. disk_rv_redo_format carries an is_first_call flag (rcv->offset == -1) that skips the disk-cache update on the first of its two calls, so the format handlers encode an implicit assumption about how far disk_format got before a crash. Whether every interleaving (volume file created / cache registered / volume_info written) is covered — notably a crash between the redo’s two calls — is not provable from the handlers alone and is left as a verification target.

  1. file_Tempcache recycles whole temp files: get pops a cached list or allocates a shell; put admits a reset file or returns false so the caller destroys it.
  2. A temp file lives on exactly one tracking list; file_temp_preserve pops it off the transaction list for the query manager, drop_tran_temp_files drains the rest at commit/abort.
  3. The File Tracker is a (volid, fileid)-ordered FILE_TRACK_ITEM chain reached via trk_vfidfile_Tracker_vfid / file_Tracker_vpid; unregister uses logical undo because items migrate between pages.
  4. Sticky first pages are lifecycle-exempt by contract — file_dealloc only asserts (debug build) the page is never vpid_sticky_first; release builds rely on callers never passing it.
  5. TDE (mutually exclusive AES/ARIA bits in file_flags) is orthogonal to allocation; the bits are read only when bytes hit disk.
  6. file_extdata_apply_funcs is the one engine behind every table, with per-page/per-item callbacks and an exit policy that can hand the stopped-on page back still latched.
  7. Recovery is indexed by RV* constants; the one unproven corner is disk_rv_*_format idempotency across mid-disk_format crash points.

The following are line numbers as observed on 2026-06-09; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
bit64_count_trailing_onessrc/base/bit.c515
PRM_ID_BOSR_MAXTMP_PAGESsrc/base/system_parameter.c1246
DB_VOLPURPOSEsrc/compat/dbtype_def.h196
DB_VOLTYPEsrc/compat/dbtype_def.h203
VSIDsrc/compat/dbtype_def.h939
fhs_fix_nth_pagesrc/query/query_hash_scan.c1078
disk_volume_headersrc/storage/disk_manager.c75
disk_cache_volinfosrc/storage/disk_manager.c155
disk_extend_infosrc/storage/disk_manager.c162
disk_perm_infosrc/storage/disk_manager.c180
disk_temp_infosrc/storage/disk_manager.c186
nsect_perm_freesrc/storage/disk_manager.c189
disk_cachesrc/storage/disk_manager.c194
disk_Cachesrc/storage/disk_manager.c209
disk_Temp_max_sectssrc/storage/disk_manager.c211
DISK_STAB_UNITsrc/storage/disk_manager.c224
disk_stab_cursorsrc/storage/disk_manager.c229
DISK_STAB_PAGE_BIT_COUNTsrc/storage/disk_manager.c250
DISK_ALLOCTBL_SECTOR_PAGE_OFFSETsrc/storage/disk_manager.c253
DISK_ALLOCTBL_SECTOR_UNIT_OFFSETsrc/storage/disk_manager.c255
DISK_STAB_NPAGESsrc/storage/disk_manager.c263
disk_cache_vol_reservesrc/storage/disk_manager.c273
DISK_PRERESERVE_BUF_DEFAULTsrc/storage/disk_manager.c278
disk_reserve_contextsrc/storage/disk_manager.c281
DISK_MIN_VOLUME_SECTSsrc/storage/disk_manager.c300
DISK_SYS_NSECT_SIZEsrc/storage/disk_manager.c347
disk_formatsrc/storage/disk_manager.c512
disk_unformatsrc/storage/disk_manager.c822
disk_rv_undo_formatsrc/storage/disk_manager.c1235
disk_rv_redo_formatsrc/storage/disk_manager.c1340
disk_extendsrc/storage/disk_manager.c1633
disk_volume_expandsrc/storage/disk_manager.c1904
disk_rv_volhead_extend_redosrc/storage/disk_manager.c2022
disk_rv_volhead_extend_undosrc/storage/disk_manager.c2081
disk_add_volumesrc/storage/disk_manager.c2117
disk_add_volume_extensionsrc/storage/disk_manager.c2326
disk_volume_bootsrc/storage/disk_manager.c2443
disk_cache_load_volumesrc/storage/disk_manager.c2567
disk_cache_initsrc/storage/disk_manager.c2627
disk_cache_finalsrc/storage/disk_manager.c2688
disk_cache_load_all_volumessrc/storage/disk_manager.c2714
disk_cache_free_reservedsrc/storage/disk_manager.c2728
disk_cache_update_vol_freesrc/storage/disk_manager.c2748
disk_lock_extendsrc/storage/disk_manager.c2791
disk_unlock_extendsrc/storage/disk_manager.c2817
disk_cache_lock_reserve_for_purposesrc/storage/disk_manager.c2837
disk_volume_header_set_stabsrc/storage/disk_manager.c3166
disk_verify_volume_headersrc/storage/disk_manager.c3179
disk_stab_cursor_set_at_sectidsrc/storage/disk_manager.c3258
disk_stab_cursor_set_at_endsrc/storage/disk_manager.c3284
disk_stab_cursor_set_at_startsrc/storage/disk_manager.c3303
disk_stab_cursor_check_validsrc/storage/disk_manager.c3372
disk_stab_cursor_is_bit_setsrc/storage/disk_manager.c3414
disk_stab_cursor_set_bitsrc/storage/disk_manager.c3429
disk_stab_cursor_fixsrc/storage/disk_manager.c3493
disk_stab_unit_reservesrc/storage/disk_manager.c3544
disk_stab_iterate_unitssrc/storage/disk_manager.c3665
disk_stab_iterate_units_allsrc/storage/disk_manager.c3738
disk_stab_set_bits_contiguoussrc/storage/disk_manager.c3807
disk_rv_reserve_sectorssrc/storage/disk_manager.c3899
disk_rv_unreserve_sectorssrc/storage/disk_manager.c3982
disk_reserve_sectors_in_volumesrc/storage/disk_manager.c4066
disk_reserve_sectorssrc/storage/disk_manager.c4290
disk_reserve_from_cachesrc/storage/disk_manager.c4463
disk_reserve_from_cache_volssrc/storage/disk_manager.c4612
disk_reserve_from_cache_volumesrc/storage/disk_manager.c4666
disk_unreserve_ordered_sectorssrc/storage/disk_manager.c4703
disk_unreserve_ordered_sectors_without_csectsrc/storage/disk_manager.c4735
disk_unreserve_sectors_from_volumesrc/storage/disk_manager.c4794
disk_stab_unit_unreservesrc/storage/disk_manager.c4848
disk_stab_initsrc/storage/disk_manager.c4909
disk_manager_initsrc/storage/disk_manager.c5002
disk_manager_finalsrc/storage/disk_manager.c5044
disk_format_first_volumesrc/storage/disk_manager.c5062
disk_sectors_to_extend_npagessrc/storage/disk_manager.c6845
DISK_VOLHEADER_PAGEsrc/storage/disk_manager.h35
fileio_map_mountedsrc/storage/file_io.c3448
file_headersrc/storage/file_manager.c90
n_page_mark_deletesrc/storage/file_manager.c104
volid_last_expandsrc/storage/file_manager.c117
vpid_sticky_firstsrc/storage/file_manager.c123
vpid_last_temp_allocsrc/storage/file_manager.c132
offset_to_last_temp_allocsrc/storage/file_manager.c133
vpid_last_user_page_ftabsrc/storage/file_manager.c139
vpid_find_nth_lastsrc/storage/file_manager.c156
first_index_find_nth_lastsrc/storage/file_manager.c157
FILE_HEADER_ALIGNED_SIZEsrc/storage/file_manager.c167
FILE_FLAG_NUMERABLEsrc/storage/file_manager.c170
FILE_FLAG_ENCRYPTED_AESsrc/storage/file_manager.c172
FILE_FLAG_ENCRYPTED_ARIAsrc/storage/file_manager.c173
FILE_CACHE_LAST_FIND_NTHsrc/storage/file_manager.c181
FILE_HEADER_GET_PART_FTABsrc/storage/file_manager.c199
FILE_HEADER_GET_FULL_FTABsrc/storage/file_manager.c203
FILE_HEADER_GET_USER_PAGE_FTABsrc/storage/file_manager.c208
file_extensible_datasrc/storage/file_manager.c232
FILE_EXTDATA_HEADER_ALIGNED_SIZEsrc/storage/file_manager.c240
FILE_TABLESPACE_FOR_PERM_NPAGESsrc/storage/file_manager.c281
FILE_TABLESPACE_FOR_TEMP_NPAGESsrc/storage/file_manager.c287
file_vsid_collectorsrc/storage/file_manager.c296
file_alloc_typesrc/storage/file_manager.c388
FILE_USER_PAGE_MARK_DELETE_FLAGsrc/storage/file_manager.c425
FILE_USER_PAGE_IS_MARKED_DELETEDsrc/storage/file_manager.c426
FILE_USER_PAGE_MARK_DELETEDsrc/storage/file_manager.c427
FILE_USER_PAGE_CLEAR_MARK_DELETEDsrc/storage/file_manager.c428
file_find_nth_contextsrc/storage/file_manager.c433
file_tempcache_entrysrc/storage/file_manager.c448
file_tempcache_tran_entrysrc/storage/file_manager.c457
file_tempcachesrc/storage/file_manager.c467
file_Tempcachesrc/storage/file_manager.c490
file_Tracker_vfidsrc/storage/file_manager.c496
file_Tracker_vpidsrc/storage/file_manager.c497
file_track_metadatasrc/storage/file_manager.c507
file_track_itemsrc/storage/file_manager.c515
file_manager_initsrc/storage/file_manager.c859
file_manager_finalsrc/storage/file_manager.c872
file_header_allocsrc/storage/file_manager.c1093
file_header_update_mark_deletedsrc/storage/file_manager.c1317
file_extdata_initsrc/storage/file_manager.c1492
file_extdata_max_sizesrc/storage/file_manager.c1520
file_extdata_apply_funcssrc/storage/file_manager.c1886
file_extdata_find_and_remove_itemsrc/storage/file_manager.c2571
file_partsect_is_fullsrc/storage/file_manager.c2758
file_partsect_is_emptysrc/storage/file_manager.c2770
file_partsect_set_bitsrc/storage/file_manager.c2796
file_partsect_pageid_to_offsetsrc/storage/file_manager.c2826
file_partsect_allocsrc/storage/file_manager.c2847
file_create_with_npagessrc/storage/file_manager.c3101
file_create_heapsrc/storage/file_manager.c3126
file_create_temp_internalsrc/storage/file_manager.c3155
file_create_tempsrc/storage/file_manager.c3217
file_create_temp_numerablesrc/storage/file_manager.c3231
file_create_query_areasrc/storage/file_manager.c3244
file_create_ehashsrc/storage/file_manager.c3261
file_create_ehash_dirsrc/storage/file_manager.c3285
file_createsrc/storage/file_manager.c3311
file_table_collect_vsidsrc/storage/file_manager.c3915
file_table_collect_all_vsidssrc/storage/file_manager.c3934
file_destroysrc/storage/file_manager.c4121
file_temp_retire_preservedsrc/storage/file_manager.c4445
file_temp_retire_internalsrc/storage/file_manager.c4476
file_perm_expandsrc/storage/file_manager.c4644
file_table_move_partial_sectors_to_headersrc/storage/file_manager.c4772
file_table_append_full_sector_pagesrc/storage/file_manager.c4976
file_table_add_full_sectorsrc/storage/file_manager.c5026
file_perm_allocsrc/storage/file_manager.c5166
file_allocsrc/storage/file_manager.c5405
file_alloc_sticky_first_pagesrc/storage/file_manager.c5681
file_rv_fhead_sticky_pagesrc/storage/file_manager.c5753
file_get_sticky_first_pagesrc/storage/file_manager.c5779
file_set_tde_algorithm_internalsrc/storage/file_manager.c5896
file_get_tde_algorithm_internalsrc/storage/file_manager.c5963
file_deallocsrc/storage/file_manager.c6116
file_perm_deallocsrc/storage/file_manager.c6309
file_rv_dealloc_internalsrc/storage/file_manager.c6616
file_rv_dealloc_on_undosrc/storage/file_manager.c6758
file_rv_dealloc_on_postponesrc/storage/file_manager.c6773
file_numerable_add_pagesrc/storage/file_manager.c7935
file_extdata_find_nth_vpidsrc/storage/file_manager.c8119
file_extdata_find_nth_vpid_and_skip_markedsrc/storage/file_manager.c8153
file_numerable_find_nthsrc/storage/file_manager.c8193
file_rv_user_page_mark_deletesrc/storage/file_manager.c8381
file_rv_user_page_unmark_delete_logicalsrc/storage/file_manager.c8406
file_numerable_truncatesrc/storage/file_manager.c8577
file_temp_allocsrc/storage/file_manager.c8650
disk_reserve_sectorssrc/storage/file_manager.c8715
file_temp_reset_user_pagessrc/storage/file_manager.c8949
file_temp_preservesrc/storage/file_manager.c9143
file_tempcache_initsrc/storage/file_manager.c9171
file_tempcache_finalsrc/storage/file_manager.c9234
file_tempcache_getsrc/storage/file_manager.c9414
file_tempcache_putsrc/storage/file_manager.c9541
file_tempcache_drop_tran_temp_filessrc/storage/file_manager.c9645
file_tempcache_cache_or_drop_entriessrc/storage/file_manager.c9664
file_tempcache_pop_tran_filesrc/storage/file_manager.c9702
file_tracker_createsrc/storage/file_manager.c9861
file_tracker_loadsrc/storage/file_manager.c9910
file_tracker_registersrc/storage/file_manager.c9960
file_tracker_register_internalsrc/storage/file_manager.c10016
file_tracker_unregistersrc/storage/file_manager.c10113
file_tracker_mapsrc/storage/file_manager.c10306
file_tracker_interruptable_iteratesrc/storage/file_manager.c10992
file_heap_dessrc/storage/file_manager.h82
file_btree_dessrc/storage/file_manager.h98
file_ovf_btree_dessrc/storage/file_manager.h106
FILE_DESCRIPTORS_SIZEsrc/storage/file_manager.h128
file_descriptorssrc/storage/file_manager.h130
file_tablespacesrc/storage/file_manager.h143
FILE_ALLOC_BITMAPsrc/storage/file_manager.h153
FILE_FULL_PAGE_BITMAPsrc/storage/file_manager.h154
FILE_ALLOC_BITMAP_NBITSsrc/storage/file_manager.h157
file_partial_sectorsrc/storage/file_manager.h162
pgbuf_dealloc_pagesrc/storage/page_buffer.c14562
DISK_SECTOR_NPAGESsrc/storage/storage_common.h109
trk_vfidsrc/transaction/boot_sr.c119
LOG_MAX_DBVOLIDsrc/transaction/log_volids.hpp34
  • cubrid-disk-manager.md — the high-level companion (covers both file and disk managers).
  • Raw analyses under raw/code-analysis/cubrid/storage/disk_manager/ and the numerable-file Q&A note raw/code-analysis/cubrid/file-manager-numerable-qa.md.
  • Code: src/storage/file_manager.{c,h}, src/storage/disk_manager.{c,h}.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.