Runtime & Weights

The device-side runtime: weight loading and materialization, the on-demand GPU weight cache, and the shared device memory pool. These are internal interfaces, documented here for operators and contributors. See On-demand Weights for the operational view.

Weights

ReactantServer.free_weights! — Method

free_weights!(backend, bufs) -> nothing

Release every device buffer in bufs. Used to evict an unpinned model's resident weights.

ReactantServer.host_materialize — Method

host_materialize(store, key, st, names) -> Vector{Any}

Materialize a model's host weight floor through a WeightStore. The private store allocates per-worker arrays (identical to materialize_host_weights); the shared store backs them with a node-shared SHM region so same-node workers share one copy. key is the model name.

ReactantServer.host_release! — Method

host_release!(store, key) -> nothing

Release a model's host weight floor previously obtained from host_materialize. A no-op for the private store; for the shared store it detaches and, if last on the node, unlinks the region. The caller must drop its references to the host arrays first.

ReactantServer.host_rename! — Method

host_rename!(store, old, new) -> nothing

Rekey a model's host weight floor from old to new after a model rename (the weights are unchanged, so nothing is re-materialized). A no-op for the private store.

ReactantServer.materialize_host_weights — Method

materialize_host_weights(st, names) -> Vector{Any}

Materialize the named weights from the lazy mmap safetensors views into resident host Arrays, in weightnames order. This is the expensive step (a layout copy at host-memory speed); doing it once at startup and keeping the result resident in RAM lets an on-demand GPU load be a pure host->device transfer (see `transferto_device`) rather than a re-materialization.

ReactantServer.transfer_to_device — Method

transfer_to_device(backend, pool, hosts) -> Vector{Any}

Transfer already-materialized host weight Arrays to device buffers, in order. This is the only cost paid on an on-demand load when weights are pinned in host RAM.

ReactantServer.weight_order — Method

weight_order(st) -> Vector{String}

Return weight names in StableHLO argument order. Reads the argumentorder field from the safetensors metadata (a JSON-encoded list). A file with tensors but no argumentorder is an error; a file with no tensors yields an empty order.

ReactantServer.weights_nbytes — Method

weights_nbytes(st, names) -> Int

Total device footprint of the named weights, summed from the lazy safetensors views' shape and element type. Reads only metadata (no collect), so it is cheap and lets the weight cache budget against a model before its weights are ever loaded.

Weight stores (host RAM)

ReactantServerCore.PrivateWeightStore — Type

PrivateWeightStore()

The default store: each worker materializes its own host copy of a model's weights.

ReactantServerCore.SharedWeightStore — Type

SharedWeightStore(; mode=0o666)

Opt-in store backing each model's host weights with a node-level POSIX shared-memory region so same-node workers share one copy. See the file header for the flock-coordinated protocol. mode sets the permission bits for the regions and their lock files. The default 0o666 lets workers running as unrelated UIDs (for example different containers) share the regions, but is world-writable; 0o660 is recommended for production and multi-user systems.

ReactantServerCore.WeightStore — Type

WeightStore

Host-RAM residency backend for a model's materialized weights. PrivateWeightStore (the default) gives each worker its own copy; SharedWeightStore shares one copy per model across same-node workers through a POSIX shared-memory region. See the file header for the coordination protocol.

ReactantServerCore.materialize_host_weights! — Method

materialize_host_weights!(store, key, digest, specs, fill!) -> Vector{Any}

Return a model's host weight Arrays (in weight order), populating them via fill!(arrays) when this worker is the one that must materialize them. specs is a vector of (eltype, dims) per tensor. For PrivateWeightStore the arrays are freshly allocated; for SharedWeightStore they alias a node-shared region.

ReactantServerCore.release_host_weights! — Method

release_host_weights!(store, key) -> nothing

Release a model's host weights. For SharedWeightStore this detaches the region and, if this was the last holder on the node (a non-blocking upgrade to an exclusive flock succeeds), unlinks the region and its lock file. A no-op for the private store. The caller must drop all references to the arrays first.

ReactantServerCore.rename_host_weights! — Method

rename_host_weights!(store, old, new) -> nothing

Rekey a model's attached host-weight region from old to new (a model rename; the weights are unchanged). The region itself keeps its original content-addressed SHM name; only this worker's bookkeeping key moves, so a later release_host_weights!(store, new) detaches the same region. A no-op for the private store and when nothing is attached under old.

ReactantServerCore.weights_digest — Method

weights_digest(key, specs; content=0) -> UInt64

Digest over a model's identity, weight layout (name, per-tensor dtype/size/shape, a format version), and a content token identifying the weight file's on-disk version (see weights_file_token). Two workers computing this for the same model and the same file agree on the same region key. Without the content token a weights-only update (same name, same tensor layout) would collide with the previous version's region and silently serve stale weights, including across server restarts (regions in /dev/shm outlive the process by design).

ReactantServerCore.weights_file_token — Method

weights_file_token(path) -> UInt64

Identity token for a weights file's current on-disk version, folded into weights_digest via its content keyword. Derived from the file's size and mtime: cheap (no content read, so the shared store's zero-copy attach path stays zero-read) and identical across same-node workers stat'ing the same file. Returns 0 for an empty or missing path (hand-built test entries), which reproduces the layout-only digest.

Weight cache

ReactantServer.NotResidentError — Type

NotResidentError

Raised by acquire! in externally-managed mode when a request targets a model whose weights are not currently resident. A control plane is authoritative for residency in that mode, so the worker does not autonomously load; the model must be pinned first.

ReactantServer.acquire! — Method

acquire!(cache, entry) -> nothing

Guarantee entry.executable.weights is resident before the model runs. Device-pinned and already-resident models return immediately (the latter is bumped to most-recently-used). In self-managed mode an evicted model is loaded autonomously, evicting LRU victims until it fits the budget (a model larger than the whole budget is loaded anyway after evicting everything, with a warning, since it cannot run otherwise). In externally-managed mode the worker does not autonomously load: an evicted model raises NotResidentError.

ReactantServer.compact! — Method

compact!(cache, registry; reload) -> Int

Defragment the device arena. Frees every resident non-pinned device weight buffer at once so the BFC allocator coalesces its free list, then reloads each non-pinned model named in reload. Models not in reload are left freed; in self-managed mode they reload lazily through acquire! on their next dispatch. Host floors (host_weights) are never touched, so the reload is a pure host->device copy where a floor exists.

Device-pinned models are deliberately left in place: they are loaded once at startup, before any on-demand traffic, so they sit at the base of the arena, and compaction neither frees them nor pays to re-read them from disk. Only the on-demand churn region above them is freed and coalesced.

Runs on the dispatch-loop thread (the sole residency mutator), so no execution reads a buffer while it is being freed. In externally-managed mode acquire! will not autonomously load, so the reload list is ignored there (the control plane re-pins what it needs). Returns the number of models reloaded.

ReactantServer.preload_pinned! — Method

preload_pinned!(cache, registry) -> nothing

Ensure every pinned model's weights are resident. build_loaded_model already loads them when on-demand mode is on, so this is normally a no-op; it loads defensively otherwise.

ReactantServer.release_all! — Method

release_all!(cache, entry) -> nothing

Release every residency a model holds: free its device buffers, drop it from the LRU budget, and release any shared host floor. Used by evict! when a model is removed from the worker. Runs on the dispatch thread (sole residency mutator); the device free runs outside the lock.

ReactantServer.rename! — Method

rename!(cache, old, new) -> nothing

Rekey a model's residency bookkeeping from old to new after a model rename: the LRU entry (when the model is device-resident and non-pinned) and the host-weight store key. No device or host memory moves; the weights themselves are untouched. Runs on the dispatch thread (sole residency mutator), like every other cache mutation.

ReactantServer.set_residency_state! — Method

set_residency_state!(cache, entry, target) -> ResidencyState

Move a model to the target residency floor (see ResidencyState). Like acquire! this runs on the dispatch-loop thread, the sole mutator of residency. It materializes or drops the host floor and, for PINNED_DEVICE, ensures the weights are resident on the device (exempt from the budget). In externally-managed mode, leaving the device floor releases the device buffers (there is no autonomous evictor to reclaim them); in self-managed mode they stay resident but become evictable. The slow host/device work runs outside the lock; only the bookkeeping commit is locked.

ReactantServer.weight_budget — Method

weight_budget(; arena, fraction, wiggle, max_scratch, pinned_bytes) -> (; on_demand_budget, weight_pool, scratch_ceiling, pinned_over_commit)

Solve for the on-demand weight-cache byte budget from the unified memory model. The arena holds, at the worst instant, pinned weights + on-demand weights + one model's execution max_scratch + a wiggle fraction of free slack:

scratch_ceiling  = (1 - wiggle) * arena
weight_pool      = min(fraction * arena, scratch_ceiling - max_scratch)   # for pinned + on-demand
on_demand_budget = clamp(weight_pool - pinned_bytes, 0, weight_pool)

fraction caps the share of the arena devoted to weights (1.0 = all of it, minus scratch + wiggle). Pinned models reserve their footprint off the top, so the operator never subtracts it by hand. pinned_over_commit (pinned exceed the pool) flags a genuine over-commit no sizing can fix. Pure arithmetic, no device access, so it is unit testable.

ReactantServer.weight_cache_stats — Method

weight_cache_stats(cache) -> NamedTuple

Snapshot the cache counters under its lock for observability.

Runtime & Weights

Weights

Weight stores (host RAM)

Weight cache

Memory pool