Runtime & Weights

The device-side runtime: weight loading and materialization, the on-demand GPU weight cache, and the shared device memory pool. These are internal interfaces, documented here for operators and contributors. See On-demand Weights for the operational view.

Weights

ReactantServer.free_weights!Method
free_weights!(backend, bufs) -> nothing

Release every device buffer in bufs. Used to evict an unpinned model's resident weights.

ReactantServer.host_materializeMethod
host_materialize(store, key, st, names) -> Vector{Any}

Materialize a model's host weight floor through a WeightStore. The private store allocates per-worker arrays (identical to materialize_host_weights); the shared store backs them with a node-shared SHM region so same-node workers share one copy. key is the model name.

ReactantServer.host_release!Method
host_release!(store, key) -> nothing

Release a model's host weight floor previously obtained from host_materialize. A no-op for the private store; for the shared store it detaches and, if last on the node, unlinks the region. The caller must drop its references to the host arrays first.

ReactantServer.materialize_host_weightsMethod
materialize_host_weights(st, names) -> Vector{Any}

Materialize the named weights from the lazy mmap safetensors views into resident host Arrays, in weightnames order. This is the expensive step (a layout copy at host-memory speed); doing it once at startup and keeping the result resident in RAM lets an on-demand GPU load be a pure host->device transfer (see `transferto_device`) rather than a re-materialization.

ReactantServer.transfer_to_deviceMethod
transfer_to_device(backend, pool, hosts) -> Vector{Any}

Transfer already-materialized host weight Arrays to device buffers, in order. This is the only cost paid on an on-demand load when weights are pinned in host RAM.

ReactantServer.weight_orderMethod
weight_order(st) -> Vector{String}

Return weight names in StableHLO argument order. Reads the argumentorder field from the safetensors metadata (a JSON-encoded list). A file with tensors but no argumentorder is an error; a file with no tensors yields an empty order.

ReactantServer.weights_nbytesMethod
weights_nbytes(st, names) -> Int

Total device footprint of the named weights, summed from the lazy safetensors views' shape and element type. Reads only metadata (no collect), so it is cheap and lets the weight cache budget against a model before its weights are ever loaded.

Weight stores (host RAM)

ReactantServerCore.SharedWeightStoreType
SharedWeightStore(; mode=0o666)

Opt-in store backing each model's host weights with a node-level POSIX shared-memory region so same-node workers share one copy. See the file header for the flock-coordinated protocol. mode sets the permission bits for the regions and their lock files. The default 0o666 lets workers running as unrelated UIDs (for example different containers) share the regions, but is world-writable; 0o660 is recommended for production and multi-user systems.

ReactantServerCore.WeightStoreType
WeightStore

Host-RAM residency backend for a model's materialized weights. PrivateWeightStore (the default) gives each worker its own copy; SharedWeightStore shares one copy per model across same-node workers through a POSIX shared-memory region. See the file header for the coordination protocol.

ReactantServerCore.materialize_host_weights!Method
materialize_host_weights!(store, key, digest, specs, fill!) -> Vector{Any}

Return a model's host weight Arrays (in weight order), populating them via fill!(arrays) when this worker is the one that must materialize them. specs is a vector of (eltype, dims) per tensor. For PrivateWeightStore the arrays are freshly allocated; for SharedWeightStore they alias a node-shared region.

ReactantServerCore.release_host_weights!Method
release_host_weights!(store, key) -> nothing

Release a model's host weights. For SharedWeightStore this detaches the region and, if this was the last holder on the node (a non-blocking upgrade to an exclusive flock succeeds), unlinks the region and its lock file. A no-op for the private store. The caller must drop all references to the arrays first.

ReactantServerCore.weights_digestMethod
weights_digest(key, specs) -> UInt64

Content digest over a model's identity and weight layout (name, per-tensor dtype/size/shape, and a format version). Two workers computing this for the same model agree on the same region key.

Weight cache

ReactantServer.NotResidentErrorType
NotResidentError

Raised by acquire! in externally-managed mode when a request targets a model whose weights are not currently resident. A control plane is authoritative for residency in that mode, so the worker does not autonomously load; the model must be pinned first.

ReactantServer.acquire!Method
acquire!(cache, entry) -> nothing

Guarantee entry.executable.weights is resident before the model runs. Device-pinned and already-resident models return immediately (the latter is bumped to most-recently-used). In self-managed mode an evicted model is loaded autonomously, evicting LRU victims until it fits the budget (a model larger than the whole budget is loaded anyway after evicting everything, with a warning, since it cannot run otherwise). In externally-managed mode the worker does not autonomously load: an evicted model raises NotResidentError.

ReactantServer.compact!Method
compact!(cache, registry; reload) -> Int

Defragment the device arena. Frees every resident non-pinned device weight buffer at once so the BFC allocator coalesces its free list, then reloads each non-pinned model named in reload. Models not in reload are left freed; in self-managed mode they reload lazily through acquire! on their next dispatch. Host floors (host_weights) are never touched, so the reload is a pure host->device copy where a floor exists.

Device-pinned models are deliberately left in place: they are loaded once at startup, before any on-demand traffic, so they sit at the base of the arena, and compaction neither frees them nor pays to re-read them from disk. Only the on-demand churn region above them is freed and coalesced.

Runs on the dispatch-loop thread (the sole residency mutator), so no execution reads a buffer while it is being freed. In externally-managed mode acquire! will not autonomously load, so the reload list is ignored there (the control plane re-pins what it needs). Returns the number of models reloaded.

ReactantServer.preload_pinned!Method
preload_pinned!(cache, registry) -> nothing

Ensure every pinned model's weights are resident. build_loaded_model already loads them when on-demand mode is on, so this is normally a no-op; it loads defensively otherwise.

ReactantServer.release_all!Method
release_all!(cache, entry) -> nothing

Release every residency a model holds: free its device buffers, drop it from the LRU budget, and release any shared host floor. Used by evict! when a model is removed from the worker. Runs on the dispatch thread (sole residency mutator); the device free runs outside the lock.

ReactantServer.set_residency_state!Method
set_residency_state!(cache, entry, target) -> ResidencyState

Move a model to the target residency floor (see ResidencyState). Like acquire! this runs on the dispatch-loop thread, the sole mutator of residency. It materializes or drops the host floor and, for PINNED_DEVICE, ensures the weights are resident on the device (exempt from the budget). In externally-managed mode, leaving the device floor releases the device buffers (there is no autonomous evictor to reclaim them); in self-managed mode they stay resident but become evictable. The slow host/device work runs outside the lock; only the bookkeeping commit is locked.

ReactantServer.weight_budgetMethod
weight_budget(; arena, fraction, wiggle, max_scratch, pinned_bytes) -> (; on_demand_budget, weight_pool, scratch_ceiling, pinned_over_commit)

Solve for the on-demand weight-cache byte budget from the unified memory model. The arena holds, at the worst instant, pinned weights + on-demand weights + one model's execution max_scratch + a wiggle fraction of free slack:

scratch_ceiling  = (1 - wiggle) * arena
weight_pool      = min(fraction * arena, scratch_ceiling - max_scratch)   # for pinned + on-demand
on_demand_budget = clamp(weight_pool - pinned_bytes, 0, weight_pool)

fraction caps the share of the arena devoted to weights (1.0 = all of it, minus scratch + wiggle). Pinned models reserve their footprint off the top, so the operator never subtracts it by hand. pinned_over_commit (pinned exceed the pool) flags a genuine over-commit no sizing can fix. Pure arithmetic, no device access, so it is unit testable.

Memory pool