Configuration

The typed configuration a worker resolves from its node file. These types live in ReactantServerCore (the shared substrate) and are re-exported by ReactantServer. See Node Configuration for how these map onto the YAML.

ReactantServerCore.ServerConfig — Type

ServerConfig

The fully resolved configuration for a single worker process, frozen for the process lifetime. It is produced from a node file (see node.jl) with environment-variable overrides applied, then checked by validate_config. Fields: model_dirs (bundle search paths), cache_dir, the RuntimeConfig, SchedulerConfig, and EndpointsConfig sub-configs, models_include (an allowlist of model names to load; empty means load all), model_poll_seconds (the dynamic-mode interval at which the worker re-scans its model_dirs for added, changed, or removed bundles and hot-swaps them), model_control_mode (see ModelControlMode: static, dynamic, or explicit), and the GrpcConfig grpc sub-config (gRPC message-size limits). ReactantServer.serve also accepts a ServerConfig directly.

ReactantServerCore.RuntimeConfig — Type

RuntimeConfig

Runtime and device settings (the runtime: config block). backend selects CPU or CUDA execution; device_ordinal picks the GPU among several visible ones; mem_fraction is the fraction of device memory claimed for the pool; preallocate claims that pool up front; allow_cpu_fallback permits falling back to CPU when the device is unavailable; weight_cache_fraction is the single knob sizing on-demand (unpinned) weight residency, the fraction of the BFC arena (mem_fraction * device memory) devoted to all weights, pinned plus on-demand (1.0, the default, uses the whole arena minus measured scratch and wiggle; 0 disables the cache so every model's weights stay resident; resolved to a byte budget at startup, GPU only); weight_cache_wiggle_fraction is the fraction of the arena kept free as anti-fragmentation slack and feeds the startup auto-sizing (the worker probes each model's peak device usage and sizes the cache so pinned weights + on-demand + scratch + this slack fit); residency_mode selects self-managed or externally-managed residency; shared_host_weights opts the node into the shared-memory host-weight store so same-node workers share one host copy of each system-pinned model; and shared_host_weights_mode sets the permission bits (an octal string, default "666") for those shared regions and their lock files. The "666" default keeps cross-UID container setups working but is world-writable; "660" is recommended for production and multi-user systems. autotune (default true) enables XLA's GPU compile autotuner; set it false to compile with xla_gpu_autotune_level=0 (default gemm/conv algorithm selection, no timing trials), which removes autotuning's run-to-run non-determinism and its device scratch that otherwise inflates the startup memory probe (_probe_max_scratch!) on the first, un-cached start. autotune_cache (default nothing, meaning inherit Reactant's LocalPreferences.toml) toggles the persistent per-fusion autotune cache, and autotune_cache_dir (default "", inherit) sets its directory; both are applied to Reactant's compile cache at worker startup, so a container can drive them by env. numerics (default auto) sets the f32 matmul/convolution precision policy; see NumericsMode.

ReactantServerCore.SchedulerConfig — Type

SchedulerConfig

Global scheduler settings (the scheduler: config block). ema_halflife_seconds is the half-life of the recent-compute moving average that drives fairness; recency_penalty_cap bounds the recency adjustment; coalescing_discount is the cost discount applied to coalesced batches; cost_ema_alpha is the smoothing factor for the learned per-batch-size cost; max_queue_depth caps each model's queue independently (a full model rejects new requests without affecting admission for the others); dispatch_timeout_seconds is the per-request execution timeout; discipline selects the inter-model ordering (see SchedulingDiscipline); compaction_interval runs worker-local memory compaction every N on-demand weight-cache loads (0 disables), the standalone (no-gateway) trigger, off by default so a gateway-fronted worker never self-compacts; and models holds the per-model ModelSchedConfig overrides.

ReactantServerCore.ModelSchedConfig — Type

ModelSchedConfig

Per-model scheduler overrides (an entry under scheduler.models). weight is the model's relative compute share (default 1.0, so all-default weights yield uniform shares; consulted only by the fair discipline). residency is the model's initial residency floor (see ResidencyState); nothing means unspecified, which the server resolves at startup to PINNED_SYSTEM when the on-demand weight cache is enabled (so every model's weights are materialized into host RAM and an on-demand GPU load is a pure host-to-device transfer) and UNPINNED otherwise. max_batch_size caps how many rows the scheduler coalesces into one dispatch of this model; nothing means uncapped. The cap does not change compiled shapes: a partial fill still pads up to the smallest compiled batch size, and a single request larger than the cap is still served (requests are never split).

ReactantServerCore.EndpointsConfig — Type

EndpointsConfig

The listen addresses (the endpoints: config block): host, the gRPC port, the optional metrics_port for the Prometheus exposition endpoint (0 = disabled), and max_concurrent_requests, the cap on simultaneously-handled RPCs (0 = uncapped). For a worker fronted by the gateway, bind host to all interfaces (0.0.0.0) so the gateway and Prometheus can reach it; the gRPC port is usually derived from the node file's base_port and the metrics port from metrics_base_port.

max_concurrent_requests is a worker-level overload backstop: past the cap, new requests are shed immediately with RESOURCE_EXHAUSTED rather than queued. Keep it strictly above the gateway's per-worker outbound stream limit so it never sheds traffic the gateway has already admitted (and so it never rejects a meta-model's loopback sub-call); in single-worker mode (no gateway, clients hit the worker directly) it is the only inbound admission control.

ReactantServerCore.GrpcConfig — Type

GrpcConfig

gRPC transport limits for a worker's server. max_recv_msg_bytes / max_send_msg_bytes bound a single gRPC message in each direction (decode/encode caps, not allocations). Configured under the grpc: block of a worker (or node global:) config, with INFERENCE_SERVER_GRPC_MAX_RECV_MSG_BYTES / INFERENCE_SERVER_GRPC_MAX_SEND_MSG_BYTES environment overrides. Default DEFAULT_GRPC_MSG_BYTES (512 MiB).

ReactantServerCore.ModelControlMode — Type

ModelControlMode

How a worker manages the set of loaded models over its lifetime (mirrors NVIDIA Triton's model-control-mode). STATIC loads and compiles every bundle once at startup and never changes the set. DYNAMIC (the default) additionally runs a filesystem watcher that polls the model repository every model_poll_seconds and hot-swaps bundles as they are added, changed, or removed on disk. EXPLICIT cedes authority to an upstream control plane (the externally-managed residency behavior): no autonomous watcher, and non-resident models are not served until the control plane pins them.

ReactantServerCore.SchedulingDiscipline — Type

SchedulingDiscipline

The inter-model dispatch ordering. FAIR is the deficit-weighted, cost-aware policy with per-model weights and the coalescing discount; FIFO serves in global arrival order; EDF (earliest-deadline-first) serves the model whose most-urgent queued request has the soonest deadline. Coalescing runs underneath all three.

Guidance: FAIR is for deployments where models share this worker with no upstream placement authority, a single-GPU worker or a multi-GPU fleet behind the round-robin gateway, where the worker itself must stop one model from crowding out the rest. Under the gateway's lpt_packing scheduling the gateway is the fairness authority (placement concentration plus the per-worker share cap), and workers must run FIFO or EDF so the two do not fight; lpt_packing supersedes the role FAIR played on manually-assigned multi-GPU fleets.

EDF is for deadline-sensitive deployments where requests carry a remaining-budget timeout (see the request-level timeout parameter). It degrades to FIFO whenever queued requests share the same deadline, so its only divergence from FIFO is to promote requests with less budget left, in practice the in-flight meta-model sub-calls that have already spent part of their budget on an earlier stage. It also sheds work that cannot finish within its learned compute cost (laxity), so it trades some throughput (batch fragmentation, and no per-model fairness) for hitting more deadlines under load. NOTE: EDF derives urgency purely from the deadline, so issuing different per-client deadlines for the same model reorders that model's service and therefore affects fairness across clients; uniform deadlines keep it behaving like FIFO.

ReactantServerCore.NumericsMode — Type

NumericsMode

How f32 matmul/convolution precision is resolved at model compile time (the runtime.numerics knob). NUMERICS_AUTO (the default) is hardware-adaptive: TF32 is used where the GPU supports it (compute capability >= 8.0) and explicit TF32 algorithms are stripped where it does not, so the same bundle compiles everywhere but its numerics follow the hardware. NUMERICS_F32 pins full f32 everywhere: TF32 DotAlgorithms are rewritten to f32 and every algorithm-free f32 dot_general/convolution gets precision_config = HIGHEST, so numerics are identical across GPU generations (the mode for validated deployments; costs tensor-core throughput on TF32-capable GPUs). NUMERICS_TF32 compiles exactly like auto (TF32 permitted; kernel choice stays with XLA/cuBLAS, and StableHLO cannot force TF32 for convolutions at all) but makes the hardware requirement a guarantee: the worker fails startup on hardware that cannot run TF32, rather than silently degrading per worker in a mixed fleet.

ReactantServerCore.ResidencyMode — Type

ResidencyMode

Who owns device residency on a worker, fixed at startup. SELF_MANAGED lets the worker autonomously transfer and evict weights above each model's floor within the device budget; EXTERNALLY_MANAGED makes a control plane authoritative (no autonomous eviction, non-resident models are not served until pinned). This is no longer configured directly: it is derived from ModelControlMode (explicit ⇒ EXTERNALLY_MANAGED, otherwise SELF_MANAGED).

ReactantServerCore.ResidencyState — Type

ResidencyState

The residency floor an operator (self-managed) or control plane (externally-managed) sets for a model's weights. UNPINNED keeps no guaranteed residency (loaded from the mmap on demand); PINNED_SYSTEM guarantees the weights stay resident in host RAM (and must be transferred to the device before execution); PINNED_DEVICE guarantees them resident on the GPU for the server's lifetime (exempt from eviction).

Node files

Parsing and validation of the node file, and the resolution of one worker's ServerConfig from it:

ReactantServerCore.load_node — Method

load_node(path) -> Dict{String,Any}

Read and structurally validate a node config file.

ReactantServerCore.load_node_raw — Method

load_node_raw(path) -> Dict{String,Any}

Read a node config file without structural validation. Used by the supervisor, which may need to synthesize the workers: list (see materialize_node!) before validating.

ReactantServerCore.materialize_node! — Method

materialize_node!(node, devices; cpu_workers=1) -> Vector{Union{String,Nothing}}

Prepare a raw node config for supervised single-container deployment: assign one visible device per worker and return the per-worker device selectors (in worker declaration order; nothing means the worker gets no CUDA_VISIBLE_DEVICES of its own, the CPU case).

With no workers: list, one is synthesized: worker0..workerN-1, one per device (or cpu_workers workers when devices is empty). An explicit workers: list wins: each worker is assigned devices[i] positionally, or devices[gpu+1] when the entry carries a gpu: key. The gpu: key is consumed here (deleted after assignment) so each child process, seeing a single device through CUDA_VISIBLE_DEVICES, resolves device ordinal 0. Assigning one device to two workers, or having more workers than devices, is a ConfigError. Call validate_node on the result before use.

ReactantServerCore.node_gpus — Method

node_gpus(node) -> :auto | Int | Vector{String}

Parse the optional top-level gpus: key: auto (the default when absent) asks the supervisor to enumerate visible devices; an integer is a device count (expanded to ordinals 0..n-1); a list gives explicit device selectors (ordinals or GPU UUIDs, passed to CUDA_VISIBLE_DEVICES verbatim).

ReactantServerCore.node_server_config — Method

node_server_config(node, worker) -> (ServerConfig, applied_overrides, worker_name)

Resolve the ServerConfig for one worker of a node. worker may be nothing when the node has exactly one worker (it defaults to that sole entry); otherwise it must name a defined worker. Environment overrides (INFERENCE_SERVER_*) are applied on top, as for any server config. Does not validate; call validate_config on the result.

ReactantServerCore.validate_node — Method

validate_node(node)

Structural validation of a parsed node config. Raises ConfigError on a malformed file: missing model_repo, duplicate worker names, colliding ports, or a models: entry that targets an undefined worker. The models: map is optional for any node: when omitted, every worker loads every bundle in the repo; when present, it is a per-model override that pins the named models to device memory on the listed workers (see worker_raw_config).

ReactantServerCore.worker_names — Method

worker_names(node) -> Vector{String}

The worker names in declaration order.

ReactantServerCore.worker_raw_config — Method

worker_raw_config(node, name) -> Dict{String,Any}

Resolve a single worker's raw config dict from a (validated) node config: deep-merge global with the named worker's override blocks, then set model_dirs to the shared repo, the endpoint port from base_port, the runtime device ordinal from the worker's GPU, and the node-level shared_host_weights flag. Every worker loads every bundle in the repo; the top-level models: map is an optional per-model override that pins the named models to device memory on the listed workers (translated here into scheduler.models.<name>.residency: device). The result has the shape build_config consumes.