Node Configuration
A deployment is described by a single node file. It is the only supported config format. It describes one or more single-GPU workers on one machine and, optionally, the gateway that fronts them. Each worker reads this same file, resolves its own entry by name, and loads (and can serve) every bundle in the shared model repository.
A single GPU is just a one-worker node: keep one entry under workers: (or omit workers: entirely under the supervisor, below). Growing to more GPUs means adding workers, not changing the config format.
Under the node supervisor (the container default, see Docker Deployment) the workers: list is optional: omit it and add gpus: auto (or an integer count, or an explicit device list) and the supervisor synthesizes one worker per detected GPU, assigning each its device. An explicit workers: list still wins when present. The keys below describe that explicit form, which the supervisor also honors.
Top-level keys
# One shared bundle repository. Each immediate subdirectory containing a manifest.yaml is a
# bundle; its directory name is the model name.
model_repo: /var/lib/reactantserver/models
# Worker at index i binds base_port + i unless it sets an explicit `port:`.
base_port: 8080
global: # defaults merged into every worker (any block may be overridden per worker)
workers: # one entry per GPU
models: # optional; pins the named models to device memory on the listed workers
gateway: # optional; read only by the gateway, never by a workerGlobal settings
The global: block holds defaults merged into every worker; a worker entry may override any of these blocks. The sub-blocks map onto the resolved ServerConfig:
global:
cache_dir: /var/cache/reactantserver
model_control_mode: dynamic # dynamic (watch the repo) | static | explicit (control plane)
model_poll_seconds: 15.0 # repository poll interval in dynamic mode
runtime: # -> RuntimeConfig
backend: cuda # cpu or cuda
mem_fraction: 0.9 # fraction of device memory claimed for the pool (GPU only)
preallocate: true # claim the pool up front (GPU only)
allow_cpu_fallback: false
weight_cache_fraction: 1.0 # arena fraction for all weights (pinned + on-demand); 0 disables, GPU only
weight_cache_wiggle_fraction: 0.1 # arena fraction kept free; drives startup peak probe + auto-sizing
scheduler: # -> SchedulerConfig
discipline: fair # fair | fifo | edf (use fifo or edf behind a gateway running lpt_packing)
ema_halflife_seconds: 30.0
max_queue_depth: 1024 # per-model queue cap; a full model rejects new requests
dispatch_timeout_seconds: 30.0
compaction_interval: 0 # worker-local: defragment device memory every N on-demand weight loads;
# 0 disables (the default). Leave 0 behind a gateway (see On-demand Weights)
models: {} # per-model overrides -> ModelSchedConfig (see below)
endpoints: # -> EndpointsConfig
host: 0.0.0.0 # bind all interfaces so the gateway/clients can reach the worker
max_concurrent_requests: 64 # in-flight RPC cap; 0 = uncapped. Past the cap the worker sheds
# with RESOURCE_EXHAUSTED. Keep it above the gateway's per-worker
# outbound stream limit (worker_client.max_concurrent_streams)model_control_mode sets how the loaded model set evolves: dynamic (the default) watches the repository and loads, unloads, and reloads bundles online as files change; static fixes the startup set; explicit cedes the lifecycle to an external control plane via the worker control RPCs. scheduler.discipline selects the dispatch policy: fair shares GPU time across models by weighted deficit and learned cost, while fifo serves the oldest queued request first. Workers fronted by a gateway in lpt_packing mode must run fifo or edf (not fair), so the gateway stays the placement and fairness authority (see Multi-GPU Gateway).
edf (earliest-deadline-first) serves the model whose most-urgent queued request has the soonest deadline, where the deadline comes from the request's remaining-budget timeout. A meta model is not scheduled here (it runs on the request task), but each of its in-flight sub-calls inherits the meta's deadline, so under edf a meta's continuation is ordered ahead of fresher regular work. It is designed for deadline-sensitive serving: while every client uses the same deadline it behaves exactly like fifo, and it diverges only to dispatch requests with less budget left ahead of those with more, so a request close to its deadline is served before a fresher one rather than missing behind it. edf also sheds work it cannot finish within its learned compute cost (laxity), trading some throughput (batch fragmentation, and no per-model weighting) for meeting more deadlines under load. Note that because edf derives urgency solely from the deadline, issuing different per-client deadlines for the same model reorders that model's service and therefore affects fairness across clients; keep deadlines uniform to retain fifo-like fairness.
Each sub-block corresponds to a typed config struct: RuntimeConfig, SchedulerConfig, ModelSchedConfig, and EndpointsConfig. See the API Reference for every field and its default.
Workers
workers:
- { name: worker0 }
- { name: worker1 }name is the routing identity (and, under Docker, the compose service name). The listen port is base_port + index unless the worker sets an explicit port:.
Under the supervisor (the container default) you do not assign GPUs yourself: it detects the visible devices, gives each worker one of them, and sets that worker's own CUDA_VISIBLE_DEVICES, so every worker sees a single GPU at ordinal 0. Influence the assignment with gpus: above (auto, a count, or an explicit device list), the REACTANT_GPUS environment variable, or by adding gpu: N to a worker entry to pin it to a specific visible device. A container-level CUDA_VISIBLE_DEVICES acts as a coarse filter on which physical GPUs the supervisor sees (see Docker Deployment). Running a single worker by hand without the supervisor (a bare serve), the worker uses device ordinal 0, or gpu: N to pick another.
A worker entry may also carry override blocks (for example a runtime: block) that merge over global.
Device pinning (the models: map)
models:
resnet50: [worker0, worker1] # hot on both GPUs
vsq_coral: [worker0] # hot on worker0 onlyEvery worker loads (and can serve) every bundle in model_repo; the gateway discovers what each worker serves and schedules requests across them. The optional top-level models: map is a per-model override that pins the named models to device memory on the listed workers for the lowest latency (it translates into scheduler.models.<name>.residency: device on those workers). Unlisted models stay system-pinned in host RAM and load to the device on demand. Omit the block entirely to keep every model on-demand. To load only a subset of the repository on a worker, the resolved config also supports a models_include allowlist (empty means load all).
Per-model scheduler overrides
Tune individual models under scheduler.models, which builds the ModelSchedConfig entries:
scheduler:
models:
resnet50:
weight: 2.0 # relative compute share (default 1.0)
residency: device # keep weights GPU-resident for the server's lifetime
max_batch_size: 8 # cap on rows coalesced per dispatch (default uncapped)
yolo:
residency: unpinned # no host floor; re-materialized from disk on each loadweight sets the model's fair share. residency sets the model's residency floor (unpinned, system, or device; pin_to_gpu: true is a back-compat alias for residency: device). When the on-demand weight cache is enabled, models without an explicit residency default to system (weights pinned in host RAM); see On-demand Weights.
max_batch_size caps how many rows the scheduler coalesces into one dispatch of the model. It does not change compiled shapes: the dispatch sizes come from the batch sizes the bundle was compiled for (a partial fill still pads up to the smallest compiled size), the batch axis comes from the bundle manifest, and a single request larger than the cap is still served in one dispatch because requests are never split.
Gateway configuration
The gateway does not read the node file at all: it is configured by its own standalone gateway.yml, which carries its listen addresses, limits, and a flat endpoints: list of worker host:port addresses (see Multi-GPU Gateway). A gateway.yml looks like:
listen:
grpc: "0.0.0.0:8001"
metrics: "0.0.0.0:8002"
grpc:
max_recv_msg_bytes: 268435456 # 256 MiB
max_send_msg_bytes: 268435456
max_concurrent_requests_per_worker: 64 # inbound cap is this x worker count; 0 = uncapped.
# Sized above the outbound stream limit so a startup
# burst has headroom rather than being shed early
worker_client:
request_timeout_seconds: 60
max_concurrent_streams: 32 # outbound in-flight RPCs the gateway multiplexes to one worker
logging:
level: "info"
format: "json"
scheduling:
mode: round_robin # round_robin | lpt_packing (see Multi-GPU Gateway)
endpoints: # worker host:port addresses, across any number of nodes
- "worker0:8080"
- "worker1:8081"Environment-variable overrides
Any worker value can be overridden per process by an environment variable of the form INFERENCE_SERVER_<SECTION>_<FIELD>, for example:
INFERENCE_SERVER_ENDPOINTS_PORT=9100
INFERENCE_SERVER_RUNTIME_BACKEND=cpu
INFERENCE_SERVER_RUNTIME_WEIGHT_CACHE_FRACTION=0.8List-valued overrides (INFERENCE_SERVER_MODEL_DIRS, INFERENCE_SERVER_MODELS_INCLUDE) are colon-separated. Overrides are applied on top of the resolved node config, and the effective configuration is logged at startup.