Scheduling

The dispatch engine that decides which model runs next and coalesces same-model requests. See Architecture for the decision policy.

ReactantServer.Scheduler — Type

Scheduler

The deficit-weighted, cost-aware, coalescing dispatch engine. Concurrent requests land on per-model queues; a single dispatch loop runs one GPU execution at a time, coalescing same-model requests into one batched execution at a compiled size and sharing GPU time by relative model weight and a learned per-batch-size cost estimate. It holds the model registry, the backend, the device memory pool, the SchedulerConfig, per-model dispatch state, and an optional on-demand WeightCache. Submit work with infer and read observability counters with scheduler_metrics.

ReactantServer.infer — Function

infer(scheduler, request) -> Vector{NamedTensor}

Submit a request and block until the scheduler returns the result. Re-raises any error captured during dispatch.

Runs the model's preprocess/postprocess hooks here, on the caller's task, rather than on the dispatch loop: preprocess before the request is queued, postprocess on the raw device outputs the loop hands back. Because each gRPC request runs on its own task, many requests' hook work proceeds in parallel and overlaps the single, serialized GPU execution. The dispatch loop coalesces and runs qr.prepared and never executes a request whose preprocess has not finished, since a request is only enqueued (made visible to the loop) after preprocess returns.

ReactantServer.scheduler_metrics — Function

scheduler_metrics(scheduler) -> Dict{String,NamedTuple}

Snapshot per-model observability: dispatch count, total compute consumed, current recent-compute EMA, queue-wait P50/P99, the histogram of dispatch batch sizes, and residency.