Skip to content

feat(sandbox): make L7 inference proxy CHUNK_IDLE_TIMEOUT configurable per route #866

@vnicolici

Description

@vnicolici

Problem Statement

The L7 inference proxy in openshell-sandbox aborts streaming inference responses when no chunk arrives within CHUNK_IDLE_TIMEOUT, a hardcoded constant (crates/openshell-sandbox/src/proxy.rs:39). PR #834 raised it from 30 s to 120 s in v0.0.30 to accommodate reasoning models, but the constant remains baked into the binary with no operator override.

120 s is enough for typical generation, but several legitimate workloads exceed it on the prefill phase alone — at which point the proxy injects an SSE error and truncates a perfectly healthy stream:

  • Large initial tool-call returns / pasted context. A single user turn carrying a 20K+ token tool result has to be prefilled before the model emits any token.

  • Full-context reprocess after prefix-cache invalidation. Long-running OpenClaw conversations can carry hundreds of thousands of tokens. When something near the start of the context changes (a system-prompt edit, memory injection, tool-result rewrite), every subsequent token is no longer prefix-cacheable and the entire context must be re-prefilled. Empirically measured single-request prefill on this setup (NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 served by local vLLM on an ASUS Ascent GX10 / NVIDIA GB10, DGX Spark–compatible, otherwise idle GPU):

    tokens elapsed tok/s vs 120 s cap
    16,000 6 s 2647 5 %
    64,000 26 s 2441 22 %
    128,000 61 s 2094 51 %
    256,000 158 s 1616 132 %
    400,000 293 s 1356 245 %
    779,685 866 s 901 722 %

    Rate degrades from 2647 → 901 tok/s as size grows (consistent with attention's O(N²) work). 256K already crosses the cap on a fresh, otherwise-idle GPU. 780K — close to Nemotron-3-Super's 1M context window — takes ~14.4 minutes of pure prefill, ~7× the cap. Extrapolating to a true 1M reprocess, ~20–25 minutes is realistic. Full data and methodology in Agent Investigation below.

  • Backend restart / KV-cache eviction. vLLM losing its KV cache (process restart, OOM-driven eviction, cache pressure from concurrent users) forces the next request to re-prefill from scratch with no prefix-cache hits. Same math as above.

  • GPU contention from concurrent OpenClaw sessions. OpenClaw can spawn sub-agents and run multiple sessions against the same backend. Per-request prefill rate divides roughly with concurrency, so the safe single-request ceiling above (~250K tokens at 120 s) drops to ~125K with two concurrent prefills, ~80K with three. Workloads that fit comfortably under the cap in isolation can fail under realistic load.

In every case the upstream is healthy and would happily stream tokens given enough time; only the proxy's hardcoded ceiling is at fault. The 120 s value is a reasonable default but not a universal one — the right ceiling depends on the operator's hardware, model, and expected prompt sizes.

Proposed Design

Make the per-chunk idle timeout configurable per inference route, mirroring the existing timeout_secs plumbing.

  1. Config plumbing. Add chunk_idle_timeout_secs: Option<u64> to RouteConfig / ResolvedRoute in openshell-router, alongside the existing timeout_secs. None = use the compiled-in default (currently 120 s); 0 = disable the per-chunk timeout entirely (consistent with how --timeout 0 already means "default" for total request timeout).

  2. CLI surface. Extend openshell inference set and openshell inference update with --chunk-idle-timeout <secs> (default 0 = unchanged/use default). Mirror the existing --timeout flag behavior. Show the resolved value in openshell inference get.

  3. Proxy wiring. In route_inference_request (crates/openshell-sandbox/src/proxy.rs:1261), replace the constant with the route-resolved value:

    let chunk_idle = route.chunk_idle_timeout.unwrap_or(DEFAULT_CHUNK_IDLE_TIMEOUT);

    Keep DEFAULT_CHUNK_IDLE_TIMEOUT = 120s as the floor when unset, so existing deployments behave identically.

  4. Truncation event metadata. Include the resolved timeout value in the OCSF streaming response chunk idle timeout event so operators can see in logs which timeout fired.

  5. Validation. Sanity-check the value at config load (e.g., reject < 5 s to avoid foot-guns; cap at some upper bound like 3600 s to avoid lock-up scenarios).

User-facing change is additive and backward-compatible.

Alternatives Considered

  • Just bump the constant again. No value satisfies all workloads — 600 s helps a 200K-token reprocess but penalizes operators on weak hardware who'd rather fail-fast on stuck streams. A constant always picks one operating point.
  • Derive from existing --timeout. Reusing route.timeout_secs as both total-request and per-chunk idle timeout would conflate two distinct safety properties: total lifetime (cap on runaway requests) vs. per-chunk liveness (catch a genuinely dead stream). Operators want to set them independently — e.g., 3600 s total, 300 s per-chunk idle.
  • Auto-tune from a backend probe. Inferring the right value from /v1/models or a warmup probe is too magical and brittle; it can't know what the operator's worst-case prompt looks like.
  • Disable the per-chunk timeout entirely for self-hosted backends. Removes a useful safety net that catches genuinely stuck streams (network partitions, hung backend processes). Should remain available as an opt-in (--chunk-idle-timeout 0) but not the default.

Agent Investigation

Reproduced and isolated with NemoClaw v0.0.10 + OpenShell v0.0.26 + local vLLM serving NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on an ASUS Ascent GX10 (NVIDIA GB10 GPU, 20-core Cortex-X925, aarch64; DGX Spark–compatible reference platform):

  • vLLM logs show every aborted request was killed exactly 30 s after acceptance; matches CHUNK_IDLE_TIMEOUT = 30s in proxy.rs:34 of v0.0.26.
  • Streaming curl from inside the sandbox through the L7 proxy with a 25K–30K random-token prompt: aborted at 30 s, vLLM logs the abort, client receives the format_sse_error("response truncated: chunk idle timeout exceeded") frame.
  • Same prompt sent direct host → vLLM (bypassing the sandbox): completed normally with no abort. Confirms the cap lives in openshell-sandbox, not in vLLM, the OpenAI SDK, undici, or OpenClaw's agents.defaults.llm.idleTimeoutSeconds (which is a client-side iterator timer that only fires after the proxy has already aborted).
  • Closed PR fix(inference): prevent silent truncation of large streaming responses #834 / commit 355d845d already shows maintainers see the constant as a tuning knob ("Reasoning models … can pause for 60+ seconds … 120 s provides headroom"). This proposal is the next step: lift it from a compile-time constant to a per-route config.

Prefill benchmark methodology (numbers in Problem Statement table): non-streaming /v1/chat/completions requests sent direct to vLLM, max_tokens: 5 to make generation a negligible fraction of total time, each request prefixed with a fresh UUID to defeat prefix-cache hits. Verified usage.prompt_tokens_details.cached_tokens == 0 on every response. Cross-validated synthetic random-number prompts against real book text (Romeo & Juliet through Moby Dick and Chambers Dictionary): rates match within ~3 % at the same token sizes.

Full benchmark data (16 sweep points, random + book text)
tokens elapsed tok/s source
15,974 6.04 s 2647 random
31,850 12.19 s 2614 random
42,216 16.82 s 2510 alice-in-wonderland
44,881 17.95 s 2501 romeo-and-juliet
63,696 26.09 s 2441 random
71,105 29.97 s 2373 the-great-gatsby
104,166 47.76 s 2181 frankenstein
127,372 60.82 s 2094 random
170,463 90.60 s 1881 wuthering-heights
178,023 96.82 s 1839 pride-and-prejudice
254,541 157.49 s 1616 random
263,972 168.64 s 1565 jane-eyre
317,863 218.20 s 1457 moby-dick
326,075 226.87 s 1437 city-of-god-volume-i
397,856 293.47 s 1356 random
779,685 865.70 s 901 chambers-dictionary

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions