You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The L7 inference proxy in openshell-sandbox aborts streaming inference responses when no chunk arrives within CHUNK_IDLE_TIMEOUT, a hardcoded constant (crates/openshell-sandbox/src/proxy.rs:39). PR #834 raised it from 30 s to 120 s in v0.0.30 to accommodate reasoning models, but the constant remains baked into the binary with no operator override.
120 s is enough for typical generation, but several legitimate workloads exceed it on the prefill phase alone — at which point the proxy injects an SSE error and truncates a perfectly healthy stream:
Large initial tool-call returns / pasted context. A single user turn carrying a 20K+ token tool result has to be prefilled before the model emits any token.
Full-context reprocess after prefix-cache invalidation. Long-running OpenClaw conversations can carry hundreds of thousands of tokens. When something near the start of the context changes (a system-prompt edit, memory injection, tool-result rewrite), every subsequent token is no longer prefix-cacheable and the entire context must be re-prefilled. Empirically measured single-request prefill on this setup (NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 served by local vLLM on an ASUS Ascent GX10 / NVIDIA GB10, DGX Spark–compatible, otherwise idle GPU):
tokens
elapsed
tok/s
vs 120 s cap
16,000
6 s
2647
5 %
64,000
26 s
2441
22 %
128,000
61 s
2094
51 %
256,000
158 s
1616
132 %
400,000
293 s
1356
245 %
779,685
866 s
901
722 %
Rate degrades from 2647 → 901 tok/s as size grows (consistent with attention's O(N²) work). 256K already crosses the cap on a fresh, otherwise-idle GPU. 780K — close to Nemotron-3-Super's 1M context window — takes ~14.4 minutes of pure prefill, ~7× the cap. Extrapolating to a true 1M reprocess, ~20–25 minutes is realistic. Full data and methodology in Agent Investigation below.
Backend restart / KV-cache eviction. vLLM losing its KV cache (process restart, OOM-driven eviction, cache pressure from concurrent users) forces the next request to re-prefill from scratch with no prefix-cache hits. Same math as above.
GPU contention from concurrent OpenClaw sessions. OpenClaw can spawn sub-agents and run multiple sessions against the same backend. Per-request prefill rate divides roughly with concurrency, so the safe single-request ceiling above (~250K tokens at 120 s) drops to ~125K with two concurrent prefills, ~80K with three. Workloads that fit comfortably under the cap in isolation can fail under realistic load.
In every case the upstream is healthy and would happily stream tokens given enough time; only the proxy's hardcoded ceiling is at fault. The 120 s value is a reasonable default but not a universal one — the right ceiling depends on the operator's hardware, model, and expected prompt sizes.
Proposed Design
Make the per-chunk idle timeout configurable per inference route, mirroring the existing timeout_secs plumbing.
Config plumbing. Add chunk_idle_timeout_secs: Option<u64> to RouteConfig / ResolvedRoute in openshell-router, alongside the existing timeout_secs. None = use the compiled-in default (currently 120 s); 0 = disable the per-chunk timeout entirely (consistent with how --timeout 0 already means "default" for total request timeout).
CLI surface. Extend openshell inference set and openshell inference update with --chunk-idle-timeout <secs> (default 0 = unchanged/use default). Mirror the existing --timeout flag behavior. Show the resolved value in openshell inference get.
Proxy wiring. In route_inference_request (crates/openshell-sandbox/src/proxy.rs:1261), replace the constant with the route-resolved value:
let chunk_idle = route.chunk_idle_timeout.unwrap_or(DEFAULT_CHUNK_IDLE_TIMEOUT);
Keep DEFAULT_CHUNK_IDLE_TIMEOUT = 120s as the floor when unset, so existing deployments behave identically.
Truncation event metadata. Include the resolved timeout value in the OCSF streaming response chunk idle timeout event so operators can see in logs which timeout fired.
Validation. Sanity-check the value at config load (e.g., reject < 5 s to avoid foot-guns; cap at some upper bound like 3600 s to avoid lock-up scenarios).
User-facing change is additive and backward-compatible.
Alternatives Considered
Just bump the constant again. No value satisfies all workloads — 600 s helps a 200K-token reprocess but penalizes operators on weak hardware who'd rather fail-fast on stuck streams. A constant always picks one operating point.
Derive from existing --timeout. Reusing route.timeout_secs as both total-request and per-chunk idle timeout would conflate two distinct safety properties: total lifetime (cap on runaway requests) vs. per-chunk liveness (catch a genuinely dead stream). Operators want to set them independently — e.g., 3600 s total, 300 s per-chunk idle.
Auto-tune from a backend probe. Inferring the right value from /v1/models or a warmup probe is too magical and brittle; it can't know what the operator's worst-case prompt looks like.
Disable the per-chunk timeout entirely for self-hosted backends. Removes a useful safety net that catches genuinely stuck streams (network partitions, hung backend processes). Should remain available as an opt-in (--chunk-idle-timeout 0) but not the default.
Agent Investigation
Reproduced and isolated with NemoClaw v0.0.10 + OpenShell v0.0.26 + local vLLM serving NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on an ASUS Ascent GX10 (NVIDIA GB10 GPU, 20-core Cortex-X925, aarch64; DGX Spark–compatible reference platform):
vLLM logs show every aborted request was killed exactly 30 s after acceptance; matches CHUNK_IDLE_TIMEOUT = 30s in proxy.rs:34 of v0.0.26.
Streaming curl from inside the sandbox through the L7 proxy with a 25K–30K random-token prompt: aborted at 30 s, vLLM logs the abort, client receives the format_sse_error("response truncated: chunk idle timeout exceeded") frame.
Same prompt sent direct host → vLLM (bypassing the sandbox): completed normally with no abort. Confirms the cap lives in openshell-sandbox, not in vLLM, the OpenAI SDK, undici, or OpenClaw's agents.defaults.llm.idleTimeoutSeconds (which is a client-side iterator timer that only fires after the proxy has already aborted).
Closed PR fix(inference): prevent silent truncation of large streaming responses #834 / commit 355d845d already shows maintainers see the constant as a tuning knob ("Reasoning models … can pause for 60+ seconds … 120 s provides headroom"). This proposal is the next step: lift it from a compile-time constant to a per-route config.
Prefill benchmark methodology (numbers in Problem Statement table): non-streaming /v1/chat/completions requests sent direct to vLLM, max_tokens: 5 to make generation a negligible fraction of total time, each request prefixed with a fresh UUID to defeat prefix-cache hits. Verified usage.prompt_tokens_details.cached_tokens == 0 on every response. Cross-validated synthetic random-number prompts against real book text (Romeo & Juliet through Moby Dick and Chambers Dictionary): rates match within ~3 % at the same token sizes.
Full benchmark data (16 sweep points, random + book text)
tokens
elapsed
tok/s
source
15,974
6.04 s
2647
random
31,850
12.19 s
2614
random
42,216
16.82 s
2510
alice-in-wonderland
44,881
17.95 s
2501
romeo-and-juliet
63,696
26.09 s
2441
random
71,105
29.97 s
2373
the-great-gatsby
104,166
47.76 s
2181
frankenstein
127,372
60.82 s
2094
random
170,463
90.60 s
1881
wuthering-heights
178,023
96.82 s
1839
pride-and-prejudice
254,541
157.49 s
1616
random
263,972
168.64 s
1565
jane-eyre
317,863
218.20 s
1457
moby-dick
326,075
226.87 s
1437
city-of-god-volume-i
397,856
293.47 s
1356
random
779,685
865.70 s
901
chambers-dictionary
Checklist
I've reviewed existing issues and the architecture docs
This is a design proposal, not a "please build this" request
Problem Statement
The L7 inference proxy in
openshell-sandboxaborts streaming inference responses when no chunk arrives withinCHUNK_IDLE_TIMEOUT, a hardcoded constant (crates/openshell-sandbox/src/proxy.rs:39). PR #834 raised it from 30 s to 120 s in v0.0.30 to accommodate reasoning models, but the constant remains baked into the binary with no operator override.120 s is enough for typical generation, but several legitimate workloads exceed it on the prefill phase alone — at which point the proxy injects an SSE error and truncates a perfectly healthy stream:
Large initial tool-call returns / pasted context. A single user turn carrying a 20K+ token tool result has to be prefilled before the model emits any token.
Full-context reprocess after prefix-cache invalidation. Long-running OpenClaw conversations can carry hundreds of thousands of tokens. When something near the start of the context changes (a system-prompt edit, memory injection, tool-result rewrite), every subsequent token is no longer prefix-cacheable and the entire context must be re-prefilled. Empirically measured single-request prefill on this setup (NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 served by local vLLM on an ASUS Ascent GX10 / NVIDIA GB10, DGX Spark–compatible, otherwise idle GPU):
Rate degrades from 2647 → 901 tok/s as size grows (consistent with attention's O(N²) work). 256K already crosses the cap on a fresh, otherwise-idle GPU. 780K — close to Nemotron-3-Super's 1M context window — takes ~14.4 minutes of pure prefill, ~7× the cap. Extrapolating to a true 1M reprocess, ~20–25 minutes is realistic. Full data and methodology in Agent Investigation below.
Backend restart / KV-cache eviction. vLLM losing its KV cache (process restart, OOM-driven eviction, cache pressure from concurrent users) forces the next request to re-prefill from scratch with no prefix-cache hits. Same math as above.
GPU contention from concurrent OpenClaw sessions. OpenClaw can spawn sub-agents and run multiple sessions against the same backend. Per-request prefill rate divides roughly with concurrency, so the safe single-request ceiling above (~250K tokens at 120 s) drops to ~125K with two concurrent prefills, ~80K with three. Workloads that fit comfortably under the cap in isolation can fail under realistic load.
In every case the upstream is healthy and would happily stream tokens given enough time; only the proxy's hardcoded ceiling is at fault. The 120 s value is a reasonable default but not a universal one — the right ceiling depends on the operator's hardware, model, and expected prompt sizes.
Proposed Design
Make the per-chunk idle timeout configurable per inference route, mirroring the existing
timeout_secsplumbing.Config plumbing. Add
chunk_idle_timeout_secs: Option<u64>toRouteConfig/ResolvedRouteinopenshell-router, alongside the existingtimeout_secs.None= use the compiled-in default (currently 120 s);0= disable the per-chunk timeout entirely (consistent with how--timeout 0already means "default" for total request timeout).CLI surface. Extend
openshell inference setandopenshell inference updatewith--chunk-idle-timeout <secs>(default0= unchanged/use default). Mirror the existing--timeoutflag behavior. Show the resolved value inopenshell inference get.Proxy wiring. In
route_inference_request(crates/openshell-sandbox/src/proxy.rs:1261), replace the constant with the route-resolved value:Keep
DEFAULT_CHUNK_IDLE_TIMEOUT = 120sas the floor when unset, so existing deployments behave identically.Truncation event metadata. Include the resolved timeout value in the OCSF
streaming response chunk idle timeoutevent so operators can see in logs which timeout fired.Validation. Sanity-check the value at config load (e.g., reject < 5 s to avoid foot-guns; cap at some upper bound like 3600 s to avoid lock-up scenarios).
User-facing change is additive and backward-compatible.
Alternatives Considered
--timeout. Reusingroute.timeout_secsas both total-request and per-chunk idle timeout would conflate two distinct safety properties: total lifetime (cap on runaway requests) vs. per-chunk liveness (catch a genuinely dead stream). Operators want to set them independently — e.g., 3600 s total, 300 s per-chunk idle./v1/modelsor a warmup probe is too magical and brittle; it can't know what the operator's worst-case prompt looks like.--chunk-idle-timeout 0) but not the default.Agent Investigation
Reproduced and isolated with NemoClaw v0.0.10 + OpenShell v0.0.26 + local vLLM serving NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on an ASUS Ascent GX10 (NVIDIA GB10 GPU, 20-core Cortex-X925, aarch64; DGX Spark–compatible reference platform):
CHUNK_IDLE_TIMEOUT = 30sinproxy.rs:34of v0.0.26.curlfrom inside the sandbox through the L7 proxy with a 25K–30K random-token prompt: aborted at 30 s, vLLM logs the abort, client receives theformat_sse_error("response truncated: chunk idle timeout exceeded")frame.openshell-sandbox, not in vLLM, the OpenAI SDK, undici, or OpenClaw'sagents.defaults.llm.idleTimeoutSeconds(which is a client-side iterator timer that only fires after the proxy has already aborted).355d845dalready shows maintainers see the constant as a tuning knob ("Reasoning models … can pause for 60+ seconds … 120 s provides headroom"). This proposal is the next step: lift it from a compile-time constant to a per-route config.Prefill benchmark methodology (numbers in Problem Statement table): non-streaming
/v1/chat/completionsrequests sent direct to vLLM,max_tokens: 5to make generation a negligible fraction of total time, each request prefixed with a fresh UUID to defeat prefix-cache hits. Verifiedusage.prompt_tokens_details.cached_tokens == 0on every response. Cross-validated synthetic random-number prompts against real book text (Romeo & Juliet through Moby Dick and Chambers Dictionary): rates match within ~3 % at the same token sizes.Full benchmark data (16 sweep points, random + book text)
Checklist