feat(sandbox): make L7 inference proxy CHUNK_IDLE_TIMEOUT configurable per route

### Problem Statement

The L7 inference proxy in `openshell-sandbox` aborts streaming inference responses when no chunk arrives within `CHUNK_IDLE_TIMEOUT`, a hardcoded constant (`crates/openshell-sandbox/src/proxy.rs:39`). PR #834 raised it from 30 s to 120 s in v0.0.30 to accommodate reasoning models, but the constant remains baked into the binary with no operator override.

120 s is enough for typical generation, but several legitimate workloads exceed it on the prefill phase alone — at which point the proxy injects an SSE error and truncates a perfectly healthy stream:

- **Large initial tool-call returns / pasted context.** A single user turn carrying a 20K+ token tool result has to be prefilled before the model emits any token.
- **Full-context reprocess after prefix-cache invalidation.** Long-running OpenClaw conversations can carry hundreds of thousands of tokens. When something near the *start* of the context changes (a system-prompt edit, memory injection, tool-result rewrite), every subsequent token is no longer prefix-cacheable and the entire context must be re-prefilled. Empirically measured single-request prefill on this setup (NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 served by local vLLM on an ASUS Ascent GX10 / NVIDIA GB10, DGX Spark–compatible, otherwise idle GPU):

  | tokens | elapsed | tok/s | vs 120 s cap |
  |-------:|--------:|------:|-------------:|
  | 16,000  | 6 s   | 2647 | 5 % |
  | 64,000  | 26 s  | 2441 | 22 % |
  | 128,000 | 61 s  | 2094 | 51 % |
  | 256,000 | 158 s | 1616 | **132 %** |
  | 400,000 | 293 s | 1356 | **245 %** |
  | 779,685 | 866 s | 901  | **722 %** |

  Rate degrades from 2647 → 901 tok/s as size grows (consistent with attention's O(N²) work). 256K already crosses the cap on a fresh, otherwise-idle GPU. 780K — close to Nemotron-3-Super's 1M context window — takes ~14.4 minutes of pure prefill, **~7× the cap**. Extrapolating to a true 1M reprocess, ~20–25 minutes is realistic. Full data and methodology in Agent Investigation below.

- **Backend restart / KV-cache eviction.** vLLM losing its KV cache (process restart, OOM-driven eviction, cache pressure from concurrent users) forces the next request to re-prefill from scratch with no prefix-cache hits. Same math as above.

- **GPU contention from concurrent OpenClaw sessions.** OpenClaw can spawn sub-agents and run multiple sessions against the same backend. Per-request prefill rate divides roughly with concurrency, so the safe single-request ceiling above (~250K tokens at 120 s) drops to ~125K with two concurrent prefills, ~80K with three. Workloads that fit comfortably under the cap in isolation can fail under realistic load.

In every case the upstream is healthy and would happily stream tokens given enough time; only the proxy's hardcoded ceiling is at fault. The 120 s value is a reasonable default but not a universal one — the right ceiling depends on the operator's hardware, model, and expected prompt sizes.

### Proposed Design

Make the per-chunk idle timeout configurable per inference route, mirroring the existing `timeout_secs` plumbing.

1. **Config plumbing.** Add `chunk_idle_timeout_secs: Option<u64>` to `RouteConfig` / `ResolvedRoute` in `openshell-router`, alongside the existing `timeout_secs`. `None` = use the compiled-in default (currently 120 s); `0` = disable the per-chunk timeout entirely (consistent with how `--timeout 0` already means "default" for total request timeout).

2. **CLI surface.** Extend `openshell inference set` and `openshell inference update` with `--chunk-idle-timeout <secs>` (default `0` = unchanged/use default). Mirror the existing `--timeout` flag behavior. Show the resolved value in `openshell inference get`.

3. **Proxy wiring.** In `route_inference_request` (`crates/openshell-sandbox/src/proxy.rs:1261`), replace the constant with the route-resolved value:

    ```rust
    let chunk_idle = route.chunk_idle_timeout.unwrap_or(DEFAULT_CHUNK_IDLE_TIMEOUT);
    ```

   Keep `DEFAULT_CHUNK_IDLE_TIMEOUT = 120s` as the floor when unset, so existing deployments behave identically.

4. **Truncation event metadata.** Include the resolved timeout value in the OCSF `streaming response chunk idle timeout` event so operators can see in logs which timeout fired.

5. **Validation.** Sanity-check the value at config load (e.g., reject < 5 s to avoid foot-guns; cap at some upper bound like 3600 s to avoid lock-up scenarios).

User-facing change is additive and backward-compatible.

### Alternatives Considered

- **Just bump the constant again.** No value satisfies all workloads — 600 s helps a 200K-token reprocess but penalizes operators on weak hardware who'd rather fail-fast on stuck streams. A constant always picks one operating point.
- **Derive from existing `--timeout`.** Reusing `route.timeout_secs` as both total-request and per-chunk idle timeout would conflate two distinct safety properties: total lifetime (cap on runaway requests) vs. per-chunk liveness (catch a genuinely dead stream). Operators want to set them independently — e.g., 3600 s total, 300 s per-chunk idle.
- **Auto-tune from a backend probe.** Inferring the right value from `/v1/models` or a warmup probe is too magical and brittle; it can't know what the operator's worst-case prompt looks like.
- **Disable the per-chunk timeout entirely for self-hosted backends.** Removes a useful safety net that catches genuinely stuck streams (network partitions, hung backend processes). Should remain available as an opt-in (`--chunk-idle-timeout 0`) but not the default.

### Agent Investigation

Reproduced and isolated with NemoClaw v0.0.10 + OpenShell v0.0.26 + local vLLM serving NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on an ASUS Ascent GX10 (NVIDIA GB10 GPU, 20-core Cortex-X925, aarch64; DGX Spark–compatible reference platform):

- vLLM logs show every aborted request was killed exactly 30 s after acceptance; matches `CHUNK_IDLE_TIMEOUT = 30s` in `proxy.rs:34` of v0.0.26.
- Streaming `curl` from inside the sandbox through the L7 proxy with a 25K–30K random-token prompt: aborted at 30 s, vLLM logs the abort, client receives the `format_sse_error("response truncated: chunk idle timeout exceeded")` frame.
- Same prompt sent direct host → vLLM (bypassing the sandbox): completed normally with no abort. Confirms the cap lives in `openshell-sandbox`, not in vLLM, the OpenAI SDK, undici, or OpenClaw's `agents.defaults.llm.idleTimeoutSeconds` (which is a client-side iterator timer that only fires *after* the proxy has already aborted).
- Closed PR #834 / commit `355d845d` already shows maintainers see the constant as a tuning knob ("Reasoning models … can pause for 60+ seconds … 120 s provides headroom"). This proposal is the next step: lift it from a compile-time constant to a per-route config.

**Prefill benchmark methodology** (numbers in Problem Statement table): non-streaming `/v1/chat/completions` requests sent direct to vLLM, `max_tokens: 5` to make generation a negligible fraction of total time, each request prefixed with a fresh UUID to defeat prefix-cache hits. Verified `usage.prompt_tokens_details.cached_tokens == 0` on every response. Cross-validated synthetic random-number prompts against real book text (Romeo & Juliet through Moby Dick and Chambers Dictionary): rates match within ~3 % at the same token sizes.

<details>
<summary>Full benchmark data (16 sweep points, random + book text)</summary>

| tokens | elapsed | tok/s | source |
|------:|--------:|------:|-------|
| 15,974  | 6.04 s   | 2647 | random |
| 31,850  | 12.19 s  | 2614 | random |
| 42,216  | 16.82 s  | 2510 | alice-in-wonderland |
| 44,881  | 17.95 s  | 2501 | romeo-and-juliet |
| 63,696  | 26.09 s  | 2441 | random |
| 71,105  | 29.97 s  | 2373 | the-great-gatsby |
| 104,166 | 47.76 s  | 2181 | frankenstein |
| 127,372 | 60.82 s  | 2094 | random |
| 170,463 | 90.60 s  | 1881 | wuthering-heights |
| 178,023 | 96.82 s  | 1839 | pride-and-prejudice |
| 254,541 | 157.49 s | 1616 | random |
| 263,972 | 168.64 s | 1565 | jane-eyre |
| 317,863 | 218.20 s | 1457 | moby-dick |
| 326,075 | 226.87 s | 1437 | city-of-god-volume-i |
| 397,856 | 293.47 s | 1356 | random |
| 779,685 | 865.70 s |  901 | chambers-dictionary |

</details>

### Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sandbox): make L7 inference proxy CHUNK_IDLE_TIMEOUT configurable per route #866

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokens	elapsed	tok/s	vs 120 s cap
16,000	6 s	2647	5 %
64,000	26 s	2441	22 %
128,000	61 s	2094	51 %
256,000	158 s	1616	132 %
400,000	293 s	1356	245 %
779,685	866 s	901	722 %

tokens	elapsed	tok/s	source
15,974	6.04 s	2647	random
31,850	12.19 s	2614	random
42,216	16.82 s	2510	alice-in-wonderland
44,881	17.95 s	2501	romeo-and-juliet
63,696	26.09 s	2441	random
71,105	29.97 s	2373	the-great-gatsby
104,166	47.76 s	2181	frankenstein
127,372	60.82 s	2094	random
170,463	90.60 s	1881	wuthering-heights
178,023	96.82 s	1839	pride-and-prejudice
254,541	157.49 s	1616	random
263,972	168.64 s	1565	jane-eyre
317,863	218.20 s	1457	moby-dick
326,075	226.87 s	1437	city-of-god-volume-i
397,856	293.47 s	1356	random
779,685	865.70 s	901	chambers-dictionary

feat(sandbox): make L7 inference proxy CHUNK_IDLE_TIMEOUT configurable per route #866

Description

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions