feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay by pimlock · Pull Request #867 · NVIDIA/OpenShell

pimlock · 2026-04-16T19:01:00Z

Summary

Introduces a persistent supervisor-to-gateway session (ConnectSupervisor) and migrates /connect/ssh and ExecSandbox onto relay channels that ride the session's HTTP/2 connection. Removes the requirement for direct gateway→sandbox network connectivity.

Two-plane design, one TCP+TLS connection per sandbox:

Control plane: ConnectSupervisor bidirectional gRPC stream — session lifecycle (hello, heartbeat, accept/reject) and relay lifecycle (RelayOpen, RelayOpenResult, RelayClose).
Data plane: RelayStream bidirectional gRPC RPC — one per relay, multiplexed as a separate HTTP/2 stream on the same connection. First RelayFrame carries a typed RelayInit { channel_id } to match a pending-relay slot; subsequent frames carry raw bytes.

The supervisor stays a dumb byte bridge with no SSH protocol awareness.

Removes ResolveSandboxEndpoint from the proto, gateway, and K8s driver — no code path now dials the sandbox directly for connect or exec.

Closes OS-86. Design: Plan. Supersedes (and closes) #861.

History

The initial approach (#861) used reverse HTTP CONNECT tunnels for the data plane — one new TCP+TLS handshake per relay. This PR replaces that with a RelayStream gRPC RPC that rides the existing supervisor session connection as a new HTTP/2 stream. Both approaches were benchmarked side-by-side on nemoclaw; after tuning HTTP/2 flow-control windows, throughput and latency are within noise, and the gRPC path wins decisively on the architectural metric (supervisor→gateway TCP count during a 50-relay storm: 3 vs 53). See the perf comments inline on this PR for the full numbers.

Why

One TLS handshake per sandbox instead of one per relay — every sandbox connect / ExecSandbox saves one RTT + crypto cost.
One supervisor→gateway TCP instead of 1 + N — fewer file descriptors, simpler firewall/LB story.
Less code and fewer deps — no relay.rs, no reverse HTTP CONNECT plumbing. Drops hyper, hyper-util, http, http-body-util from openshell-sandbox.
Auth and observability reuse — mTLS identity, tracing, gRPC status codes, and keepalive all inherited from the control channel.

Changes

proto: new RelayStream(stream RelayFrame) returns (stream RelayFrame) RPC alongside ConnectSupervisor. RelayFrame is a oneof { RelayInit init | bytes data } — the first frame from the supervisor must be Init; subsequent frames (both directions) carry data. Remove ResolveSandboxEndpoint and its request/response messages.
server: handle_relay_stream reads channel_id from the first Init frame, claims the pending relay slot (same SupervisorSessionRegistry path as before), and bridges the gRPC stream to a DuplexStream in 16 KiB reads.
sandbox: on RelayOpen, opens a RelayStream on the existing Channel and bridges the local SSH Unix socket. openshell-sandbox loses ~200 lines of TLS + HTTP CONNECT plumbing and the entire NSSH1 preface path.
SSH daemon → Unix socket: supervisor SSH daemon listens on a filesystem path (sandbox_ssh_socket_path, default /run/openshell/ssh.sock) with 0700 parent / 0600 socket perms. Removes port 2222, the NSSH1 HMAC handshake, and nonce replay detection — filesystem permissions are the access control boundary now.
HTTP/2 flow control: adaptive window tuning (adaptive_window(true)) on both the gateway-side builder and the sandbox-side Endpoint so bulk transfers aren't throttled by the 64 KiB defaults.
session registry hardening: session_id-based remove_if_current to survive a supersede race, spawn_relay_reaper to reap pending relay entries a supervisor never claimed, per-caller session-wait timeouts (30 s for SSH connect's cold start, 15 s for ExecSandbox steady state).
OCSF telemetry: sandbox-side NetworkActivityBuilder events for supervisor session open/close/fail and relay open/close/fail (7 event shapes, extracted to pure builder fns with 10 unit tests).
Client-side SSH keepalives: generated ssh-config and direct ssh invocations carry ServerAliveInterval=15 / ServerAliveCountMax=3 so in-flight sessions detect a silently-dropped relay (gateway or supervisor restart) in ~45 s instead of hanging indefinitely.
tests: 12 registry unit tests + 5 relay gRPC integration tests (tests/supervisor_relay_integration.rs) + 10 OCSF event-shape tests, plus live-cluster verification (SSH, SFTP/scp, ssh -L, 3 concurrent sessions, gateway-restart recovery, supervisor-restart recovery).

Security note

This change moves SSH/exec data flow onto the supervisor gRPC path. That path is not yet bound to a per-sandbox transport identity, so the gateway cannot fully enforce caller identity == target sandbox at the RPC boundary today.

As a result, this branch inherits the existing weakness in sandbox-originated RPC identity and applies it to a more privileged path. The intended fix is proper per-sandbox identity via sandbox-specific mTLS work in OS-109, rather than introducing another temporary authentication mechanism in this PR.

This is a conscious tradeoff to minimize churn while the transport identity model is being replaced.

Performance

Benchmarked side-by-side on nemoclaw, same cluster, same script (architecture/plans/relay-bench.sh), 15 iterations per latency metric, 50 concurrent relays for the storm. See the perf comments on this PR for all three runs (HTTP CONNECT baseline, gRPC with default windows, gRPC with adaptive windows).

Headline:

Supervisor→gateway TCPs during 50-relay storm: HTTP = 53, gRPC = 3.
Bulk 50 MiB throughput: HTTP 567 Mbps, gRPC (adaptive) 478 Mbps — within a reasonable operational band. Fixed 2/4 MiB windows matched or beat HTTP (595 Mbps) if we want absolute parity, at the cost of a predictable memory ceiling per connection.
Connect / exec latency: tied within noise.
Rapid serial churn (50 back-to-back exec -- true): gRPC ~16 % slower, due to per-RPC overhead (fresh RelayStream + SSH session per exec). Addressable later by direct exec via SupervisorExec — tracked as OS-91.

Follow-ups (tracked separately)

Out of scope for this PR; filed as issues so the PR stays focused on the transport migration.

OS-92 phase 1 — hook supervisor-session cleanup into compute::cleanup_sandbox_state so sandbox delete proactively tears down the registry entry (today it's cleaned up lazily via stream death).
OS-92 phase 3 — surface session connect/disconnect on GetSandbox / WatchSandbox.
OS-91 — direct SupervisorExec RPC bypassing SSH to recover the ~16 % per-exec overhead on rapid serial churn.
OS-102 — drop the now-dead ssh_handshake_secret / ssh_handshake_skew_secs plumbing across 7 crates + bootstrap now that NSSH1 is gone.
OS-109 — introduce proper per-sandbox transport identity so sandbox-originated RPCs are bound to the calling sandbox instead of shared client identity.
Perf: compare adaptive_window vs a fixed 2 MiB / 4 MiB window in a WAN scenario before committing to the default.
Perf: evaluate swapping Vec<u8> for prost::bytes::Bytes in RelayFrame::data to recover the remaining per-chunk copy cost (~30 ms win on exec -- true).
Gateway-side OCSF: adding GatewayContext + emits on the server side (the OCSF crate is currently sandbox-shaped).

Testing

Checklist

Conforms to Conventional Commits
No new secrets or credentials
Scope limited to the connect/exec transport migration

…on relay Introduce a persistent supervisor-to-gateway session (ConnectSupervisor bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto relay channels coordinated through it. Architecture: - gRPC control plane: carries session lifecycle (hello, heartbeat) and relay lifecycle (RelayOpen, RelayOpenResult, RelayClose) - HTTP data plane: for each relay, the supervisor opens a reverse HTTP CONNECT to /relay/{channel_id} on the gateway; the gateway bridges the client stream with the supervisor stream - The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness; the gateway sends the NSSH1 preface through the relay Key changes: - Add ConnectSupervisor RPC and session/relay proto messages - Add gateway session registry (SupervisorSessionRegistry) with pending-relay map for channel correlation - Add /relay/{channel_id} HTTP CONNECT endpoint - Rewire /connect/ssh: session lookup + RelayOpen instead of direct TCP dial to sandbox:2222 - Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial - Add supervisor session client with reconnect and relay bridge - Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver Closes OS-86

When a sandbox first reports Ready, the supervisor session may not have completed its gRPC handshake yet. Instead of failing immediately with 502 / "supervisor session not connected", the relay open now retries with exponential backoff (100ms → 2s) for up to 15 seconds. This fixes the race between K8s marking the pod Ready and the supervisor establishing its ConnectSupervisor session.

Three related changes: 1. Fold the session-wait into `open_relay` itself via a new `wait_for_session` helper with exponential backoff (100ms → 2s). Callers pass an explicit `session_wait_timeout`: - SSH connect uses 30s — it typically runs right after `sandbox create`, so the timeout has to cover a cold supervisor's TLS + gRPC handshake. - ExecSandbox uses 15s — during normal operation it only needs to cover a transient supervisor reconnect window. This covers both the startup race (pod Ready before the supervisor's ConnectSupervisor stream is up) and mid-lifetime reconnects after a network blip or gateway/supervisor restart — both look identical to the caller. 2. Fix a supersede cleanup race. `LiveSession` now tracks a `session_id`, and `remove_if_current(sandbox_id, session_id)` only evicts when the registered entry still matches. Previously an old session's cleanup could run after a reconnect had already registered the new session, unconditionally removing the live registration. 3. Wire up `spawn_relay_reaper` alongside the existing SSH session reaper so expired pending relay entries (supervisor acknowledged RelayOpen but never opened the reverse CONNECT) are swept every 30s instead of leaking until someone tries to claim them. Adds 12 unit tests covering: open_relay happy path, timeout, mid-wait session appearance, closed-receiver failure, supersede routing; claim_relay unknown/expired/receiver-dropped/round-trip; and the remove_if_current cleanup-race regression.

Replace the supervisor's reverse HTTP CONNECT data plane with a new `RelayStream` gRPC RPC. Each relay now rides the supervisor's existing `ConnectSupervisor` TCP+TLS+HTTP/2 connection as a new HTTP/2 stream, multiplexed natively. Removes one TLS handshake per SSH/exec session. - proto: add `RelayStream(stream RelayChunk) returns (stream RelayChunk)`; the first chunk from the supervisor carries `channel_id` and no data, matching the existing RelayOpen channel_id. Subsequent chunks are bytes-only — leaving channel_id off data frames avoids a ~36 B per-frame tax that would hurt interactive SSH. - server: add `handle_relay_stream` alongside `handle_connect_supervisor`. It reads the first RelayChunk for channel_id, claims the pending relay (same `SupervisorSessionRegistry::claim_relay` path as before, returning a `DuplexStream` half), then bridges that half ↔ the gRPC stream via two tasks (16 KiB chunks). Delete `relay.rs` and its `/relay/{channel_id}` HTTP endpoint. - sandbox: on `RelayOpen`, open a `RelayStream` RPC on the existing `Channel`, send `RelayChunk { channel_id, data: [] }` as the first frame, then bridge the local SSH socket. Drop `open_reverse_connect`, `send_connect_request`, `connect_tls`, and the `hyper`, `hyper-util`, `http`, `http-body-util` deps that existed solely for the reverse CONNECT. - tests: add `RelayStreamStream` type alias and `relay_stream` stub to the seven `OpenShell` mock impls in server + CLI integration tests. The registry shape (pending_relays, claim_relay, RelayOpen control message, DuplexStream bridging) is unchanged, so the existing session-wait / supersede / reaper hardening on feat/supervisor-session-relay carries over intact.

… plane Default h2 initial windows are 64 KiB per stream and 64 KiB per connection. That throttles a single RelayStream SSH tunnel to ~500 Mbps on LAN, roughly 35% below the raw HTTP CONNECT baseline measured on `nemoclaw`. Bump both server (hyper-util auto::Builder via multiplex.rs) and client (tonic Endpoint in openshell-sandbox/grpc_client.rs) windows to 2 MiB / 4 MiB. This is the window size at which bulk throughput on a 50 MiB transfer matches the reverse HTTP CONNECT path. The numbers apply only to the RelayStream data plane in this branch; ConnectSupervisor and all other RPCs benefit too but are low-rate.

…ed windows

pimlock · 2026-04-16T19:57:31Z

Round 3 — adaptive vs fixed windows

Swapped the fixed 2 MiB / 4 MiB windows for `adaptive_window(true)` on both sides in `1ec551a6` and reran the bench.

Metric	HTTP CONNECT	gRPC default (64 KiB)	gRPC fixed (2/4 MiB)	gRPC adaptive
Exec latency p50	0.279 s	0.304 s	0.308 s	0.313 s
Connect latency p50	0.235 s	0.260 s	0.268 s	0.270 s
Bulk 50 MiB	567 Mbps	395 Mbps	595 Mbps	478 Mbps
Small-frame 10k	0.244 s	0.320 s	0.264 s	0.271 s
20× parallel zero-sleep	0.52 s	0.55 s	0.48 s	0.56 s
50-relay storm	4.01 s	4.37 s	3.93 s	3.96 s
Rapid serial churn (50×)	13.2 s	15.3 s	15.3 s	16.1 s
Non-loopback TCPs (50-storm)	53	3	3	3

What adaptive bought us

Unthrottles the 64 KiB default — bulk goes from 395 to 478 Mbps (+21 %).
Zero configuration constants — no fixed budget, memory footprint sized by measured BDP.
The architectural win is unchanged — still 3 non-loopback TCPs during a 50-relay storm.

Where adaptive loses to fixed 2/4 MiB

Bulk throughput: 478 vs 595 Mbps (~20 % slower). Expected on a low-RTT LAN — adaptive sizes windows from measured bandwidth × delay, and delay is essentially zero here. The fixed 2/4 MiB committed enough headroom that the TCP pipe could fill; adaptive runs tighter.
Latency / concurrency / storm — all within noise of fixed.

Recommendation

On this LAN, fixed 2/4 MiB gives the best numbers. Adaptive is the safer default for mixed / unknown network conditions (WAN clients, variable RTTs) and avoids the "pick a number" debate, at a ~20 % bulk-throughput cost.

I'd lean fixed 2/4 MiB for production — the worst-case memory (max_concurrent_streams × stream_window ≈ 200 MiB per connection) is bounded and the throughput headroom is real. If we ever see pathological memory usage, adaptive is a one-line revert.

Full numbers in `architecture/plans/perf-grpc-adaptive.txt`, comparison table in `architecture/plans/perf-comparison.md`.

drew · 2026-04-17T03:56:30Z

This looks good to me

…, drop NSSH1 The embedded SSH daemon in openshell-sandbox no longer listens on a TCP port. Instead it binds a root-owned Unix socket (default /run/openshell/ssh.sock, 0700 parent dir, 0600 socket). The supervisor's relay bridge connects to that socket instead of 127.0.0.1:2222. With the socket gated by filesystem permissions, the NSSH1 HMAC preface is redundant and has been removed: - openshell-sandbox: drop `verify_preface`, `hmac_sha256`, the nonce cache and reaper, and the preface read/write on every SSH accept. `run_ssh_server` takes a `PathBuf` and uses `UnixListener`. - openshell-server/ssh_tunnel: remove the NSSH1 write + response read before bridging the client's upgraded CONNECT stream; the relay is now bridged immediately. - openshell-server/grpc/sandbox: same cleanup in the exec-path relay proxy. `stream_exec_over_relay` and `start_single_use_ssh_proxy_over_relay` stop taking a `handshake_secret`. - openshell-server lib: the K8s driver is now configured with the socket path ("/run/openshell/ssh.sock") instead of "0.0.0.0:2222". - Parent directory of the socket is created with 0700 root:root by the supervisor at startup to keep the sandbox entrypoint user out. `ssh_handshake_secret` is still accepted on the CLI / env for backwards compatibility but is no longer used for SSH.

Adds `sandbox_ssh_socket_path` to `Config` (default `/run/openshell/ssh.sock`). The K8s driver is now wired with the configured value instead of a hard-coded path. K8s and VM drivers already isolate the socket via per-pod / per-VM filesystems, so the default is safe there. This makes it easy to override in local dev when multiple supervisors share a filesystem, matching the prior `OPENSHELL_SSH_LISTEN_ADDR` knob on the supervisor side.

Adds tests/supervisor_relay_integration.rs covering the RelayStream wire contract, handshake frame, bridging, and claim timing. Five cases: happy-path echo, gateway drop, supervisor drop, no-session timeout, and concurrent multiplexed relays on one session. Narrows handle_relay_stream to take &SupervisorSessionRegistry directly so the test can exercise the real handler without standing up a full ServerState. Adds register_for_test for the same reason.

…ents Emits NetworkActivity events for session open/close/fail and relay open/close/fail from the sandbox side. Keeps plain tracing for internal plumbing (SSH socket connect, gateway stream close observation). Event shapes are extracted into pure builder fns so unit tests can assert activity/severity/status without wiring up a tracing subscriber. Gateway endpoint is parsed into host + port for dst_endpoint.

Adds ServerAliveInterval=15 and ServerAliveCountMax=3 to both the rendered ssh-config block and the direct ssh invocation used by `openshell sandbox connect`. Without this, a client-side SSH session hangs indefinitely when the gateway or supervisor dies mid-session: the relay transport's TCP connection can't signal EOF to the client because the peer process is gone, not cleanly closing. Detection now takes ~45s instead of the TCP keepalive default of 2 hours. Verified on a live cluster by deleting the gateway pod and the sandbox pod mid-session — SSH exits with "Broken pipe" after one missed ServerAlive reply.

pimlock · 2026-04-17T21:37:59Z

Live-cluster testing findings

Ran the unchecked items from the Testing section on nemoclaw with the merged branch (including the Unix-socket + NSSH1-removal changes).

Pass

#	Test	Observation
1	SFTP/scp through relay	scp 512 KiB upload + sftp download, sha256 round-trip matches
2	SSH port forwarding	`ssh -L 19090:localhost:18080` to `python3 -m http.server` inside sandbox; `curl` through the tunnel returns expected body
3	Concurrent SSH sessions on one supervisor session	3 parallel SSH sessions, each 4 s sleep — all complete successfully in ~4 s (not 12). Confirms HTTP/2 multiplexing over the one supervisor session

Pass after a fix

Tests 4 and 5 initially exposed a client-side hang: when the gateway or supervisor disappears mid-session, the in-flight SSH client stalls indefinitely because the relay transport's TCP socket can't signal EOF (peer process is gone, not cleanly closing).

Fixed in 4bd88f56 by adding ServerAliveInterval=15 / ServerAliveCountMax=3 to both the generated ssh-config block and the openshell sandbox connect direct ssh invocation. SSH-level keepalives make the session detect the dead relay within ~45 s and exit cleanly with Broken pipe.

#	Test	Before fix	After fix
4	Gateway restart mid-session	client hangs forever after tick-6	exits in ~17 s with `Broken pipe`; recovery SSH works immediately
5	Supervisor restart mid-relay (pod delete)	same hang	exits in ~17 s with `Broken pipe`; recovery SSH works immediately

Notes for reviewers

I couldn't test the literal "kill -9 1" case — PID-namespace init protection drops SIGKILL silently. Pod delete is semantically equivalent (supervisor terminates, k8s recreates) and is what I ran.
The ~17 s detection time is one 15 s keepalive interval plus ~2 s for SSH's internal timeout — inside operator expectations, tunable via ServerAliveInterval.
3 non-loopback outbound TCPs observed on the supervisor during the concurrent-SSH test — those are the separate sandbox clients (policy fetch, inference bundle refresh, supervisor session), not a reflection of the SSH session count. The session-multiplexing claim is confirmed by the parallel completion timing.

All Testing boxes on this PR are now checked.

The RPC was used by the direct gateway→sandbox SSH/exec path, which is gone — connect/ssh and ExecSandbox both ride the supervisor session relay now. Removes the RPC, SandboxEndpoint/ResolveSandboxEndpoint* messages, and the now-dead ssh_port / sandbox_ssh_port config fields across openshell-core, openshell-server, openshell-driver-kubernetes, and openshell-driver-vm. The k8s driver's standalone binary also stops synthesizing a TCP listen address ("0.0.0.0:<port>") and reads the Unix socket path directly from OPENSHELL_SANDBOX_SSH_SOCKET_PATH.

…rename ssh-listen-addr → ssh-socket-path Renames the sandbox binary's `--ssh-listen-addr` / `OPENSHELL_SSH_LISTEN_ADDR` / `ssh_listen_addr` to `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH` / `ssh_socket_path` so the flag name matches its sole accepted form (a Unix socket filesystem path) after the supervisor-initiated relay migration. Migrates the VM compute driver to the same supervisor-initiated model used by the K8s driver: the in-guest sandbox now binds `/run/openshell/ssh.sock` and opens its own outbound `ConnectSupervisor` session to the gateway, so the host→guest SSH port-forward is no longer needed. Drops `--vm-port` plumbing, the `ssh_port` allocation path, the `port_is_ready` TCP probe, and the now- unused `GUEST_SSH_PORT` import from `driver.rs`. Readiness falls back to the existing console-log marker from `guest_ssh_ready`. Remaining `ssh_port` / `GUEST_SSH_PORT` residue in `openshell-driver-vm/src/runtime.rs` (gvproxy port-mapping plan) is dead but left for OS-102, which already covers NSSH1/handshake plumbing removal across crates.

…p historical prose Updates `sandbox-connect.md`, `gateway.md`, `sandbox.md`, `gateway-security.md`, and `system-architecture.md` to describe the current supervisor-initiated model forward-facing: two-plane `ConnectSupervisor` + `RelayStream` design, the registry's `open_relay` / `claim_relay` / reaper behaviour, Unix-socket sshd access control, and the sandbox-side OCSF event surface. Strips historical framing that describes what was removed — the "Earlier designs..." paragraph, the "Historical: NSSH1 Handshake (removed)" subsection, retained-for-compat config/env table rows, and scattered "no longer X" prose — in favour of clean current-state descriptions. Syncs env- var and flag names to the renamed `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH`.

…on-grpc-data

Updates user-facing docs to match the connect/exec transport change: - `docs/security/best-practices.mdx` — SSH tunnel section now describes traffic riding the sandbox's mTLS session (transport auth) plus a short-lived session token scoped to the sandbox (authorization), with the sandbox's sshd bound to a local Unix socket rather than a TCP port. Removes the stale mention of the NSSH1 HMAC handshake. - `docs/observability/logging.mdx` — example OCSF shorthand lines for SSH:LISTEN / SSH:OPEN updated to reflect the current emit shape (no peer endpoint on the Unix-socket listener, no NSSH1 auth tag).

github-actions · 2026-04-18T02:39:01Z

🌿 Preview your docs: https://nvidia-preview-pr-867.docs.buildwithfern.com/openshell

Adds two `ResourceExhausted`-returning guards on `open_relay` to bound the `pending_relays` map against runaway or abusive callers: - `MAX_PENDING_RELAYS = 256` — upper bound across all sandboxes. Caps the memory a caller can pin by calling `open_relay` in a loop while no supervisor ever claims (or the supervisor is hung). - `MAX_PENDING_RELAYS_PER_SANDBOX = 32` — per-sandbox ceiling so one noisy tenant can't consume the entire global budget. Sits above the existing SSH-tunnel per-sandbox cap (20) so tunnel-specific limits still fire first for that caller. Both checks and the `pending_relays` insert happen under a single lock hold so concurrent callers can't each observe "under the cap" and both insert past it. Adds a `sandbox_id` field on `PendingRelay` so the per-sandbox count is a single filter over the map without extra indexes. Tests: - Two unit tests in `supervisor_session.rs` — assert the global cap and the per-sandbox cap both return `ResourceExhausted` with the right message, and a cap-hit on one sandbox doesn't leak onto others. - One integration test in `supervisor_relay_integration.rs` — bursts 64 concurrent `open_relay` calls at a single sandbox and asserts exactly 32 succeed, exactly 32 are rejected with the per-sandbox message, and a different sandbox still accepts new relays. Reaper behaviour is unchanged; the cap makes the map bounded, so the existing `HashMap::retain` pass stays cheap under any load.

pimlock added 4 commits April 15, 2026 20:28

This comment was marked as resolved.

Sign in to view

style: apply cargo fmt to CLI test mocks

ebc72b1

pimlock mentioned this pull request Apr 16, 2026

feat(server,sandbox): move SSH connect and exec onto supervisor session relay #861

Closed

18 tasks

This comment was marked as outdated.

Sign in to view

perf(server,sandbox): use adaptive HTTP/2 flow control instead of fix…

1ec551a

…ed windows

pimlock self-assigned this Apr 16, 2026

pimlock changed the base branch from feat/supervisor-session-relay to main April 17, 2026 16:05

pimlock changed the title ~~refactor(server,sandbox): move relay data plane onto HTTP/2 streams~~ feat(server,sandbox): move SSH connect and exec onto supervisor session relay Apr 17, 2026

pimlock changed the title ~~feat(server,sandbox): move SSH connect and exec onto supervisor session relay~~ feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay Apr 17, 2026

pimlock added 6 commits April 17, 2026 11:34

Merge branch 'main' into feat/supervisor-session-grpc-data

2dabbad

chore: exclude rfc/0002 from this PR (will land separately)

a0e8391

refactor(server,sandbox): use typed relay init frames

28427f6

pimlock force-pushed the feat/supervisor-session-grpc-data branch from f75f8ee to 3e8a245 Compare April 17, 2026 20:34

pimlock added 2 commits April 17, 2026 13:51

This comment was marked as resolved.

Sign in to view

pimlock added the test:e2e Requires end-to-end coverage label Apr 17, 2026

style: cargo fmt

264ebb1

pimlock marked this pull request as ready for review April 17, 2026 23:17

pimlock requested a review from a team as a code owner April 17, 2026 23:17

pimlock mentioned this pull request Apr 17, 2026

test(sandbox): fix flaky arm64 procfs binary_path tests #881

Merged

7 tasks

pimlock added 2 commits April 17, 2026 17:13

Merge remote-tracking branch 'origin/main' into feat/supervisor-sessi…

bcea46b

…on-grpc-data

pimlock force-pushed the feat/supervisor-session-grpc-data branch from 9135762 to 7a850ae Compare April 18, 2026 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867

feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867
pimlock wants to merge 22 commits intomainfrom
feat/supervisor-session-grpc-data

pimlock commented Apr 16, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

pimlock commented Apr 16, 2026

Uh oh!

drew commented Apr 17, 2026

Uh oh!

pimlock commented Apr 17, 2026

Uh oh!

This comment was marked as resolved.

github-actions bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pimlock commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

History

Why

Changes

Security note

Performance

Follow-ups (tracked separately)

Testing

Checklist

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as outdated.

pimlock commented Apr 16, 2026

Round 3 — adaptive vs fixed windows

What adaptive bought us

Where adaptive loses to fixed 2/4 MiB

Recommendation

Uh oh!

drew commented Apr 17, 2026

Uh oh!

pimlock commented Apr 17, 2026

Live-cluster testing findings

Pass

Pass after a fix

Notes for reviewers

Uh oh!

This comment was marked as resolved.

github-actions bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pimlock commented Apr 16, 2026 •

edited

Loading