feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867
feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867
Conversation
…on relay
Introduce a persistent supervisor-to-gateway session (ConnectSupervisor
bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto
relay channels coordinated through it.
Architecture:
- gRPC control plane: carries session lifecycle (hello, heartbeat) and
relay lifecycle (RelayOpen, RelayOpenResult, RelayClose)
- HTTP data plane: for each relay, the supervisor opens a reverse HTTP
CONNECT to /relay/{channel_id} on the gateway; the gateway bridges
the client stream with the supervisor stream
- The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness;
the gateway sends the NSSH1 preface through the relay
Key changes:
- Add ConnectSupervisor RPC and session/relay proto messages
- Add gateway session registry (SupervisorSessionRegistry) with
pending-relay map for channel correlation
- Add /relay/{channel_id} HTTP CONNECT endpoint
- Rewire /connect/ssh: session lookup + RelayOpen instead of direct
TCP dial to sandbox:2222
- Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial
- Add supervisor session client with reconnect and relay bridge
- Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver
Closes OS-86
When a sandbox first reports Ready, the supervisor session may not have completed its gRPC handshake yet. Instead of failing immediately with 502 / "supervisor session not connected", the relay open now retries with exponential backoff (100ms → 2s) for up to 15 seconds. This fixes the race between K8s marking the pod Ready and the supervisor establishing its ConnectSupervisor session.
Three related changes:
1. Fold the session-wait into `open_relay` itself via a new `wait_for_session`
helper with exponential backoff (100ms → 2s). Callers pass an explicit
`session_wait_timeout`:
- SSH connect uses 30s — it typically runs right after `sandbox create`,
so the timeout has to cover a cold supervisor's TLS + gRPC handshake.
- ExecSandbox uses 15s — during normal operation it only needs to cover
a transient supervisor reconnect window.
This covers both the startup race (pod Ready before the supervisor's
ConnectSupervisor stream is up) and mid-lifetime reconnects after a
network blip or gateway/supervisor restart — both look identical to the
caller.
2. Fix a supersede cleanup race. `LiveSession` now tracks a `session_id`,
and `remove_if_current(sandbox_id, session_id)` only evicts when the
registered entry still matches. Previously an old session's cleanup
could run after a reconnect had already registered the new session,
unconditionally removing the live registration.
3. Wire up `spawn_relay_reaper` alongside the existing SSH session reaper
so expired pending relay entries (supervisor acknowledged RelayOpen but
never opened the reverse CONNECT) are swept every 30s instead of
leaking until someone tries to claim them.
Adds 12 unit tests covering: open_relay happy path, timeout, mid-wait
session appearance, closed-receiver failure, supersede routing; claim_relay
unknown/expired/receiver-dropped/round-trip; and the remove_if_current
cleanup-race regression.
Replace the supervisor's reverse HTTP CONNECT data plane with a new
`RelayStream` gRPC RPC. Each relay now rides the supervisor's existing
`ConnectSupervisor` TCP+TLS+HTTP/2 connection as a new HTTP/2 stream,
multiplexed natively. Removes one TLS handshake per SSH/exec session.
- proto: add `RelayStream(stream RelayChunk) returns (stream RelayChunk)`;
the first chunk from the supervisor carries `channel_id` and no data,
matching the existing RelayOpen channel_id. Subsequent chunks are
bytes-only — leaving channel_id off data frames avoids a ~36 B
per-frame tax that would hurt interactive SSH.
- server: add `handle_relay_stream` alongside `handle_connect_supervisor`.
It reads the first RelayChunk for channel_id, claims the pending relay
(same `SupervisorSessionRegistry::claim_relay` path as before, returning
a `DuplexStream` half), then bridges that half ↔ the gRPC stream via
two tasks (16 KiB chunks). Delete `relay.rs` and its `/relay/{channel_id}`
HTTP endpoint.
- sandbox: on `RelayOpen`, open a `RelayStream` RPC on the existing
`Channel`, send `RelayChunk { channel_id, data: [] }` as the first frame,
then bridge the local SSH socket. Drop `open_reverse_connect`,
`send_connect_request`, `connect_tls`, and the `hyper`, `hyper-util`,
`http`, `http-body-util` deps that existed solely for the reverse CONNECT.
- tests: add `RelayStreamStream` type alias and `relay_stream` stub to the
seven `OpenShell` mock impls in server + CLI integration tests.
The registry shape (pending_relays, claim_relay, RelayOpen control message,
DuplexStream bridging) is unchanged, so the existing session-wait / supersede
/ reaper hardening on feat/supervisor-session-relay carries over intact.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
… plane Default h2 initial windows are 64 KiB per stream and 64 KiB per connection. That throttles a single RelayStream SSH tunnel to ~500 Mbps on LAN, roughly 35% below the raw HTTP CONNECT baseline measured on `nemoclaw`. Bump both server (hyper-util auto::Builder via multiplex.rs) and client (tonic Endpoint in openshell-sandbox/grpc_client.rs) windows to 2 MiB / 4 MiB. This is the window size at which bulk throughput on a 50 MiB transfer matches the reverse HTTP CONNECT path. The numbers apply only to the RelayStream data plane in this branch; ConnectSupervisor and all other RPCs benefit too but are low-rate.
This comment was marked as outdated.
This comment was marked as outdated.
Round 3 — adaptive vs fixed windowsSwapped the fixed 2 MiB / 4 MiB windows for `adaptive_window(true)` on both sides in `1ec551a6` and reran the bench.
What adaptive bought us
Where adaptive loses to fixed 2/4 MiB
RecommendationOn this LAN, fixed 2/4 MiB gives the best numbers. Adaptive is the safer default for mixed / unknown network conditions (WAN clients, variable RTTs) and avoids the "pick a number" debate, at a ~20 % bulk-throughput cost. I'd lean fixed 2/4 MiB for production — the worst-case memory (max_concurrent_streams × stream_window ≈ 200 MiB per connection) is bounded and the throughput headroom is real. If we ever see pathological memory usage, adaptive is a one-line revert. Full numbers in `architecture/plans/perf-grpc-adaptive.txt`, comparison table in `architecture/plans/perf-comparison.md`. |
|
This looks good to me |
…, drop NSSH1
The embedded SSH daemon in openshell-sandbox no longer listens on a TCP
port. Instead it binds a root-owned Unix socket (default
/run/openshell/ssh.sock, 0700 parent dir, 0600 socket). The supervisor's
relay bridge connects to that socket instead of 127.0.0.1:2222.
With the socket gated by filesystem permissions, the NSSH1 HMAC preface
is redundant and has been removed:
- openshell-sandbox: drop `verify_preface`, `hmac_sha256`, the nonce
cache and reaper, and the preface read/write on every SSH accept.
`run_ssh_server` takes a `PathBuf` and uses `UnixListener`.
- openshell-server/ssh_tunnel: remove the NSSH1 write + response read
before bridging the client's upgraded CONNECT stream; the relay is
now bridged immediately.
- openshell-server/grpc/sandbox: same cleanup in the exec-path relay
proxy. `stream_exec_over_relay` and `start_single_use_ssh_proxy_over_relay`
stop taking a `handshake_secret`.
- openshell-server lib: the K8s driver is now configured with the
socket path ("/run/openshell/ssh.sock") instead of "0.0.0.0:2222".
- Parent directory of the socket is created with 0700 root:root by the
supervisor at startup to keep the sandbox entrypoint user out.
`ssh_handshake_secret` is still accepted on the CLI / env for backwards
compatibility but is no longer used for SSH.
Adds `sandbox_ssh_socket_path` to `Config` (default `/run/openshell/ssh.sock`). The K8s driver is now wired with the configured value instead of a hard-coded path. K8s and VM drivers already isolate the socket via per-pod / per-VM filesystems, so the default is safe there. This makes it easy to override in local dev when multiple supervisors share a filesystem, matching the prior `OPENSHELL_SSH_LISTEN_ADDR` knob on the supervisor side.
Adds tests/supervisor_relay_integration.rs covering the RelayStream wire contract, handshake frame, bridging, and claim timing. Five cases: happy-path echo, gateway drop, supervisor drop, no-session timeout, and concurrent multiplexed relays on one session. Narrows handle_relay_stream to take &SupervisorSessionRegistry directly so the test can exercise the real handler without standing up a full ServerState. Adds register_for_test for the same reason.
f75f8ee to
3e8a245
Compare
…ents Emits NetworkActivity events for session open/close/fail and relay open/close/fail from the sandbox side. Keeps plain tracing for internal plumbing (SSH socket connect, gateway stream close observation). Event shapes are extracted into pure builder fns so unit tests can assert activity/severity/status without wiring up a tracing subscriber. Gateway endpoint is parsed into host + port for dst_endpoint.
Adds ServerAliveInterval=15 and ServerAliveCountMax=3 to both the rendered ssh-config block and the direct ssh invocation used by `openshell sandbox connect`. Without this, a client-side SSH session hangs indefinitely when the gateway or supervisor dies mid-session: the relay transport's TCP connection can't signal EOF to the client because the peer process is gone, not cleanly closing. Detection now takes ~45s instead of the TCP keepalive default of 2 hours. Verified on a live cluster by deleting the gateway pod and the sandbox pod mid-session — SSH exits with "Broken pipe" after one missed ServerAlive reply.
Live-cluster testing findingsRan the unchecked items from the Testing section on Pass
Pass after a fixTests 4 and 5 initially exposed a client-side hang: when the gateway or supervisor disappears mid-session, the in-flight SSH client stalls indefinitely because the relay transport's TCP socket can't signal EOF (peer process is gone, not cleanly closing). Fixed in
Notes for reviewers
All Testing boxes on this PR are now checked. |
The RPC was used by the direct gateway→sandbox SSH/exec path, which is
gone — connect/ssh and ExecSandbox both ride the supervisor session
relay now. Removes the RPC, SandboxEndpoint/ResolveSandboxEndpoint*
messages, and the now-dead ssh_port / sandbox_ssh_port config fields
across openshell-core, openshell-server, openshell-driver-kubernetes,
and openshell-driver-vm.
The k8s driver's standalone binary also stops synthesizing a TCP
listen address ("0.0.0.0:<port>") and reads the Unix socket path
directly from OPENSHELL_SANDBOX_SSH_SOCKET_PATH.
This comment was marked as resolved.
This comment was marked as resolved.
…rename ssh-listen-addr → ssh-socket-path Renames the sandbox binary's `--ssh-listen-addr` / `OPENSHELL_SSH_LISTEN_ADDR` / `ssh_listen_addr` to `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH` / `ssh_socket_path` so the flag name matches its sole accepted form (a Unix socket filesystem path) after the supervisor-initiated relay migration. Migrates the VM compute driver to the same supervisor-initiated model used by the K8s driver: the in-guest sandbox now binds `/run/openshell/ssh.sock` and opens its own outbound `ConnectSupervisor` session to the gateway, so the host→guest SSH port-forward is no longer needed. Drops `--vm-port` plumbing, the `ssh_port` allocation path, the `port_is_ready` TCP probe, and the now- unused `GUEST_SSH_PORT` import from `driver.rs`. Readiness falls back to the existing console-log marker from `guest_ssh_ready`. Remaining `ssh_port` / `GUEST_SSH_PORT` residue in `openshell-driver-vm/src/runtime.rs` (gvproxy port-mapping plan) is dead but left for OS-102, which already covers NSSH1/handshake plumbing removal across crates.
…p historical prose Updates `sandbox-connect.md`, `gateway.md`, `sandbox.md`, `gateway-security.md`, and `system-architecture.md` to describe the current supervisor-initiated model forward-facing: two-plane `ConnectSupervisor` + `RelayStream` design, the registry's `open_relay` / `claim_relay` / reaper behaviour, Unix-socket sshd access control, and the sandbox-side OCSF event surface. Strips historical framing that describes what was removed — the "Earlier designs..." paragraph, the "Historical: NSSH1 Handshake (removed)" subsection, retained-for-compat config/env table rows, and scattered "no longer X" prose — in favour of clean current-state descriptions. Syncs env- var and flag names to the renamed `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH`.
Updates user-facing docs to match the connect/exec transport change: - `docs/security/best-practices.mdx` — SSH tunnel section now describes traffic riding the sandbox's mTLS session (transport auth) plus a short-lived session token scoped to the sandbox (authorization), with the sandbox's sshd bound to a local Unix socket rather than a TCP port. Removes the stale mention of the NSSH1 HMAC handshake. - `docs/observability/logging.mdx` — example OCSF shorthand lines for SSH:LISTEN / SSH:OPEN updated to reflect the current emit shape (no peer endpoint on the Unix-socket listener, no NSSH1 auth tag).
|
🌿 Preview your docs: https://nvidia-preview-pr-867.docs.buildwithfern.com/openshell |
Adds two `ResourceExhausted`-returning guards on `open_relay` to bound the `pending_relays` map against runaway or abusive callers: - `MAX_PENDING_RELAYS = 256` — upper bound across all sandboxes. Caps the memory a caller can pin by calling `open_relay` in a loop while no supervisor ever claims (or the supervisor is hung). - `MAX_PENDING_RELAYS_PER_SANDBOX = 32` — per-sandbox ceiling so one noisy tenant can't consume the entire global budget. Sits above the existing SSH-tunnel per-sandbox cap (20) so tunnel-specific limits still fire first for that caller. Both checks and the `pending_relays` insert happen under a single lock hold so concurrent callers can't each observe "under the cap" and both insert past it. Adds a `sandbox_id` field on `PendingRelay` so the per-sandbox count is a single filter over the map without extra indexes. Tests: - Two unit tests in `supervisor_session.rs` — assert the global cap and the per-sandbox cap both return `ResourceExhausted` with the right message, and a cap-hit on one sandbox doesn't leak onto others. - One integration test in `supervisor_relay_integration.rs` — bursts 64 concurrent `open_relay` calls at a single sandbox and asserts exactly 32 succeed, exactly 32 are rejected with the per-sandbox message, and a different sandbox still accepts new relays. Reaper behaviour is unchanged; the cap makes the map bounded, so the existing `HashMap::retain` pass stays cheap under any load.
9135762 to
7a850ae
Compare
Summary
Introduces a persistent supervisor-to-gateway session (
ConnectSupervisor) and migrates/connect/sshandExecSandboxonto relay channels that ride the session's HTTP/2 connection. Removes the requirement for direct gateway→sandbox network connectivity.Two-plane design, one TCP+TLS connection per sandbox:
ConnectSupervisorbidirectional gRPC stream — session lifecycle (hello, heartbeat, accept/reject) and relay lifecycle (RelayOpen,RelayOpenResult,RelayClose).RelayStreambidirectional gRPC RPC — one per relay, multiplexed as a separate HTTP/2 stream on the same connection. FirstRelayFramecarries a typedRelayInit { channel_id }to match a pending-relay slot; subsequent frames carry raw bytes.The supervisor stays a dumb byte bridge with no SSH protocol awareness.
Removes
ResolveSandboxEndpointfrom the proto, gateway, and K8s driver — no code path now dials the sandbox directly for connect or exec.Closes OS-86. Design: Plan. Supersedes (and closes) #861.
History
The initial approach (#861) used reverse HTTP CONNECT tunnels for the data plane — one new TCP+TLS handshake per relay. This PR replaces that with a
RelayStreamgRPC RPC that rides the existing supervisor session connection as a new HTTP/2 stream. Both approaches were benchmarked side-by-side onnemoclaw; after tuning HTTP/2 flow-control windows, throughput and latency are within noise, and the gRPC path wins decisively on the architectural metric (supervisor→gateway TCP count during a 50-relay storm: 3 vs 53). See the perf comments inline on this PR for the full numbers.Why
sandbox connect/ExecSandboxsaves one RTT + crypto cost.1 + N— fewer file descriptors, simpler firewall/LB story.relay.rs, no reverse HTTP CONNECT plumbing. Dropshyper,hyper-util,http,http-body-utilfromopenshell-sandbox.Changes
RelayStream(stream RelayFrame) returns (stream RelayFrame)RPC alongsideConnectSupervisor.RelayFrameis aoneof { RelayInit init | bytes data }— the first frame from the supervisor must beInit; subsequent frames (both directions) carrydata. RemoveResolveSandboxEndpointand its request/response messages.handle_relay_streamreadschannel_idfrom the firstInitframe, claims the pending relay slot (sameSupervisorSessionRegistrypath as before), and bridges the gRPC stream to aDuplexStreamin 16 KiB reads.RelayOpen, opens aRelayStreamon the existingChanneland bridges the local SSH Unix socket.openshell-sandboxloses ~200 lines of TLS + HTTP CONNECT plumbing and the entire NSSH1 preface path.sandbox_ssh_socket_path, default/run/openshell/ssh.sock) with 0700 parent / 0600 socket perms. Removes port 2222, the NSSH1 HMAC handshake, and nonce replay detection — filesystem permissions are the access control boundary now.adaptive_window(true)) on both the gateway-side builder and the sandbox-sideEndpointso bulk transfers aren't throttled by the 64 KiB defaults.session_id-basedremove_if_currentto survive a supersede race,spawn_relay_reaperto reap pending relay entries a supervisor never claimed, per-caller session-wait timeouts (30 s for SSH connect's cold start, 15 s forExecSandboxsteady state).NetworkActivityBuilderevents for supervisor session open/close/fail and relay open/close/fail (7 event shapes, extracted to pure builder fns with 10 unit tests).sshinvocations carryServerAliveInterval=15/ServerAliveCountMax=3so in-flight sessions detect a silently-dropped relay (gateway or supervisor restart) in ~45 s instead of hanging indefinitely.tests/supervisor_relay_integration.rs) + 10 OCSF event-shape tests, plus live-cluster verification (SSH, SFTP/scp,ssh -L, 3 concurrent sessions, gateway-restart recovery, supervisor-restart recovery).Security note
This change moves SSH/exec data flow onto the supervisor gRPC path. That path is not yet bound to a per-sandbox transport identity, so the gateway cannot fully enforce
caller identity == target sandboxat the RPC boundary today.As a result, this branch inherits the existing weakness in sandbox-originated RPC identity and applies it to a more privileged path. The intended fix is proper per-sandbox identity via sandbox-specific mTLS work in OS-109, rather than introducing another temporary authentication mechanism in this PR.
This is a conscious tradeoff to minimize churn while the transport identity model is being replaced.
Performance
Benchmarked side-by-side on
nemoclaw, same cluster, same script (architecture/plans/relay-bench.sh), 15 iterations per latency metric, 50 concurrent relays for the storm. See the perf comments on this PR for all three runs (HTTP CONNECT baseline, gRPC with default windows, gRPC with adaptive windows).Headline:
exec -- true): gRPC ~16 % slower, due to per-RPC overhead (fresh RelayStream + SSH session per exec). Addressable later by direct exec viaSupervisorExec— tracked as OS-91.Follow-ups (tracked separately)
Out of scope for this PR; filed as issues so the PR stays focused on the transport migration.
compute::cleanup_sandbox_stateso sandbox delete proactively tears down the registry entry (today it's cleaned up lazily via stream death).GetSandbox/WatchSandbox.SupervisorExecRPC bypassing SSH to recover the ~16 % per-exec overhead on rapid serial churn.ssh_handshake_secret/ssh_handshake_skew_secsplumbing across 7 crates + bootstrap now that NSSH1 is gone.adaptive_windowvs a fixed 2 MiB / 4 MiB window in a WAN scenario before committing to the default.Vec<u8>forprost::bytes::BytesinRelayFrame::datato recover the remaining per-chunk copy cost (~30 ms win onexec -- true).GatewayContext+ emits on the server side (the OCSF crate is currently sandbox-shaped).Testing
cargo build -p openshell-server -p openshell-sandbox -p openshell-clicargo test -p openshell-server --lib— 227 pass (including 18 supervisor_session unit tests)cargo test -p openshell-server --test supervisor_relay_integration— 5 passcargo test -p openshell-sandbox --lib— 492 pass (including 10 OCSF event-shape tests)cargo test -p openshell-cli --lib— 78 passsandbox execworks through relay (verified onnemoclaw)sandbox connectworks through relay (verified onnemoclaw)ssh -L 19090:localhost:18080to in-sandbox http server verifiedBroken pipe(ServerAliveInterval)Broken pipe(ServerAliveInterval)Checklist