Skip to content

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901

Open
drew wants to merge 29 commits intomainfrom
drew/vm-driver-install-hangs-on-startup
Open

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
drew wants to merge 29 commits intomainfrom
drew/vm-driver-install-hangs-on-startup

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 21, 2026

Summary

Makes the VM compute driver end-to-end path work on top of the supervisor-initiated relay in #867, and moves the authoritative "sandbox is Ready" transition from each compute driver onto the gateway. The smoke test against openshell-gateway --drivers vm (mise run e2e:vm) goes from hanging at 180s to passing in ~10s.

Related Issue

Stacked on top of #867. No issue link.

Changes

feat(server): promote sandbox phase on supervisor session connect

  • New SupervisorSessionObserver trait. SupervisorSessionRegistry invokes the observer on register / remove_if_current outside the internal mutex.
  • ComputeRuntime::install_supervisor_observer wires a ComputeSessionObserver bridge; the runtime holds a Weak<SupervisorSessionRegistry> to break the Arc cycle between registry and observer.
  • New mark_sandbox_session_connected / mark_sandbox_session_disconnected flip phase and rewrite the Ready condition with reason=SupervisorConnected / SupervisorDisconnected. Terminal states (Deleting, Error) are preserved.
  • Backfill path in apply_sandbox_update_locked handles the register-before-store race: if a driver snapshot arrives and the registry already holds a live session for that sandbox, phase is promoted on the spot.

refactor(driver-vm): drop log-grep readiness; always run gvproxy

  • Delete guest_ssh_ready() and ready_condition(). The driver no longer owns Ready; monitor_sandbox only surfaces Error for launcher-process failures.
  • Critical fix: runtime.rs now starts gvproxy unconditionally. With the SSH port forward removed in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867, port_map was empty by default, which skipped gvproxy startup entirely — leaving the guest with no eth0 and no route to the host gateway. The guest supervisor's ConnectSupervisor stream needs gvproxy to reach host.containers.internal (rewritten to 192.168.127.1 inside the guest).
  • Remove dead VmContext::set_port_map; mark the libkrun FFI binding #[allow(dead_code)].

e2e(vm): run smoke against openshell-gateway with the VM compute driver

  • Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (former openshell-vm K8s-in-a-VM binary is gone).
  • Pin --driver-dir target/debug so the gateway picks up the freshly cargo-built driver rather than a stale ~/.local/libexec/openshell/openshell-driver-vm from a prior install-vm.sh run.
  • Anchor per-run state under /tmp (macOS AF_UNIX SUN_LEN is 104 bytes; worktree paths routinely blow it).
  • On failure, preserve the state dir and dump the gateway log + every sandbox's rootfs-console.log inline for post-mortem.
  • Drop build:docker:gateway and vm:build dependencies from tasks/test.toml's e2e:vm.

Testing

  • mise run pre-commit passes (lint + format + license headers clean; clippy warnings unchanged from baseline)
  • Unit tests added/updated
    • openshell-server lib: 255 pass (+8 compute promotion tests, +4 registry observer tests)
    • openshell-driver-vm lib: 17 pass
    • openshell-server integration (supervisor_relay_integration): 6 pass
  • E2E tests added/updated: mise run e2e:vm passes in ~10s, stable across back-to-back runs

Checklist

Notes for the reviewer

  • Base is feat/supervisor-session-grpc-data, not main. Merge order: land feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 first, then rebase this onto main.
  • The gvproxy-always change in driver-vm/runtime.rs could arguably belong in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 itself (it's a latent bug in that PR — VMs have no network after the SSH port forward was dropped). Happy to split that hunk off into a separate commit against feat/supervisor-session-grpc-data directly if that's the preferred landing path.
  • Readiness semantics change: before = "sshd is bound", after = "supervisor→gateway session is live". The latter is what clients actually need before opening a relay, so this is a strict improvement. The Kubernetes driver is unaffected because its own Ready=True from kubelet still wins (the gateway override only applies when phase is Provisioning / Unknown).

pimlock and others added 28 commits April 15, 2026 20:28
…on relay

Introduce a persistent supervisor-to-gateway session (ConnectSupervisor
bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto
relay channels coordinated through it.

Architecture:
- gRPC control plane: carries session lifecycle (hello, heartbeat) and
  relay lifecycle (RelayOpen, RelayOpenResult, RelayClose)
- HTTP data plane: for each relay, the supervisor opens a reverse HTTP
  CONNECT to /relay/{channel_id} on the gateway; the gateway bridges
  the client stream with the supervisor stream
- The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness;
  the gateway sends the NSSH1 preface through the relay

Key changes:
- Add ConnectSupervisor RPC and session/relay proto messages
- Add gateway session registry (SupervisorSessionRegistry) with
  pending-relay map for channel correlation
- Add /relay/{channel_id} HTTP CONNECT endpoint
- Rewire /connect/ssh: session lookup + RelayOpen instead of direct
  TCP dial to sandbox:2222
- Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial
- Add supervisor session client with reconnect and relay bridge
- Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver

Closes OS-86
When a sandbox first reports Ready, the supervisor session may not have
completed its gRPC handshake yet. Instead of failing immediately with
502 / "supervisor session not connected", the relay open now retries
with exponential backoff (100ms → 2s) for up to 15 seconds.

This fixes the race between K8s marking the pod Ready and the
supervisor establishing its ConnectSupervisor session.
Three related changes:

1. Fold the session-wait into `open_relay` itself via a new `wait_for_session`
   helper with exponential backoff (100ms → 2s). Callers pass an explicit
   `session_wait_timeout`:
   - SSH connect uses 30s — it typically runs right after `sandbox create`,
     so the timeout has to cover a cold supervisor's TLS + gRPC handshake.
   - ExecSandbox uses 15s — during normal operation it only needs to cover
     a transient supervisor reconnect window.

   This covers both the startup race (pod Ready before the supervisor's
   ConnectSupervisor stream is up) and mid-lifetime reconnects after a
   network blip or gateway/supervisor restart — both look identical to the
   caller.

2. Fix a supersede cleanup race. `LiveSession` now tracks a `session_id`,
   and `remove_if_current(sandbox_id, session_id)` only evicts when the
   registered entry still matches. Previously an old session's cleanup
   could run after a reconnect had already registered the new session,
   unconditionally removing the live registration.

3. Wire up `spawn_relay_reaper` alongside the existing SSH session reaper
   so expired pending relay entries (supervisor acknowledged RelayOpen but
   never opened the reverse CONNECT) are swept every 30s instead of
   leaking until someone tries to claim them.

Adds 12 unit tests covering: open_relay happy path, timeout, mid-wait
session appearance, closed-receiver failure, supersede routing; claim_relay
unknown/expired/receiver-dropped/round-trip; and the remove_if_current
cleanup-race regression.
Replace the supervisor's reverse HTTP CONNECT data plane with a new
`RelayStream` gRPC RPC. Each relay now rides the supervisor's existing
`ConnectSupervisor` TCP+TLS+HTTP/2 connection as a new HTTP/2 stream,
multiplexed natively. Removes one TLS handshake per SSH/exec session.

- proto: add `RelayStream(stream RelayChunk) returns (stream RelayChunk)`;
  the first chunk from the supervisor carries `channel_id` and no data,
  matching the existing RelayOpen channel_id. Subsequent chunks are
  bytes-only — leaving channel_id off data frames avoids a ~36 B
  per-frame tax that would hurt interactive SSH.

- server: add `handle_relay_stream` alongside `handle_connect_supervisor`.
  It reads the first RelayChunk for channel_id, claims the pending relay
  (same `SupervisorSessionRegistry::claim_relay` path as before, returning
  a `DuplexStream` half), then bridges that half ↔ the gRPC stream via
  two tasks (16 KiB chunks). Delete `relay.rs` and its `/relay/{channel_id}`
  HTTP endpoint.

- sandbox: on `RelayOpen`, open a `RelayStream` RPC on the existing
  `Channel`, send `RelayChunk { channel_id, data: [] }` as the first frame,
  then bridge the local SSH socket. Drop `open_reverse_connect`,
  `send_connect_request`, `connect_tls`, and the `hyper`, `hyper-util`,
  `http`, `http-body-util` deps that existed solely for the reverse CONNECT.

- tests: add `RelayStreamStream` type alias and `relay_stream` stub to the
  seven `OpenShell` mock impls in server + CLI integration tests.

The registry shape (pending_relays, claim_relay, RelayOpen control message,
DuplexStream bridging) is unchanged, so the existing session-wait / supersede
/ reaper hardening on feat/supervisor-session-relay carries over intact.
… plane

Default h2 initial windows are 64 KiB per stream and 64 KiB per
connection. That throttles a single RelayStream SSH tunnel to ~500 Mbps
on LAN, roughly 35% below the raw HTTP CONNECT baseline measured on
`nemoclaw`.

Bump both server (hyper-util auto::Builder via multiplex.rs) and client
(tonic Endpoint in openshell-sandbox/grpc_client.rs) windows to 2 MiB /
4 MiB. This is the window size at which bulk throughput on a 50 MiB
transfer matches the reverse HTTP CONNECT path.

The numbers apply only to the RelayStream data plane in this branch;
ConnectSupervisor and all other RPCs benefit too but are low-rate.
…, drop NSSH1

The embedded SSH daemon in openshell-sandbox no longer listens on a TCP
port. Instead it binds a root-owned Unix socket (default
/run/openshell/ssh.sock, 0700 parent dir, 0600 socket). The supervisor's
relay bridge connects to that socket instead of 127.0.0.1:2222.

With the socket gated by filesystem permissions, the NSSH1 HMAC preface
is redundant and has been removed:

- openshell-sandbox: drop `verify_preface`, `hmac_sha256`, the nonce
  cache and reaper, and the preface read/write on every SSH accept.
  `run_ssh_server` takes a `PathBuf` and uses `UnixListener`.
- openshell-server/ssh_tunnel: remove the NSSH1 write + response read
  before bridging the client's upgraded CONNECT stream; the relay is
  now bridged immediately.
- openshell-server/grpc/sandbox: same cleanup in the exec-path relay
  proxy. `stream_exec_over_relay` and `start_single_use_ssh_proxy_over_relay`
  stop taking a `handshake_secret`.
- openshell-server lib: the K8s driver is now configured with the
  socket path ("/run/openshell/ssh.sock") instead of "0.0.0.0:2222".
- Parent directory of the socket is created with 0700 root:root by the
  supervisor at startup to keep the sandbox entrypoint user out.

`ssh_handshake_secret` is still accepted on the CLI / env for backwards
compatibility but is no longer used for SSH.
Adds `sandbox_ssh_socket_path` to `Config` (default
`/run/openshell/ssh.sock`). The K8s driver is now wired with the
configured value instead of a hard-coded path.

K8s and VM drivers already isolate the socket via per-pod / per-VM
filesystems, so the default is safe there. This makes it easy to
override in local dev when multiple supervisors share a filesystem,
matching the prior `OPENSHELL_SSH_LISTEN_ADDR` knob on the supervisor
side.
Adds tests/supervisor_relay_integration.rs covering the RelayStream wire
contract, handshake frame, bridging, and claim timing. Five cases:
happy-path echo, gateway drop, supervisor drop, no-session timeout, and
concurrent multiplexed relays on one session.

Narrows handle_relay_stream to take &SupervisorSessionRegistry directly
so the test can exercise the real handler without standing up a full
ServerState. Adds register_for_test for the same reason.
…ents

Emits NetworkActivity events for session open/close/fail and relay
open/close/fail from the sandbox side. Keeps plain tracing for internal
plumbing (SSH socket connect, gateway stream close observation).

Event shapes are extracted into pure builder fns so unit tests can
assert activity/severity/status without wiring up a tracing subscriber.
Gateway endpoint is parsed into host + port for dst_endpoint.
Adds ServerAliveInterval=15 and ServerAliveCountMax=3 to both the
rendered ssh-config block and the direct ssh invocation used by
`openshell sandbox connect`. Without this, a client-side SSH session
hangs indefinitely when the gateway or supervisor dies mid-session:
the relay transport's TCP connection can't signal EOF to the client
because the peer process is gone, not cleanly closing.

Detection now takes ~45s instead of the TCP keepalive default of
2 hours. Verified on a live cluster by deleting the gateway pod and
the sandbox pod mid-session — SSH exits with "Broken pipe" after one
missed ServerAlive reply.
The RPC was used by the direct gateway→sandbox SSH/exec path, which is
gone — connect/ssh and ExecSandbox both ride the supervisor session
relay now. Removes the RPC, SandboxEndpoint/ResolveSandboxEndpoint*
messages, and the now-dead ssh_port / sandbox_ssh_port config fields
across openshell-core, openshell-server, openshell-driver-kubernetes,
and openshell-driver-vm.

The k8s driver's standalone binary also stops synthesizing a TCP
listen address ("0.0.0.0:<port>") and reads the Unix socket path
directly from OPENSHELL_SANDBOX_SSH_SOCKET_PATH.
…rename ssh-listen-addr → ssh-socket-path

Renames the sandbox binary's `--ssh-listen-addr` / `OPENSHELL_SSH_LISTEN_ADDR`
/ `ssh_listen_addr` to `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH` /
`ssh_socket_path` so the flag name matches its sole accepted form (a Unix
socket filesystem path) after the supervisor-initiated relay migration.

Migrates the VM compute driver to the same supervisor-initiated model used by
the K8s driver: the in-guest sandbox now binds `/run/openshell/ssh.sock` and
opens its own outbound `ConnectSupervisor` session to the gateway, so the
host→guest SSH port-forward is no longer needed. Drops `--vm-port` plumbing,
the `ssh_port` allocation path, the `port_is_ready` TCP probe, and the now-
unused `GUEST_SSH_PORT` import from `driver.rs`. Readiness falls back to the
existing console-log marker from `guest_ssh_ready`.

Remaining `ssh_port` / `GUEST_SSH_PORT` residue in
`openshell-driver-vm/src/runtime.rs` (gvproxy port-mapping plan) is dead but
left for OS-102, which already covers NSSH1/handshake plumbing removal across
crates.
…p historical prose

Updates `sandbox-connect.md`, `gateway.md`, `sandbox.md`, `gateway-security.md`,
and `system-architecture.md` to describe the current supervisor-initiated
model forward-facing: two-plane `ConnectSupervisor` + `RelayStream` design,
the registry's `open_relay` / `claim_relay` / reaper behaviour, Unix-socket
sshd access control, and the sandbox-side OCSF event surface.

Strips historical framing that describes what was removed — the
"Earlier designs..." paragraph, the "Historical: NSSH1 Handshake (removed)"
subsection, retained-for-compat config/env table rows, and scattered "no
longer X" prose — in favour of clean current-state descriptions. Syncs env-
var and flag names to the renamed `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH`.
Updates user-facing docs to match the connect/exec transport change:

- `docs/security/best-practices.mdx` — SSH tunnel section now describes
  traffic riding the sandbox's mTLS session (transport auth) plus a
  short-lived session token scoped to the sandbox (authorization), with
  the sandbox's sshd bound to a local Unix socket rather than a TCP port.
  Removes the stale mention of the NSSH1 HMAC handshake.

- `docs/observability/logging.mdx` — example OCSF shorthand lines for
  SSH:LISTEN / SSH:OPEN updated to reflect the current emit shape (no
  peer endpoint on the Unix-socket listener, no NSSH1 auth tag).
Adds two `ResourceExhausted`-returning guards on `open_relay` to bound the
`pending_relays` map against runaway or abusive callers:

- `MAX_PENDING_RELAYS = 256` — upper bound across all sandboxes. Caps the
  memory a caller can pin by calling `open_relay` in a loop while no
  supervisor ever claims (or the supervisor is hung).
- `MAX_PENDING_RELAYS_PER_SANDBOX = 32` — per-sandbox ceiling so one noisy
  tenant can't consume the entire global budget. Sits above the existing
  SSH-tunnel per-sandbox cap (20) so tunnel-specific limits still fire
  first for that caller.

Both checks and the `pending_relays` insert happen under a single lock
hold so concurrent callers can't each observe "under the cap" and both
insert past it. Adds a `sandbox_id` field on `PendingRelay` so the
per-sandbox count is a single filter over the map without extra indexes.

Tests:

- Two unit tests in `supervisor_session.rs` — assert the global cap and
  the per-sandbox cap both return `ResourceExhausted` with the right
  message, and a cap-hit on one sandbox doesn't leak onto others.
- One integration test in `supervisor_relay_integration.rs` — bursts 64
  concurrent `open_relay` calls at a single sandbox and asserts exactly
  32 succeed, exactly 32 are rejected with the per-sandbox message, and
  a different sandbox still accepts new relays.

Reaper behaviour is unchanged; the cap makes the map bounded, so the
existing `HashMap::retain` pass stays cheap under any load.
…on-grpc-data

# Conflicts:
#	architecture/gateway.md
#	crates/openshell-sandbox/src/main.rs
…on-grpc-data

# Conflicts:
#	crates/openshell-cli/tests/sandbox_name_fallback_integration.rs
The gateway now owns the Ready transition for sandboxes. When a
ConnectSupervisor RPC registers a session in SupervisorSessionRegistry,
ComputeRuntime promotes the sandbox record from Provisioning to Ready
with reason=SupervisorConnected. When the session ends, the sandbox
is demoted back to Provisioning with reason=SupervisorDisconnected.

This replaces compute-driver-specific liveness probes (log grep, TCP
poll) with the authoritative signal that the relay plane for SSH and
exec is live end-to-end. Drivers that still want to report Error
conditions (e.g. the VM driver on a dead launcher process) continue
to do so; only the Ready transition moves.

Wiring is observer-based: SupervisorSessionRegistry holds an optional
trait object installed by ComputeRuntime::install_supervisor_observer
during server startup. Session register/remove_if_current invoke
on_session_connected/on_session_disconnected off the internal lock;
the observer spawns async tasks to update the persisted sandbox record
and notify the watch bus.

Backfill path in apply_sandbox_update_locked handles the
register-before-store race: when a driver snapshot arrives and the
registry already has a live session for that sandbox, the phase is
promoted on the spot.

Adds 8 unit tests covering promote/demote transitions, terminal-state
no-ops, idempotent re-register, and the backfill race. Adds 4
registry-level tests covering observer fire-once semantics and
supersede-race guards.
The VM driver no longer owns the Ready transition — the gateway-side
SupervisorSessionObserver now promotes sandboxes to Ready when their
supervisor session connects. Remove guest_ssh_ready() (a brittle
grep over the serial console) and the ready_condition() helper.
monitor_sandbox still watches the launcher child process and emits
Error conditions on ProcessExited / ProcessPollFailed.

Also always start gvproxy, not just when port_map is non-empty. With
the supervisor-initiated relay migration in #867, the SSH port forward
was dropped; that left port_map empty in the default path, which in
turn skipped gvproxy startup, which left the guest with no eth0 and
no route to the host gateway. The guest supervisor's outbound
ConnectSupervisor stream needs gvproxy to reach
host.containers.internal (rewritten to 192.168.127.1 inside the guest),
so gvproxy is structurally required for any sandbox that talks to
the gateway.

Inline the gvproxy setup into an unconditional block that returns
(guard, api_sock, forwarded_port_map), dropping the mutable plumbing
the prior conditional form needed. Remove the now-dead
VmContext::set_port_map wrapper; mark its libkrun FFI binding
#[allow(dead_code)] so a future reintroduction doesn't need to touch
the symbol table.
Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (openshell-gateway
+ openshell-driver-vm) now that the former openshell-vm K8s-in-a-VM
binary is gone. The new flow:

  1. Stage the embedded VM runtime (libkrun + gvproxy + base rootfs)
     via mise run vm:setup and mise run vm:rootfs -- --base, both
     idempotent and run only when artifacts are missing.
  2. Build openshell-gateway, openshell-driver-vm, and the openshell
     CLI from the current workspace with cargo.
  3. On macOS, codesign the driver with the Hypervisor.framework
     entitlement so libkrun can start the microVM.
  4. Start the gateway with --drivers vm --disable-tls
     --disable-gateway-auth --db-url sqlite::memory:, pinning
     --driver-dir target/debug so the gateway picks up the freshly
     built driver rather than ~/.local/libexec/openshell from a
     prior install-vm.sh run.
  5. Wait for 'Server listening', run the cluster-agnostic Rust smoke
     test against OPENSHELL_GATEWAY_ENDPOINT=http://127.0.0.1:<port>,
     then SIGTERM the gateway.

State paths root under /tmp rather than target/ because the VM
driver's compute-driver.sock lives under --vm-driver-state-dir; with
AF_UNIX SUN_LEN = 104 bytes on macOS (108 on Linux), worktree paths
under target/ routinely blow the limit.

On failure, the trap preserves the per-run state dir plus dumps the
gateway log and every sandbox's rootfs-console.log inline so CI
artifacts capture post-mortem data.

Drop the former --vm-port / --vm-name reuse path entirely — the new
gateway is cheap to start (a few seconds, no k3s bootstrap) and that
reuse flow mapped to openshell-vm's StatefulSet rollout, which no
longer exists. Drop the build:docker:gateway and vm:build task
dependencies from tasks/test.toml's e2e:vm for the same reason.
@drew drew self-assigned this Apr 21, 2026
@drew drew requested a review from a team as a code owner April 21, 2026 05:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

With the SSH port forward removed in #867 and no other host→guest port
mappings in play, everything that configured gvproxy's port-forwarder
is dead weight. gvproxy stays because the VM still needs its virtual
NIC, DHCP server, and default router for guest egress, and because
the sandbox supervisor's per-sandbox netns (veth + iptables, see
openshell-sandbox/src/sandbox/linux/netns.rs) needs a real kernel
network stack inside the guest to branch off of — libkrun's built-in
TSI socket impersonation would not satisfy those primitives.

What we stop doing:

* Dropping the `-listen` API socket. No one calls
  `/services/forwarder/expose` on it any more.
* Passing `-ssh-port -1`. gvproxy's default 2222 SSH forward binds
  a host-side TCP listener that would race concurrent sandboxes
  and surface a misleading 'sshd is reachable' endpoint.
  `-1` is gvproxy's documented switch for 'no SSH forward'; see
  getForwardsMap in containers/gvisor-tap-vsock cmd/gvproxy/main.go.
* Removing VmLaunchConfig::port_map and the CLI --vm-port flag.
* Removing krun_set_port_map from the libkrun FFI bindings.
* Removing helpers that only made sense when we had a port map to
  manage: plan_gvproxy_ports, parse_port_mapping, expose_port_map,
  gvproxy_expose, pick_gvproxy_ssh_port, kill_stale_gvproxy_by_port,
  kill_stale_gvproxy_by_port_map, kill_gvproxy_pid, is_process_named,
  and the GUEST_SSH_PORT constant.
* Removing the four port-mapping unit tests.

Verified: after `sandbox create -- echo hi`, `lsof` shows gvproxy
opens zero TCP listeners; only its qemu/vfkit unixgram data socket
remains. E2E smoke still passes in ~10s.
Base automatically changed from feat/supervisor-session-grpc-data to main April 21, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants