Skip to content

feat: Support restricted SecurityContextConstraints for managed Kubernetes platforms #899

@rdwj

Description

@rdwj

Problem Statement

OpenShell sandbox pods require CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYSLOG, and runAsUser: 0. On Red Hat OpenShift, the default restricted-v2 SecurityContextConstraint drops all capabilities, enforces runAsNonRoot: true, and sets allowPrivilegeEscalation: false. Granting a custom SCC with these capabilities weakens the cluster's security posture and requires cluster-admin approval -- a non-starter for many enterprise deployments.

This means OpenShell cannot be deployed on OpenShift (or any managed Kubernetes platform with enforced pod security standards at the restricted level) without a security exception that many platform teams cannot allow.

Related: #873 (roadmap for local workstation drivers), #882 (Podman driver / CRI-O compatibility), #579 (closed -- reduce SYS_ADMIN/SYS_PTRACE), #586 (closed -- graceful degradation without netns, decided fail-closed), #398 (CDI for GPU injection, prerequisite for OpenShift GPU support).

Proposed Design

Add a Platform variant to the NetworkMode enum. When active, the sandbox supervisor skips network namespace creation, bypass monitoring, and iptables rules, and instead binds the CONNECT proxy to loopback. The Kubernetes driver omits elevated capabilities and runAsUser: 0 from the pod spec. Egress control is enforced by a Kubernetes NetworkPolicy emitted by the driver. The OPA policy engine, L7 inspection, inference routing, and credential injection continue to function through the loopback proxy.

Capability elimination path

Requirement Current Location Platform Mode Alternative
Network namespace + veth (SYS_ADMIN, NET_ADMIN) sandbox/linux/netns.rs Skip entirely; Kubernetes NetworkPolicy provides L3/L4 egress control
/proc/<pid>/exe resolution (SYS_PTRACE) procfs.rs Use shareProcessNamespace: true on the pod; proxy and sandbox share a PID namespace, so same-UID /proc reads work without ptrace
Privilege drop via setuid/setgid (root) process.rs pre_exec Container starts as non-root; no privilege drop needed
Landlock PathFd opening (root) sandbox/linux/mod.rs Phase 1 Run Phase 1 as the pod's non-root user; degrade gracefully via existing best_effort mode for inaccessible paths
dmesg bypass detection (SYSLOG) bypass_monitor.rs Disabled; already degrades gracefully. Unnecessary when NetworkPolicy enforces egress outside the pod's trust boundary
Supervisor sideload via hostPath driver.rs:703-732 Bake supervisor into the sandbox image or use an emptyDir init container
Workspace init container (root) driver.rs:821-898 Run as the image's default non-root user (image should have read access to /sandbox)

Key architectural change

Move network isolation from "inside the sandbox pod" to "platform-provided, before the pod starts." The CONNECT proxy continues running on 127.0.0.1:3128 for cooperative L7 inspection, OPA policy evaluation, credential injection, and inference routing. Kubernetes NetworkPolicy acts as the hard L3/L4 enforcement backstop at the CNI level.

This is architecturally consistent with how service meshes operate alongside NetworkPolicy in production Kubernetes clusters: the sidecar proxy handles L7 for cooperative traffic, the platform handles L3/L4 for all traffic.

Implementation sketch

  1. Add Platform to the NetworkMode enum in policy.rs and a network_enforcement field to the proto SandboxPolicy message (backward-compatible: default zero value = current Namespace mode)
  2. In run_sandbox() (lib.rs), add a Platform branch that skips netns creation and bypass monitoring, binds proxy to loopback (existing fallback path at proxy.rs:158), and still starts the OPA engine
  3. In spawn_impl() (process.rs), skip setns() when netns_fd is None (already handled), set proxy env vars to 127.0.0.1:3128, skip privilege drop when already non-root (already handled at process.rs:431-454)
  4. In the Kubernetes driver (driver.rs), conditionally omit capabilities.add and runAsUser: 0, and emit an egress NetworkPolicy for sandbox pods
  5. In seccomp.rs, include Platform alongside Proxy in the allow_inet decision (seccomp works without root via no_new_privs)

What still works in Platform mode

  • Seccomp BPF -- prctl(PR_SET_NO_NEW_PRIVS) and seccomp(SET_MODE_FILTER) do not require any capability
  • Landlock -- restrict_self() works via no_new_privs path; Phase 1 PathFd opening works for user-readable paths, degrades gracefully otherwise
  • OPA policy evaluation -- the Rego rules are completely decoupled from the network namespace; they operate on abstract JSON input
  • L7 inspection -- for cooperative clients honoring HTTP_PROXY
  • Credential injection, inference routing, SSRF protection -- all proxy features, unchanged
  • Process identity binding -- preserved via shareProcessNamespace: true

What is reduced in Platform mode

  • Non-cooperative process enforcement: Processes ignoring HTTP_PROXY can attempt direct connections. NetworkPolicy is the enforcement boundary, not the network namespace. This is the primary security trade-off.
  • L7 inspection coverage: Only applies to cooperative proxy traffic. Non-proxy traffic gets L3/L4 enforcement only.
  • Bypass detection: No iptables LOG rules, no /dev/kmsg monitoring. Replaced by NetworkPolicy deny logging at the CNI level.

Scope boundaries:

  • This does NOT remove the existing InPod mode -- it remains the default for Docker/K3s deployments
  • Platform mode trades some defense-in-depth (no in-pod netns isolation) for deployability on locked-down platforms
  • The platform's NetworkPolicy enforcement is the network isolation layer in this mode
  • Landlock + seccomp remain fully functional (both work under no_new_privs)

Alternatives Considered

  1. Runtime capability probing -- Auto-detect whether CAP_NET_ADMIN is available and fall back to Platform mode. Rejected: implicit behavior is harder to reason about, test, and debug. A failed ip netns add could be transient rather than a capability restriction. Explicit configuration is recommended.

  2. NetworkPolicy-only (no in-pod proxy) -- Eliminate the CONNECT proxy entirely. Rejected: loses OPA per-binary policy evaluation, L7 inspection, inference routing, credential injection, and denial aggregator -- all core OpenShell features.

  3. User namespaces -- Map container root to an unprivileged host UID. Rejected: Kubernetes user namespace support is alpha (KEP-127), not available on OpenShift, and the seccomp filter currently blocks CLONE_NEWUSER.

  4. Custom SCC grant -- Just grant the capabilities. This is what we'd have to do today, but platform teams reject it because it weakens the namespace's security posture. Not a solution for enterprise adoption.

  5. gVisor RuntimeClass -- Referenced in Evaluate sandbox isolation options: gVisor runtime, Firecracker microVMs, or cluster-in-VM #4. Would eliminate in-pod namespace manipulation via syscall interception. Not available on OpenShift without a custom RuntimeClass and cluster-admin involvement.

Agent Investigation

Investigation performed with a coding agent pointed at the repo. Skills loaded: create-spike, generate-sandbox-policy. Full findings below.

Architecture overview

The sandbox employs a defense-in-depth model with six layers, three of which require elevated capabilities:

Layer Mechanism Requires Elevated Caps Platform Mode
Seccomp BPF prctl(PR_SET_NO_NEW_PRIVS) + seccomp(SET_MODE_FILTER) No Works unchanged
Landlock LSM Phase 1 PathFds + Phase 2 restrict_self() Phase 1 needs root for restricted paths Degrades gracefully via best_effort
Network namespace + veth ip netns add, ip link add, setns() Yes (SYS_ADMIN, NET_ADMIN) Replaced by NetworkPolicy
iptables bypass detection OUTPUT chain LOG + REJECT rules Yes (NET_ADMIN) Disabled
Process identity via procfs /proc/<pid>/exe, /proc/<pid>/fd/ Yes (SYS_PTRACE for cross-user) Works via shareProcessNamespace
Bypass monitor via dmesg dmesg --follow Yes (SYSLOG) Disabled

Code references

Location Description
crates/openshell-driver-kubernetes/src/driver.rs:1100-1113 Hardcoded capabilities.add: ["SYS_ADMIN", "NET_ADMIN", "SYS_PTRACE", "SYSLOG"]
crates/openshell-driver-kubernetes/src/driver.rs:748-804 apply_supervisor_sideload() forces runAsUser: 0
crates/openshell-driver-kubernetes/src/driver.rs:821-898 Workspace init container also runAsUser: 0
crates/openshell-driver-kubernetes/src/driver.rs:703-732 Supervisor sideload via hostPath volume (also blocked by restricted-v2)
crates/openshell-sandbox/src/policy.rs:59-65 NetworkMode enum: Block, Proxy, Allow -- no Platform variant
crates/openshell-sandbox/src/policy.rs:98-119 TryFrom<ProtoSandboxPolicy> unconditionally forces NetworkMode::Proxy
crates/openshell-sandbox/src/lib.rs:376-412 Netns creation gated on NetworkMode::Proxy -- fatal failure if caps unavailable
crates/openshell-sandbox/src/lib.rs:423-481 Proxy startup, identity cache, OPA engine gated on Proxy mode
crates/openshell-sandbox/src/process.rs:144-262 spawn_impl(): setns() at 236, drop_privileges() at 245, Landlock+seccomp at 255
crates/openshell-sandbox/src/process.rs:171-193 Proxy URL env var injection gated on NetworkMode::Proxy
crates/openshell-sandbox/src/sandbox/linux/seccomp.rs:28-44 Seccomp: prctl(PR_SET_NO_NEW_PRIVS) + apply_filter() -- confirmed no root needed
crates/openshell-sandbox/src/sandbox/linux/seccomp.rs:29 allow_inet decision based on network mode
crates/openshell-sandbox/src/sandbox/linux/netns.rs:53-178 NetworkNamespace::create() -- requires root + CAP_NET_ADMIN
crates/openshell-sandbox/src/sandbox/linux/netns.rs:252-331 install_bypass_rules() -- iptables inside netns
crates/openshell-sandbox/src/bypass_monitor.rs:117-292 spawn() -- requires CAP_SYSLOG
crates/openshell-sandbox/src/procfs.rs:49-79 binary_path() -- /proc/<pid>/exe, needs SYS_PTRACE across users
crates/openshell-sandbox/src/procfs.rs:276-315 find_pid_by_socket_inode() -- /proc/<pid>/fd/ scanning
crates/openshell-sandbox/src/proxy.rs:143-159 start_with_bind_addr() -- proxy binds to veth host IP or loopback
proto/sandbox.proto:17-28 SandboxPolicy message -- no network mode field currently
proto/compute_driver.proto DriverSandboxTemplate.platform_config -- existing opaque extensibility point
deploy/helm/openshell/templates/networkpolicy.yaml Existing ingress-only NetworkPolicy

OPA/Rego decoupling (from generate-sandbox-policy investigation)

The Rego rules (crates/openshell-sandbox/data/sandbox-policy.rego) have zero dependency on the network namespace. They evaluate against an abstract input JSON object containing host, port, binary_path, and ancestors. The coupling to in-pod networking exists solely in how that input is constructed -- specifically, process identity resolution via /proc. The OPA engine, policy loading, hot-reload, L7 inspection chain, credential injection, and SSRF protection are all mode-agnostic.

Proto extensibility

The SandboxPolicy proto message can be extended with backward-compatible fields:

enum NetworkEnforcementMode {
  NETWORK_ENFORCEMENT_NAMESPACE = 0;  // default, backward-compatible
  NETWORK_ENFORCEMENT_PLATFORM = 1;
}

message PlatformNetworkConfig {
  string network_policy_name = 1;
  string network_policy_namespace = 2;
  bool shared_pid_namespace = 3;
  string proxy_listen_addr = 4;
}

Default zero value preserves current behavior. The existing DriverSandboxTemplate.platform_config (google.protobuf.Struct) can carry Kubernetes-specific configuration without touching the core policy schema.

Existing patterns followed

  • NetworkMode enum gating pattern: codebase uses matches!(policy.network.mode, NetworkMode::Proxy) extensively
  • Graceful degradation: Landlock BestEffort, bypass monitor None return
  • Proxy loopback fallback: proxy.rs:158 already handles binding to 127.0.0.1:3128
  • Helm chart conditional rendering: existing networkpolicy.yaml via .Values.networkPolicy.enabled
  • OCSF event emission for security state changes

Scope assessment

  • Complexity: High
  • Confidence: Medium (core approach is sound; design decisions needed for NetworkPolicy reconciliation, init container alternatives, identity resolution degradation)
  • Estimated files to change: 12-15
  • Issue type: feat

Risks & open questions:

  1. NetworkPolicy operates at IP/port level, not per-binary or per-request -- fundamental security downgrade from in-pod proxy model. Proxy on loopback + NetworkPolicy as backstop is the mitigation. How much trust do we place in NetworkPolicy as sole enforcement?
  2. Dynamic NetworkPolicy updates: if OPA network policies change at runtime, the driver needs a reconciliation loop to keep Kubernetes NetworkPolicy in sync. Significant new subsystem.
  3. Init container without root: workspace persistence init container (driver.rs:821-898) uses runAsUser: 0. Needs alternative seeding strategy.
  4. Landlock without root: Phase 1 opens PathFds as root. Without root, BestEffort mode handles inaccessible paths gracefully. Should Platform mode force BestEffort?
  5. hostPath volume for supervisor sideload (driver.rs:703-732) is also blocked by restricted-v2. Supervisor must be baked into the sandbox image or use emptyDir.
  6. Proxy bypass: without netns, processes ignoring HTTP_PROXY connect directly. NetworkPolicy is the only enforcement. Prominently document this trade-off.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions