feat(auth): per-sandbox authentication to gateway#1404
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
381784e to
9bc2e11
Compare
834b56e to
f4daea6
Compare
|
Label |
|
🌿 Preview your docs: https://nvidia-preview-pr-1404.docs.buildwithfern.com/openshell |
719f5f5 to
0d7df90
Compare
602d5ea to
a5cb368
Compare
|
Label |
6801cb1 to
c3657cd
Compare
|
Security/stability review notes from a focused pass on the per-sandbox auth changes. 1. Sandbox callers can drop
|
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Require persisted sandbox records before IssueSandboxToken and RefreshSandboxToken mint gateway JWTs. This closes the stale-token path where a deleted sandbox identity could continue refreshing itself until token expiry windows were repeatedly extended. Pin PushSandboxLogs streams to the first validated sandbox id. A sandbox now validates scope and sandbox existence once, then any later batch that changes sandbox_id is rejected instead of being accepted under the original validation. For Kubernetes bootstrap, add service_account_name to the Kubernetes driver config, set it on sandbox pod specs, and require TokenReview usernames to match system:serviceaccount:<sandbox-namespace>:<service-account>. The Helm chart provisions a dedicated sandbox ServiceAccount, places it in the sandbox namespace, scopes sandbox RBAC there, and writes the generated name into gateway.toml. Update Helm unit coverage, Helm README, gateway/driver docs, architecture notes, and debug-openshell-cluster guidance for the new sandbox ServiceAccount behavior. Validation: mise run pre-commit; Kubernetes smoke e2e via helm-dev-environment/k3d; Docker smoke e2e; Podman smoke e2e.
Address PR review feedback on the per-sandbox authentication changes. Remove the implicit permissive user fallback once sandbox or user auth is configured. Missing credentials now fail closed unless an explicit local mode is selected. Keep mTLS user auth as a local single-player option for Docker, Podman, and VM gateways, reject it for Kubernetes, and add an explicit unsafe unauthenticated-user switch for trusted local Skaffold/Kubernetes development. Deliver sandbox JWTs through driver-owned token files for Docker, Podman, and VM sandboxes instead of placing raw bearers in container or guest environment metadata. Strip token env overrides from user-provided sandbox environments and update debug-rpc helpers to print token fingerprints, expiry, and claims rather than raw bearer values. Make certgen upgrades recover existing TLS-only installs by creating just the missing gateway JWT signing material while preserving existing TLS certificates and keys. Keep partial-state failures for inconsistent TLS or JWT sets. Improve supervisor token refresh behavior for short JWT TTLs by removing the 60-second refresh floor, using shorter retry backoff, and re-running the Kubernetes ServiceAccount bootstrap path after unauthenticated refresh failures. Update Helm defaults, Skaffold values, e2e gateway setup, Python gateway metadata handling, architecture notes, published docs, and generated chart docs to describe the new auth modes and local development behavior. Validation: mise run pre-commit; Docker smoke e2e; Podman smoke e2e; Kubernetes smoke e2e.
Tighten the follow-up review points from PR 1404. Restrict GetInferenceBundle to sandbox principals because the response carries provider route credentials. User callers continue to manage inference through the user-facing inference APIs. Fail startup for in-cluster K8s ServiceAccount bootstrap when the Kubernetes driver config is missing instead of silently falling back to the default namespace. Collapse sandbox-principal name lookups on supervisor-callable policy RPCs so missing and foreign sandbox names both return PermissionDenied, avoiding sandbox name enumeration. Rename the local unauthenticated development identity provider from Internal to LocalDev, add Helm coverage for OIDC CA mounts with TLS disabled, fill generated sandboxJwt values docs, and document the current offline JWT signing-key rotation procedure. Follow-up: created #1510 for online gateway sandbox JWT key rotation. Validation: mise run pre-commit.
Require Kubernetes ServiceAccount bootstrap to validate the live pod's controlling Sandbox ownerReference before minting a gateway sandbox JWT. The K8s resolver now verifies the pod sandbox-id annotation, the controlling Sandbox CR UID, and the Sandbox CR sandbox-id label in addition to TokenReview and live pod UID checks. Update gateway architecture and user-facing auth docs to describe the additional Kubernetes bootstrap binding checks. Tested with focused k8s_sa tests, full pre-commit, and a fresh Helm dev cluster sandbox create.
c3657cd to
c0342fd
Compare
Sandbox pods now run as the openshell-sandbox service account, so OpenShift installs must grant the privileged SCC to that service account instead of default. Update the published OpenShift guide and Helm chart README template/generated README. Tested with markdown lint and Helm docs check.
Serialize the supervisor's first sandbox JWT acquisition so concurrent startup clients reuse the process-wide token slot instead of racing into duplicate K8s ServiceAccount bootstrap exchanges. Set XDG_STATE_HOME inside Docker e2e's host-visible workdir. GitHub Actions container jobs talk to the host Docker daemon, so driver-owned sandbox JWT bind mounts must resolve from a path visible on both sides. Verification: mise run pre-commit; OPENSHELL_E2E_DOCKER_TEST=bypass_detection OPENSHELL_SUPERVISOR_IMAGE=openshell/supervisor:dev-25-g15a2a59fb-dirty e2e/rust/e2e-docker.sh; local helm dev deploy plus sandbox log check showed one K8s token exchange line.
The cred-inject fields already use the 9000+ range reserved for openlock fork additions, but the later bind-mount `volumes` field and the `SANDBOX_PHASE_STOPPED` enum value grabbed the next sequential numbers (11 and 6). That put them in upstream's path: upstream's per-sandbox-auth work (NVIDIA#1404) took DriverSandboxSpec field 11 (`sandbox_token`), colliding with `volumes`, and upstream could extend SandboxPhase past 5 at any time. Move both into the fork range so the delta is permanently collision-proof and the convention is uniform: - SandboxSpec.volumes 11 -> 9003 (openshell.proto) - DriverSandboxSpec.volumes 12 -> 9003 (compute_driver.proto) - SANDBOX_PHASE_STOPPED 6 -> 9004 (openshell.proto) Safe to renumber: `volumes` is transient provisioning input and the phase is re-derived from backend state on every watch (never persisted as the enum int), so no gateway-DB upgrade-path break. Gateway, CLI, and sandbox binaries always ship as one matched fork tag, so there is no wire skew.
PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
… enforce at the router (#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR #1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. * Use `_make_cluster_bearer_provider` — a per-RPC closure that reads `oidc_token.json` on every invocation, returning the current `access_token` if fresh and raising `SandboxError` with a "re-authenticate with: openshell gateway login" hint if the token is missing, malformed, or expired (the 30 s grace window matches `openshell-bootstrap::oidc_token::is_token_expired`). A long-lived `SandboxClient` now picks up token rotations done by the CLI without being reconstructed. OAuth2 refresh itself stays in the CLI; the SDK only consumes what's on disk. Tested: - 23 SDK unit tests pass (5 existing + 18 new across the bearer interceptor, token provider, `TlsConfig` validation, and the `from_active_cluster` auth ladder). `mise run test:python` → 31 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift (deploy recipe at `architecture/plans/deploy-openshift.md`): * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against four pre-PR failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Summary
Adds per-sandbox supervisor authentication for gateway RPCs and closes the
cross-sandbox access gap tracked in #1354. Sandbox supervisors now authenticate
as a specific
Principal::Sandbox, and gateway handlers authorize access bycomparing that authenticated principal to the sandbox named in each
sandbox-scoped request.
The implementation has two bootstrap paths:
through driver-managed supervisor secret files or guest secret material.
the same kind of gateway-minted JWT. The gateway validates the projected token
with Kubernetes
TokenReview, requires the configured sandbox ServiceAccountin the sandbox namespace, checks the pod name and UID, fetches the live pod,
and reads the gateway-owned
openshell.io/sandbox-idannotation.After bootstrap, all drivers converge on the same steady state: the supervisor
presents
Authorization: Bearer <gateway-jwt>, refreshes that credential inmemory, and is authorized only for its own sandbox.
Related Issue
Closes #1354
Changes
Authenticator/Principalrouting for gateway gRPCauthentication.
JWT files without exposing tokens through public APIs or user entrypoint
environments.
IssueSandboxTokenusing the Kubernetes
TokenReviewAPI.pods, with support for using an existing ServiceAccount.
and sets it on sandbox pods while keeping automatic ServiceAccount token
mounting disabled.
the configured sandbox namespace.
and inject it on every gateway call.
PushSandboxLogsto the first validated sandbox ID in the stream andrejects later frames that try to switch sandbox identity.
IssueSandboxTokenorRefreshSandboxTokenmint a token.debug-rpchelpers for end-to-end authentication testing.speed up Kubernetes e2e smoke tests.
per-sandbox identity model.
Implementation Details
Problem Context
Before this PR, sandbox-class handlers trusted a
sandbox_idor sandbox namesupplied in the request body. The shared mTLS client certificate only proved
that the caller had a gateway client certificate; it did not prove that the
caller was sandbox A rather than sandbox B. Any holder of that shared credential
could therefore ask for another sandbox's policy, drafts, provider environment,
or related sandbox-private state.
This PR moves the identity decision into the gateway authentication layer. The
router authenticates the caller, inserts a
Principalinto request extensions,and handlers compare that principal to the requested sandbox before serving
sandbox-private data.
Shared Gateway Auth Model
The gateway now uses a pluggable authenticator chain. Each authenticator can
produce a
Principal, decline so the next authenticator can try, or reject therequest fail-closed.
The steady-state sandbox credential is a gateway-minted Ed25519 JWT. Validation
checks issuer, audience, key ID, expiry, algorithm, and sandbox identity. The
token is intentionally short lived. Refresh mints a replacement for the same
sandbox principal, and older tokens remain valid only until their own expiry.
This JWT is supervisor identity material:
CreateSandboxResponse.Docker, Podman, And VM Bootstrap
Docker, Podman, and VM deployments do not have a platform identity service
equivalent to Kubernetes projected ServiceAccount tokens. For those drivers, the
gateway uses a push-based bootstrap pattern.
At sandbox creation time, the gateway mints a sandbox JWT for the new sandbox
and passes it to the in-process driver boundary as secret material. The driver
writes that token to a supervisor-only file, or VM guest secret material, and
starts the sandbox with
OPENSHELL_SANDBOX_TOKEN_FILEpointing at that file.The supervisor reads the file once at startup and then keeps the active token in
memory.
This path avoids the unsafe parts of the old model:
material.
Kubernetes Bootstrap
Kubernetes uses a pull-based bootstrap pattern because kubelet can provide a
short-lived, audience-bound, pod-bound ServiceAccount token to the sandbox pod.
The sandbox pod gets a projected ServiceAccount token mounted at a
supervisor-only path. On startup, the supervisor presents that token to
IssueSandboxToken. The gateway validates the token with KubernetesTokenReview, verifies the accepted audience, requires the exact configuredsandbox ServiceAccount username, extracts the bound pod name and UID, fetches
the live pod from the sandbox namespace, checks the UID, and reads the
openshell.io/sandbox-idannotation to derive the sandbox identity.The Helm chart now creates a dedicated sandbox ServiceAccount by default and
passes its name into the gateway's Kubernetes driver configuration. Operators
can disable creation and provide an existing ServiceAccount name. Sandbox pods
continue to set
automountServiceAccountToken: false; the only token madeavailable to the supervisor is the explicit projected token used for bootstrap.
Handler Authorization
Authentication alone is not enough; handlers still need to authorize access to
the requested sandbox.
Direct
sandbox_idhandlers compare the authenticatedPrincipal::Sandbox.sandbox_idto the requested ID. Name-keyed handlers resolvethe sandbox name to the canonical ID and then compare.
PushSandboxLogsauthorizes the first non-empty batch, verifies the sandbox still exists, stores
that sandbox ID for the stream, and rejects any later batch that names a
different sandbox.
User principals continue through the normal RBAC path. Sandbox principals are
limited to their own sandbox. Anonymous principals are rejected for
sandbox-scoped paths.
Token Lifecycle
IssueSandboxTokenis only available to Kubernetes ServiceAccount bootstrapprincipals.
RefreshSandboxTokenis only available to supervisors alreadyholding a gateway-minted JWT. Both paths require the sandbox record to still
exist before minting a token, so deleted or unknown sandboxes cannot keep
refreshing credentials.
Kubernetes supervisors can recover from restart by repeating the ServiceAccount
bootstrap exchange. Docker, Podman, and VM supervisors use their file or guest
secret bootstrap material and then rely on in-memory refresh for steady state.
Signing Key Persistence
The gateway JWT signing key is persisted through the existing local and Helm
PKI paths. Helm mounts the JWT key material into the gateway even when local TLS
is disabled, because per-sandbox authentication is independent from TLS
enablement.
Design Decisions For Reviewers
supervisor-only bootstrap material; Kubernetes pulls a token through
ServiceAccount exchange. Both become the same gateway JWT.
TokenReviewinstead of in-gateway JWT verification so theapiserver remains the source of truth for projected ServiceAccount token
validity and audience acceptance.
creating per-sandbox ServiceAccounts in this PR.
CreateSandboxResponse, sandboxmetadata, ordinary user environments, or logs.
identifies the target sandbox.
Reviewer Focus Areas
leakage into entrypoint environment, and cleanup behavior.
TokenReview, accepted audience, configuredServiceAccount, pod name/UID binding, annotation handling, and RBAC scope.
sandbox-scope guard or have a documented reason not to.
PushSandboxLogsshould not allow a stream to changesandbox identity after validation.
across gateway restarts; multi-replica gateways must share the same key
material.
gateway-minted tokens.
Testing
mise run pre-commitcargo test -p openshell-server auth_rpccargo test -p openshell-server auth::k8s_sacargo test -p openshell-server log_stream_scopecargo test -p openshell-driver-kubernetesmise run helm:docs:checkmise run helm:lintmise run helm:test/helm-dev-environmentChecklist