feat(auth): per-sandbox authentication to gateway by TaylorMutch · Pull Request #1404 · NVIDIA/OpenShell

TaylorMutch · 2026-05-15T16:16:25Z

Summary

Adds per-sandbox supervisor authentication for gateway RPCs and closes the
cross-sandbox access gap tracked in #1354. Sandbox supervisors now authenticate
as a specific Principal::Sandbox, and gateway handlers authorize access by
comparing that authenticated principal to the sandbox named in each
sandbox-scoped request.

The implementation has two bootstrap paths:

Docker, Podman, and VM sandboxes receive gateway-minted JWT bootstrap material
through driver-managed supervisor secret files or guest secret material.
Kubernetes sandboxes exchange a projected, pod-bound ServiceAccount token for
the same kind of gateway-minted JWT. The gateway validates the projected token
with Kubernetes TokenReview, requires the configured sandbox ServiceAccount
in the sandbox namespace, checks the pod name and UID, fetches the live pod,
and reads the gateway-owned openshell.io/sandbox-id annotation.

After bootstrap, all drivers converge on the same steady state: the supervisor
presents Authorization: Bearer <gateway-jwt>, refreshes that credential in
memory, and is authorized only for its own sandbox.

Related Issue

Closes #1354

Changes

Introduces Authenticator/Principal routing for gateway gRPC
authentication.
Adds gateway-minted sandbox JWT signing, validation, and refresh support.
Adds Docker, Podman, and VM bootstrap plumbing that delivers supervisor-only
JWT files without exposing tokens through public APIs or user entrypoint
environments.
Adds Kubernetes ServiceAccount token bootstrap through IssueSandboxToken
using the Kubernetes TokenReview API.
Provisions and configures a Helm-managed sandbox ServiceAccount for sandbox
pods, with support for using an existing ServiceAccount.
Configures the Kubernetes compute driver with the sandbox ServiceAccount name
and sets it on sandbox pods while keeping automatic ServiceAccount token
mounting disabled.
Restricts Kubernetes bootstrap to the configured sandbox ServiceAccount and
the configured sandbox namespace.
Updates the supervisor gRPC client to acquire a bearer credential at startup
and inject it on every gateway call.
Enforces per-handler sandbox ID equality for sandbox-scoped RPCs.
Pins PushSandboxLogs to the first validated sandbox ID in the stream and
rejects later frames that try to switch sandbox identity.
Requires persisted sandbox records before IssueSandboxToken or
RefreshSandboxToken mint a token.
Adds sandbox debug-rpc helpers for end-to-end authentication testing.
Mounts sandbox JWT keys in Helm deployments even when local TLS is disabled.
Updates helm-dev k3d setup to preload the default community sandbox image to
speed up Kubernetes e2e smoke tests.
Updates docs, Helm chart tests, and debugging guidance for the new
per-sandbox identity model.

Implementation Details

Problem Context

Before this PR, sandbox-class handlers trusted a sandbox_id or sandbox name
supplied in the request body. The shared mTLS client certificate only proved
that the caller had a gateway client certificate; it did not prove that the
caller was sandbox A rather than sandbox B. Any holder of that shared credential
could therefore ask for another sandbox's policy, drafts, provider environment,
or related sandbox-private state.

This PR moves the identity decision into the gateway authentication layer. The
router authenticates the caller, inserts a Principal into request extensions,
and handlers compare that principal to the requested sandbox before serving
sandbox-private data.

Shared Gateway Auth Model

The gateway now uses a pluggable authenticator chain. Each authenticator can
produce a Principal, decline so the next authenticator can try, or reject the
request fail-closed.

The steady-state sandbox credential is a gateway-minted Ed25519 JWT. Validation
checks issuer, audience, key ID, expiry, algorithm, and sandbox identity. The
token is intentionally short lived. Refresh mints a replacement for the same
sandbox principal, and older tokens remain valid only until their own expiry.

This JWT is supervisor identity material:

It is not returned in CreateSandboxResponse.
It is not stored in public sandbox metadata.
It is not logged.
It is kept out of ordinary user entrypoint environments.

Docker, Podman, And VM Bootstrap

Docker, Podman, and VM deployments do not have a platform identity service
equivalent to Kubernetes projected ServiceAccount tokens. For those drivers, the
gateway uses a push-based bootstrap pattern.

At sandbox creation time, the gateway mints a sandbox JWT for the new sandbox
and passes it to the in-process driver boundary as secret material. The driver
writes that token to a supervisor-only file, or VM guest secret material, and
starts the sandbox with OPENSHELL_SANDBOX_TOKEN_FILE pointing at that file.
The supervisor reads the file once at startup and then keeps the active token in
memory.

This path avoids the unsafe parts of the old model:

The raw token does not cross the public gRPC API.
The token is not placed in the user command environment.
The token is scoped to one sandbox ID.
Refresh rotates the in-memory bearer token without rewriting bootstrap
material.

Kubernetes Bootstrap

Kubernetes uses a pull-based bootstrap pattern because kubelet can provide a
short-lived, audience-bound, pod-bound ServiceAccount token to the sandbox pod.

The sandbox pod gets a projected ServiceAccount token mounted at a
supervisor-only path. On startup, the supervisor presents that token to
IssueSandboxToken. The gateway validates the token with Kubernetes
TokenReview, verifies the accepted audience, requires the exact configured
sandbox ServiceAccount username, extracts the bound pod name and UID, fetches
the live pod from the sandbox namespace, checks the UID, and reads the
openshell.io/sandbox-id annotation to derive the sandbox identity.

The Helm chart now creates a dedicated sandbox ServiceAccount by default and
passes its name into the gateway's Kubernetes driver configuration. Operators
can disable creation and provide an existing ServiceAccount name. Sandbox pods
continue to set automountServiceAccountToken: false; the only token made
available to the supervisor is the explicit projected token used for bootstrap.

Handler Authorization

Authentication alone is not enough; handlers still need to authorize access to
the requested sandbox.

Direct sandbox_id handlers compare the authenticated
Principal::Sandbox.sandbox_id to the requested ID. Name-keyed handlers resolve
the sandbox name to the canonical ID and then compare. PushSandboxLogs
authorizes the first non-empty batch, verifies the sandbox still exists, stores
that sandbox ID for the stream, and rejects any later batch that names a
different sandbox.

User principals continue through the normal RBAC path. Sandbox principals are
limited to their own sandbox. Anonymous principals are rejected for
sandbox-scoped paths.

Token Lifecycle

IssueSandboxToken is only available to Kubernetes ServiceAccount bootstrap
principals. RefreshSandboxToken is only available to supervisors already
holding a gateway-minted JWT. Both paths require the sandbox record to still
exist before minting a token, so deleted or unknown sandboxes cannot keep
refreshing credentials.

Kubernetes supervisors can recover from restart by repeating the ServiceAccount
bootstrap exchange. Docker, Podman, and VM supervisors use their file or guest
secret bootstrap material and then rely on in-memory refresh for steady state.

Signing Key Persistence

The gateway JWT signing key is persisted through the existing local and Helm
PKI paths. Helm mounts the JWT key material into the gateway even when local TLS
is disabled, because per-sandbox authentication is independent from TLS
enablement.

Design Decisions For Reviewers

Two bootstrap patterns, one steady-state credential: local and VM drivers push
supervisor-only bootstrap material; Kubernetes pulls a token through
ServiceAccount exchange. Both become the same gateway JWT.
Kubernetes uses TokenReview instead of in-gateway JWT verification so the
apiserver remains the source of truth for projected ServiceAccount token
validity and audience acceptance.
The Helm chart provisions a sandbox ServiceAccount by default rather than
creating per-sandbox ServiceAccounts in this PR.
No per-sandbox Kubernetes Secret objects are created for bootstrap.
No raw token is exposed through public APIs, CreateSandboxResponse, sandbox
metadata, ordinary user environments, or logs.
mTLS remains transport protection, not sandbox identity.
Handler checks are explicit because handlers know which request field
identifies the target sandbox.

Reviewer Focus Areas

Docker, Podman, and VM token file handling: supervisor-only placement, no
leakage into entrypoint environment, and cleanup behavior.
Kubernetes bootstrap validation: TokenReview, accepted audience, configured
ServiceAccount, pod name/UID binding, annotation handling, and RBAC scope.
Handler coverage: every sandbox-private RPC should either call the
sandbox-scope guard or have a documented reason not to.
Streaming RPC behavior: PushSandboxLogs should not allow a stream to change
sandbox identity after validation.
Signing key persistence: local and Helm deployments must preserve the JWT key
across gateway restarts; multi-replica gateways must share the same key
material.
Token lifecycle edges: unknown or deleted sandbox records must not receive new
gateway-minted tokens.

Testing

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-05-15T16:16:29Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-05-15T22:08:03Z

Label test:e2e applied for f4daea6. Open Branch E2E Checks, find the run for commit f4daea6, and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

github-actions · 2026-05-18T18:53:26Z

🌿 Preview your docs: https://nvidia-preview-pr-1404.docs.buildwithfern.com/openshell

github-actions · 2026-05-20T00:52:28Z

Label test:e2e-kubernetes applied for a5cb368. Open the existing run and click Re-run all jobs to execute with the label set. The matching required CI gate status on this PR will flip green automatically once the run finishes.

dimityrmirchev

Thanks for addressing my comment with regards to the TokenReview API and also introducing additional security hardening with 2bcf8a9 and 3a15997. The changes look good to me

johntmyers · 2026-05-21T14:20:13Z

Security/stability review notes from a focused pass on the per-sandbox auth changes.

1. Sandbox callers can drop `Authorization` and become `Principal::User`

Refs: crates/openshell-server/src/multiplex.rs, crates/openshell-server/src/auth/authenticator.rs, crates/openshell-server/src/auth/guard.rs

The new router still has paths where a sandbox can avoid Principal::Sandbox by omitting the sandbox JWT. In no-OIDC deployments, the chain falls through to PermissiveUserAuthenticator, producing a synthetic user. In mTLS deployments, missing bearer auth can fall back to the shared peer cert, which is issued as openshell-user. Once represented as Principal::User, the sandbox allowlist and same-sandbox guard no longer constrain the request.

Suggested way forward:

Treat sandbox-originated transport identity separately from user identity. A request coming from supervisor/shared sandbox mTLS material should not be promotable to Principal::User just because the bearer is missing.
Consider removing the peer-cert fallback for authenticated gRPC methods once sandbox JWT auth is configured, or only allowing it for known CLI/front-proxy user paths.
In dev/no-OIDC mode, keep permissive user auth for external user clients, but deny supervisor-callable paths when the request lacks a valid sandbox principal.
Add regression tests that call a sandbox-allowed method and a user/admin method with no Authorization under no-OIDC and mTLS configurations.

2. VM sandboxes do not receive any supervisor token source

Refs: crates/openshell-driver-vm/src/driver.rs, crates/openshell-sandbox/src/grpc_client.rs

The supervisor now requires OPENSHELL_SANDBOX_TOKEN, OPENSHELL_SANDBOX_TOKEN_FILE, or OPENSHELL_K8S_SA_TOKEN_FILE. The VM driver builds guest environment without any of those values, so VM sandboxes will fail before opening the gateway control stream.

Suggested way forward:

Plumb DriverSandboxSpec.sandbox_token into the VM guest as a root-only guest file and set OPENSHELL_SANDBOX_TOKEN_FILE.
Avoid passing the raw token in the process environment if possible.
Add VM coverage that validates the supervisor reaches ConnectSupervisor after auth is enabled.

3. Existing installs cannot upgrade cleanly when JWT material is missing

Refs: crates/openshell-server/src/certgen.rs

Existing deployments already have server/client TLS material but no JWT secret/files. The new certgen state machine treats that as partial PKI state and aborts, requiring operators to delete all PKI artifacts and rotate TLS just to add JWT keys.

Suggested way forward:

Split TLS material and JWT material into independently recoverable states.
If server/client TLS already exist and only JWT material is missing, generate only the JWT keypair/secret.
Keep the fatal partial-state path for genuinely inconsistent TLS sets, but make “old install missing JWT” a supported migration.
Add a test for the pre-PR upgrade state: server secret exists, client secret exists, JWT secret missing.

4. A leaked sandbox JWT can be refreshed indefinitely

Refs: crates/openshell-server/src/grpc/auth_rpc.rs, crates/openshell-server/src/auth/sandbox_jwt.rs

RefreshSandboxToken accepts any still-valid gateway JWT and mints another token while the sandbox record exists. There is no jti, token generation counter, revocation state, rotation binding, or proof-of-possession. A one-time bearer leak can therefore become durable sandbox impersonation until the sandbox is deleted.

Suggested way forward:

Add a jti or token generation claim and persist the currently valid refresh generation per sandbox.
Rotate on refresh and reject reused/old tokens after a grace window.
Alternatively, make access tokens non-refreshable and require refresh through a stronger bootstrap proof such as K8s SA token exchange or driver-owned secret material.
Consider binding refresh to the sandbox transport identity where available.

5. Docker/Podman and debug tooling expose raw gateway JWTs

Refs: crates/openshell-driver-docker/src/lib.rs, crates/openshell-driver-podman/src/container.rs, crates/openshell-sandbox/src/debug_rpc.rs, proto/compute_driver.proto, crates/openshell-core/src/sandbox_env.rs

The proto/core comments say local drivers materialize the token via a per-sandbox file, but Docker/Podman inject the raw JWT into the container environment. That makes the bearer visible through container metadata and privileged exec workflows. openshell-sandbox debug-rpc refresh also prints a fresh token to stdout.

Suggested way forward:

Use a root-only bind-mounted token file for Docker/Podman and set OPENSHELL_SANDBOX_TOKEN_FILE instead of OPENSHELL_SANDBOX_TOKEN.
Keep stripping token-related env vars from the user entrypoint, but do not put the credential in container metadata in the first place.
Gate token-printing debug subcommands behind an explicit dev/test feature or remove raw token output. If a diagnostic is needed, print claims/fingerprint/expiry instead of the bearer.

6. K8s bootstrap trusts a forgeable pod annotation

Refs: crates/openshell-server/src/auth/k8s_sa.rs, crates/openshell-driver-kubernetes/src/driver.rs

TokenReview proves that the caller has a pod-bound token for the shared sandbox ServiceAccount, then the gateway trusts the pod's openshell.io/sandbox-id annotation. Any actor that can create a pod in the sandbox namespace using that ServiceAccount can annotate it with a victim sandbox ID and mint a gateway JWT for that sandbox.

Suggested way forward:

Verify the pod is OpenShell-owned, not just annotated. For example, check owner references/UID against the corresponding Sandbox CR or a gateway-persisted pod UID recorded at create time.
Consider per-sandbox ServiceAccounts or a non-forgeable driver-generated nonce stored both in gateway state and the pod spec.
Add a negative test where a pod with the right ServiceAccount but forged annotation is rejected.

7. Short JWT TTLs break refresh deterministically

Refs: crates/openshell-core/src/config.rs, crates/openshell-sandbox/src/grpc_client.rs

gateway_jwt.ttl_secs accepts any value, but the refresh loop floors its sleep at 60 seconds. With ttl_secs <= 60, refresh happens after expiry. If the gateway is unreachable until after expiry, the loop also keeps trying refresh with an expired gateway JWT rather than re-running bootstrap.

Suggested way forward:

Enforce a server-side minimum TTL or reject unsupported values at config load.
Compute refresh delay without a 60s floor when the token lifetime is short.
On Unauthenticated/expired-token refresh failures, re-run the available bootstrap path where possible, especially K8s SA token exchange.
Add a TTL=30s regression test and an outage-past-expiry recovery test.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

Require persisted sandbox records before IssueSandboxToken and RefreshSandboxToken mint gateway JWTs. This closes the stale-token path where a deleted sandbox identity could continue refreshing itself until token expiry windows were repeatedly extended. Pin PushSandboxLogs streams to the first validated sandbox id. A sandbox now validates scope and sandbox existence once, then any later batch that changes sandbox_id is rejected instead of being accepted under the original validation. For Kubernetes bootstrap, add service_account_name to the Kubernetes driver config, set it on sandbox pod specs, and require TokenReview usernames to match system:serviceaccount:<sandbox-namespace>:<service-account>. The Helm chart provisions a dedicated sandbox ServiceAccount, places it in the sandbox namespace, scopes sandbox RBAC there, and writes the generated name into gateway.toml. Update Helm unit coverage, Helm README, gateway/driver docs, architecture notes, and debug-openshell-cluster guidance for the new sandbox ServiceAccount behavior. Validation: mise run pre-commit; Kubernetes smoke e2e via helm-dev-environment/k3d; Docker smoke e2e; Podman smoke e2e.

Address PR review feedback on the per-sandbox authentication changes. Remove the implicit permissive user fallback once sandbox or user auth is configured. Missing credentials now fail closed unless an explicit local mode is selected. Keep mTLS user auth as a local single-player option for Docker, Podman, and VM gateways, reject it for Kubernetes, and add an explicit unsafe unauthenticated-user switch for trusted local Skaffold/Kubernetes development. Deliver sandbox JWTs through driver-owned token files for Docker, Podman, and VM sandboxes instead of placing raw bearers in container or guest environment metadata. Strip token env overrides from user-provided sandbox environments and update debug-rpc helpers to print token fingerprints, expiry, and claims rather than raw bearer values. Make certgen upgrades recover existing TLS-only installs by creating just the missing gateway JWT signing material while preserving existing TLS certificates and keys. Keep partial-state failures for inconsistent TLS or JWT sets. Improve supervisor token refresh behavior for short JWT TTLs by removing the 60-second refresh floor, using shorter retry backoff, and re-running the Kubernetes ServiceAccount bootstrap path after unauthenticated refresh failures. Update Helm defaults, Skaffold values, e2e gateway setup, Python gateway metadata handling, architecture notes, published docs, and generated chart docs to describe the new auth modes and local development behavior. Validation: mise run pre-commit; Docker smoke e2e; Podman smoke e2e; Kubernetes smoke e2e.

Tighten the follow-up review points from PR 1404. Restrict GetInferenceBundle to sandbox principals because the response carries provider route credentials. User callers continue to manage inference through the user-facing inference APIs. Fail startup for in-cluster K8s ServiceAccount bootstrap when the Kubernetes driver config is missing instead of silently falling back to the default namespace. Collapse sandbox-principal name lookups on supervisor-callable policy RPCs so missing and foreign sandbox names both return PermissionDenied, avoiding sandbox name enumeration. Rename the local unauthenticated development identity provider from Internal to LocalDev, add Helm coverage for OIDC CA mounts with TLS disabled, fill generated sandboxJwt values docs, and document the current offline JWT signing-key rotation procedure. Follow-up: created #1510 for online gateway sandbox JWT key rotation. Validation: mise run pre-commit.

Require Kubernetes ServiceAccount bootstrap to validate the live pod's controlling Sandbox ownerReference before minting a gateway sandbox JWT. The K8s resolver now verifies the pod sandbox-id annotation, the controlling Sandbox CR UID, and the Sandbox CR sandbox-id label in addition to TokenReview and live pod UID checks. Update gateway architecture and user-facing auth docs to describe the additional Kubernetes bootstrap binding checks. Tested with focused k8s_sa tests, full pre-commit, and a fresh Helm dev cluster sandbox create.

Sandbox pods now run as the openshell-sandbox service account, so OpenShift installs must grant the privileged SCC to that service account instead of default. Update the published OpenShift guide and Helm chart README template/generated README. Tested with markdown lint and Helm docs check.

Serialize the supervisor's first sandbox JWT acquisition so concurrent startup clients reuse the process-wide token slot instead of racing into duplicate K8s ServiceAccount bootstrap exchanges. Set XDG_STATE_HOME inside Docker e2e's host-visible workdir. GitHub Actions container jobs talk to the host Docker daemon, so driver-owned sandbox JWT bind mounts must resolve from a path visible on both sides. Verification: mise run pre-commit; OPENSHELL_E2E_DOCKER_TEST=bypass_detection OPENSHELL_SUPERVISOR_IMAGE=openshell/supervisor:dev-25-g15a2a59fb-dirty e2e/rust/e2e-docker.sh; local helm dev deploy plus sandbox log check showed one K8s token exchange line.

The cred-inject fields already use the 9000+ range reserved for openlock fork additions, but the later bind-mount `volumes` field and the `SANDBOX_PHASE_STOPPED` enum value grabbed the next sequential numbers (11 and 6). That put them in upstream's path: upstream's per-sandbox-auth work (NVIDIA#1404) took DriverSandboxSpec field 11 (`sandbox_token`), colliding with `volumes`, and upstream could extend SandboxPhase past 5 at any time. Move both into the fork range so the delta is permanently collision-proof and the convention is uniform: - SandboxSpec.volumes 11 -> 9003 (openshell.proto) - DriverSandboxSpec.volumes 12 -> 9003 (compute_driver.proto) - SANDBOX_PHASE_STOPPED 6 -> 9004 (openshell.proto) Safe to renumber: `volumes` is transient provisioning input and the phase is re-derived from backend state on every watch (never persisted as the enum int), so no gateway-DB upgrade-path break. Gateway, CLI, and sandbox binaries always ship as one matched fork tag, so there is no wire skew.

PR NVIDIA#1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-NVIDIA#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

… enforce at the router (#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR #1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. * Use `_make_cluster_bearer_provider` — a per-RPC closure that reads `oidc_token.json` on every invocation, returning the current `access_token` if fresh and raising `SandboxError` with a "re-authenticate with: openshell gateway login" hint if the token is missing, malformed, or expired (the 30 s grace window matches `openshell-bootstrap::oidc_token::is_token_expired`). A long-lived `SandboxClient` now picks up token rotations done by the CLI without being reconstructed. OAuth2 refresh itself stays in the CLI; the SDK only consumes what's on disk. Tested: - 23 SDK unit tests pass (5 existing + 18 new across the bearer interceptor, token provider, `TlsConfig` validation, and the `from_active_cluster` auth ladder). `mise run test:python` → 31 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift (deploy recipe at `architecture/plans/deploy-openshift.md`): * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against four pre-PR failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

TaylorMutch force-pushed the tmutch/gateway-config-impl branch 2 times, most recently from 381784e to 9bc2e11 Compare May 15, 2026 19:17

Base automatically changed from tmutch/gateway-config-impl to main May 15, 2026 19:43

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch from 834b56e to f4daea6 Compare May 15, 2026 20:41

TaylorMutch changed the title ~~feat: per-sandbox authentication~~ feat: per-sandbox authentication to gateway May 15, 2026

TaylorMutch changed the title ~~feat: per-sandbox authentication to gateway~~ feat(auth): per-sandbox authentication to gateway May 15, 2026

TaylorMutch added the test:e2e Requires end-to-end coverage label May 15, 2026

TaylorMutch marked this pull request as ready for review May 15, 2026 22:07

TaylorMutch requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 22:07

TaylorMutch mentioned this pull request May 15, 2026

feat(auth): add SPIFFE supervisor authentication #1414

Open

7 tasks

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch 3 times, most recently from 719f5f5 to 0d7df90 Compare May 19, 2026 00:06

dimityrmirchev reviewed May 19, 2026

View reviewed changes

Comment thread crates/openshell-server/src/auth/k8s_sa.rs

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch 3 times, most recently from 602d5ea to a5cb368 Compare May 19, 2026 23:23

TaylorMutch added the test:e2e-kubernetes Requires Kubernetes end-to-end coverage label May 20, 2026

TaylorMutch requested a review from dimityrmirchev May 20, 2026 06:03

dimityrmirchev reviewed May 20, 2026

View reviewed changes

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch 2 times, most recently from 6801cb1 to c3657cd Compare May 20, 2026 21:09

TaylorMutch added 14 commits May 21, 2026 16:06

fix(helm): mount sandbox JWT keys without TLS

9bff8b0

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

test(e2e): configure sandbox JWT keys in harnesses

e0683d1

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

refactor(auth): remove sandbox token revocation

527ade3

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

test(server): fix rebased test fixtures

db4b90a

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

docs(helm): update chart values reference

231b5f8

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

chore(markdown): ignore local architecture plans

6d1bc65

fix(server): restrict sandbox principal RPC access

90ba664

fix(server): validate k8s serviceaccount tokens with tokenreview

6426627

fix(server): allow sandbox inference bundle fetch

c3eb704

fix(server): treat podman created state as provisioning

c70cb15

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch from c3657cd to c0342fd Compare May 21, 2026 23:20

TaylorMutch added 2 commits May 21, 2026 16:30

johntmyers approved these changes May 22, 2026

View reviewed changes

TaylorMutch merged commit a3b16c1 into main May 22, 2026
44 checks passed

TaylorMutch deleted the tmutch/per-supervisor-authn branch May 22, 2026 00:58

zanetworker mentioned this pull request May 22, 2026

fix(server): add ConnectSupervisor and RelayStream to SANDBOX_METHODS #1475

Merged

8 tasks

vessux mentioned this pull request May 26, 2026

chore(sync): rebase fork onto upstream/main (2026-05-27) vessux/OpenShell#6

Merged

5 tasks

mrunalp mentioned this pull request May 27, 2026

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router #1586

Closed

4 tasks

zanetworker mentioned this pull request May 27, 2026

docs(rfc): propose SDK consumption entrypoints and file transfer #1590

Open

7 tasks

mrunalp mentioned this pull request May 27, 2026

feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router #1596

Merged

7 tasks

snarkipus mentioned this pull request May 28, 2026

bug(auth): local-driver sandboxes cannot restart after on-disk bootstrap JWT expires #1603

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth): per-sandbox authentication to gateway#1404

feat(auth): per-sandbox authentication to gateway#1404
TaylorMutch merged 26 commits into
mainfrom
tmutch/per-supervisor-authn

TaylorMutch commented May 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

dimityrmirchev left a comment

Uh oh!

johntmyers commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TaylorMutch commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Implementation Details

Problem Context

Shared Gateway Auth Model

Docker, Podman, And VM Bootstrap

Kubernetes Bootstrap

Handler Authorization

Token Lifecycle

Signing Key Persistence

Design Decisions For Reviewers

Reviewer Focus Areas

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

dimityrmirchev left a comment

Choose a reason for hiding this comment

Uh oh!

johntmyers commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Sandbox callers can drop Authorization and become Principal::User

2. VM sandboxes do not receive any supervisor token source

3. Existing installs cannot upgrade cleanly when JWT material is missing

4. A leaked sandbox JWT can be refreshed indefinitely

5. Docker/Podman and debug tooling expose raw gateway JWTs

6. K8s bootstrap trusts a forgeable pod annotation

7. Short JWT TTLs break refresh deterministically

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TaylorMutch commented May 15, 2026 •

edited

Loading

johntmyers commented May 21, 2026 •

edited

Loading

1. Sandbox callers can drop `Authorization` and become `Principal::User`