feat(auth): add OIDC/Keycloak authentication with RBAC and scope-based permissions#935
Merged
Merged
Conversation
c722c96 to
13744ab
Compare
5 tasks
13744ab to
cfdd8bd
Compare
This was referenced Apr 28, 2026
TaylorMutch
reviewed
Apr 28, 2026
Add OAuth2/OIDC authentication to the gateway server with role-based access control, CLI login flows, and full deployment plumbing. Server: JWT validation against configurable OIDC issuer (oidc.rs), JWKS key caching with TTL and rotation handling, method classification (unauthenticated/sandbox-secret/dual-auth/bearer), identity extraction with provider-agnostic Identity type, and RBAC enforcement via AuthzPolicy with configurable admin/user roles and auth-only mode. CLI: browser-based Authorization Code + PKCE flow, Client Credentials flow for CI/automation, token storage with refresh, gateway add/login/ logout commands, OIDC bearer token injection over mTLS transport, discovery endpoint for auto-configuration. Security: sandbox-secret scope restriction on UpdateConfig (policy sync only), anti-spoofing header stripping, dual-auth fallthrough from sandbox-secret to Bearer token. Deployment: OIDC config wired through DeployOptions, Docker env vars, Helm values/templates, HelmChart manifest, cluster-entrypoint.sh, and bootstrap scripts. Keycloak dev server script with pre-configured realm (test users, roles, PKCE client, CI client). Tested with Keycloak. The roles claim path and role names are configurable to support other OIDC providers.
Add opt-in scope enforcement on top of existing OIDC role-based access control. When --oidc-scopes-claim is set, the server extracts scopes from the JWT and checks them per-method against an exhaustive scope map. Scopes: sandbox:read, sandbox:write, provider:read, provider:write, config:read, config:write, inference:read, inference:write, and openshell:all (wildcard). Methods not in the scope map require openshell:all. Scopes layer on top of roles and cannot escalate privilege. Auth-only mode (empty role names) still enforces scopes when enabled. Server: scopes_claim in OidcConfig, scope extraction from JWT (space-delimited and JSON array formats), standard OIDC scope filtering, scope check in AuthzPolicy after role check. CLI: --oidc-scopes on gateway add/start stored in metadata and consumed by gateway login, --oidc-scopes-claim on gateway start forwarded to server, scopes parameter in browser and client credentials OAuth2 flows with openid deduplication. Deployment: oidc_scopes_claim wired through DeployOptions, docker.rs, Helm, bootstrap scripts, and cluster entrypoint. Keycloak: realm config updated with built-in OIDC scopes and 9 OpenShell client scopes as optional on openshell-cli and openshell:all as default on openshell-ci.
Add GetInferenceBundle to sandbox-secret methods so sandbox inference route refresh works under OIDC. Make GetSandboxConfig dual-auth so CLI users can read sandbox settings with Bearer tokens. Preserve OIDC gateway metadata on restart — a bare gateway start without --oidc-* flags no longer erases the stored OIDC registration. Document CI client ID requirement (openshell-ci vs openshell-cli) in the testing guide. Add security note about auth-only mode blast radius for GitHub Actions.
Move OpenShell/GetSandboxConfig from sandbox-secret-only to dual-auth so CLI users can read sandbox settings with Bearer tokens while sandbox supervisors continue using the shared secret. Add sandbox secret interceptor to the inference bundle fetch path so GetInferenceBundle works under OIDC-enabled gateways. Extract shared interceptor constructor to avoid duplication. Add GetSandboxConfig to the config:read scope map so scope enforcement applies consistently when scopes are enabled. Refactor OIDC metadata preservation into apply_oidc_gateway_metadata() with explicit resume semantics — only preserve existing OIDC metadata on real resume paths, not on fresh deployments. Update architecture docs and testing guide to reflect the corrected method classifications and add new test coverage for interceptor injection, scope requirements, metadata preservation, and dual-auth classification.
Replace hand-written PKCE generation, authorization URL construction, token exchange, client credentials, and token refresh with the oauth2 crate's typed API. Eliminates sha2, hex, and getrandom dependencies from the CLI. The custom urlencoded() helper and manual form POST logic are replaced by BasicClient methods with proper type-state safety. Discovery and the callback server remain custom since the oauth2 crate does not provide OIDC discovery or a localhost redirect listener.
Group oidc.rs, authz.rs, identity.rs, and the auth HTTP endpoints under src/auth/ module directory. No behavioral changes. auth/mod.rs — module root, re-exports HTTP router auth/oidc.rs — JWT validation, JWKS caching, method classification auth/authz.rs — role and scope authorization policy auth/identity.rs — provider-agnostic Identity type auth/http.rs — /auth/connect and /auth/oidc-config endpoints
The oauth2 crate defaults to BasicAuth (HTTP Basic header) but Keycloak and most OIDC providers expect client_secret_post (credentials in the request body). Set AuthType::RequestBody explicitly to match the pre-refactor behavior. Also re-export Identity, IdentityProvider, and JwksCache from the auth module so ServerState's public API remains nameable by external consumers.
Pass --oidc-scopes to gateway start so the metadata includes requested scopes after cluster bootstrap. Without this, users had to manually edit metadata.json to set scopes for gateway login. Usage: OPENSHELL_OIDC_SCOPES="openshell:all" mise run cluster
Add 10 end-to-end tests covering OIDC authentication against a live K3s cluster with Keycloak: RBAC (5 tests): admin can create providers, user cannot, user can list sandboxes, unauthenticated requests rejected, health probe works without auth. Scopes (4 tests): sandbox-scoped token can list sandboxes but not providers, openshell:all grants full access, no-scopes token denied. Client credentials (1 test): CI token via client_credentials grant. Tests are opt-in via OPENSHELL_E2E_OIDC=1 and OPENSHELL_E2E_OIDC_SCOPES=1 env vars. They derive the Keycloak URL from gateway metadata to match the server's configured issuer. Run with: OPENSHELL_E2E_OIDC=1 OPENSHELL_E2E_OIDC_SCOPES=1 \ PYTHONPATH=python uv run pytest e2e/python/oidc/ -v
011c25b to
ec4a1e6
Compare
TaylorMutch
previously approved these changes
Apr 29, 2026
Collaborator
TaylorMutch
left a comment
There was a problem hiding this comment.
This is a great starting point. Looking forward to iterating on it - thanks @mrunalp!
80c9001 to
44a1051
Compare
Add blank lines before lists and fenced code blocks to satisfy markdownlint MD031 and MD032 rules.
44a1051 to
a1196d6
Compare
Collaborator
Author
|
@TaylorMutch I rebased the branch and the tests are green. |
Collaborator
|
/ok to test a1196d6 |
|
Label |
TaylorMutch
approved these changes
Apr 30, 2026
Collaborator
|
Thank you @mrunalp ! |
8 tasks
Closed
4 tasks
7 tasks
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. * Use `_make_cluster_bearer_provider` — a per-RPC closure that reads `oidc_token.json` on every invocation, returning the current `access_token` if fresh and raising `SandboxError` with a "re-authenticate with: openshell gateway login" hint if the token is missing, malformed, or expired (the 30 s grace window matches `openshell-bootstrap::oidc_token::is_token_expired`). A long-lived `SandboxClient` now picks up token rotations done by the CLI without being reconstructed. OAuth2 refresh itself stays in the CLI; the SDK only consumes what's on disk. Tested: - 23 SDK unit tests pass (5 existing + 18 new across the bearer interceptor, token provider, `TlsConfig` validation, and the `from_active_cluster` auth ladder). `mise run test:python` → 31 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift (deploy recipe at `architecture/plans/deploy-openshift.md`): * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against four pre-PR failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp
added a commit
to mrunalp/OpenShell
that referenced
this pull request
May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the IdP's error body included for diagnostics. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=False)` factory. Default is the refresher path; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` (opt-in) atomically persists the rotated bundle with 0600 mode so other processes — including the Rust CLI — see the rotation. Off by default; treats the Rust CLI as the canonical writer. - `from_active_cluster` exposes `auto_refresh` / `write_back` kwargs (defaults: True / False). OAuth2 refresh refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | opt-in | Refresh in the SDK is the production answer because: - Long-running Python orchestrators (agent runs, data pipelines) outlast a Keycloak 1-hour access token. Without in-SDK refresh, they crash at expiry. - Headless containers (sandbox-controller pods, GitHub Actions runners) may not have the Rust CLI installed but always have Python and a refresh_token. - Subprocess-to-CLI per RPC would spawn `openshell` on every gRPC call, including hot streaming paths. Unacceptable. The Rust CLI keeps owning interactive flows (browser/device-code, keyring storage, the initial login). The SDK owns refresh during script execution. Tested: - 32 SDK unit tests pass (5 existing + 27 new across the bearer interceptor, fail-closed provider, refresher behavior, `TlsConfig` validation, `from_active_cluster` auth ladder, and the refresher's concurrency / rotation / write-back / error paths). `mise run test:python` → 40 passed total. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against the four pre-review failure modes: * `https://` OIDC gateway without `mtls/` no longer falls back to `insecure_channel` * CA-only `mtls/ca.crt` layout no longer raises `FileNotFoundError` * plaintext gateway with stale `oidc_token.json` no longer gets a bearer attached * long-lived client picks up rotated tokens; expired tokens surface as `SandboxError`, not silent gateway 401s - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with error body, and the client_credentials / no-refresh_token error path. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add OAuth2/OIDC authentication with Keycloak, role-based access control, and scope-based fine-grained permissions to the gateway server and CLI.
Related Issue
Fixes #930
Changes
--oidc-scopesfor fine-grained token requests,oauth2crate for typed OAuth2 flowsoidc-auth.md), local testing guide (oidc-local-testing.md)RBAC Role Mapping
openshell-useropenshell-useropenshell-adminopenshell-adminopenshell-adminScope Definitions (opt-in via
--oidc-scopes-claim)sandbox:readsandbox:writeprovider:readprovider:writeconfig:readconfig:writeinference:readinference:writeopenshell:allScopes layer on top of roles — a caller needs both the required role AND the required scope. Scopes cannot escalate privilege.
Keycloak Test Users
admin@testadminopenshell-admin,openshell-useruser@testuseropenshell-userOIDC Clients
openshell-cliopenshell-cici-test-secretTesting
mise run pre-commitpassesManual Testing Steps
Testing
mise run pre-commitpassesChecklist