Skip to content

feat(auth): add OIDC/Keycloak authentication with RBAC and scope-based permissions#935

Merged
TaylorMutch merged 10 commits into
NVIDIA:mainfrom
mrunalp:feat/oidc-keycloak
Apr 30, 2026
Merged

feat(auth): add OIDC/Keycloak authentication with RBAC and scope-based permissions#935
TaylorMutch merged 10 commits into
NVIDIA:mainfrom
mrunalp:feat/oidc-keycloak

Conversation

@mrunalp
Copy link
Copy Markdown
Collaborator

@mrunalp mrunalp commented Apr 23, 2026

Summary

Add OAuth2/OIDC authentication with Keycloak, role-based access control, and scope-based fine-grained permissions to the gateway server and CLI.

Related Issue

Fixes #930

Changes

  • Server: JWT validation against configurable OIDC issuer with JWKS caching, method classification (unauthenticated/sandbox-secret/dual-auth/bearer), provider-agnostic Identity type, RBAC with configurable admin/user roles, scope enforcement with exhaustive per-method scope map
  • CLI: Authorization Code + PKCE browser flow, Client Credentials flow for CI, token storage with auto-refresh, gateway login/logout commands, --oidc-scopes for fine-grained token requests, oauth2 crate for typed OAuth2 flows
  • Sandbox: SandboxSecretInterceptor for supervisor-to-gateway auth under OIDC, GetInferenceBundle classified as sandbox-secret for credential isolation
  • Deployment: Full plumbing through DeployOptions, Docker env, Helm values/templates, bootstrap scripts, gateway metadata preservation across restarts
  • Keycloak: Dev server script, pre-configured realm with test users/roles/clients/scopes
  • Docs: Architecture doc (oidc-auth.md), local testing guide (oidc-local-testing.md)
  • E2E: OIDC-specific e2e tests covering RBAC, scope enforcement, and client credentials

RBAC Role Mapping

Operation Required Role
Health probes, reflection (no auth)
Supervisor RPCs (ReportPolicyStatus, PushSandboxLogs, etc.) (sandbox secret)
Sandbox create, list, delete, exec, SSH openshell-user
Provider list, get openshell-user
Provider create, update, delete openshell-admin
Config/policy mutations, draft approvals openshell-admin
SetClusterInference openshell-admin

Scope Definitions (opt-in via --oidc-scopes-claim)

Scope Operations
sandbox:read GetSandbox, ListSandboxes, WatchSandbox, GetSandboxLogs, GetSandboxPolicyStatus, ListSandboxPolicies
sandbox:write CreateSandbox, DeleteSandbox, ExecSandbox, CreateSshSession, RevokeSshSession
provider:read GetProvider, ListProviders
provider:write CreateProvider, UpdateProvider, DeleteProvider
config:read GetGatewayConfig, GetSandboxConfig, GetDraftPolicy, GetDraftHistory
config:write UpdateConfig, ApproveDraftChunk, ApproveAllDraftChunks, RejectDraftChunk, EditDraftChunk, UndoDraftChunk, ClearDraftChunks
inference:read GetClusterInference
inference:write SetClusterInference
openshell:all All of the above (wildcard)

Scopes layer on top of roles — a caller needs both the required role AND the required scope. Scopes cannot escalate privilege.

Keycloak Test Users

Username Password Roles
admin@test admin openshell-admin, openshell-user
user@test user openshell-user

OIDC Clients

Client ID Type Grant Secret
openshell-cli Public Auth Code + PKCE N/A
openshell-ci Confidential Client Credentials ci-test-secret

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (38 auth tests: JWT validation, role check, scope enforcement, interceptor injection, metadata preservation)
  • E2E tests added/updated (10 OIDC e2e tests against live K3s + Keycloak)
  • Manual testing: standalone server, K3s cluster, browser login, CI login, scope enforcement, RBAC, token refresh, logout

Manual Testing Steps

# 1. Start Keycloak
mise run keycloak

# 2. Start K3s cluster with OIDC + scope enforcement
HOST_IP=$(hostname -I | awk '{print $1}')
OPENSHELL_OIDC_ISSUER="http://${HOST_IP}:8180/realms/openshell" \
OPENSHELL_OIDC_SCOPES_CLAIM="scope" \
OPENSHELL_OIDC_SCOPES="openshell:all" \
mise run cluster

# 3. Verify OIDC is active
CONTAINER=$(docker ps --format '{{.Names}}' | grep openshell-cluster)
docker exec $CONTAINER kubectl -n openshell logs openshell-0 | grep OIDC

# 4. Login and test
openshell gateway login  # admin@test / admin
openshell sandbox list   # should work
openshell sandbox create # should work

# 5. Test scope enforcement (narrow scopes)
jq '.oidc_scopes = "sandbox:read sandbox:write"' \
  ~/.config/openshell/gateways/openshell/metadata.json > /tmp/meta.json \
  && mv /tmp/meta.json ~/.config/openshell/gateways/openshell/metadata.json
openshell gateway login
openshell sandbox list    # should work
openshell provider list   # should fail: scope 'provider:read' required

# 6. Test RBAC (login as user)
# Sign out at http://localhost:8180/realms/openshell/account/#/ first
openshell gateway login  # user@test / user
openshell sandbox list   # should work
openshell provider create --name test --type claude --credential API_KEY=test
# should fail: role 'openshell-admin' required

# 7. Test CI client credentials
jq '.oidc_client_id = "openshell-ci"' \
  ~/.config/openshell/gateways/openshell/metadata.json > /tmp/meta.json \
  && mv /tmp/meta.json ~/.config/openshell/gateways/openshell/metadata.json
OPENSHELL_OIDC_CLIENT_SECRET=ci-test-secret openshell gateway login
openshell sandbox list   # should work

# 8. Test logout
openshell gateway logout
openshell sandbox list   # should fail: unauthenticated

# 9. Run OIDC e2e tests
OPENSHELL_E2E_OIDC=1 OPENSHELL_E2E_OIDC_SCOPES=1 \
PYTHONPATH=python uv run pytest e2e/python/oidc/ -v

# 10. Cleanup
mise run cluster:stop
mise run keycloak:stop

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@mrunalp mrunalp requested a review from a team as a code owner April 23, 2026 04:07
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread architecture/oidc-auth.md Outdated
mrunalp added 9 commits April 29, 2026 14:54
Add OAuth2/OIDC authentication to the gateway server with role-based
access control, CLI login flows, and full deployment plumbing.

Server: JWT validation against configurable OIDC issuer (oidc.rs),
JWKS key caching with TTL and rotation handling, method classification
(unauthenticated/sandbox-secret/dual-auth/bearer), identity extraction
with provider-agnostic Identity type, and RBAC enforcement via
AuthzPolicy with configurable admin/user roles and auth-only mode.

CLI: browser-based Authorization Code + PKCE flow, Client Credentials
flow for CI/automation, token storage with refresh, gateway add/login/
logout commands, OIDC bearer token injection over mTLS transport,
discovery endpoint for auto-configuration.

Security: sandbox-secret scope restriction on UpdateConfig (policy
sync only), anti-spoofing header stripping, dual-auth fallthrough
from sandbox-secret to Bearer token.

Deployment: OIDC config wired through DeployOptions, Docker env vars,
Helm values/templates, HelmChart manifest, cluster-entrypoint.sh, and
bootstrap scripts. Keycloak dev server script with pre-configured
realm (test users, roles, PKCE client, CI client).

Tested with Keycloak. The roles claim path and role names are
configurable to support other OIDC providers.
Add opt-in scope enforcement on top of existing OIDC role-based access
control. When --oidc-scopes-claim is set, the server extracts scopes
from the JWT and checks them per-method against an exhaustive scope map.

Scopes: sandbox:read, sandbox:write, provider:read, provider:write,
config:read, config:write, inference:read, inference:write, and
openshell:all (wildcard). Methods not in the scope map require
openshell:all. Scopes layer on top of roles and cannot escalate
privilege. Auth-only mode (empty role names) still enforces scopes
when enabled.

Server: scopes_claim in OidcConfig, scope extraction from JWT
(space-delimited and JSON array formats), standard OIDC scope
filtering, scope check in AuthzPolicy after role check.

CLI: --oidc-scopes on gateway add/start stored in metadata and
consumed by gateway login, --oidc-scopes-claim on gateway start
forwarded to server, scopes parameter in browser and client
credentials OAuth2 flows with openid deduplication.

Deployment: oidc_scopes_claim wired through DeployOptions, docker.rs,
Helm, bootstrap scripts, and cluster entrypoint.

Keycloak: realm config updated with built-in OIDC scopes and 9
OpenShell client scopes as optional on openshell-cli and openshell:all
as default on openshell-ci.
Add GetInferenceBundle to sandbox-secret methods so sandbox inference
route refresh works under OIDC. Make GetSandboxConfig dual-auth so CLI
users can read sandbox settings with Bearer tokens.

Preserve OIDC gateway metadata on restart — a bare gateway start
without --oidc-* flags no longer erases the stored OIDC registration.

Document CI client ID requirement (openshell-ci vs openshell-cli) in
the testing guide. Add security note about auth-only mode blast radius
for GitHub Actions.
Move OpenShell/GetSandboxConfig from sandbox-secret-only to dual-auth
so CLI users can read sandbox settings with Bearer tokens while sandbox
supervisors continue using the shared secret.

Add sandbox secret interceptor to the inference bundle fetch path so
GetInferenceBundle works under OIDC-enabled gateways. Extract shared
interceptor constructor to avoid duplication.

Add GetSandboxConfig to the config:read scope map so scope enforcement
applies consistently when scopes are enabled.

Refactor OIDC metadata preservation into apply_oidc_gateway_metadata()
with explicit resume semantics — only preserve existing OIDC metadata
on real resume paths, not on fresh deployments.

Update architecture docs and testing guide to reflect the corrected
method classifications and add new test coverage for interceptor
injection, scope requirements, metadata preservation, and dual-auth
classification.
Replace hand-written PKCE generation, authorization URL construction,
token exchange, client credentials, and token refresh with the oauth2
crate's typed API.

Eliminates sha2, hex, and getrandom dependencies from the CLI. The
custom urlencoded() helper and manual form POST logic are replaced by
BasicClient methods with proper type-state safety.

Discovery and the callback server remain custom since the oauth2 crate
does not provide OIDC discovery or a localhost redirect listener.
Group oidc.rs, authz.rs, identity.rs, and the auth HTTP endpoints
under src/auth/ module directory. No behavioral changes.

  auth/mod.rs      — module root, re-exports HTTP router
  auth/oidc.rs     — JWT validation, JWKS caching, method classification
  auth/authz.rs    — role and scope authorization policy
  auth/identity.rs — provider-agnostic Identity type
  auth/http.rs     — /auth/connect and /auth/oidc-config endpoints
The oauth2 crate defaults to BasicAuth (HTTP Basic header) but Keycloak
and most OIDC providers expect client_secret_post (credentials in the
request body). Set AuthType::RequestBody explicitly to match the
pre-refactor behavior.

Also re-export Identity, IdentityProvider, and JwksCache from the auth
module so ServerState's public API remains nameable by external consumers.
Pass --oidc-scopes to gateway start so the metadata includes requested
scopes after cluster bootstrap. Without this, users had to manually
edit metadata.json to set scopes for gateway login.

Usage: OPENSHELL_OIDC_SCOPES="openshell:all" mise run cluster
Add 10 end-to-end tests covering OIDC authentication against a live
K3s cluster with Keycloak:

RBAC (5 tests): admin can create providers, user cannot, user can list
sandboxes, unauthenticated requests rejected, health probe works
without auth.

Scopes (4 tests): sandbox-scoped token can list sandboxes but not
providers, openshell:all grants full access, no-scopes token denied.

Client credentials (1 test): CI token via client_credentials grant.

Tests are opt-in via OPENSHELL_E2E_OIDC=1 and OPENSHELL_E2E_OIDC_SCOPES=1
env vars. They derive the Keycloak URL from gateway metadata to match
the server's configured issuer.

Run with:

  OPENSHELL_E2E_OIDC=1 OPENSHELL_E2E_OIDC_SCOPES=1 \
  PYTHONPATH=python uv run pytest e2e/python/oidc/ -v
@mrunalp mrunalp force-pushed the feat/oidc-keycloak branch 3 times, most recently from 011c25b to ec4a1e6 Compare April 29, 2026 22:04
TaylorMutch
TaylorMutch previously approved these changes Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great starting point. Looking forward to iterating on it - thanks @mrunalp!

Comment thread deploy/helm/openshell/templates/statefulset.yaml
@mrunalp mrunalp force-pushed the feat/oidc-keycloak branch 2 times, most recently from 80c9001 to 44a1051 Compare April 29, 2026 23:56
Add blank lines before lists and fenced code blocks to satisfy
markdownlint MD031 and MD032 rules.
@mrunalp mrunalp force-pushed the feat/oidc-keycloak branch from 44a1051 to a1196d6 Compare April 30, 2026 00:34
@mrunalp
Copy link
Copy Markdown
Collaborator Author

mrunalp commented Apr 30, 2026

@TaylorMutch I rebased the branch and the tests are green.

@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test a1196d6

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for a1196d6. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Thank you @mrunalp !

@TaylorMutch TaylorMutch merged commit 0845054 into NVIDIA:main Apr 30, 2026
36 of 38 checks passed
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
    * Use `_make_cluster_bearer_provider` — a per-RPC closure that
      reads `oidc_token.json` on every invocation, returning the
      current `access_token` if fresh and raising `SandboxError`
      with a "re-authenticate with: openshell gateway login" hint
      if the token is missing, malformed, or expired (the 30 s
      grace window matches
      `openshell-bootstrap::oidc_token::is_token_expired`). A
      long-lived `SandboxClient` now picks up token rotations done
      by the CLI without being reconstructed. OAuth2 refresh itself
      stays in the CLI; the SDK only consumes what's on disk.

Tested:

- 23 SDK unit tests pass (5 existing + 18 new across the bearer
  interceptor, token provider, `TlsConfig` validation, and the
  `from_active_cluster` auth ladder). `mise run test:python` →
  31 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift
  (deploy recipe at `architecture/plans/deploy-openshift.md`):
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against four pre-PR failure modes:
    * `https://` OIDC gateway without `mtls/` no longer falls back
      to `insecure_channel`
    * CA-only `mtls/ca.crt` layout no longer raises
      `FileNotFoundError`
    * plaintext gateway with stale `oidc_token.json` no longer gets
      a bearer attached
    * long-lived client picks up rotated tokens; expired tokens
      surface as `SandboxError`, not silent gateway 401s

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
mrunalp added a commit to mrunalp/OpenShell that referenced this pull request May 28, 2026
PR NVIDIA#1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR NVIDIA#935 / PR NVIDIA#1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the IdP's error body included for
  diagnostics.
- `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=False)`
  factory. Default is the refresher path; `auto_refresh=False` falls
  back to the read-only fail-closed behavior for callers that don't
  want the SDK to make outbound HTTP calls to the IdP.
  `write_back=True` (opt-in) atomically persists the rotated bundle
  with 0600 mode so other processes — including the Rust CLI — see
  the rotation. Off by default; treats the Rust CLI as the canonical
  writer.
- `from_active_cluster` exposes `auto_refresh` / `write_back`
  kwargs (defaults: True / False).

OAuth2 refresh refresh policy and write-back semantics deliberately
mirror what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | opt-in      |

Refresh in the SDK is the production answer because:

- Long-running Python orchestrators (agent runs, data pipelines)
  outlast a Keycloak 1-hour access token. Without in-SDK refresh,
  they crash at expiry.
- Headless containers (sandbox-controller pods, GitHub Actions
  runners) may not have the Rust CLI installed but always have
  Python and a refresh_token.
- Subprocess-to-CLI per RPC would spawn `openshell` on every gRPC
  call, including hot streaming paths. Unacceptable.

The Rust CLI keeps owning interactive flows (browser/device-code,
keyring storage, the initial login). The SDK owns refresh during
script execution.

Tested:

- 32 SDK unit tests pass (5 existing + 27 new across the bearer
  interceptor, fail-closed provider, refresher behavior, `TlsConfig`
  validation, `from_active_cluster` auth ladder, and the refresher's
  concurrency / rotation / write-back / error paths).
  `mise run test:python` → 40 passed total.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR NVIDIA#1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against the four pre-review failure modes:
    * `https://` OIDC gateway without `mtls/` no longer falls back
      to `insecure_channel`
    * CA-only `mtls/ca.crt` layout no longer raises
      `FileNotFoundError`
    * plaintext gateway with stale `oidc_token.json` no longer gets
      a bearer attached
    * long-lived client picks up rotated tokens; expired tokens
      surface as `SandboxError`, not silent gateway 401s
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode, concurrent N-thread coordination (one refresh shared
  across 8 threads), IdP failure surfaced with error body, and the
  client_credentials / no-refresh_token error path.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OIDC/Keycloak authentication with RBAC and scope-based permissions

2 participants