Skip to content

fix(mcp): auto-recover remote MCP OAuth on server-side invalid_token#3207

Open
aheritier wants to merge 4 commits into
mainfrom
fix/mcp-oauth-invalid-token-recovery-signed
Open

fix(mcp): auto-recover remote MCP OAuth on server-side invalid_token#3207
aheritier wants to merge 4 commits into
mainfrom
fix/mcp-oauth-invalid-token-recovery-signed

Conversation

@aheritier

@aheritier aheritier commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the bug where a remote MCP server rejecting a stored OAuth access token with a spec-compliant 401 invalid_token was not auto-recovered: the lifecycle supervisor treated the auth failure as a generic transient error, burned all 5 reconnect attempts, and the toolset instead permanently in StateFailed. The only workaround was restarting the process. Observed with the Notion remote MCP server (https://mcp.notion.com/mcp) but the root cause is transport-agnostic.

Fixes #3198

Root causes addressed

  1. Error misclassification (pkg/tools/lifecycle/classify.go): a 401/invalid_token was not mapped to the existing ErrAuthRequired sentinel, so IsPermanent returned false and the supervisor retried 5× pointlessly before failing.
  2. Silent refresh only on local expiry (pkg/tools/mcp/oauth.go): refresh fired only when the token was locally expired; a server-side rejection of a token we still believed valid was never refreshed or evicted.
  3. Re-auth handler never ran on the reconnect path (pkg/tools/mcp/oauth.go + pkg/tools/lifecycle/supervisor.go): background reconnects had no ready elicitation bridge, so the toolset silently died with no re-auth affordance.

What changed

Commit 1 — fix(mcp): stop retry storm on permanent OAuth reconnect error

  • classifyByMessage maps invalid_token / invalid token (RFC 6750 §3.1) to ErrAuthRequired. A bare unauthorized without invalid_token is intentionally NOT matched, to avoid false positives on app-level 401s.
  • supervisor.tryRestart now checks IsPermanent after connector.Connect fails and transitions straight to StateFailed instead of exhausting the retry budget (symmetric with the existing Wait()-path check).
  • Adds withBackgroundReconnect/IsBackgroundReconnect context helpers; clientConnector.Connect applies WithoutInteractivePrompts for background reconnects so a background 401 defers cleanly, and maps auth/deferred-OAuth errors to lifecycle.ErrAuthRequired.

Commit 2 — fix(mcp): evict + refresh on server-side OAuth invalid_token rejection

  • errorCodeFromWWWAuth extracts the RFC 6750 error= code from the WWW-Authenticate header.
  • refreshStoredToken is extracted from getValidToken and shared by the local-expiry and server-rejection paths; it honours the existing 30s refreshFailedAt backoff.
  • roundTrip tracks whether a token was attached; on a 401 with invalid_token signalled and a token attached, it runs handleServerRejectedToken (serialised under oauthFlowMu so concurrent initialize-stage RPCs coalesce): re-check, evict stale token, attempt silent refresh, otherwise fall back to interactive OAuth or defer. The lastOAuthDeclined sticky-decline latch is honoured so a dismissed dialog is not re-popped.

Commit 3 — fix(mcp): surface re-auth notice and ensure background reconnects are non-interactive (Fixes #3198)

  • StartableToolSet gains a recoveryStreak (distinct from startStreak) and ShouldReportRecoveryFailure(), which fires exactly once per recovery-failure streak and stays silent for initial-startup auth deferral.
  • emitToolsProgressively (and agent.ensureToolSetsAreStarted) emit a single deduped, user-visible notice — " needs re-authentication — it will prompt on your next message, or use /toolset-restart" — when a previously-working toolset fails with ErrAuthRequired.

Commit 4 — docs: document invalid_token auto-recovery behavior

  • Documents the silent-refresh path and the re-auth-on-next-message fallback in docs/features/remote-mcp, docs/community/troubleshooting, and docs/tools/mcp.

Behaviour after this change

  • A server-side invalid_token no longer triggers a 5-retry storm.
  • When a refresh token is available, the toolset silently refreshes and reconnects — fully automatic, no user interaction.
  • When refresh is not possible, the user gets a clear one-time re-auth notice and the OAuth prompt reappears on the next message (or via /toolset-restart).

Testing

  • task build, task lint (clean for all changed files), and go test -race ./pkg/tools/lifecycle/... ./pkg/tools/mcp/... ./pkg/tools/... ./pkg/runtime/... ./pkg/agent/... all pass.
  • New/updated tests:
    • TestSupervisor_PermanentConnectErrorDoesNotRetry — permanent Connect error fails fast (2 calls, not 5).
    • classify_test.goinvalid_tokenErrAuthRequired, including idempotency and a bare-unauthorized negative case.
    • TestRoundTrip_ServerInvalidTokenEvictsAndRefreshes, ..._RefreshFails_DefersWhenNonInteractive, ..._NoRefreshToken_NonInteractive, TestRoundTrip_FirstContact401Unchanged, TestRoundTrip_ConcurrentInvalidToken_RefreshesOnce, TestRoundTrip_Bare401WithAttachedTokenUnchanged, TestRoundTrip_ConcurrentInvalidToken_NoRefresh_StickyDecline.
    • TestEmitStartupInfo_RecoveryAuthNoticeEmittedOnce and the StartableToolSet.ShouldReportRecoveryFailure suite (once-per-streak, not-on-initial-startup, resets on success/stop).

Notes

  • Commits are SSH-signed.
  • No config/schema changes.

@aheritier aheritier requested a review from a team as a code owner June 22, 2026 20:16
@aheritier aheritier added area/mcp MCP protocol, MCP tool servers, integration kind/bug Bug report or bug fix labels Jun 22, 2026
Three layered changes to eliminate the 5-attempt reconnect storm when a
remote MCP server rejects the stored OAuth token:

1. pkg/tools/lifecycle/classify.go: classifyByMessage now maps
   "invalid_token" and "invalid token" substrings (RFC 6750 §3.1 error
   codes) to ErrAuthRequired. Bare "unauthorized" without "invalid_token"
   is intentionally NOT matched to avoid false-positives on app-level 401s
   (human decision Q3).

2. pkg/tools/lifecycle/supervisor.go: tryRestart now checks IsPermanent
   after connector.Connect returns an error. A permanent error (including
   ErrAuthRequired) transitions directly to StateFailed, firing OnFailed
   and signalDone without burning the MaxAttempts budget. Symmetric with
   the existing shouldRestart check on the Wait() path.

   Also introduces withBackgroundReconnect/IsBackgroundReconnect context
   helpers. tryRestart stamps the reconnect context so Connector
   implementations can distinguish background watcher reconnects from the
   initial interactive Start.

3. pkg/tools/mcp/mcp.go (clientConnector.Connect): applies
   WithoutInteractivePrompts for background reconnects (IsBackgroundReconnect)
   so a background 401 always defers rather than blocking on a dead
   elicitation bridge. Maps AuthorizationRequiredError, OAuthDeclinedError,
   and lifecycle.Classify-detected auth errors to lifecycle.ErrAuthRequired
   so the supervisor's new permanent-error guard fires correctly.

Tests:
- classify_test.go: new cases for invalid_token (with idempotency and
  bare-unauthorized negative assertion)
- supervisor_test.go: TestSupervisor_PermanentConnectErrorDoesNotRetry —
  verifies a permanent Connect error fails-fast in 2 calls (initial +
  1 reconnect) rather than exhausting MaxAttempts

Refs #3198
Adds silent recovery for the common "token was rotated/revoked server-side"
case, so the toolset transparently reconnects without any user interaction
when a refresh token is available.

Changes in pkg/tools/mcp/oauth.go:

1. errorCodeFromWWWAuth: new helper that extracts the RFC 6750 error=
   parameter from a WWW-Authenticate header (e.g. error="invalid_token").

2. refreshStoredToken: the refresh logic previously inlined in
   getValidToken is extracted into a reusable method. Both getValidToken
   (local-expiry path) and the new server-rejection path share one
   implementation. The method honours the 30-second refreshFailedAt
   backoff so the token endpoint is never hammered, and resets the
   backoff on success.

3. roundTrip: tracks whether a Bearer token was attached to the outgoing
   request (attachedToken). On 401 with WWW-Authenticate, if a token was
   attached AND the server signals invalid_token (RFC 6750 error= field
   or body message), the new handleServerRejectedToken path runs before
   falling through to the existing first-contact OAuth flow. Non-invalid-
   token 401s against an attached token also attempt one evict+refresh as
   best-effort but do not gate on it.

4. handleServerRejectedToken: serialised under oauthFlowMu (same mutex
   authorizeOnce uses) so concurrent initialize-stage RPCs coalesce.
   Re-checks the stored token first (another goroutine may have already
   refreshed it); evicts the stale token; attempts refreshStoredToken; on
   success calls oauthSuccess() and returns nil; on failure falls back to
   interactive OAuth if the context allows it, else returns
   AuthorizationRequiredError.

Detection heuristic (defense in depth, avoids misfiring on app-level 401s):
- Token was attached to the request (we believed it valid)
- Server responded with RFC 6750 error="invalid_token" in WWW-Authenticate
  OR "invalid_token"/"invalid token" in the response body
- Trigger: evict + refresh. If that also 401s → fall through to existing
  interactive/deferred paths.

Loop guards:
- isRetry flag prevents a second call to handleServerRejectedToken within
  one roundTrip execution.
- refreshFailedAt 30-second backoff prevents token-endpoint hammering.
- lastOAuthDeclined latch prevents re-popping a dismissed dialog.

Tests (5 new cases in oauth_test.go):
- TestRoundTrip_ServerInvalidTokenEvictsAndRefreshes: end-to-end happy
  path, asserts 200, exactly one /token call, fresh token stored,
  oauthSuccess fired.
- TestRoundTrip_ServerInvalidToken_RefreshFails_DefersWhenNonInteractive:
  /token returns 400, non-interactive ctx → IsAuthorizationRequired and
  token evicted; refreshFailedAt prevents a second /token hit.
- TestRoundTrip_ServerInvalidToken_NoRefreshToken_NonInteractive: no
  refresh token → /token never called, returns IsAuthorizationRequired.
- TestRoundTrip_FirstContact401Unchanged: no stored token → existing
  first-contact OAuth flow unchanged (regression guard).
- TestRoundTrip_ConcurrentInvalidToken_RefreshesOnce: N concurrent
  round-trips → oauthFlowMu coalescing ensures exactly one /token call.

Refs #3198
… non-interactive

Two changes that complete the fix and close the loop on user-visible recovery:

1. pkg/tools/startable.go: ShouldReportRecoveryFailure method.

   StartableToolSet gains a recoveryStreak (separate from startStreak) that
   is incremented only when Start() enters the recovery path (the toolset was
   previously started, inner.IsStarted()==false) AND Restart() fails. This
   distinguishes:

   - Initial-startup auth deferral (silent; dialog appears naturally on the
     first user message) — recoveryStreak untouched.
   - Recovery failure (toolset was working, background watcher hit
     invalid_token, supervisor reached StateFailed) — recoveryStreak.fail().

   ShouldReportRecoveryFailure() returns true exactly once per recovery-
   failure streak (same once-per-streak dedup as ShouldReportFailure) and
   false for initial-startup failures, preventing the re-auth notice from
   appearing on first-time OAuth setup.

   Stop() also resets recoveryStreak so a future recovery cycle is reported
   fresh. recoveryStreak is also reset on successful Start().

2. pkg/runtime/runtime.go: emit targeted re-auth notice in emitToolsProgressively.

   The IsAuthorizationRequired branch is split:
   - ShouldReportRecoveryFailure() true → emit "<server> needs re-authentication
     — it will prompt on your next message, or use /toolset-restart" via
     a.AddToolWarning (same warning channel as other toolset failures), with
     slog.Warn.
   - ShouldReportRecoveryFailure() false (initial startup deferral) → silent
     continue as before.

   This gives users a concrete, deduped notice when a previously-working MCP
   server has its token revoked, so they know the OAuth prompt they are about
   to see on the next message is intentional and can be acted on immediately
   with /toolset-restart if they prefer not to wait.

Tests (4 new cases in startable_test.go):
- TestStartableToolSet_ShouldReportRecoveryFailure_OncePerStreak
- TestStartableToolSet_ShouldReportRecoveryFailure_NotFiredForInitialStartup
- TestStartableToolSet_ShouldReportRecoveryFailure_ResetsOnSuccess
- TestStartableToolSet_ShouldReportRecoveryFailure_ResetsOnStop

Fixes #3198
- docs/features/remote-mcp/index.md: new callout "Automatic recovery from
  revoked or rotated OAuth tokens" explaining the silent-refresh path and the
  re-auth-on-next-message fallback, plus the /toolset-restart shortcut.

- docs/community/troubleshooting/index.md: add a paragraph in the tool-
  lifecycle section noting that 401 invalid_token is now self-healing (silent
  refresh or interactive re-auth prompt), with /toolset-restart as the
  immediate option.

- docs/tools/mcp/index.md: one-line addition noting that tokens are refreshed
  silently when revoked server-side, and the OAuth prompt reappears when silent
  refresh is not possible.

Refs #3198
@aheritier aheritier force-pushed the fix/mcp-oauth-invalid-token-recovery-signed branch from 0d1deda to f6f06d8 Compare June 22, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP protocol, MCP tool servers, integration kind/bug Bug report or bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote MCP OAuth: invalid_token (server-side 401) is not auto-recovered — toolset dies after retry storm

1 participant