fix(mcp): auto-recover remote MCP OAuth on server-side invalid_token#3207
Open
aheritier wants to merge 4 commits into
Open
fix(mcp): auto-recover remote MCP OAuth on server-side invalid_token#3207aheritier wants to merge 4 commits into
aheritier wants to merge 4 commits into
Conversation
Three layered changes to eliminate the 5-attempt reconnect storm when a remote MCP server rejects the stored OAuth token: 1. pkg/tools/lifecycle/classify.go: classifyByMessage now maps "invalid_token" and "invalid token" substrings (RFC 6750 §3.1 error codes) to ErrAuthRequired. Bare "unauthorized" without "invalid_token" is intentionally NOT matched to avoid false-positives on app-level 401s (human decision Q3). 2. pkg/tools/lifecycle/supervisor.go: tryRestart now checks IsPermanent after connector.Connect returns an error. A permanent error (including ErrAuthRequired) transitions directly to StateFailed, firing OnFailed and signalDone without burning the MaxAttempts budget. Symmetric with the existing shouldRestart check on the Wait() path. Also introduces withBackgroundReconnect/IsBackgroundReconnect context helpers. tryRestart stamps the reconnect context so Connector implementations can distinguish background watcher reconnects from the initial interactive Start. 3. pkg/tools/mcp/mcp.go (clientConnector.Connect): applies WithoutInteractivePrompts for background reconnects (IsBackgroundReconnect) so a background 401 always defers rather than blocking on a dead elicitation bridge. Maps AuthorizationRequiredError, OAuthDeclinedError, and lifecycle.Classify-detected auth errors to lifecycle.ErrAuthRequired so the supervisor's new permanent-error guard fires correctly. Tests: - classify_test.go: new cases for invalid_token (with idempotency and bare-unauthorized negative assertion) - supervisor_test.go: TestSupervisor_PermanentConnectErrorDoesNotRetry — verifies a permanent Connect error fails-fast in 2 calls (initial + 1 reconnect) rather than exhausting MaxAttempts Refs #3198
Adds silent recovery for the common "token was rotated/revoked server-side" case, so the toolset transparently reconnects without any user interaction when a refresh token is available. Changes in pkg/tools/mcp/oauth.go: 1. errorCodeFromWWWAuth: new helper that extracts the RFC 6750 error= parameter from a WWW-Authenticate header (e.g. error="invalid_token"). 2. refreshStoredToken: the refresh logic previously inlined in getValidToken is extracted into a reusable method. Both getValidToken (local-expiry path) and the new server-rejection path share one implementation. The method honours the 30-second refreshFailedAt backoff so the token endpoint is never hammered, and resets the backoff on success. 3. roundTrip: tracks whether a Bearer token was attached to the outgoing request (attachedToken). On 401 with WWW-Authenticate, if a token was attached AND the server signals invalid_token (RFC 6750 error= field or body message), the new handleServerRejectedToken path runs before falling through to the existing first-contact OAuth flow. Non-invalid- token 401s against an attached token also attempt one evict+refresh as best-effort but do not gate on it. 4. handleServerRejectedToken: serialised under oauthFlowMu (same mutex authorizeOnce uses) so concurrent initialize-stage RPCs coalesce. Re-checks the stored token first (another goroutine may have already refreshed it); evicts the stale token; attempts refreshStoredToken; on success calls oauthSuccess() and returns nil; on failure falls back to interactive OAuth if the context allows it, else returns AuthorizationRequiredError. Detection heuristic (defense in depth, avoids misfiring on app-level 401s): - Token was attached to the request (we believed it valid) - Server responded with RFC 6750 error="invalid_token" in WWW-Authenticate OR "invalid_token"/"invalid token" in the response body - Trigger: evict + refresh. If that also 401s → fall through to existing interactive/deferred paths. Loop guards: - isRetry flag prevents a second call to handleServerRejectedToken within one roundTrip execution. - refreshFailedAt 30-second backoff prevents token-endpoint hammering. - lastOAuthDeclined latch prevents re-popping a dismissed dialog. Tests (5 new cases in oauth_test.go): - TestRoundTrip_ServerInvalidTokenEvictsAndRefreshes: end-to-end happy path, asserts 200, exactly one /token call, fresh token stored, oauthSuccess fired. - TestRoundTrip_ServerInvalidToken_RefreshFails_DefersWhenNonInteractive: /token returns 400, non-interactive ctx → IsAuthorizationRequired and token evicted; refreshFailedAt prevents a second /token hit. - TestRoundTrip_ServerInvalidToken_NoRefreshToken_NonInteractive: no refresh token → /token never called, returns IsAuthorizationRequired. - TestRoundTrip_FirstContact401Unchanged: no stored token → existing first-contact OAuth flow unchanged (regression guard). - TestRoundTrip_ConcurrentInvalidToken_RefreshesOnce: N concurrent round-trips → oauthFlowMu coalescing ensures exactly one /token call. Refs #3198
… non-interactive
Two changes that complete the fix and close the loop on user-visible recovery:
1. pkg/tools/startable.go: ShouldReportRecoveryFailure method.
StartableToolSet gains a recoveryStreak (separate from startStreak) that
is incremented only when Start() enters the recovery path (the toolset was
previously started, inner.IsStarted()==false) AND Restart() fails. This
distinguishes:
- Initial-startup auth deferral (silent; dialog appears naturally on the
first user message) — recoveryStreak untouched.
- Recovery failure (toolset was working, background watcher hit
invalid_token, supervisor reached StateFailed) — recoveryStreak.fail().
ShouldReportRecoveryFailure() returns true exactly once per recovery-
failure streak (same once-per-streak dedup as ShouldReportFailure) and
false for initial-startup failures, preventing the re-auth notice from
appearing on first-time OAuth setup.
Stop() also resets recoveryStreak so a future recovery cycle is reported
fresh. recoveryStreak is also reset on successful Start().
2. pkg/runtime/runtime.go: emit targeted re-auth notice in emitToolsProgressively.
The IsAuthorizationRequired branch is split:
- ShouldReportRecoveryFailure() true → emit "<server> needs re-authentication
— it will prompt on your next message, or use /toolset-restart" via
a.AddToolWarning (same warning channel as other toolset failures), with
slog.Warn.
- ShouldReportRecoveryFailure() false (initial startup deferral) → silent
continue as before.
This gives users a concrete, deduped notice when a previously-working MCP
server has its token revoked, so they know the OAuth prompt they are about
to see on the next message is intentional and can be acted on immediately
with /toolset-restart if they prefer not to wait.
Tests (4 new cases in startable_test.go):
- TestStartableToolSet_ShouldReportRecoveryFailure_OncePerStreak
- TestStartableToolSet_ShouldReportRecoveryFailure_NotFiredForInitialStartup
- TestStartableToolSet_ShouldReportRecoveryFailure_ResetsOnSuccess
- TestStartableToolSet_ShouldReportRecoveryFailure_ResetsOnStop
Fixes #3198
- docs/features/remote-mcp/index.md: new callout "Automatic recovery from revoked or rotated OAuth tokens" explaining the silent-refresh path and the re-auth-on-next-message fallback, plus the /toolset-restart shortcut. - docs/community/troubleshooting/index.md: add a paragraph in the tool- lifecycle section noting that 401 invalid_token is now self-healing (silent refresh or interactive re-auth prompt), with /toolset-restart as the immediate option. - docs/tools/mcp/index.md: one-line addition noting that tokens are refreshed silently when revoked server-side, and the OAuth prompt reappears when silent refresh is not possible. Refs #3198
0d1deda to
f6f06d8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the bug where a remote MCP server rejecting a stored OAuth access token with a spec-compliant
401 invalid_tokenwas not auto-recovered: the lifecycle supervisor treated the auth failure as a generic transient error, burned all 5 reconnect attempts, and the toolset instead permanently inStateFailed. The only workaround was restarting the process. Observed with the Notion remote MCP server (https://mcp.notion.com/mcp) but the root cause is transport-agnostic.Fixes #3198
Root causes addressed
pkg/tools/lifecycle/classify.go): a401/invalid_tokenwas not mapped to the existingErrAuthRequiredsentinel, soIsPermanentreturned false and the supervisor retried 5× pointlessly before failing.pkg/tools/mcp/oauth.go): refresh fired only when the token was locally expired; a server-side rejection of a token we still believed valid was never refreshed or evicted.pkg/tools/mcp/oauth.go+pkg/tools/lifecycle/supervisor.go): background reconnects had no ready elicitation bridge, so the toolset silently died with no re-auth affordance.What changed
Commit 1 —
fix(mcp): stop retry storm on permanent OAuth reconnect errorclassifyByMessagemapsinvalid_token/invalid token(RFC 6750 §3.1) toErrAuthRequired. A bareunauthorizedwithoutinvalid_tokenis intentionally NOT matched, to avoid false positives on app-level 401s.supervisor.tryRestartnow checksIsPermanentafterconnector.Connectfails and transitions straight toStateFailedinstead of exhausting the retry budget (symmetric with the existingWait()-path check).withBackgroundReconnect/IsBackgroundReconnectcontext helpers;clientConnector.ConnectappliesWithoutInteractivePromptsfor background reconnects so a background 401 defers cleanly, and maps auth/deferred-OAuth errors tolifecycle.ErrAuthRequired.Commit 2 —
fix(mcp): evict + refresh on server-side OAuth invalid_token rejectionerrorCodeFromWWWAuthextracts the RFC 6750error=code from theWWW-Authenticateheader.refreshStoredTokenis extracted fromgetValidTokenand shared by the local-expiry and server-rejection paths; it honours the existing 30srefreshFailedAtbackoff.roundTriptracks whether a token was attached; on a401withinvalid_tokensignalled and a token attached, it runshandleServerRejectedToken(serialised underoauthFlowMuso concurrent initialize-stage RPCs coalesce): re-check, evict stale token, attempt silent refresh, otherwise fall back to interactive OAuth or defer. ThelastOAuthDeclinedsticky-decline latch is honoured so a dismissed dialog is not re-popped.Commit 3 —
fix(mcp): surface re-auth notice and ensure background reconnects are non-interactive(Fixes #3198)StartableToolSetgains arecoveryStreak(distinct fromstartStreak) andShouldReportRecoveryFailure(), which fires exactly once per recovery-failure streak and stays silent for initial-startup auth deferral.emitToolsProgressively(andagent.ensureToolSetsAreStarted) emit a single deduped, user-visible notice — " needs re-authentication — it will prompt on your next message, or use /toolset-restart" — when a previously-working toolset fails withErrAuthRequired.Commit 4 —
docs: document invalid_token auto-recovery behaviordocs/features/remote-mcp,docs/community/troubleshooting, anddocs/tools/mcp.Behaviour after this change
invalid_tokenno longer triggers a 5-retry storm./toolset-restart).Testing
task build,task lint(clean for all changed files), andgo test -race ./pkg/tools/lifecycle/... ./pkg/tools/mcp/... ./pkg/tools/... ./pkg/runtime/... ./pkg/agent/...all pass.TestSupervisor_PermanentConnectErrorDoesNotRetry— permanent Connect error fails fast (2 calls, not 5).classify_test.go—invalid_token→ErrAuthRequired, including idempotency and a bare-unauthorizednegative case.TestRoundTrip_ServerInvalidTokenEvictsAndRefreshes,..._RefreshFails_DefersWhenNonInteractive,..._NoRefreshToken_NonInteractive,TestRoundTrip_FirstContact401Unchanged,TestRoundTrip_ConcurrentInvalidToken_RefreshesOnce,TestRoundTrip_Bare401WithAttachedTokenUnchanged,TestRoundTrip_ConcurrentInvalidToken_NoRefresh_StickyDecline.TestEmitStartupInfo_RecoveryAuthNoticeEmittedOnceand theStartableToolSet.ShouldReportRecoveryFailuresuite (once-per-streak, not-on-initial-startup, resets on success/stop).Notes