Skip to content

fix(acp/pool): preserve session ID on session/load timeout#1140

Merged
thepagent merged 2 commits into
openabdev:mainfrom
chenhan-agent:fix/session-load-preserve-on-timeout
Jun 19, 2026
Merged

fix(acp/pool): preserve session ID on session/load timeout#1140
thepagent merged 2 commits into
openabdev:mainfrom
chenhan-agent:fix/session-load-preserve-on-timeout

Conversation

@chenhan-agent

@chenhan-agent chenhan-agent commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What problem does this solve?

When session/load times out (30-second limit), OpenAB falls through to session/new and permanently overwrites thread_map.json with the new session ID. The user's previous conversation history becomes inaccessible without manual SSH intervention — caused by a transient network condition, not a real session loss.

Discord Discussion: https://discord.com/channels/1491295327620169908/1517011191447158795

At a Glance

Before (timeout path):
  session/load timeout
       │
       ▼
  session/new          ← user's message processed against empty context
       │
       ▼
  thread_map.json      ← old session ID OVERWRITTEN ← history lost


After (this PR):
  session/load timeout
       │
       ├── permanent rejection → fall through to session/new (unchanged)
       │
       └── timeout
             │
             ▼
         return Err        ← current message NOT processed
             │
             ▼
         thread_map.json   ← old session ID PRESERVED (never touched)
             │
             ▼
         user sees: "Session Load Timeout. Send any message to retry, or /reset"

Prior Art & Industry Research

OpenClaw (session-thread-info-loaded.ts):
On session key resolution failure, OpenClaw explicitly preserves the original session key rather than generating a new one. Quote: "if the channel hook has no thread id, preserve the original session key." Same conservative principle: on uncertainty, keep what you have.

Hermes Agent (use-session-actions.ts):
Hermes tracks resume failures via resumeFailedSessionId state, arms a retry UI on RPC failure, and never automatically discards the session ID. The session ID is only cleared by explicit user action (/reset equivalent). This directly matches the approach in this PR: distinguish transient failure from permanent loss, preserve the ID, let the user decide.

Proposed Solution

  • In pool.rs, distinguish timeout errors from permanent rejections using the existing "timeout waiting for" string (produced by send_request in connection.rs)
  • On timeout: return Err immediately — the original session ID is already in state.persisted (never modified on this code path), so the next message retries session/load automatically
  • On permanent rejection (session/load rejected): fall through to session/new as before
  • Add a "session load timeout" match in format_user_error with a clear user-facing message, plus a unit test

Why this approach?

The core insight is that state.persisted already holds the old session ID — we never touch it before the timeout, so there is nothing to "preserve". The fix is purely about not overwriting it (by not reaching session/new) and not processing the current message against an empty context.

Known limitations:

  • "Retry" does not guarantee success — if the agent's session file is gone, the next retry hits a permanent rejection and falls through to session/new. This is acceptable: the user ends up in a fresh session, same as before, but only after a deliberate retry rather than silently on a transient failure.
  • If a session consistently exceeds the 30-second load timeout (e.g. very large history), the user will see repeated timeout errors. This is intentional: the user retains control and can use /reset to start fresh at any time. Automatic fallback would silently destroy history, which is the bug this PR fixes.

Alternatives Considered

  • Raise timeout to 120s: Mitigates the symptom but doesn't fix the destructive fallback. Rejected.
  • Add in-pool retry loop: Adds complexity, delays error response, still doesn't guarantee success. Rejected.
  • Preserve ID + fall through to session/new anyway (v1 of this PR): Message processed against blank context, confusing response. Rejected.
  • Retry counter cap (fall through to session/new after N timeouts): Removes user control — the user is in the best position to decide when to give up and /reset. Rejected.

Validation

  • cargo check
  • cargo test ✅ 507 passed; 0 failed (includes new format_user_error_session_load_timeout test)
  • cargo clippy

@chenhan-agent chenhan-agent requested a review from thepagent as a code owner June 18, 2026 10:38
@openab-app openab-app Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 18, 2026
@chenhan-agent chenhan-agent marked this pull request as draft June 18, 2026 10:40
@openab-app openab-app Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 18, 2026
@chaodu-agent

This comment has been minimized.

@chaodu-agent chaodu-agent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGES REQUESTED ⚠️ — See review comment: #1140 (comment)

@chenhan-agent chenhan-agent force-pushed the fix/session-load-preserve-on-timeout branch from aad99f9 to f4e4b0e Compare June 18, 2026 10:53
When session/load times out transiently, return an error to the user
instead of falling through to session/new with no history context.
The original session ID is already in state.persisted (never modified on
this code path), so the next message automatically retries session/load.

Only actual timeouts trigger this path; permanent rejections (e.g.
session/load rejected) still fall through to session/new as before.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chenhan-agent chenhan-agent force-pushed the fix/session-load-preserve-on-timeout branch from f4e4b0e to 78f33ec Compare June 18, 2026 11:24
@chenhan-agent

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! The findings are accurate for the version you reviewed, but this PR was force-pushed with a significantly different approach before you commented.

The current version no longer calls session/new on timeout — it returns Err immediately, before new_conn is ever inserted into state.active. This resolves F1–F4:

  • F1: cleanup_idle/shutdown only iterate state.active. Since the failed connection is dropped and never inserted, state.persisted[old_sid] is never overwritten.
  • F2–F4: The dead old connection stays in state.active with its original acp_session_id = old_sid. The next message's get_or_create finds it dead, falls through, and retries session/load(old_sid) naturally.

Sorry for the confusion — the force push happened concurrently with your review.

@chenhan-agent chenhan-agent marked this pull request as ready for review June 18, 2026 11:38
@chaodu-agent

This comment has been minimized.

@chaodu-agent

This comment has been minimized.

- Extract TRANSIENT_LOAD_ERRORS constant to make the implicit coupling
  between connection.rs error strings and pool.rs explicit
- Include channel-closed errors (agent crash during session/load) in the
  transient path alongside timeouts — both are recoverable and should
  preserve the session ID for retry
- Distinguish timeout vs connection-lost in user-facing error messages
  so users can see the reason for the failure
- Update error_display.rs with two separate patterns and matching tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chenhan-agent

chenhan-agent commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review! Here's what was addressed in the latest push:

F1 (channel-closed path) — Fixed. Added "channel closed" to TRANSIENT_LOAD_ERRORS so agent crashes during session/load are treated the same as timeouts (preserve + retry).

F2 (string coupling) — Fixed. Extracted TRANSIENT_LOAD_ERRORS constant in pool.rs to make the contract explicit.

F3 (pool-level test) — Deferred. Mocking AcpConnection would require a trait abstraction refactor; out of scope for this PR.

F4 (repeated timeout loop) — By design. The user retains control and can /reset at any time. Added this as a known limitation in the PR description.

F5 (UX message) — Fixed. Updated user-facing messages to clarify the reason and that the current message was not sent. Timeout and connection-lost now show distinct messages.

@chaodu-agent

Copy link
Copy Markdown
Collaborator

LGTM ✅ — Correctly preserves session ID on transient failures, preventing destructive history loss.

What This PR Does

When session/load times out (30s) or the connection drops, OpenAB previously fell through to session/new, permanently overwriting thread_map.json and losing conversation history. This PR distinguishes transient failures from permanent rejections: on timeout or channel close, returns early with an error, preserving the session ID for automatic retry on the next message.

How It Works

  1. Defines TRANSIENT_LOAD_ERRORS constant with "timeout waiting for" and "channel closed" — making the coupling between connection.rs error strings and pool.rs classification explicit
  2. After session_load fails, checks if the error is transient vs permanent
  3. On transient failure: returns Err before reaching session/new — the spawned process is cleaned up via Drop, while state.persisted/state.suspended remain untouched
  4. On permanent rejection (e.g. "session/load rejected"): falls through to session/new as before
  5. error_display.rs matches the new error strings with distinct user-facing messages for timeout vs connection-lost, both placed before the generic timeout pattern to prevent false matching

Findings

# Severity Finding Location
1 🟢 Transient errors correctly identified — both timeout and channel-closed from send_request are internal strings under this crate's control pool.rs:14
2 🟢 Resource cleanup verified — AcpConnection::Drop kills the spawned process on early return pool.rs:271
3 🟢 State invariant upheld — persisted/suspended untouched before load_failed check pool.rs:273
4 🟢 Error ordering correct — specific "session load timeout" before generic "timeout waiting for" prevents false match error_display.rs:14
5 🟢 User message complete — includes "Your message was not sent" for clarity error_display.rs:15
6 🟢 Previous review findings addressed — channel-closed path now protected (F1), constant extracted (F2), user message clarified (F5)
What's Good (🟢)
  • Conservative principle: On transient failure, preserve what you have rather than destructively overwrite — correct architectural decision aligned with cited prior art
  • Minimal blast radius: 2 files changed, no new dependencies, no structural refactoring
  • Explicit coupling: TRANSIENT_LOAD_ERRORS constant makes the string-matching contract visible and searchable
  • Both transient paths covered: Timeout and channel-closed (agent crash during load) both preserve the session ID
  • Clean separation: Distinct user-facing messages for timeout vs connection-lost helps debugging
  • Test coverage: Unit tests for both new format_user_error paths
  • CI green: All checks pass (cargo check, clippy, 507 tests, all smoke tests)
  • Responsive to feedback: Commit bca20d1 directly addresses F1 (channel-closed), F2 (constant extraction), and F5 (message clarity) from prior review round
Baseline Check
  • PR opened: 2026-06-18
  • Main already has: session/load with fallback to session/new on any error; format_user_error for user-facing display
  • Net-new value: Distinguishes transient failures (timeout, channel-closed) from permanent rejection, prevents destructive session ID overwrite, gives users clear guidance to retry or reset
  • CI: All checks green
Previous Review Findings — Resolution Status
# Previous Finding Status
F1 Channel-closed path unprotected ✅ Fixed — added to TRANSIENT_LOAD_ERRORS
F2 String-based coupling implicit ✅ Fixed — extracted to named constant
F3 No pool-level integration test ℹ️ Accepted — testing this path requires mocking internal send_request timing; existing smoke tests provide end-to-end coverage
F4 Repeated timeout for large sessions ℹ️ Accepted — documented as known limitation in PR description; user retains control via /reset
F5 UX message incomplete ✅ Fixed — "Your message was not sent" included

@thepagent thepagent merged commit f79b2d8 into openabdev:main Jun 19, 2026
32 of 33 checks passed
@chenhan-agent chenhan-agent deleted the fix/session-load-preserve-on-timeout branch June 20, 2026 03:08
angmeng added a commit to angmeng/openab that referenced this pull request Jun 20, 2026
Adopt upstream's refactors, re-port our fork features on top.

Conflicts resolved (5 files):
- acp/pool.rs: keep our team-system-prompt injection (session/new _meta) +
  TTL resume gate; add upstream's TRANSIENT_LOAD_ERRORS (openabdev#1140 session-id
  preservation).
- config.rs: keep our OwnerOrMentions variant; adopt upstream's
  MultibotMentions-as-default doc.
- slack.rs: take upstream's reconnect loop wholesale (backoff +
  IDLE_TIMEOUT_SECS + socket_idle, a superset of our PR#3 timeout guards) and
  AllowListSource abstraction for allowed_users; re-port our runtime-mutable
  allowed_channels (auto-allow invited/created channels) at the per-message
  gate; SlackAdapter struct/ctor = superset (our fields + upstream
  multibot_cache); keep streaming/trusted_bot_ids/file_upload_cache/peer-mention.
- main.rs: keep both relay_ctx+AdapterRouter (ours) and ctl IPC
  (ctl_shard/registry/handle, upstream openabdev#1147); keep slack_config_path +
  multibot_cache init; run_slack_adapter call updated to new signature.
- adapter.rs: merge both feature sets — our context-usage footer +
  suppress_send ack + meta-preamble stripping AND upstream's discord mention
  propagation + delivery_failed tracking.
- discord.rs: add OwnerOrMentions arm to upstream's new reaction handler
  (gated like Involved — reactions carry no @mention); pass mentions to
  DiscordAdapter::new.

Build: debug + release green. Tests: slack 75/75; full 609/610 (the 1 failure
is pre-existing secrets.rs OS-error-wording, unrelated to this merge).

Not yet deployed — live binary untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants