fix(acp/pool): preserve session ID on session/load timeout#1140
Conversation
This comment has been minimized.
This comment has been minimized.
chaodu-agent
left a comment
There was a problem hiding this comment.
CHANGES REQUESTED
aad99f9 to
f4e4b0e
Compare
When session/load times out transiently, return an error to the user instead of falling through to session/new with no history context. The original session ID is already in state.persisted (never modified on this code path), so the next message automatically retries session/load. Only actual timeouts trigger this path; permanent rejections (e.g. session/load rejected) still fall through to session/new as before. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f4e4b0e to
78f33ec
Compare
|
Thanks for the thorough review! The findings are accurate for the version you reviewed, but this PR was force-pushed with a significantly different approach before you commented. The current version no longer calls
Sorry for the confusion — the force push happened concurrently with your review. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Extract TRANSIENT_LOAD_ERRORS constant to make the implicit coupling between connection.rs error strings and pool.rs explicit - Include channel-closed errors (agent crash during session/load) in the transient path alongside timeouts — both are recoverable and should preserve the session ID for retry - Distinguish timeout vs connection-lost in user-facing error messages so users can see the reason for the failure - Update error_display.rs with two separate patterns and matching tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the detailed review! Here's what was addressed in the latest push: F1 (channel-closed path) — Fixed. Added F2 (string coupling) — Fixed. Extracted F3 (pool-level test) — Deferred. Mocking F4 (repeated timeout loop) — By design. The user retains control and can F5 (UX message) — Fixed. Updated user-facing messages to clarify the reason and that the current message was not sent. Timeout and connection-lost now show distinct messages. |
|
LGTM ✅ — Correctly preserves session ID on transient failures, preventing destructive history loss. What This PR DoesWhen How It Works
Findings
What's Good (🟢)
Baseline Check
Previous Review Findings — Resolution Status
|
Adopt upstream's refactors, re-port our fork features on top. Conflicts resolved (5 files): - acp/pool.rs: keep our team-system-prompt injection (session/new _meta) + TTL resume gate; add upstream's TRANSIENT_LOAD_ERRORS (openabdev#1140 session-id preservation). - config.rs: keep our OwnerOrMentions variant; adopt upstream's MultibotMentions-as-default doc. - slack.rs: take upstream's reconnect loop wholesale (backoff + IDLE_TIMEOUT_SECS + socket_idle, a superset of our PR#3 timeout guards) and AllowListSource abstraction for allowed_users; re-port our runtime-mutable allowed_channels (auto-allow invited/created channels) at the per-message gate; SlackAdapter struct/ctor = superset (our fields + upstream multibot_cache); keep streaming/trusted_bot_ids/file_upload_cache/peer-mention. - main.rs: keep both relay_ctx+AdapterRouter (ours) and ctl IPC (ctl_shard/registry/handle, upstream openabdev#1147); keep slack_config_path + multibot_cache init; run_slack_adapter call updated to new signature. - adapter.rs: merge both feature sets — our context-usage footer + suppress_send ack + meta-preamble stripping AND upstream's discord mention propagation + delivery_failed tracking. - discord.rs: add OwnerOrMentions arm to upstream's new reaction handler (gated like Involved — reactions carry no @mention); pass mentions to DiscordAdapter::new. Build: debug + release green. Tests: slack 75/75; full 609/610 (the 1 failure is pre-existing secrets.rs OS-error-wording, unrelated to this merge). Not yet deployed — live binary untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What problem does this solve?
When
session/loadtimes out (30-second limit), OpenAB falls through tosession/newand permanently overwritesthread_map.jsonwith the new session ID. The user's previous conversation history becomes inaccessible without manual SSH intervention — caused by a transient network condition, not a real session loss.Discord Discussion: https://discord.com/channels/1491295327620169908/1517011191447158795
At a Glance
Prior Art & Industry Research
OpenClaw (session-thread-info-loaded.ts):
On session key resolution failure, OpenClaw explicitly preserves the original session key rather than generating a new one. Quote: "if the channel hook has no thread id, preserve the original session key." Same conservative principle: on uncertainty, keep what you have.
Hermes Agent (use-session-actions.ts):
Hermes tracks resume failures via
resumeFailedSessionIdstate, arms a retry UI on RPC failure, and never automatically discards the session ID. The session ID is only cleared by explicit user action (/resetequivalent). This directly matches the approach in this PR: distinguish transient failure from permanent loss, preserve the ID, let the user decide.Proposed Solution
pool.rs, distinguish timeout errors from permanent rejections using the existing"timeout waiting for"string (produced bysend_requestinconnection.rs)Errimmediately — the original session ID is already instate.persisted(never modified on this code path), so the next message retriessession/loadautomaticallysession/load rejected): fall through tosession/newas before"session load timeout"match informat_user_errorwith a clear user-facing message, plus a unit testWhy this approach?
The core insight is that
state.persistedalready holds the old session ID — we never touch it before the timeout, so there is nothing to "preserve". The fix is purely about not overwriting it (by not reachingsession/new) and not processing the current message against an empty context.Known limitations:
session/new. This is acceptable: the user ends up in a fresh session, same as before, but only after a deliberate retry rather than silently on a transient failure./resetto start fresh at any time. Automatic fallback would silently destroy history, which is the bug this PR fixes.Alternatives Considered
Validation
cargo check✅cargo test✅ 507 passed; 0 failed (includes newformat_user_error_session_load_timeouttest)cargo clippy✅