Skip to content

Add auto-resume feature and bump version (0.3.31)#142

Merged
aliroberts merged 1 commit intomainfrom
dev
Apr 24, 2026
Merged

Add auto-resume feature and bump version (0.3.31)#142
aliroberts merged 1 commit intomainfrom
dev

Conversation

@aliroberts
Copy link
Copy Markdown
Contributor

Summary

Wraps the optimization loop so that transient network failures (ConnectionError, ReadTimeout, HTTP 502/503/504) auto-resume the run instead of bailing out. Enabled by
default, with CLI overrides. Non-transient failures (auth, 4xx, insufficient credits, Ctrl-C) still propagate unchanged. Bumps version to 0.3.31.

CLI

weco run ... [--no-auto-resume] [--auto-resume-max-attempts N]
weco resume ... [--no-auto-resume] [--auto-resume-max-attempts N]

Defaults: enabled, 5 attempts, 5s initial backoff, exponential (×2) capped at 60s.

Implementation notes

  • _run_loop_with_auto_resume in weco/optimizer.py drives run_optimization_loop as a closure. On transient exit it sleeps with exponential backoff, calls
    WecoClient.resume_run silently (_silent_resume), and re-enters the loop with start_step = result.final_step. Non-transient results return verbatim.
  • run_optimization_loop now catches ConnectionError / ReadTimeout explicitly and tags them transient_network_error instead of landing in the generic unknown bucket.
    HTTPError branch is unchanged; transient classification uses reason ∈ {transient_network_error, http_502, http_503, http_504}.
  • _silent_resume failures retry in-place (don't re-invoke the loop), so when the backend is unreachable we don't spin in get_execution_tasks for 10 minutes between resume
    attempts.
  • New UI events on_reconnecting(attempt, max, backoff_s) / on_reconnected() on the OptimizationUI protocol. Rich UI adds a reconnecting status (📡, yellow) with
    attempt/backoff in the status row; plain UI prints [RECONNECTING] / [RECONNECTED] lines. Exhaustion routes through on_error so it lands in the prominent Error row.
  • AutoResumePolicy dataclass carries the overrides; both optimize() and resume_optimization() accept one and default to AutoResumePolicy() when absent.

Tests

  • tests/test_auto_resume.py: 21 tests covering classification across 12 reasons, happy path, transient-then-success, exhaustion, disabled policy, _silent_resume failure
    retries without re-invoking the loop, exponential backoff with cap, and event payload shape.

@aliroberts aliroberts merged commit 6a4dc2d into main Apr 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant