Skip to content

Add submit timeout flag and surface api errors to user#137

Merged
aliroberts merged 1 commit intodevfrom
feature/submit-timeout
Apr 17, 2026
Merged

Add submit timeout flag and surface api errors to user#137
aliroberts merged 1 commit intodevfrom
feature/submit-timeout

Conversation

@aliroberts
Copy link
Copy Markdown
Contributor

Summary

Prevents runs from hanging on lost /suggest responses and surfaces real backend error detail instead of the generic submit_failed termination.

Context

Users were seeing runs stuck in submitting for ~40 min before the server heartbeat reaper killed them. Root cause: submit_execution_result had a 3650 s read
timeout; the backend /suggest would commit the next execution_tasks row and return 200, but the reply was getting dropped in transit (LB/proxy/network). Meanwhile
every failure in the submit path was swallowed by a try/except Exception: return None, so users only ever saw termination_details="Failed to submit execution result" — hiding real causes like insufficient credits, auth errors, and candidate-generation failures.

Changes

Queue-mode recovery on submit (core/api.py)

  • WecoClient.suggest now runs _recover_queue_suggest on ReadTimeout / ConnectionError / 5xx when a task_id is supplied.
  • Recovery calls get_run_status(include_history=True) + get_execution_tasks: if a ready task is queued (or the run is completed), synthesize a success response
    so the main loop continues via its normal poll/claim path. Also pulls the previous step's metric_value so ui.on_metric fires for the recovered step.

Configurable submit timeout

  • Optional timeout plumbed WecoClient.suggestsubmit_execution_result_run_optimization_loopoptimize / resume_optimization.
  • Hidden --submit-timeout SECONDS flag on weco run and weco resume. Default (None) preserves the existing (10, 3650) behavior — no impact on existing
    clients.

Error surfacing (ported from feature/derive-run; derive feature excluded)

  • New format_api_error renders backend detail / suggestion / extras as a multi-line string suitable for ui.on_error.
  • submit_execution_result no longer swallows exceptions; signature tightened from Optional[Dict] to Dict; docstring now records what it raises.
  • Central except HTTPError in _run_optimization_loop pushes the formatted error through the UI and returns OptimizationResult(reason=f"http_{status_code}", details=error_message) so runs.termination_reason / termination_details carry the real cause.

@aliroberts aliroberts merged commit 8fc8e1a into dev Apr 17, 2026
1 check passed
@aliroberts aliroberts deleted the feature/submit-timeout branch April 17, 2026 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant