Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/seclab_taskflow_agent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from agents.run import DEFAULT_MAX_TURNS
from dotenv import find_dotenv, load_dotenv
from openai import AsyncOpenAI
import httpx

from .capi import get_AI_endpoint, get_AI_token, get_provider

Expand Down Expand Up @@ -182,6 +183,7 @@ def __init__(
base_url=resolved_endpoint,
api_key=resolved_token,
default_headers=provider.extra_headers or None,
timeout=httpx.Timeout(connect=10.0, read=300.0, write=300.0, pool=60.0),
)
Comment on lines 182 to 187
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client read timeout is set to 300s while the runner’s STREAM_IDLE_TIMEOUT is 1800s. Since these are two independent timeouts affecting streaming behavior, the effective idle-kill behavior can become unclear to maintainers and may not match the intended 30-minute threshold. Consider centralizing/documenting the relationship between these timeouts (or aligning them) so the configured behavior is unambiguous.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are intentionally different — they guard different failure modes at different layers:

  • httpx.Timeout(read=300s) — TCP-level. Catches dead connections where the socket itself stops delivering bytes (CLOSE_WAIT). This is the first line of defense.
  • STREAM_IDLE_TIMEOUT(1800s) — Application-level. Catches hangs where the connection is technically alive but no events arrive (e.g. the async generator is stuck, or the server stops sending SSE frames while keeping the connection open).

The read timeout fires per individual socket read; the idle timeout fires when no complete event has been yielded for 30 minutes. In practice the httpx timeout catches most dead-connection cases and the idle timeout is a backstop for subtler hangs. I'll add a comment in the code clarifying the relationship.

set_tracing_disabled(True)
self.run_hooks = run_hooks or TaskRunHooks()
Expand Down
18 changes: 17 additions & 1 deletion src/seclab_taskflow_agent/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
MAX_API_RETRY = 5 # Maximum number of consecutive API error retries
TASK_RETRY_LIMIT = 3 # Maximum retry attempts for a failed task
TASK_RETRY_BACKOFF = 10 # Initial backoff in seconds between task retries
STREAM_IDLE_TIMEOUT = 1800 # Kill a streaming run if no events received for this long (seconds)


def _resolve_model_config(
Expand Down Expand Up @@ -391,7 +392,22 @@ async def _run_streamed() -> None:
while rate_limit_backoff:
try:
result = agent0.run_streamed(prompt, max_turns=max_turns)
async for event in result.stream_events():
stream = result.stream_events()
async_iter = stream.__aiter__()
while True:
try:
event = await asyncio.wait_for(
async_iter.__anext__(),
timeout=STREAM_IDLE_TIMEOUT,
)
except StopAsyncIteration:
break
except asyncio.TimeoutError:
logging.error(
f"Stream idle for {STREAM_IDLE_TIMEOUT}s — "
"connection likely dead, raising APITimeoutError"
)
raise APITimeoutError("Stream idle timeout exceeded")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a stream-idle timeout behavior but there are no unit tests covering the new retry path (timeout -> APITimeoutError -> retry) or verifying that iteration stops/cleans up correctly. Consider adding a focused test that stubs run_streamed().stream_events() to yield a couple events and then hang, asserting the idle timeout triggers and cleanup/retry behavior occurs as intended.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a targeted hotfix for a production hang we hit during a live experiment — the timeout/retry path was validated manually against Python 3.13 (normal iteration, StopAsyncIteration propagation, TimeoutError, and cancellation safety). Happy to add a proper test suite in a follow-up but didn't want to block the fix on that.

if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
await render_model_output(event.data.delta, async_task=async_task, task_id=task_id)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual __aiter__/__anext__ loop no longer guarantees the stream iterator is closed when exiting early (e.g., on asyncio.TimeoutError). Unlike async for, this can leave the async generator / underlying streaming run unclosed, potentially leaking tasks/sockets and undermining the goal of recovering from dead connections. Ensure the stream is explicitly closed in a finally (e.g., via contextlib.aclosing(...) or calling aclose()/a provided result cleanup API) before raising/retrying.

Suggested change
stream = result.stream_events()
async_iter = stream.__aiter__()
while True:
try:
event = await asyncio.wait_for(
async_iter.__anext__(),
timeout=STREAM_IDLE_TIMEOUT,
)
except StopAsyncIteration:
break
except asyncio.TimeoutError:
logging.error(
f"Stream idle for {STREAM_IDLE_TIMEOUT}s — "
"connection likely dead, raising APITimeoutError"
)
raise APITimeoutError("Stream idle timeout exceeded")
if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
await render_model_output(event.data.delta, async_task=async_task, task_id=task_id)
stream = None
try:
stream = result.stream_events()
async_iter = stream.__aiter__()
while True:
try:
event = await asyncio.wait_for(
async_iter.__anext__(),
timeout=STREAM_IDLE_TIMEOUT,
)
except StopAsyncIteration:
break
except asyncio.TimeoutError:
logging.error(
f"Stream idle for {STREAM_IDLE_TIMEOUT}s — "
"connection likely dead, raising APITimeoutError"
)
raise APITimeoutError("Stream idle timeout exceeded")
if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
await render_model_output(event.data.delta, async_task=async_task, task_id=task_id)
finally:
if stream is not None:
aclose = getattr(stream, "aclose", None)
if aclose is not None:
await aclose()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in abb4ba3. Added finally block with aclose() to ensure the async generator is cleaned up on timeout/early exit.

await render_model_output("\n\n", async_task=async_task, task_id=task_id)
Expand Down
Loading