fix: inter-chunk stream timeout and download part guard (RFC) by planetf1 · Pull Request #1236 · generative-computing/mellea

planetf1 · 2026-06-09T10:15:46Z

Closes #650. Addresses all three concerns in #1235.

What changed

Streaming hang fix (helpers/async_helpers.py)

A stalled backend iterator previously blocked the process forever. send_to_queue had no per-chunk timeout, so its iteration loop waited indefinitely and no sentinel ever reached the queue. Every downstream consumer blocked with it.

This wraps each anext() call in asyncio.timeout(). A stalled stream now surfaces as a TimeoutError through the queue. On timeout, the underlying iterator's close method is called via duck-typing (getattr(ait, "aclose", None) or getattr(ait, "close", None)) to release HTTP connections cleanly — this covers both async def-style AsyncGenerator and class-based iterators such as OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer.

The timeout context manager is captured (async with asyncio.timeout(...) as cm) and cm.expired() is checked before rewriting the error — so a TimeoutError raised by the backend iterator itself (e.g. an httpx read timeout) is forwarded verbatim rather than being misreported as a misconfigured STREAM_TIMEOUT.

DEFAULT_CHUNK_TIMEOUT = 120.0 is a public constant exported from helpers/. All eight backend call sites import and use it — no hardcoded values. The new ModelOption.STREAM_TIMEOUT key lets callers override per-call or at the backend/session level. chunk_timeout is keyword-only in the send_to_queue signature.

The 120 s default covers time-to-first-token as well as inter-chunk gaps. Slow local inference (large models on CPU, long prompts) can legitimately exceed this before the first token. Use a higher value or None for those deployments. Note: this timeout only applies to streaming responses; non-streaming coroutines bypass the per-chunk loop entirely and are unaffected.

Retriever download guard (formatters/granite/retrievers/util.py)

The parquet download loop now stops at _MAX_PARTS + 1 attempts. The extra iteration allows a corpus of exactly _MAX_PARTS parts to receive the terminating 404 rather than raising spuriously. The largest corpus currently has 20 parts; 50 gives headroom. _MAX_PARTS is a module-level constant (patchable in tests).

Documentation (docs/)

ModelOption reference table updated: STREAM_TIMEOUT description notes it covers TTFT and that non-streaming calls are unaffected. MAX_NEW_TOKENS row flags that backend defaults vary widely (vLLM defaults to 16). Both streaming how-to pages show three examples: tight timeout, slow-model value (300 s), and None to disable.

Tests

Four new cases in test/helpers/test_async_helpers.py: timeout fires and puts TimeoutError with no trailing sentinel; None disables the timeout and clean completion still gets a sentinel; DEFAULT_CHUNK_TIMEOUT is 120.0. One new case in test/formatters/granite/test_retrievers_util.py: _MAX_PARTS exhaustion raises RuntimeError.

Follow-up

#1242 tracks the cancel hook not firing for direct avalue()/astream() callers on HF STREAM_TIMEOUT — the worker thread keeps generating until natural completion on those paths. Marked with TODO(#1242) at the two _cancel_hook arming sites. The stream_with_chunking path already calls cancel_generation() and is not affected.

🤖 Generated with Claude Code

Adds ModelOption.STREAM_TIMEOUT (default 60 s) and threads it through all backend send_to_queue call sites. A hung or stalled backend iterator now surfaces as a TimeoutError in the queue rather than blocking forever. Addresses the root cause reported in generative-computing#650. Adds a _MAX_PARTS=1000 guard to the retriever parquet download loop so a malformed or unexpectedly large dataset cannot exhaust disk. Prototype / RFC — see generative-computing#1235 for discussion. Default timeout value and whether MAX_NEW_TOKENS should also gain a default cap are open questions. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Adds a complete ModelOption reference table to configure-model-options, a streaming-timeout section explaining the default 60 s inter-chunk bound and how to override it, and a matching note on the async/streaming how-to page. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

The largest corpus in frreiss/mt-rag-embeddings currently has 20 parts. 1000 was effectively unbounded for this dataset; 50 gives reasonable headroom while actually catching runaway downloads. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Backend defaults vary widely — vLLM defaults to 16 tokens which silently truncates most responses. Added a warning callout and recommendation to always set MAX_NEW_TOKENS explicitly in production code. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

- Rename _DEFAULT_CHUNK_TIMEOUT -> DEFAULT_CHUNK_TIMEOUT (public export); all backends now import and reference it, eliminating the hardcoded 60.0 - Use aiter()/anext() builtins instead of __aiter__/__anext__ dunders - Close underlying AsyncGenerator on timeout via isinstance guard - Clarify docstring: timeout covers TTFT as well as inter-chunk gaps; no trailing sentinel after TimeoutError (exception item is the terminator) - Fix off-by-one in retriever download guard: allow _MAX_PARTS+1 loop iterations so a corpus of exactly _MAX_PARTS parts can see the 404 - Add tests: timeout fires + no trailing sentinel, None disables timeout, DEFAULT_CHUNK_TIMEOUT value exported correctly (all 14 pass) - Update docs: explicit TTFT note, add 300s slow-model example alongside the None disable example Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Backend-default STREAM_TIMEOUT was silently dropped because the call sites read from the raw per-call model_options instead of the merged model_opts dict produced by _simplify_and_merge(). This meant setting STREAM_TIMEOUT on a backend constructor or via push_model_options() had no effect. Fix: four sites (ollama, litellm, watsonx, openai standard path) now read `model_opts.get(ModelOption.STREAM_TIMEOUT, DEFAULT_CHUNK_TIMEOUT)` in line with every other option lookup in those functions. HF's three sites were already correct (they receive the merged dict under the model_options param). Also: update model_options.py docstrings to say "per-chunk" and note time-to-first-token coverage; bump test timeout from 0.05s to 0.5s to reduce flakiness risk on loaded CI runners. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

ajbozarth

A quick review from Claude to start:

Nice scoped fix — the streaming hang really did need bounding, and the API surface (single DEFAULT_CHUNK_TIMEOUT constant, single STREAM_TIMEOUT ModelOption, None to disable) is restrained. Tests pin the right invariants, especially that a timeout uses the exception itself as the stream terminator with no trailing sentinel.

Four action items inline. The (model_options or {}).get(...) consistency item and the aclose() reach-through are worth addressing in this PR; the cancel-hook miss can be a follow-up.

Replace the isinstance(ait, AsyncGenerator) guard before aclose() with duck-typing via getattr — OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer are class-based AsyncIterators, not AsyncGenerator instances, so the original branch was unreachable on the two streaming paths that matter most. The new path handles both aclose() (async generators) and close() (class-based iterators), logging any cleanup failure at DEBUG rather than swallowing it silently. Drop the `(model_options or {})` defensive wrapper at the four STREAM_TIMEOUT read sites (hf.py:3, openai.py:1). The parameter is typed dict[str, Any] and is always a merged non-None dict at those call sites — the guard was dead code and inconsistent with the other four sites in the same files. Add TODO(generative-computing#1242) at both _cancel_hook arming sites noting that direct avalue()/astream() callers do not fire the cancel event on STREAM_TIMEOUT; the stream_with_chunking path already mitigates this. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

ajbozarth

LGTM on a code read through and Claude says so too

planetf1 · 2026-06-15T09:45:23Z

@jakelorocco can you take a look - I need extra approval on this one?

planetf1 · 2026-06-19T12:53:05Z

@AngeloDanducci @psschwei — this PR has been open for 10 days and has one approval. It needs a sign-off from the mellea-intrinsics team (CODEOWNERS for mellea/formatters/granite). Would appreciate a look when you can.

AngeloDanducci · 2026-06-19T15:01:27Z

I'm not in the mellea-intrinsics approver list - do you want additional reviews other than that? I saw one approve and considering my review is non blocking I did not follow up.

frreiss · 2026-06-19T18:35:43Z

The core issue being addressed, #650, appears to involve the system operating correctly as designed.

My understanding of the scenario is:

A test case issues a combination of requests that triggers a bug in ollama.
The bug causes the ollama server to hang.
The frozen server causes the test case to freeze waiting for a response.
The test harness correctly identifies that the test has frozen and correctly fails the session.

The proposed solution for this issue involves adding a timeout to some backend operations but not to others. In the event that an inference server becomes unresponsive, the new timeout logic may or may not trigger. If the timeout logic does trigger, it will cause the entire application to fail with a stack trace involving multiple threads and code from obscure asyncio classes. This failure mode does not seem like an improvement to me.

Knowledgeable users could make their code catch TimeoutError to prevent the entire application from crashing due to a timeout. However, a hypothetical user who adds a try-except block to catch TimeoutError could just as easily add their own call to asyncio.timeout() at a location that is more application-appropriate than deep inside certain backend streaming operations.

Claude's flagging of the in mellea/formatters/granite/retrievers/util.py is a false positive. That file-downloading code is specfically related to the MTRAG benchmark data set. This code would only perform unbounded amounts of downloads if the MTRAG benchmark were to become infinitely large. That said, Claude's fix doesn't appear to do any harm.

planetf1 · 2026-06-22T11:09:01Z

Thanks for the detailed review — let me take the points in turn.

On #650 being "correct as designed": You're right that the test harness caught the hang — but the fix is aimed at production use, not tests. A user streaming from a stalled backend has nothing watching them; today they hang indefinitely with no error and no way out. It doesn't have to be an ollama bug — any unresponsive server produces the same result.

On "some backends but not others": The timeout lives in send_to_queue rather than in each backend — every streaming path goes through that function. The backend changes just wire the argument through. All five streaming backends are covered (huggingface, openai, ollama, litellm, watsonx); bedrock and dummy don't implement streaming. The one genuine gap is the cancel-hook cleanup on timeout, which is already split out as #1242.

On the asyncio stack trace: The asyncio TimeoutError is caught inside send_to_queue and discarded — a fresh one is constructed before it ever leaves the function, so there's no chained __context__ from the internal cancellation machinery. Here's the actual uncaught traceback:

Traceback (most recent call last):
  File "myapp.py", line 23, in <module>
    asyncio.run(main())
  File ".../asyncio/runners.py", line 194, in run
    return runner.run(main)
  File ".../asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File ".../asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
  File "myapp.py", line 21, in main
    raise item2
TimeoutError: Stream timed out after 60s without a chunk (covers time-to-first-token and inter-chunk gaps). Set ModelOption.STREAM_TIMEOUT to a larger value or None to disable.

The asyncio.run frames at the top appear in any unhandled async exception — they're not specific to this timeout. The raise is at the caller's own site and the message is the actionable one. No CancelledError chain, no task cancellation internals. If you've seen a different trace in practice I'd want to fix it.

On "users could add their own asyncio.timeout()": Fair point, and STREAM_TIMEOUT=None is there for anyone who wants the old behaviour. But there's a practical difference: a user-level asyncio.timeout() budgets the entire response, which would cut off legitimate long outputs. The per-chunk timeout only triggers when the backend goes completely silent — a slow model that's actively generating is unaffected however long it runs. That's hard to replicate cleanly from outside the library.

On the retriever guard: You're right — it's your repo so you know the data. The while True looked unbounded in isolation. Happy to revert it; since you say it does no harm I'll leave the call to you.

jakelorocco

I agree that a timeout is needed / nice. I think the documentation / comments in relation to this code are misleading though. All generation uses the send_to_queue function; it doesn't matter if STREAM==True.

I think the default value of 60s is likely too low. I think it should either default to None or be some very high value. A large prompt to a sluggish / overloaded inference provider can easily take more time. In the non-streaming case, the client will have to wait for the whole response back while fitting within this timeout window.

To educate our users, I would expect most examples to have some sort of try-catch block for these TimeoutErrors as well. For that reason, I think it makes more sense for end users to opt-in to a reasonable timeout.

planetf1 · 2026-06-23T09:26:14Z

The send_to_queue observation is correct — it's called for all backends regardless of STREAM. However, the timeout itself only fires inside the isinstance(aresponse, AsyncIterator) branch, so non-streaming paths (where the coroutine resolves to a plain response object) skip the per-chunk loop entirely. The STREAM_TIMEOUT name and docs are scoped accordingly, but you're right that neither explicitly says "this is a no-op for non-streaming callers" — worth making that clearer in the docstring.

The non-streaming point in your second concern doesn't apply for the same reason. The streaming concern is fair though: the same window covers TTFT as well as inter-chunk gaps, and 60 s can be tight for a large model loading locally before the first token arrives. Bumping the default to 120 s seems like a reasonable middle ground — still bounded (an unbounded default feels risky), but less likely to surprise users on slower hardware. On the TTFT vs inter-chunk distinction: they do have different profiles and a split timeout would give finer control — worth a follow-up issue if that's of interest?

On error handling in examples — agreed that showing what to do when a timeout fires would be useful. We could update the existing streaming examples to include a try-catch, though a dedicated error-handling example might cover it better. Worth a follow-up issue?

- Bump DEFAULT_CHUNK_TIMEOUT from 60 s to 120 s to give more headroom for TTFT on slow local inference - Capture asyncio.timeout() context manager and check cm.expired() before rewriting TimeoutError — backend-raised TimeoutErrors (e.g. httpx read timeout) now forward verbatim instead of being misreported as a misconfigured STREAM_TIMEOUT - Make chunk_timeout keyword-only in send_to_queue signature - Document that chunk_timeout=0 aborts immediately; None is the disable path - Hoist _MAX_PARTS to module level so tests can patch it; add test case covering the exhaustion RuntimeError path - Tighten test timeout from 0.5 s to 0.05 s for faster suite Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Backend-sourced TimeoutError is re-raised verbatim in the per-chunk guard path; the CI quality gate requires a matching Raises: entry for every raise in a public function. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

github-actions Bot added the bug Something isn't working label Jun 9, 2026

planetf1 marked this pull request as ready for review June 9, 2026 10:31

planetf1 requested review from a team as code owners June 9, 2026 10:31

planetf1 requested review from AngeloDanducci, ajbozarth and psschwei June 9, 2026 10:31

planetf1 added 3 commits June 9, 2026 11:47

planetf1 force-pushed the fix/stream-liveness-bounds branch from 0a1e230 to 036e813 Compare June 9, 2026 10:56

ajbozarth reviewed Jun 9, 2026

View reviewed changes

Comment thread mellea/backends/huggingface.py Outdated

Comment thread mellea/helpers/async_helpers.py Outdated

Comment thread mellea/backends/huggingface.py

Comment thread docs/docs/how-to/configure-model-options.md

planetf1 mentioned this pull request Jun 9, 2026

fix(hf): fire _cancel_event on STREAM_TIMEOUT for direct avalue/astream paths #1242

Open

planetf1 requested a review from ajbozarth June 9, 2026 18:07

ajbozarth approved these changes Jun 9, 2026

View reviewed changes

planetf1 enabled auto-merge June 10, 2026 10:02

planetf1 mentioned this pull request Jun 12, 2026

Security hardening: webhook validation, adapter path traversal, API key hygiene, and input validation #1246

Open

jakelorocco reviewed Jun 22, 2026

View reviewed changes

planetf1 added 2 commits June 23, 2026 10:48

jakelorocco approved these changes Jun 23, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Jun 23, 2026

Merged via the queue into generative-computing:main with commit cc06476 Jun 23, 2026
9 checks passed

planetf1 deleted the fix/stream-liveness-bounds branch June 23, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: inter-chunk stream timeout and download part guard (RFC)#1236

fix: inter-chunk stream timeout and download part guard (RFC)#1236
planetf1 merged 9 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds

planetf1 commented Jun 9, 2026 •

edited

Loading

Uh oh!

ajbozarth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Uh oh!

planetf1 commented Jun 15, 2026

Uh oh!

planetf1 commented Jun 19, 2026

Uh oh!

AngeloDanducci commented Jun 19, 2026

Uh oh!

frreiss commented Jun 19, 2026

Uh oh!

planetf1 commented Jun 22, 2026 •

edited

Loading

Uh oh!

jakelorocco left a comment

Uh oh!

planetf1 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

planetf1 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Follow-up

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 commented Jun 15, 2026

Uh oh!

planetf1 commented Jun 19, 2026

Uh oh!

AngeloDanducci commented Jun 19, 2026

Uh oh!

frreiss commented Jun 19, 2026

Uh oh!

planetf1 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

planetf1 commented Jun 9, 2026 •

edited

Loading

planetf1 commented Jun 22, 2026 •

edited

Loading