Skip to content

fix: inter-chunk stream timeout and download part guard (RFC)#1236

Merged
planetf1 merged 9 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds
Jun 23, 2026
Merged

fix: inter-chunk stream timeout and download part guard (RFC)#1236
planetf1 merged 9 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds

Conversation

@planetf1

@planetf1 planetf1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closes #650. Addresses all three concerns in #1235.

Fixes #650
Fixes #1235

What changed

Streaming hang fix (helpers/async_helpers.py)

A stalled backend iterator previously blocked the process forever. send_to_queue had no per-chunk timeout, so its iteration loop waited indefinitely and no sentinel ever reached the queue. Every downstream consumer blocked with it.

This wraps each anext() call in asyncio.timeout(). A stalled stream now surfaces as a TimeoutError through the queue. On timeout, the underlying iterator's close method is called via duck-typing (getattr(ait, "aclose", None) or getattr(ait, "close", None)) to release HTTP connections cleanly — this covers both async def-style AsyncGenerator and class-based iterators such as OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer.

The timeout context manager is captured (async with asyncio.timeout(...) as cm) and cm.expired() is checked before rewriting the error — so a TimeoutError raised by the backend iterator itself (e.g. an httpx read timeout) is forwarded verbatim rather than being misreported as a misconfigured STREAM_TIMEOUT.

DEFAULT_CHUNK_TIMEOUT = 120.0 is a public constant exported from helpers/. All eight backend call sites import and use it — no hardcoded values. The new ModelOption.STREAM_TIMEOUT key lets callers override per-call or at the backend/session level. chunk_timeout is keyword-only in the send_to_queue signature.

The 120 s default covers time-to-first-token as well as inter-chunk gaps. Slow local inference (large models on CPU, long prompts) can legitimately exceed this before the first token. Use a higher value or None for those deployments. Note: this timeout only applies to streaming responses; non-streaming coroutines bypass the per-chunk loop entirely and are unaffected.

Retriever download guard (formatters/granite/retrievers/util.py)

The parquet download loop now stops at _MAX_PARTS + 1 attempts. The extra iteration allows a corpus of exactly _MAX_PARTS parts to receive the terminating 404 rather than raising spuriously. The largest corpus currently has 20 parts; 50 gives headroom. _MAX_PARTS is a module-level constant (patchable in tests).

Documentation (docs/)

ModelOption reference table updated: STREAM_TIMEOUT description notes it covers TTFT and that non-streaming calls are unaffected. MAX_NEW_TOKENS row flags that backend defaults vary widely (vLLM defaults to 16). Both streaming how-to pages show three examples: tight timeout, slow-model value (300 s), and None to disable.

Tests

Four new cases in test/helpers/test_async_helpers.py: timeout fires and puts TimeoutError with no trailing sentinel; None disables the timeout and clean completion still gets a sentinel; DEFAULT_CHUNK_TIMEOUT is 120.0. One new case in test/formatters/granite/test_retrievers_util.py: _MAX_PARTS exhaustion raises RuntimeError.

Follow-up

#1242 tracks the cancel hook not firing for direct avalue()/astream() callers on HF STREAM_TIMEOUT — the worker thread keeps generating until natural completion on those paths. Marked with TODO(#1242) at the two _cancel_hook arming sites. The stream_with_chunking path already calls cancel_generation() and is not affected.

🤖 Generated with Claude Code

Adds ModelOption.STREAM_TIMEOUT (default 60 s) and threads it through
all backend send_to_queue call sites. A hung or stalled backend iterator
now surfaces as a TimeoutError in the queue rather than blocking forever.
Addresses the root cause reported in generative-computing#650.

Adds a _MAX_PARTS=1000 guard to the retriever parquet download loop so
a malformed or unexpectedly large dataset cannot exhaust disk.

Prototype / RFC — see generative-computing#1235 for discussion. Default timeout value and
whether MAX_NEW_TOKENS should also gain a default cap are open questions.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@github-actions github-actions Bot added the bug Something isn't working label Jun 9, 2026
Adds a complete ModelOption reference table to configure-model-options,
a streaming-timeout section explaining the default 60 s inter-chunk
bound and how to override it, and a matching note on the async/streaming
how-to page.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review June 9, 2026 10:31
@planetf1 planetf1 requested review from a team as code owners June 9, 2026 10:31
planetf1 added 3 commits June 9, 2026 11:47
The largest corpus in frreiss/mt-rag-embeddings currently has 20 parts.
1000 was effectively unbounded for this dataset; 50 gives reasonable
headroom while actually catching runaway downloads.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Backend defaults vary widely — vLLM defaults to 16 tokens which silently
truncates most responses. Added a warning callout and recommendation to
always set MAX_NEW_TOKENS explicitly in production code.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Rename _DEFAULT_CHUNK_TIMEOUT -> DEFAULT_CHUNK_TIMEOUT (public export);
  all backends now import and reference it, eliminating the hardcoded 60.0
- Use aiter()/anext() builtins instead of __aiter__/__anext__ dunders
- Close underlying AsyncGenerator on timeout via isinstance guard
- Clarify docstring: timeout covers TTFT as well as inter-chunk gaps; no
  trailing sentinel after TimeoutError (exception item is the terminator)
- Fix off-by-one in retriever download guard: allow _MAX_PARTS+1 loop
  iterations so a corpus of exactly _MAX_PARTS parts can see the 404
- Add tests: timeout fires + no trailing sentinel, None disables timeout,
  DEFAULT_CHUNK_TIMEOUT value exported correctly (all 14 pass)
- Update docs: explicit TTFT note, add 300s slow-model example alongside
  the None disable example

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 force-pushed the fix/stream-liveness-bounds branch from 0a1e230 to 036e813 Compare June 9, 2026 10:56
Backend-default STREAM_TIMEOUT was silently dropped because the call sites
read from the raw per-call model_options instead of the merged model_opts
dict produced by _simplify_and_merge(). This meant setting STREAM_TIMEOUT
on a backend constructor or via push_model_options() had no effect.

Fix: four sites (ollama, litellm, watsonx, openai standard path) now read
`model_opts.get(ModelOption.STREAM_TIMEOUT, DEFAULT_CHUNK_TIMEOUT)` in line
with every other option lookup in those functions. HF's three sites were
already correct (they receive the merged dict under the model_options param).

Also: update model_options.py docstrings to say "per-chunk" and note
time-to-first-token coverage; bump test timeout from 0.05s to 0.5s to
reduce flakiness risk on loaded CI runners.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick review from Claude to start:

Nice scoped fix — the streaming hang really did need bounding, and the API surface (single DEFAULT_CHUNK_TIMEOUT constant, single STREAM_TIMEOUT ModelOption, None to disable) is restrained. Tests pin the right invariants, especially that a timeout uses the exception itself as the stream terminator with no trailing sentinel.

Four action items inline. The (model_options or {}).get(...) consistency item and the aclose() reach-through are worth addressing in this PR; the cancel-hook miss can be a follow-up.

Comment thread mellea/backends/huggingface.py Outdated
Comment thread mellea/helpers/async_helpers.py Outdated
Comment thread mellea/backends/huggingface.py
Comment thread docs/docs/how-to/configure-model-options.md
Replace the isinstance(ait, AsyncGenerator) guard before aclose() with
duck-typing via getattr — OpenAI's AsyncStream and HF's
AsyncTextIteratorStreamer are class-based AsyncIterators, not
AsyncGenerator instances, so the original branch was unreachable on the
two streaming paths that matter most. The new path handles both aclose()
(async generators) and close() (class-based iterators), logging any
cleanup failure at DEBUG rather than swallowing it silently.

Drop the `(model_options or {})` defensive wrapper at the four
STREAM_TIMEOUT read sites (hf.py:3, openai.py:1). The parameter is
typed dict[str, Any] and is always a merged non-None dict at those call
sites — the guard was dead code and inconsistent with the other four
sites in the same files.

Add TODO(generative-computing#1242) at both _cancel_hook arming sites noting that direct
avalue()/astream() callers do not fire the cancel event on
STREAM_TIMEOUT; the stream_with_chunking path already mitigates this.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 requested a review from ajbozarth June 9, 2026 18:07

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on a code read through and Claude says so too

@planetf1

Copy link
Copy Markdown
Contributor Author

@jakelorocco can you take a look - I need extra approval on this one?

@planetf1

Copy link
Copy Markdown
Contributor Author

@AngeloDanducci @psschwei — this PR has been open for 10 days and has one approval. It needs a sign-off from the mellea-intrinsics team (CODEOWNERS for mellea/formatters/granite). Would appreciate a look when you can.

@AngeloDanducci

Copy link
Copy Markdown
Contributor

I'm not in the mellea-intrinsics approver list - do you want additional reviews other than that? I saw one approve and considering my review is non blocking I did not follow up.

@frreiss

frreiss commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

The core issue being addressed, #650, appears to involve the system operating correctly as designed.

My understanding of the scenario is:

  • A test case issues a combination of requests that triggers a bug in ollama.
  • The bug causes the ollama server to hang.
  • The frozen server causes the test case to freeze waiting for a response.
  • The test harness correctly identifies that the test has frozen and correctly fails the session.

The proposed solution for this issue involves adding a timeout to some backend operations but not to others. In the event that an inference server becomes unresponsive, the new timeout logic may or may not trigger. If the timeout logic does trigger, it will cause the entire application to fail with a stack trace involving multiple threads and code from obscure asyncio classes. This failure mode does not seem like an improvement to me.

Knowledgeable users could make their code catch TimeoutError to prevent the entire application from crashing due to a timeout. However, a hypothetical user who adds a try-except block to catch TimeoutError could just as easily add their own call to asyncio.timeout() at a location that is more application-appropriate than deep inside certain backend streaming operations.

Claude's flagging of the in mellea/formatters/granite/retrievers/util.py is a false positive. That file-downloading code is specfically related to the MTRAG benchmark data set. This code would only perform unbounded amounts of downloads if the MTRAG benchmark were to become infinitely large. That said, Claude's fix doesn't appear to do any harm.

@planetf1

planetf1 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review — let me take the points in turn.

On #650 being "correct as designed": You're right that the test harness caught the hang — but the fix is aimed at production use, not tests. A user streaming from a stalled backend has nothing watching them; today they hang indefinitely with no error and no way out. It doesn't have to be an ollama bug — any unresponsive server produces the same result.

On "some backends but not others": The timeout lives in send_to_queue rather than in each backend — every streaming path goes through that function. The backend changes just wire the argument through. All five streaming backends are covered (huggingface, openai, ollama, litellm, watsonx); bedrock and dummy don't implement streaming. The one genuine gap is the cancel-hook cleanup on timeout, which is already split out as #1242.

On the asyncio stack trace: The asyncio TimeoutError is caught inside send_to_queue and discarded — a fresh one is constructed before it ever leaves the function, so there's no chained __context__ from the internal cancellation machinery. Here's the actual uncaught traceback:

Traceback (most recent call last):
  File "myapp.py", line 23, in <module>
    asyncio.run(main())
  File ".../asyncio/runners.py", line 194, in run
    return runner.run(main)
  File ".../asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File ".../asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
  File "myapp.py", line 21, in main
    raise item2
TimeoutError: Stream timed out after 60s without a chunk (covers time-to-first-token and inter-chunk gaps). Set ModelOption.STREAM_TIMEOUT to a larger value or None to disable.

The asyncio.run frames at the top appear in any unhandled async exception — they're not specific to this timeout. The raise is at the caller's own site and the message is the actionable one. No CancelledError chain, no task cancellation internals. If you've seen a different trace in practice I'd want to fix it.

On "users could add their own asyncio.timeout()": Fair point, and STREAM_TIMEOUT=None is there for anyone who wants the old behaviour. But there's a practical difference: a user-level asyncio.timeout() budgets the entire response, which would cut off legitimate long outputs. The per-chunk timeout only triggers when the backend goes completely silent — a slow model that's actively generating is unaffected however long it runs. That's hard to replicate cleanly from outside the library.

On the retriever guard: You're right — it's your repo so you know the data. The while True looked unbounded in isolation. Happy to revert it; since you say it does no harm I'll leave the call to you.

@jakelorocco jakelorocco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a timeout is needed / nice. I think the documentation / comments in relation to this code are misleading though. All generation uses the send_to_queue function; it doesn't matter if STREAM==True.

I think the default value of 60s is likely too low. I think it should either default to None or be some very high value. A large prompt to a sluggish / overloaded inference provider can easily take more time. In the non-streaming case, the client will have to wait for the whole response back while fitting within this timeout window.

To educate our users, I would expect most examples to have some sort of try-catch block for these TimeoutErrors as well. For that reason, I think it makes more sense for end users to opt-in to a reasonable timeout.

@planetf1

Copy link
Copy Markdown
Contributor Author

The send_to_queue observation is correct — it's called for all backends regardless of STREAM. However, the timeout itself only fires inside the isinstance(aresponse, AsyncIterator) branch, so non-streaming paths (where the coroutine resolves to a plain response object) skip the per-chunk loop entirely. The STREAM_TIMEOUT name and docs are scoped accordingly, but you're right that neither explicitly says "this is a no-op for non-streaming callers" — worth making that clearer in the docstring.

The non-streaming point in your second concern doesn't apply for the same reason. The streaming concern is fair though: the same window covers TTFT as well as inter-chunk gaps, and 60 s can be tight for a large model loading locally before the first token arrives. Bumping the default to 120 s seems like a reasonable middle ground — still bounded (an unbounded default feels risky), but less likely to surprise users on slower hardware. On the TTFT vs inter-chunk distinction: they do have different profiles and a split timeout would give finer control — worth a follow-up issue if that's of interest?

On error handling in examples — agreed that showing what to do when a timeout fires would be useful. We could update the existing streaming examples to include a try-catch, though a dedicated error-handling example might cover it better. Worth a follow-up issue?

planetf1 added 2 commits June 23, 2026 10:48
- Bump DEFAULT_CHUNK_TIMEOUT from 60 s to 120 s to give more headroom
  for TTFT on slow local inference
- Capture asyncio.timeout() context manager and check cm.expired()
  before rewriting TimeoutError — backend-raised TimeoutErrors (e.g.
  httpx read timeout) now forward verbatim instead of being
  misreported as a misconfigured STREAM_TIMEOUT
- Make chunk_timeout keyword-only in send_to_queue signature
- Document that chunk_timeout=0 aborts immediately; None is the
  disable path
- Hoist _MAX_PARTS to module level so tests can patch it; add test
  case covering the exhaustion RuntimeError path
- Tighten test timeout from 0.5 s to 0.05 s for faster suite

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Backend-sourced TimeoutError is re-raised verbatim in the
per-chunk guard path; the CI quality gate requires a matching
Raises: entry for every raise in a public function.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 added this pull request to the merge queue Jun 23, 2026
Merged via the queue into generative-computing:main with commit cc06476 Jun 23, 2026
9 checks passed
@planetf1 planetf1 deleted the fix/stream-liveness-bounds branch June 23, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reliability: missing resource bounds on streaming (liveness hang, download guard, token cap) bug: generate_from_raw hanging when ollama under load

5 participants