fix(tracing+ollama): split streaming span test into fast mock + slow e2e; add 300s default timeout#1302
Merged
ajbozarth merged 5 commits intoJun 19, 2026
Conversation
7c17c35 to
91fe247
Compare
test_streaming_span_duration was hanging for the full 900 s pytest-timeout budget on CPU-only CI runners (generative-computing#1272). The root cause: OllamaModelBackend defaulted timeout=None, so the httpx client had no read deadline. When Ollama stalled mid-stream on a loaded runner, the event loop parked on selector.poll(-1) with nothing to wake it — no timers, no ready callbacks — until SIGALRM arrived 15 minutes later. Primary fix: default timeout changed from None to 300 s. A stalled stream now raises httpx.ReadTimeout promptly (the existing error-forwarding path in send_to_queue catches it and puts it on the async queue, where astream() picks it up and raises). Callers can still pass timeout=None to opt out. 300 s is the right balance: generous enough for slow-but-healthy CPU-only runners (non-streaming generation and cold model loads can take 60–120 s), while still bounding genuine stalls to 1/3 of the original 900 s hang budget. 60 s was too tight — test_span_not_closed_prematurely ("Count to 5") and test_multiple_generations_separate_spans were hitting ReadTimeout on Python 3.12/3.13 CI runners after model reload. Closes generative-computing#1272 Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code
91fe247 to
94002c9
Compare
Locks the documented escape hatch: OllamaModelBackend(timeout=None) must omit the key from client_kwargs so callers who want unbounded waits keep the upstream SDK default (no timeout). Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code
Contributor
Author
|
Closing while we work out the right approach — see the updated diagnosis in #1272. |
…; mark real e2e as slow
test_streaming_span_duration was marked e2e+ollama but NOT slow, so it ran
on every standard CI push against a CPU-only runner. After ~30 min of test
suite the granite4.1:3b model gets evicted by OLLAMA_KEEP_ALIVE=5m; reloading
it on a saturated runner routinely exceeds any reasonable scalar timeout,
making the test either hang (timeout=None) or flake (timeout=300s).
The root fix is to split the coverage in two:
1. test_streaming_span_creates_and_closes_span (new, integration, no Ollama
marker) — patches ollama.AsyncClient.chat to return a fake async-generator
with artificial per-chunk delays, then asserts that the TracingPlugin opens
a "chat" span, keeps it open for the full stream, and closes it only once
all chunks are consumed. Runs on every CI push, takes ~200 ms.
2. test_streaming_span_duration (existing) — kept to exercise a real Ollama
model end-to-end. Adding @pytest.mark.slow excludes it from the default
pyproject.toml addopts ("-m not slow") so it no longer runs on standard
CPU-only CI pushes. It belongs on the nightly GPU runner.
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
With the new 300 s default, a stalled model pull now triggers httpx.ReadTimeout instead of waiting forever. _pull_ollama_model was catching only ollama.ResponseError, so the httpx exception propagated uncaught from __init__, bypassing the friendly error path. Broaden the except to also handle httpx.TimeoutException and httpx.ConnectError so the caller still gets a clean False return and the "could not be pulled" message fires as expected. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Contributor
Author
|
thanks @akihikokuroda . I'll wait for @ajbozarth to give it a once-over given interest in otel -- Alex feel free to merge if you are happy with it |
ajbozarth
reviewed
Jun 19, 2026
ajbozarth
left a comment
Contributor
There was a problem hiding this comment.
Strategy makes sense — the mocked integration test gives us the structural check in CI without the CPU-only flakiness, and gating the real test behind slow is the right call. A few things to consider inline.
…tion test - Widen _check_ollama_server except clause to catch httpx.TimeoutException and httpx.ConnectError in addition to ConnectionError; with the new 300s default a stalled ps() call raised httpx.TimeoutException, bypassing the friendly "ollama server not running" message - Tighten streaming span-duration assertion from >= 0.05 s to >= 0.1 s; the fake stream is 3 chunks x 50 ms ~= 150 ms, so the looser threshold passed even if the span closed after the first chunk Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Contributor
Author
|
Thanks @ajbozarth — addressed in the latest commit:
Happy to leave the file rename to your cleanup in #1051. |
This was referenced Jun 19, 2026
Merged
via the queue into
generative-computing:main
with commit Jun 19, 2026
31a3479
11 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The test was hanging because two things combined badly: `OllamaModelBackend` had no httpx timeout at all, so a stalled stream would park the event loop forever; and the test wasn't marked `slow`, so it ran on every push on a CPU-only runner where after ~30 min the model gets evicted and takes minutes to reload — longer than any safe scalar timeout.
The main fix is moving the test out of standard CI. A new `integration`-marked test does the same structural check against a mocked backend (no Ollama needed, ~200 ms). The original real-model test stays but picks up `@pytest.mark.slow`, so `pyproject.toml`'s `addopts = "-m not slow"` keeps it off standard CI pushes. The 300 s default timeout is a secondary safety net for when the slow test does run.
Why not `httpx.Timeout(connect=30, read=600, ...)`? A structured timeout would let us tune connection setup and read time independently. The trade-off is an explicit `import httpx` in `ollama.py` and a wider type annotation. The scalar is fine for now; a follow-up can add the structured form if we need finer control.
Breaking change: default timeout
`OllamaModelBackend` now defaults to a 300 s timeout (previously unlimited). Users running large models on CPU, very long contexts, or multi-minute generations may hit `httpx.ReadTimeout` where they previously saw the call complete eventually. Pass `timeout=None` to restore the previous unlimited behaviour:
```python
backend = OllamaModelBackend("granite3.1-dense:8b", timeout=None)
```
Tested
Mocked integration test — no Ollama needed, runs in standard CI:
```
uv run pytest test/telemetry/test_streaming_tracing_integration.py -v
```
Passed locally in ~4.75 s. CI: ✅ passes on 3.11, 3.12, 3.13.
Slow e2e test — real Ollama with `granite3.1-dense:8b`, excluded from standard CI via `-m not slow`:
```
uv run pytest test/telemetry/test_tracing_backend.py::test_streaming_span_duration -v -m slow
```
Passed locally in ~6 s against a local Ollama instance. This test is excluded from standard CI pushes and will run in overnight/nightly runs where a GPU-backed Ollama server is available.
Closes #1272