Skip to content

fix(tracing+ollama): split streaming span test into fast mock + slow e2e; add 300s default timeout#1302

Merged
ajbozarth merged 5 commits into
generative-computing:mainfrom
planetf1:worktree-issue-1272
Jun 19, 2026
Merged

fix(tracing+ollama): split streaming span test into fast mock + slow e2e; add 300s default timeout#1302
ajbozarth merged 5 commits into
generative-computing:mainfrom
planetf1:worktree-issue-1272

Conversation

@planetf1

@planetf1 planetf1 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The test was hanging because two things combined badly: `OllamaModelBackend` had no httpx timeout at all, so a stalled stream would park the event loop forever; and the test wasn't marked `slow`, so it ran on every push on a CPU-only runner where after ~30 min the model gets evicted and takes minutes to reload — longer than any safe scalar timeout.

The main fix is moving the test out of standard CI. A new `integration`-marked test does the same structural check against a mocked backend (no Ollama needed, ~200 ms). The original real-model test stays but picks up `@pytest.mark.slow`, so `pyproject.toml`'s `addopts = "-m not slow"` keeps it off standard CI pushes. The 300 s default timeout is a secondary safety net for when the slow test does run.

Test Markers What it verifies
`test_streaming_span_creates_and_closes_span` (new) `integration` TracingPlugin opens a `chat` span, keeps it open for the full stream, and closes it when done — mocked `AsyncClient.chat` returning a fake async-generator with artificial per-chunk delays
`test_streaming_span_duration` (existing) `e2e + ollama + slow` Same invariant against a real Ollama model — belongs on the nightly GPU runner

Why not `httpx.Timeout(connect=30, read=600, ...)`? A structured timeout would let us tune connection setup and read time independently. The trade-off is an explicit `import httpx` in `ollama.py` and a wider type annotation. The scalar is fine for now; a follow-up can add the structured form if we need finer control.

Breaking change: default timeout

`OllamaModelBackend` now defaults to a 300 s timeout (previously unlimited). Users running large models on CPU, very long contexts, or multi-minute generations may hit `httpx.ReadTimeout` where they previously saw the call complete eventually. Pass `timeout=None` to restore the previous unlimited behaviour:

```python
backend = OllamaModelBackend("granite3.1-dense:8b", timeout=None)
```

Tested

Mocked integration test — no Ollama needed, runs in standard CI:

```
uv run pytest test/telemetry/test_streaming_tracing_integration.py -v
```

Passed locally in ~4.75 s. CI: ✅ passes on 3.11, 3.12, 3.13.

Slow e2e test — real Ollama with `granite3.1-dense:8b`, excluded from standard CI via `-m not slow`:

```
uv run pytest test/telemetry/test_tracing_backend.py::test_streaming_span_duration -v -m slow
```

Passed locally in ~6 s against a local Ollama instance. This test is excluded from standard CI pushes and will run in overnight/nightly runs where a GPU-backed Ollama server is available.

Closes #1272

@github-actions github-actions Bot added the bug Something isn't working label Jun 19, 2026
@planetf1 planetf1 force-pushed the worktree-issue-1272 branch 2 times, most recently from 7c17c35 to 91fe247 Compare June 19, 2026 09:31
test_streaming_span_duration was hanging for the full 900 s pytest-timeout
budget on CPU-only CI runners (generative-computing#1272). The root cause: OllamaModelBackend
defaulted timeout=None, so the httpx client had no read deadline. When Ollama
stalled mid-stream on a loaded runner, the event loop parked on
selector.poll(-1) with nothing to wake it — no timers, no ready callbacks —
until SIGALRM arrived 15 minutes later.

Primary fix: default timeout changed from None to 300 s. A stalled stream
now raises httpx.ReadTimeout promptly (the existing error-forwarding path in
send_to_queue catches it and puts it on the async queue, where astream() picks
it up and raises). Callers can still pass timeout=None to opt out.

300 s is the right balance: generous enough for slow-but-healthy CPU-only
runners (non-streaming generation and cold model loads can take 60–120 s),
while still bounding genuine stalls to 1/3 of the original 900 s hang budget.
60 s was too tight — test_span_not_closed_prematurely ("Count to 5") and
test_multiple_generations_separate_spans were hitting ReadTimeout on Python
3.12/3.13 CI runners after model reload.

Closes generative-computing#1272

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
@planetf1 planetf1 force-pushed the worktree-issue-1272 branch from 91fe247 to 94002c9 Compare June 19, 2026 10:02
@planetf1 planetf1 changed the title fix(ollama): add 120 s default read timeout to prevent mid-stream hangs fix(ollama): add 300 s default HTTP timeout to prevent mid-stream hangs Jun 19, 2026
Locks the documented escape hatch: OllamaModelBackend(timeout=None) must
omit the key from client_kwargs so callers who want unbounded waits keep
the upstream SDK default (no timeout).

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
@planetf1

Copy link
Copy Markdown
Contributor Author

Closing while we work out the right approach — see the updated diagnosis in #1272.

@planetf1 planetf1 closed this Jun 19, 2026
…; mark real e2e as slow

test_streaming_span_duration was marked e2e+ollama but NOT slow, so it ran
on every standard CI push against a CPU-only runner.  After ~30 min of test
suite the granite4.1:3b model gets evicted by OLLAMA_KEEP_ALIVE=5m; reloading
it on a saturated runner routinely exceeds any reasonable scalar timeout,
making the test either hang (timeout=None) or flake (timeout=300s).

The root fix is to split the coverage in two:

1. test_streaming_span_creates_and_closes_span (new, integration, no Ollama
   marker) — patches ollama.AsyncClient.chat to return a fake async-generator
   with artificial per-chunk delays, then asserts that the TracingPlugin opens
   a "chat" span, keeps it open for the full stream, and closes it only once
   all chunks are consumed.  Runs on every CI push, takes ~200 ms.

2. test_streaming_span_duration (existing) — kept to exercise a real Ollama
   model end-to-end.  Adding @pytest.mark.slow excludes it from the default
   pyproject.toml addopts ("-m not slow") so it no longer runs on standard
   CPU-only CI pushes.  It belongs on the nightly GPU runner.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 reopened this Jun 19, 2026
@planetf1 planetf1 changed the title fix(ollama): add 300 s default HTTP timeout to prevent mid-stream hangs fix(tracing+ollama): split streaming span test into fast mock + slow e2e; add 300s default timeout Jun 19, 2026
With the new 300 s default, a stalled model pull now triggers
httpx.ReadTimeout instead of waiting forever.  _pull_ollama_model was
catching only ollama.ResponseError, so the httpx exception propagated
uncaught from __init__, bypassing the friendly error path.

Broaden the except to also handle httpx.TimeoutException and
httpx.ConnectError so the caller still gets a clean False return and
the "could not be pulled" message fires as expected.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review June 19, 2026 12:59
@planetf1 planetf1 requested a review from a team as a code owner June 19, 2026 12:59
@planetf1 planetf1 enabled auto-merge June 19, 2026 12:59

@akihikokuroda akihikokuroda left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@planetf1 planetf1 added this pull request to the merge queue Jun 19, 2026
@planetf1 planetf1 removed this pull request from the merge queue due to a manual request Jun 19, 2026
@planetf1

Copy link
Copy Markdown
Contributor Author

thanks @akihikokuroda . I'll wait for @ajbozarth to give it a once-over given interest in otel -- Alex feel free to merge if you are happy with it

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be cleaning up streaming spans in #1051 as soon as #1289 is merged, but I have one ask for a file rename in addition to Claudes findings below.

Comment thread test/telemetry/test_streaming_tracing_integration.py

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy makes sense — the mocked integration test gives us the structural check in CI without the CPU-only flakiness, and gating the real test behind slow is the right call. A few things to consider inline.

Comment thread mellea/backends/ollama.py
Comment thread mellea/backends/ollama.py
Comment thread test/telemetry/test_streaming_tracing_integration.py Outdated
…tion test

- Widen _check_ollama_server except clause to catch httpx.TimeoutException
  and httpx.ConnectError in addition to ConnectionError; with the new 300s
  default a stalled ps() call raised httpx.TimeoutException, bypassing the
  friendly "ollama server not running" message
- Tighten streaming span-duration assertion from >= 0.05 s to >= 0.1 s;
  the fake stream is 3 chunks x 50 ms ~= 150 ms, so the looser threshold
  passed even if the span closed after the first chunk

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1

Copy link
Copy Markdown
Contributor Author

Thanks @ajbozarth — addressed in the latest commit:

  • Widened _check_ollama_server to also catch httpx.TimeoutException and httpx.ConnectError (matching the pattern already used in _pull_ollama_model)
  • Tightened the span-duration assertion to >= 0.1 s with a comment tying the threshold to the 3 × 50 ms fake stream
  • Updated the PR body to call out the default timeout change and the timeout=None escape hatch for the auto-generated release notes

Happy to leave the file rename to your cleanup in #1051.

@ajbozarth ajbozarth added this pull request to the merge queue Jun 19, 2026
Merged via the queue into generative-computing:main with commit 31a3479 Jun 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: test_streaming_span_duration hangs >900s on slow CPU-only Ollama runner

3 participants