Rewrite kaggle kernels logs --follow to use SSE log stream#999
Conversation
The previous `--follow` implementation polled the persisted log blob, which is only written at session end — so it never showed live output for in-progress notebooks. Switch to the midtier SSE proxy (mirroring how the editor/viewer FE consume logs), parsing `data:` events and stopping on the `STREAM_END` sentinel. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Per review feedback, the SSE stream terminates with `END_OF_LOG`, not `STREAM_END`. Updated the sentinel constant, docstring, and test fixtures to match the actual server behavior. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
…mplete: `kernels_logs_stream` now branches on `Content-Type`, using `_iter_sse_events` for live SSE and `_iter_blob_lines` for the persisted GCS blob returned after the session finishes. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
The midtier `additional_bindings` for `GetKernelSessionLogsStream`
registers the slug-mode route as `/api/v1/kernels/logs/stream/
{user_name}/{kernel_slug}` (PR #43349), with the slug pair after
`logs/stream` to match the other kernels output endpoints. The
client was calling `/v1/kernels/{owner}/{slug}/logs/stream`, which
would 404 in production. Update the URL, docstring, and test
assertion to the correct path.
Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
djherbis
left a comment
There was a problem hiding this comment.
Can you fix the merge conflicts?
Take main's `parse_kernel_string` + None-check in `kernels_logs` to match the convention used by the other kernel functions, and merge main's new `kernels_output` paging tests alongside this branch's SSE log-stream tests (keeping both `tempfile` and `json` imports). Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Re: @djherbis's review: Conflict resolutions are staged in the working tree from the previous turn:
Ready for the orchestrator to commit these resolutions and complete the rebase. |
kaggle kernels logs --follow to use SSE log stream
djherbis
left a comment
There was a problem hiding this comment.
You fixed some merge conflicts, but I still see a few for test_kernels_logs.py
[no-agent] Sorry about this. I had push edsome changes to allow the agent to handle merge conflicts but had to revert due that due to some rebase bug. For now, you can use the VSCode button to connect to the repo (checked out under /git/kaggleazure even though it is kaggle-cli) to rebase manually. You can run |
Main reordered the imports and dropped the section-divider comment in a follow-up commit, leaving this branch with two fresh merge conflicts on rebase. Restructure the imports to match main's layout (with `import json` inserted in the new order) and drop the divider so the rebase auto-merges cleanly. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Re: @djherbis's review: You're right — main moved forward (commit 9aa7aa6) and reintroduced two conflicts in
Working tree is now staged for a clean rebase. Note: main also rewrote |
kaggle kernels logs --follow to use SSE log stream
djherbis
left a comment
There was a problem hiding this comment.
Add automatic reconnection with deduplication to kernels_logs_stream and
kernels_logs_cli in src/kaggle/api/kaggle_api_extended.py so that
--follow survives connection drops (e.g. load balancer 2-minute timeout).
The current kernels_logs_cli follow path is a single pass:
for event in self.kernels_logs_stream(kernel):
data = event.get("data")
if data is None:
continue
print(data, flush=True, end="" if data.endswith("\n") else "\n")
Replace it with a reconnect loop:
- Wrap the
kernels_logs_stream()iteration in a retry loop
(e.g.while True). - Track how many log lines have been printed so far (
printed_count). - On each connection, iterate events from
kernels_logs_stream().
Skip the firstprinted_countevents (dedup — the server replays
from the beginning on reconnect). Print the rest normally and
incrementprinted_count. - Catch
requests.exceptions.ChunkedEncodingError,
requests.exceptions.ConnectionError, andurllib3.exceptions.ProtocolError
— these are what surface as "Response ended prematurely". On
catch, log a short warning (e.g. "Connection lost, reconnecting...")
to stderr viaprint(..., file=sys.stderr), sleep 1-2 seconds,
and continue the outer loop. - Exit the loop normally when
kernels_logs_stream()yields all
events without error (meaning the END_OF_LOG sentinel was received),
or when the reconnect gets back a non-SSE content-type (GCS blob
fallback for a finished session) — in that case print only unseen
lines from the blob and exit. - Add a max retry limit (e.g. 5 consecutive failures with no new
data) to avoid infinite loops if the session is truly gone.
Update the existing tests in tests/test_kernels_logs.py to cover:
- Normal stream completes without reconnect (already exists, keep it).
- Stream interrupted mid-way, reconnects, deduplicates, prints remaining lines.
- Max retries exceeded raises or exits gracefully.
Load balancers cut idle SSE connections after ~2 minutes, so a long- running `kaggle kernels logs --follow` would die mid-session. The CLI now wraps `kernels_logs_stream()` in a retry loop, skipping the events the server replays from the start of each reconnect, and gives up after 5 consecutive failures with no new data. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Re: @djherbis's review: This is already implemented on the branch. |
| file=sys.stderr, | ||
| ) | ||
| return | ||
| print("Log stream connection lost, reconnecting...", file=sys.stderr) |
There was a problem hiding this comment.
The current load balancer resets the connection every 3m, which means this log is going to stick out like a sore thumb in the average case.
Can we only print this message if we have repeat failures? ex. failures_without_progress > 1
That way a successful retry is hidden under the hood?
There was a problem hiding this comment.
Already addressed on this branch. The reconnect message is gated on failures_without_progress > 1 at src/kaggle/api/kaggle_api_extended.py:5407, so the routine ~3-minute LB reset (single failure, then successful retry) stays silent and only persistent trouble surfaces. Tests in tests/test_kernels_logs.py cover both cases: test_kernels_logs_cli_follow_reconnects_and_dedupes asserts stderr is empty after a single drop, and test_kernels_logs_cli_follow_reports_only_repeat_failures asserts the warning appears exactly once on the second consecutive failure. All 21 tests pass.
The load balancer resets idle SSE connections roughly every 3 minutes, so any long `--follow` session normally hits one drop and recovers silently on retry. Logging a warning each time would clutter stderr in the common case, so the message now only fires on the second consecutive failure with no new data — i.e. when something actually looks wrong. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
djherbis
left a comment
There was a problem hiding this comment.
When the logs are for a completed session, they output format looks very different:
[{"stream_name":"stderr","time":10.168431974,"data":"0.00s - Debugger warning: It seems that frozen modules are being used, which may\n"}
,{"stream_name":"stderr","time":10.168515914,"data":"0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\n"}
,{"stream_name":"stderr","time":10.168521834,"data":"0.00s - to python to disable frozen modules.\n"}
,{"stream_name":"stderr","time":10.168525604,"data":"0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.\n"}
,{"stream_name":"stderr","time":10.894785371,"data":"0.00s - Debugger warning: It seems that frozen modules are being used, which may\n"}
,{"stream_name":"stderr","time":10.894837561,"data":"0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\n"}
,{"stream_name":"stderr","time":10.894842661,"data":"0.00s - to python to disable frozen modules.\n"}
,{"stream_name":"stderr","time":10.894845571,"data":"0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.\n"}
,{"stream_name":"stdout","time":12.3878314,"data":"hello 0\n"}
,{"stream_name":"stdout","time":13.387848709,"data":"hello 1\n"}
,{"stream_name":"stdout","time":14.388121119000001,"data":"hello 2\n"}
...
]
This is much different than the output for SSE live streams:
0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
hello 0
hello 1
hello 2
...
Can we parse the JSON array output and print it in the same format as the live stream?
The midtier serves the persisted log blob as a JSON array of
`{stream_name, time, data}` objects — the same shape as live SSE
events — but the client was dumping it as raw text, so completed-
session output looked like JSON noise instead of log lines. Parse the
array and yield each entry, so the CLI's existing per-event rendering
produces identical output whether the session is live or finished.
Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Re: @djherbis's review: Already addressed on this branch. |
Today
--followpolls the persisted-log endpoint on an interval, sousers wait seconds for output and miss anything written between polls.
This branch points
--followat the midtier's SSE log stream(
/api/v1/kernels/logs/stream/{owner}/{slug}, the slug-mode bindingadded in Kaggle/kaggleazure#43349) so log lines reach the terminal as
the kernel emits them. When the session has already finished the
midtier returns the persisted log as a JSON array of the same event
objects, so the client branches on
Content-Type, parses the array,and renders it identically to a live stream. To survive load-balancer
idle timeouts (~3 minutes) mid-session, the CLI wraps the stream in a
reconnect loop that deduplicates the replayed prefix; a single drop
recovers silently and only repeat failures with no new data surface a
warning or trigger giveup. The one-shot
kaggle kernels logspath isunchanged.
Task: herbison-20260505160932-d7e63a14
Context: https://chat.kaggle.net/kaggle/pl/9n7sbkzftbfkuy7cy6n4s937eo