Skip to content

Rewrite kaggle kernels logs --follow to use SSE log stream#999

Merged
djherbis merged 10 commits into
mainfrom
agent/herbison-20260505160932-d7e63a14
Jun 17, 2026
Merged

Rewrite kaggle kernels logs --follow to use SSE log stream#999
djherbis merged 10 commits into
mainfrom
agent/herbison-20260505160932-d7e63a14

Conversation

@kaggle-agent

@kaggle-agent kaggle-agent commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Today --follow polls the persisted-log endpoint on an interval, so
users wait seconds for output and miss anything written between polls.
This branch points --follow at the midtier's SSE log stream
(/api/v1/kernels/logs/stream/{owner}/{slug}, the slug-mode binding
added in Kaggle/kaggleazure#43349) so log lines reach the terminal as
the kernel emits them. When the session has already finished the
midtier returns the persisted log as a JSON array of the same event
objects, so the client branches on Content-Type, parses the array,
and renders it identically to a live stream. To survive load-balancer
idle timeouts (~3 minutes) mid-session, the CLI wraps the stream in a
reconnect loop that deduplicates the replayed prefix; a single drop
recovers silently and only repeat failures with no new data surface a
warning or trigger giveup. The one-shot kaggle kernels logs path is
unchanged.


Task: herbison-20260505160932-d7e63a14
Context: https://chat.kaggle.net/kaggle/pl/9n7sbkzftbfkuy7cy6n4s937eo

The previous `--follow` implementation polled the persisted log
blob, which is only written at session end — so it never showed
live output for in-progress notebooks. Switch to the midtier SSE
proxy (mirroring how the editor/viewer FE consume logs), parsing
`data:` events and stopping on the `STREAM_END` sentinel.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
Per review feedback, the SSE stream terminates with `END_OF_LOG`,
not `STREAM_END`. Updated the sentinel constant, docstring, and
test fixtures to match the actual server behavior.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@djherbis djherbis requested review from bovard and stevemessick May 5, 2026 16:33
…mplete: `kernels_logs_stream` now branches on `Content-Type`, using `_iter_sse_events` for live SSE and `_iter_blob_lines` for the persisted GCS blob returned after the session finishes.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
The midtier `additional_bindings` for `GetKernelSessionLogsStream`
registers the slug-mode route as `/api/v1/kernels/logs/stream/
{user_name}/{kernel_slug}` (PR #43349), with the slug pair after
`logs/stream` to match the other kernels output endpoints. The
client was calling `/v1/kernels/{owner}/{slug}/logs/stream`, which
would 404 in production. Update the URL, docstring, and test
assertion to the correct path.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@djherbis djherbis self-requested a review June 16, 2026 17:54

@djherbis djherbis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the merge conflicts?

Take main's `parse_kernel_string` + None-check in `kernels_logs` to
match the convention used by the other kernel functions, and merge
main's new `kernels_output` paging tests alongside this branch's SSE
log-stream tests (keeping both `tempfile` and `json` imports).

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent

Copy link
Copy Markdown
Collaborator Author

Re: @djherbis's review:

Conflict resolutions are staged in the working tree from the previous turn:

  • src/kaggle/api/kaggle_api_extended.py (kernels_logs): adopted main's version using parse_kernel_string + the explicit None check, matching the convention used by the other kernel functions on main.
  • tests/test_kernels_logs.py: kept both import json (needed by the SSE tests on this branch) and import tempfile (needed by main's new output tests), and inserted main's three kernels_output paging tests ahead of the kernels_logs section divider.

Ready for the orchestrator to commit these resolutions and complete the rebase.

@kaggle-agent kaggle-agent changed the title Rewrite kaggle kernels logs --follow to use SSE log stream Stream kernel logs via midtier SSE endpoint with blob fallback Jun 16, 2026

@djherbis djherbis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You fixed some merge conflicts, but I still see a few for test_kernels_logs.py

@erdalsivri

Copy link
Copy Markdown
Contributor

Re: @djherbis's review:

Conflict resolutions are staged in the working tree from the previous turn:

  • src/kaggle/api/kaggle_api_extended.py (kernels_logs): adopted main's version using parse_kernel_string + the explicit None check, matching the convention used by the other kernel functions on main.
  • tests/test_kernels_logs.py: kept both import json (needed by the SSE tests on this branch) and import tempfile (needed by main's new output tests), and inserted main's three kernels_output paging tests ahead of the kernels_logs section divider.

Ready for the orchestrator to commit these resolutions and complete the rebase.

[no-agent] Sorry about this. I had push edsome changes to allow the agent to handle merge conflicts but had to revert due that due to some rebase bug. For now, you can use the VSCode button to connect to the repo (checked out under /git/kaggleazure even though it is kaggle-cli) to rebase manually. You can run claude --resume as well to manually invoke the agent to help with the merge.

Main reordered the imports and dropped the section-divider comment
in a follow-up commit, leaving this branch with two fresh merge
conflicts on rebase. Restructure the imports to match main's layout
(with `import json` inserted in the new order) and drop the divider
so the rebase auto-merges cleanly.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent

Copy link
Copy Markdown
Collaborator Author

Re: @djherbis's review:

You're right — main moved forward (commit 9aa7aa6) and reintroduced two conflicts in tests/test_kernels_logs.py:

  1. Import block — main reorganized to put import unittest / from unittest.mock at the top, then io / tempfile / sys. My branch had alphabetized everything and added import json. Restructured the working tree to match main's import order with import json inserted in the new layout, so the rebase will auto-merge this hunk.
  2. Section divider comment — I had added # kernels_logs (one-shot, persisted blob) above the existing tests; main removed/never had it. Dropped the divider so the rebase has nothing to conflict on.

Working tree is now staged for a clean rebase. Note: main also rewrote kernels_logs_cli to a polling implementation (#966) while this branch uses the SSE rewrite, so once the textual rebase lands, the polling-flavored tests main added (test_kernels_logs_cli_follow_mode, _stops_on_error, _stops_on_cancel) will fail against the SSE handler — flagging that as a separate follow-up since you only asked about merge markers here.

@kaggle-agent kaggle-agent changed the title Stream kernel logs via midtier SSE endpoint with blob fallback Rewrite kaggle kernels logs --follow to use SSE log stream Jun 16, 2026
@djherbis djherbis self-requested a review June 16, 2026 21:44

@djherbis djherbis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add automatic reconnection with deduplication to kernels_logs_stream and
kernels_logs_cli in src/kaggle/api/kaggle_api_extended.py so that
--follow survives connection drops (e.g. load balancer 2-minute timeout).

The current kernels_logs_cli follow path is a single pass:

for event in self.kernels_logs_stream(kernel):
    data = event.get("data")
    if data is None:
        continue
    print(data, flush=True, end="" if data.endswith("\n") else "\n")

Replace it with a reconnect loop:

  1. Wrap the kernels_logs_stream() iteration in a retry loop
    (e.g. while True).
  2. Track how many log lines have been printed so far (printed_count).
  3. On each connection, iterate events from kernels_logs_stream().
    Skip the first printed_count events (dedup — the server replays
    from the beginning on reconnect). Print the rest normally and
    increment printed_count.
  4. Catch requests.exceptions.ChunkedEncodingError,
    requests.exceptions.ConnectionError, and urllib3.exceptions.ProtocolError
    — these are what surface as "Response ended prematurely". On
    catch, log a short warning (e.g. "Connection lost, reconnecting...")
    to stderr via print(..., file=sys.stderr), sleep 1-2 seconds,
    and continue the outer loop.
  5. Exit the loop normally when kernels_logs_stream() yields all
    events without error (meaning the END_OF_LOG sentinel was received),
    or when the reconnect gets back a non-SSE content-type (GCS blob
    fallback for a finished session) — in that case print only unseen
    lines from the blob and exit.
  6. Add a max retry limit (e.g. 5 consecutive failures with no new
    data) to avoid infinite loops if the session is truly gone.

Update the existing tests in tests/test_kernels_logs.py to cover:

  • Normal stream completes without reconnect (already exists, keep it).
  • Stream interrupted mid-way, reconnects, deduplicates, prints remaining lines.
  • Max retries exceeded raises or exits gracefully.

Load balancers cut idle SSE connections after ~2 minutes, so a long-
running `kaggle kernels logs --follow` would die mid-session. The CLI
now wraps `kernels_logs_stream()` in a retry loop, skipping the events
the server replays from the start of each reconnect, and gives up
after 5 consecutive failures with no new data.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent

Copy link
Copy Markdown
Collaborator Author

Re: @djherbis's review:

This is already implemented on the branch. kernels_logs_cli (src/kaggle/api/kaggle_api_extended.py:5351) wraps kernels_logs_stream() in a while True reconnect loop. It tracks seen_count across reconnects, skips events whose index is below seen_count (the midtier replays from the start on reconnect), and catches requests.exceptions.ChunkedEncodingError, requests.exceptions.ConnectionError, and urllib3.exceptions.ProtocolError. On catch it prints "Log stream connection lost, reconnecting..." to stderr, sleeps 1s (_LOG_STREAM_RECONNECT_DELAY_SEC), and retries. After _LOG_STREAM_MAX_FAILURES = 5 consecutive failures with no new events the CLI prints a giveup message to stderr and returns. Normal completion (END_OF_LOG sentinel or non-SSE blob fallback) just falls through and returns. kernels_logs_stream itself didn't need structural changes — each call opens a fresh connection, which is what the reconnect path depends on. Two tests in tests/test_kernels_logs.py cover the new behavior: test_kernels_logs_cli_follow_reconnects_and_dedupes (mid-stream drop, reconnect, dedup the replayed events, print only the new third line) and test_kernels_logs_cli_follow_gives_up_after_max_failures (5 consecutive ConnectionErrors exit with the giveup message). All 20 tests in the file pass.

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
file=sys.stderr,
)
return
print("Log stream connection lost, reconnecting...", file=sys.stderr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current load balancer resets the connection every 3m, which means this log is going to stick out like a sore thumb in the average case.

Can we only print this message if we have repeat failures? ex. failures_without_progress > 1

That way a successful retry is hidden under the hood?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed on this branch. The reconnect message is gated on failures_without_progress > 1 at src/kaggle/api/kaggle_api_extended.py:5407, so the routine ~3-minute LB reset (single failure, then successful retry) stays silent and only persistent trouble surfaces. Tests in tests/test_kernels_logs.py cover both cases: test_kernels_logs_cli_follow_reconnects_and_dedupes asserts stderr is empty after a single drop, and test_kernels_logs_cli_follow_reports_only_repeat_failures asserts the warning appears exactly once on the second consecutive failure. All 21 tests pass.

The load balancer resets idle SSE connections roughly every 3 minutes,
so any long `--follow` session normally hits one drop and recovers
silently on retry. Logging a warning each time would clutter stderr in
the common case, so the message now only fires on the second
consecutive failure with no new data — i.e. when something actually
looks wrong.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>

@djherbis djherbis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the logs are for a completed session, they output format looks very different:

[{"stream_name":"stderr","time":10.168431974,"data":"0.00s - Debugger warning: It seems that frozen modules are being used, which may\n"}
,{"stream_name":"stderr","time":10.168515914,"data":"0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\n"}
,{"stream_name":"stderr","time":10.168521834,"data":"0.00s - to python to disable frozen modules.\n"}
,{"stream_name":"stderr","time":10.168525604,"data":"0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.\n"}
,{"stream_name":"stderr","time":10.894785371,"data":"0.00s - Debugger warning: It seems that frozen modules are being used, which may\n"}
,{"stream_name":"stderr","time":10.894837561,"data":"0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\n"}
,{"stream_name":"stderr","time":10.894842661,"data":"0.00s - to python to disable frozen modules.\n"}
,{"stream_name":"stderr","time":10.894845571,"data":"0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.\n"}
,{"stream_name":"stdout","time":12.3878314,"data":"hello 0\n"}
,{"stream_name":"stdout","time":13.387848709,"data":"hello 1\n"}
,{"stream_name":"stdout","time":14.388121119000001,"data":"hello 2\n"}
...
]

This is much different than the output for SSE live streams:

0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - Debugger warning: It seems that frozen modules are being used, which may
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
hello 0
hello 1
hello 2
...

Can we parse the JSON array output and print it in the same format as the live stream?

The midtier serves the persisted log blob as a JSON array of
`{stream_name, time, data}` objects — the same shape as live SSE
events — but the client was dumping it as raw text, so completed-
session output looked like JSON noise instead of log lines. Parse the
array and yield each entry, so the CLI's existing per-event rendering
produces identical output whether the session is live or finished.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent

Copy link
Copy Markdown
Collaborator Author

Re: @djherbis's review:

Already addressed on this branch. _iter_blob_lines at src/kaggle/api/kaggle_api_extended.py:5336 now reads response.text, parses it with json.loads, and yields each {stream_name, time, data} entry as-is. The CLI's existing event.get("data") rendering then produces the same per-line output for completed sessions as for live SSE streams — no more raw JSON dump. Non-JSON blobs fall back to line-by-line so unexpected formats still print something readable. Tests in tests/test_kernels_logs.py cover the new behavior: the _blob_response helper builds a JSON array, test_kernels_logs_stream_falls_back_to_blob_for_completed_session and test_kernels_logs_stream_blob_fallback_with_octet_stream assert on the parsed event shape, and test_kernels_logs_stream_blob_fallback_handles_non_json covers the fallback path. All 22 tests pass.

@djherbis djherbis merged commit c97b626 into main Jun 17, 2026
14 checks passed
@djherbis djherbis deleted the agent/herbison-20260505160932-d7e63a14 branch June 17, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants