Skip to content

feat(resume-build): add -shell flag for envd-backed PTY access#2528

Open
arkamar wants to merge 3 commits intomainfrom
feat/resume-build-pty
Open

feat(resume-build): add -shell flag for envd-backed PTY access#2528
arkamar wants to merge 3 commits intomainfrom
feat/resume-build-pty

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented Apr 30, 2026

Opens an interactive PTY session via envd's Process.Start RPC instead of relying on sshd in the template image. Host stdin goes to raw mode, output streams back to stdout, and SIGWINCH is forwarded via Update. TERM and locale vars are passed through so curses apps (htop, tmux, vim) initialise correctly — envd only forwards PATH/HOME/USER/LOGNAME by default.

In interactive mode, Ctrl+D exits and tears down the sandbox. In -signal-pause, the shell runs alongside the signal wait; Ctrl+D before the signal skips the snapshot. The existing nsenter+ssh hint is still printed for images with running sshd.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 30, 2026

PR Summary

Medium Risk
Touches interactive terminal handling and long-lived streaming RPCs, which can introduce hangs or cleanup issues if error paths are wrong. Scope is limited to a local CLI path and doesn’t change core orchestration logic.

Overview
Adds a new -shell option to resume-build that, in interactive mode, attaches an in-sandbox PTY shell via envd’s process RPC (including TERM/locale passthrough, stdin raw-mode proxying, and SIGWINCH resize forwarding) and then tears down the sandbox on shell exit. This threads the shell option through run/runner, tightens flag compatibility checks so -shell can’t be combined with command, pause, or benchmark modes, and adjusts interactive cleanup to be deferred rather than only on Ctrl+C.

Reviewed by Cursor Bugbot for commit 997bb5a. Bugbot is set up for automated code reviews on this repo. Configure here.

Opens an interactive PTY session via envd's Process.Start RPC instead
of relying on sshd in the template image. Host stdin goes to raw mode,
output streams back to stdout, and SIGWINCH is forwarded via Update.
TERM and locale vars are passed through so curses apps (htop, tmux,
vim) initialise correctly — envd only forwards PATH/HOME/USER/LOGNAME
by default.

In interactive mode, Ctrl+D exits and tears down the sandbox. In
-signal-pause, the shell runs alongside the signal wait; Ctrl+D before
the signal skips the snapshot. The existing nsenter+ssh hint is still
printed for images with running sshd.
@arkamar arkamar force-pushed the feat/resume-build-pty branch from 210d07f to 20c434d Compare April 30, 2026 11:40
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 20c434d. Configure here.

Comment thread packages/orchestrator/cmd/resume-build/main.go Outdated
Comment thread packages/orchestrator/cmd/resume-build/shell.go
arkamar added 2 commits April 30, 2026 15:54
processC.Start errors (transport/auth) and stream-closed-before-start
errors were unconditionally wrapped as missingShellError, so attachShell
silently retried with /bin/sh and surfaced a misleading "no usable shell
found" message — hiding the real cause (e.g. envd unreachable, bad
token).

Only the "process emitted no output and exited immediately" case is
genuinely indicative of a missing binary. Return everything else
verbatim so the operator sees what actually went wrong.
Drop support for combining -shell with any non-interactive mode and
revert waitForPauseSignal to its pre-shell shape. If a real workflow
for "interact, then snapshot" emerges later we can revisit, but it
deserves a purpose-built UX (e.g. an in-shell sentinel) rather than
piggybacking on -signal-pause.
@arkamar arkamar marked this pull request as ready for review April 30, 2026 15:11
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 997bb5a666

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +138 to +140
stream, err := processC.Start(ctx, startReq)
if err != nil {
return fmt.Errorf("start %s: %w", cmd, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat Start RPC failures as missing-shell for fallback

When /bin/bash is absent (common in minimal images), processC.Start fails before any stream events are emitted, and this branch returns immediately with start /bin/bash: ... instead of a missingShellError. Because attachShell only falls back on isMissingShell(err), it never tries the /bin/sh candidate even though one is configured, so -shell fails outright on images that only provide sh.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — developer-only CLI tool (resume-build), prior bot concerns addressed in follow-up commits, remaining bug is a documented nit.

Extended reasoning...

Overview

This PR adds a new opt-in -shell flag to the resume-build developer CLI that opens an interactive PTY session via envd's Process.Start RPC instead of relying on sshd in the template image. The change adds a new shell.go file (~330 lines) wiring up stdin raw-mode proxying, stdout streaming, SIGWINCH forwarding, and TERM/locale passthrough. main.go plumbs the flag through run/runner and adds a flag-compatibility guard restricting -shell to interactive mode. go.mod promotes golang.org/x/term from indirect to direct.

Security risks

None of significance. This is a developer-only CLI (cmd/resume-build) used to resume build snapshots locally; it is not a service code path. The PTY session runs as root inside the sandbox, but root-in-sandbox is already the established model for this tool (the existing nsenter+ssh hint also drops a root shell). The access token is reused from sbx.Config.Envd.AccessToken consistent with runCommandInSandbox.

Level of scrutiny

Low-to-medium. The code adds non-trivial goroutine/stream plumbing, but it lives entirely in a local debugging CLI with no impact on the orchestrator service path. The interactive-only restriction (added in commit 997bb5a) eliminates the most impactful prior concern (shell goroutine leaving terminal raw across pause output), and commit cbe4b80 narrowed the missingShell classification so infrastructure errors propagate properly.

Other factors

  • Two prior cursor bot Medium-severity issues were both addressed via follow-up commits before this review.
  • The remaining bug surfaced by the bug hunter is a defer-ordering / Receive-vs-Err race on external SIGINT/SIGTERM. Both verifier groups concluded it is technically real but a nit, with no impact under normal Ctrl+D shell exit and only a microsecond corruption window under external signals to the resume-build host process.
  • No tests in the package and no test changes here, but this is consistent with the rest of cmd/resume-build which is a manual debugging tool.

Comment on lines +219 to +228
<-sessionCtx.Done()

select {
case end := <-endCh:
return endToError(end)
default:
}

if err := stream.Err(); err != nil && !errors.Is(err, context.Canceled) {
return fmt.Errorf("shell stream: %w", err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In runShell, when the parent ctx is cancelled (e.g. external SIGINT/SIGTERM to the resume-build process), sessionCtx.Done() fires before the output-pump goroutine has exited, leaving two narrow issues at unwind: (1) a data race between stream.Err() (line 227) and the still-running stream.Receive() in the goroutine — connect.ServerStreamForClient is not concurrency-safe between Receive/Err, and go test -race would flag it; and (2) terminal corruption — the LIFO defer order is cancel()term.Restore (line 185) → stream.Close() (line 142), so any PTY bytes the goroutine writes between term.Restore and stream.Close go to a cooked-mode terminal as bare LFs/escape sequences. Both can be fixed together by closing the stream (or waiting on a goroutine done channel) before restoring the terminal.

Extended reasoning...

What is happening

runShell (packages/orchestrator/cmd/resume-build/shell.go, lines 142–228) registers three deferred operations that run in LIFO order on return:

  1. defer stream.Close() (line 142, registered first → fires last)
  2. defer func() { term.Restore(...); fmt.Println() }() (line 185–188, registered second → fires second)
  3. defer cancel() (line 191, registered third → fires first)

The output pump goroutine spawned at line 196–211 loops on stream.Receive() and writes PTY bytes to os.Stdout. Critically, the stream itself was created with the parent ctx (line 138), not sessionCtx, so cancelling sessionCtx does not unblock the goroutine — only stream.Close() does.

Why it bites only on parent-ctx cancellation

In the normal path (Ctrl+D / shell exit / stream errors), the goroutine receives an End event (or Receive() returns false), the goroutine returns, and its own defer cancel() is what wakes main from <-sessionCtx.Done(). By the time main runs its defers, the goroutine has already exited — no race, no corruption.

However, sessionCtx is derived from the parent ctx via context.WithCancel(ctx). When the parent ctx is cancelled (e.g. orchestrator caught SIGINT and called the top-level cancel(), or supervisor kill -TERM), sessionCtx.Done() fires synchronously while the output pump is still parked inside stream.Receive(). Note: keyboard Ctrl+C does NOT trigger this because term.MakeRaw disables ISIG and the byte goes to the sandbox shell — it requires an external signal.

Issue 1: Data race on stream.Err()

Once <-sessionCtx.Done() unblocks at line 219, the endCh select at line 221 falls through (the goroutine has not sent yet) and main proceeds to stream.Err() at line 227. Meanwhile the output pump is still inside stream.Receive(). connect.ServerStreamForClient stores receive state in unsynchronised fields (s.err, s.msg); Receive() writes s.err and Err() reads it. Concurrent unsynchronised access is a data race per the Go memory model and would be flagged by go test -race.

Issue 2: Terminal corruption window

LIFO unwind: cancel() (no-op on the goroutine, since it watches the stream not ctx) → term.Restore puts the terminal into cooked mode and prints a newline → stream.Close() finally interrupts the goroutine. Between steps 2 and 3, any buffered Data event the goroutine processes is written as raw-mode PTY bytes (bare LFs without CR, escape sequences) to a now-cooked terminal, producing stairstepped/garbled output.

Step-by-step proof

  1. User runs resume-build -shell; main goroutine enters runShell, terminal is in raw mode, output pump goroutine is blocked in stream.Receive().
  2. Operator runs kill -INT <pid> against the resume-build host process; the signal handler in main.go calls the top-level cancel().
  3. Parent ctx is cancelled; sessionCtx.Done() channel closes (because sessionCtx = context.WithCancel(parent)).
  4. Main goroutine unblocks at line 219. endCh is empty (goroutine is still in Receive), so the select falls through (default branch).
  5. Main calls stream.Err() at line 227 — at the same moment the output pump may be inside stream.Receive() writing to s.err. Race.
  6. Main returns. Defers fire in LIFO: (a) cancel() — already done, no-op for the goroutine; (b) term.Restore + Println — terminal back to cooked mode; (c) stream.Close() — only now does the goroutine's Receive() return false.
  7. In the window between (b) and (c), if any envd Data event was already buffered in the connect stream, the goroutine pulls it and does os.Stdout.Write(pty) of raw-mode bytes onto a cooked terminal → stairstepped output, mangled prompts, possibly visible escape sequences.

Why existing code does not prevent this

  • The output goroutine's defer cancel() only helps when the goroutine itself triggers the unwind. It cannot help when sessionCtx is cancelled before the goroutine has noticed.
  • The endCh select at line 221 only short-circuits the End-event path; on parent-ctx cancellation, endCh is empty.
  • The stream is bound to parent ctx, not sessionCtx, so cancelling sessionCtx does not cause Receive() to return.
  • errors.Is(err, context.Canceled) filtering at line 227 hides the resulting error message from the user, but does not eliminate the race itself.

Impact

Real but narrow. The data race is undefined behavior per Go's memory model; in practice on this single error-interface read it is usually benign (stale value of an interface that is then filtered against context.Canceled). The terminal corruption window is microseconds and only meaningful if PTY data is actually buffered at cancellation time. Trigger requires external SIGINT/SIGTERM (not keyboard Ctrl+C, which raw mode swallows). Both verifier groups concluded this is technically correct but low practical severity — a nit.

Suggested fix

Either:

  • Move stream.Close() so it runs before term.Restore (e.g. swap the registration order, or do an explicit stream.Close() before main returns); or
  • Use a done channel / sync.WaitGroup so main waits for the output pump to exit before reading stream.Err() and before unwinding the terminal.

The first is one line; the second is a few. Either eliminates both issues simultaneously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants