feat(resume-build): add -shell flag for envd-backed PTY access#2528
feat(resume-build): add -shell flag for envd-backed PTY access#2528
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 997bb5a. Bugbot is set up for automated code reviews on this repo. Configure here. |
Opens an interactive PTY session via envd's Process.Start RPC instead of relying on sshd in the template image. Host stdin goes to raw mode, output streams back to stdout, and SIGWINCH is forwarded via Update. TERM and locale vars are passed through so curses apps (htop, tmux, vim) initialise correctly — envd only forwards PATH/HOME/USER/LOGNAME by default. In interactive mode, Ctrl+D exits and tears down the sandbox. In -signal-pause, the shell runs alongside the signal wait; Ctrl+D before the signal skips the snapshot. The existing nsenter+ssh hint is still printed for images with running sshd.
210d07f to
20c434d
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 20c434d. Configure here.
processC.Start errors (transport/auth) and stream-closed-before-start errors were unconditionally wrapped as missingShellError, so attachShell silently retried with /bin/sh and surfaced a misleading "no usable shell found" message — hiding the real cause (e.g. envd unreachable, bad token). Only the "process emitted no output and exited immediately" case is genuinely indicative of a missing binary. Return everything else verbatim so the operator sees what actually went wrong.
Drop support for combining -shell with any non-interactive mode and revert waitForPauseSignal to its pre-shell shape. If a real workflow for "interact, then snapshot" emerges later we can revisit, but it deserves a purpose-built UX (e.g. an in-shell sentinel) rather than piggybacking on -signal-pause.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 997bb5a666
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| stream, err := processC.Start(ctx, startReq) | ||
| if err != nil { | ||
| return fmt.Errorf("start %s: %w", cmd, err) |
There was a problem hiding this comment.
Treat Start RPC failures as missing-shell for fallback
When /bin/bash is absent (common in minimal images), processC.Start fails before any stream events are emitted, and this branch returns immediately with start /bin/bash: ... instead of a missingShellError. Because attachShell only falls back on isMissingShell(err), it never tries the /bin/sh candidate even though one is configured, so -shell fails outright on images that only provide sh.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
LGTM — developer-only CLI tool (resume-build), prior bot concerns addressed in follow-up commits, remaining bug is a documented nit.
Extended reasoning...
Overview
This PR adds a new opt-in -shell flag to the resume-build developer CLI that opens an interactive PTY session via envd's Process.Start RPC instead of relying on sshd in the template image. The change adds a new shell.go file (~330 lines) wiring up stdin raw-mode proxying, stdout streaming, SIGWINCH forwarding, and TERM/locale passthrough. main.go plumbs the flag through run/runner and adds a flag-compatibility guard restricting -shell to interactive mode. go.mod promotes golang.org/x/term from indirect to direct.
Security risks
None of significance. This is a developer-only CLI (cmd/resume-build) used to resume build snapshots locally; it is not a service code path. The PTY session runs as root inside the sandbox, but root-in-sandbox is already the established model for this tool (the existing nsenter+ssh hint also drops a root shell). The access token is reused from sbx.Config.Envd.AccessToken consistent with runCommandInSandbox.
Level of scrutiny
Low-to-medium. The code adds non-trivial goroutine/stream plumbing, but it lives entirely in a local debugging CLI with no impact on the orchestrator service path. The interactive-only restriction (added in commit 997bb5a) eliminates the most impactful prior concern (shell goroutine leaving terminal raw across pause output), and commit cbe4b80 narrowed the missingShell classification so infrastructure errors propagate properly.
Other factors
- Two prior cursor bot Medium-severity issues were both addressed via follow-up commits before this review.
- The remaining bug surfaced by the bug hunter is a defer-ordering / Receive-vs-Err race on external SIGINT/SIGTERM. Both verifier groups concluded it is technically real but a nit, with no impact under normal Ctrl+D shell exit and only a microsecond corruption window under external signals to the resume-build host process.
- No tests in the package and no test changes here, but this is consistent with the rest of
cmd/resume-buildwhich is a manual debugging tool.
| <-sessionCtx.Done() | ||
|
|
||
| select { | ||
| case end := <-endCh: | ||
| return endToError(end) | ||
| default: | ||
| } | ||
|
|
||
| if err := stream.Err(); err != nil && !errors.Is(err, context.Canceled) { | ||
| return fmt.Errorf("shell stream: %w", err) |
There was a problem hiding this comment.
🟡 In runShell, when the parent ctx is cancelled (e.g. external SIGINT/SIGTERM to the resume-build process), sessionCtx.Done() fires before the output-pump goroutine has exited, leaving two narrow issues at unwind: (1) a data race between stream.Err() (line 227) and the still-running stream.Receive() in the goroutine — connect.ServerStreamForClient is not concurrency-safe between Receive/Err, and go test -race would flag it; and (2) terminal corruption — the LIFO defer order is cancel() → term.Restore (line 185) → stream.Close() (line 142), so any PTY bytes the goroutine writes between term.Restore and stream.Close go to a cooked-mode terminal as bare LFs/escape sequences. Both can be fixed together by closing the stream (or waiting on a goroutine done channel) before restoring the terminal.
Extended reasoning...
What is happening
runShell (packages/orchestrator/cmd/resume-build/shell.go, lines 142–228) registers three deferred operations that run in LIFO order on return:
defer stream.Close()(line 142, registered first → fires last)defer func() { term.Restore(...); fmt.Println() }()(line 185–188, registered second → fires second)defer cancel()(line 191, registered third → fires first)
The output pump goroutine spawned at line 196–211 loops on stream.Receive() and writes PTY bytes to os.Stdout. Critically, the stream itself was created with the parent ctx (line 138), not sessionCtx, so cancelling sessionCtx does not unblock the goroutine — only stream.Close() does.
Why it bites only on parent-ctx cancellation
In the normal path (Ctrl+D / shell exit / stream errors), the goroutine receives an End event (or Receive() returns false), the goroutine returns, and its own defer cancel() is what wakes main from <-sessionCtx.Done(). By the time main runs its defers, the goroutine has already exited — no race, no corruption.
However, sessionCtx is derived from the parent ctx via context.WithCancel(ctx). When the parent ctx is cancelled (e.g. orchestrator caught SIGINT and called the top-level cancel(), or supervisor kill -TERM), sessionCtx.Done() fires synchronously while the output pump is still parked inside stream.Receive(). Note: keyboard Ctrl+C does NOT trigger this because term.MakeRaw disables ISIG and the byte goes to the sandbox shell — it requires an external signal.
Issue 1: Data race on stream.Err()
Once <-sessionCtx.Done() unblocks at line 219, the endCh select at line 221 falls through (the goroutine has not sent yet) and main proceeds to stream.Err() at line 227. Meanwhile the output pump is still inside stream.Receive(). connect.ServerStreamForClient stores receive state in unsynchronised fields (s.err, s.msg); Receive() writes s.err and Err() reads it. Concurrent unsynchronised access is a data race per the Go memory model and would be flagged by go test -race.
Issue 2: Terminal corruption window
LIFO unwind: cancel() (no-op on the goroutine, since it watches the stream not ctx) → term.Restore puts the terminal into cooked mode and prints a newline → stream.Close() finally interrupts the goroutine. Between steps 2 and 3, any buffered Data event the goroutine processes is written as raw-mode PTY bytes (bare LFs without CR, escape sequences) to a now-cooked terminal, producing stairstepped/garbled output.
Step-by-step proof
- User runs
resume-build -shell; main goroutine entersrunShell, terminal is in raw mode, output pump goroutine is blocked instream.Receive(). - Operator runs
kill -INT <pid>against the resume-build host process; the signal handler inmain.gocalls the top-levelcancel(). - Parent
ctxis cancelled;sessionCtx.Done()channel closes (becausesessionCtx = context.WithCancel(parent)). - Main goroutine unblocks at line 219.
endChis empty (goroutine is still in Receive), so the select falls through (default branch). - Main calls
stream.Err()at line 227 — at the same moment the output pump may be insidestream.Receive()writing tos.err. Race. - Main returns. Defers fire in LIFO: (a)
cancel()— already done, no-op for the goroutine; (b)term.Restore+Println— terminal back to cooked mode; (c)stream.Close()— only now does the goroutine'sReceive()return false. - In the window between (b) and (c), if any envd Data event was already buffered in the connect stream, the goroutine pulls it and does
os.Stdout.Write(pty)of raw-mode bytes onto a cooked terminal → stairstepped output, mangled prompts, possibly visible escape sequences.
Why existing code does not prevent this
- The output goroutine's
defer cancel()only helps when the goroutine itself triggers the unwind. It cannot help when sessionCtx is cancelled before the goroutine has noticed. - The
endChselect at line 221 only short-circuits the End-event path; on parent-ctx cancellation, endCh is empty. - The stream is bound to parent
ctx, notsessionCtx, so cancellingsessionCtxdoes not causeReceive()to return. errors.Is(err, context.Canceled)filtering at line 227 hides the resulting error message from the user, but does not eliminate the race itself.
Impact
Real but narrow. The data race is undefined behavior per Go's memory model; in practice on this single error-interface read it is usually benign (stale value of an interface that is then filtered against context.Canceled). The terminal corruption window is microseconds and only meaningful if PTY data is actually buffered at cancellation time. Trigger requires external SIGINT/SIGTERM (not keyboard Ctrl+C, which raw mode swallows). Both verifier groups concluded this is technically correct but low practical severity — a nit.
Suggested fix
Either:
- Move
stream.Close()so it runs beforeterm.Restore(e.g. swap the registration order, or do an explicitstream.Close()before main returns); or - Use a
donechannel /sync.WaitGroupso main waits for the output pump to exit before readingstream.Err()and before unwinding the terminal.
The first is one line; the second is a few. Either eliminates both issues simultaneously.

Opens an interactive PTY session via envd's Process.Start RPC instead of relying on sshd in the template image. Host stdin goes to raw mode, output streams back to stdout, and SIGWINCH is forwarded via Update. TERM and locale vars are passed through so curses apps (htop, tmux, vim) initialise correctly — envd only forwards PATH/HOME/USER/LOGNAME by default.
In interactive mode, Ctrl+D exits and tears down the sandbox. In -signal-pause, the shell runs alongside the signal wait; Ctrl+D before the signal skips the snapshot. The existing nsenter+ssh hint is still printed for images with running sshd.