Skip to content

fix: concurrency-safe & idempotent release commit-push to main#188

Draft
PaulNewling wants to merge 3 commits into
v4-betafrom
fix/release-commit-push-race
Draft

fix: concurrency-safe & idempotent release commit-push to main#188
PaulNewling wants to merge 3 commits into
v4-betafrom
fix/release-commit-push-race

Conversation

@PaulNewling

@PaulNewling PaulNewling commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What

Make the block-release flow's "Commit changed files to main" step concurrency-safe and idempotent, in both reusable release workflows (node-simple-pnpm.yaml and node-matrix-pnpm.yaml).

Why — a half-published release

Motivating failure: antibody-tcr-lead-selection — run 28113341538, job 83255658745 (merge of PR #158).

Build, tests, and the security scan passed. The job failed at "Commit changed files to main":

! [remote rejected] main -> main (cannot lock ref 'refs/heads/main':
    is at f8b3c906… but expected 7db2518…)
error: failed to push some refs
Process completed with exit code 1

Root cause

The release runs in two passes. A merge carrying changesets runs changeset version, commits the bump as "Auto-generated changes", and pushes it to main. Publish and tag are gated on has-changes == '0', so they run on the follow-up run that this push triggers.

The commit step pushed bare — no serialization, no retry:

git checkout main
git add .
git commit -m "Auto-generated changes"
git push

Two defects compound:

  1. No concurrency control. Two near-simultaneous main releases each ran changeset version and produced the byte-identical "Auto-generated changes" commit — same parent, same bot author, same second-resolution timestamp, so the same SHA. One push won the compare-and-swap; the other lost.

  2. A failed bump-push strands the release. The bump landed on main, but the step exited 1, so the publish/tag pass never ran for that content. main then declared versions npm never received: model/ui/workflow at 4.3.0 on main, still 4.2.x on npm, and no v3.2.0 tag. Re-running can't recover — a re-run checks out the frozen run head, now behind main, so every push fails non-fast-forward.

What this PR changes

For the release job in both workflows (build-test-publish / build-publish):

  1. Serialize releases per ref so two runs cannot race the push:

    concurrency:
      group: release-${{ github.workflow }}-${{ github.ref }}
      cancel-in-progress: false

    cancel-in-progress: false lets an in-flight release finish (publish + tag) rather than being cut mid-publish.

  2. Push with rebase-and-retry — replace git push with a loop that rebases onto the latest main and treats an identical commit already on main as success:

    for attempt in 1 2 3 4 5; do
      if git push origin HEAD:main; then ... break; fi
      git fetch origin main
      if git diff --quiet FETCH_HEAD -- .; then echo "already on main"; break; fi
      git rebase FETCH_HEAD || { git rebase --abort; exit 1; }
    done

    This survives main advancing between checkout and push and succeeds when the bump is already on main.

Concurrency semantics — what this does and does not guarantee

GitHub keeps one running and one pending run per group (queue: single, the default). cancel-in-progress: false protects the in-progress run, not a pending one: a newer same-ref run supersedes the pending run and takes its place.

This is safe because the release is two-pass and self-heals:

  • Publish runs only on a follow-up run where has-changes == '0'.
  • Every bump-push triggers a fresh run, and the last run in the group is never superseded.
  • That terminal run publishes the current package.json versions — the full accumulated bump — so the latest code always reaches npm once merges quiesce.

The supersession costs only intermediate releases, and only under burst merges. If a second merge lands while an earlier release is still running, the earlier publish run can be cancelled; npm then skips that intermediate version and its v… tag and folds those changes into the next published version. On a normal cadence — one release finishing before the next merge — nothing is superseded and every version publishes.

To give every bump its own version and tag, set queue: max (compatible with cancel-in-progress: false; the rejected combination is queue: max with cancel-in-progress: true). I left it at the default; flag if you want per-bump tags.

Scope and trade-offs

  • Tier 1 only. Hardening the push makes the bump land reliably, which triggers the publish pass. The two-pass design stays.
  • Eventual consistency remains. Bump and publish live in separate runs, so the release strands only if the follow-up run never triggers at all — e.g. the bump-push fails to start a workflow. Closing that gap means publishing in the same run after the bump-push, a separate change I can make if you prefer.
  • cancel-in-progress: false also touches PR runs sharing a ref. Rapid re-pushes to a PR queue the release job instead of cancelling the in-progress one. Acceptable and safer, but flagging it in case you prefer scoping concurrency to main.

Testing

Static. Both files validated with yq and ruby; the concurrency blocks and the new push script sit under the correct jobs and steps.

Concurrency semantics confirmed against GitHub docs. queue: single supersedes the pending run regardless of cancel-in-progress; queue: max is the escape hatch and is incompatible only with cancel-in-progress: true.

Behavioral — deterministic local reproduction. Built a git harness that recreates the failure window without GitHub or timing luck: a bare "origin", a "runner" clone left pointing at the old tip, and a competitor that advances origin/main before the runner pushes. Fixed commit dates and local-only repos keep it reproducible. Each scenario runs the old push (bare git push) and the new push (rebase-retry) against the same setup:

Scenario Old: bare git push New: rebase-retry
main advanced under the runner [rejected] — step exits 1, release stranded rebases and pushes; both changes preserved on main
bump already on main (idempotent re-entry) succeeds, no duplicate commit
competing bump on the same line fails on conflict (documented limit; concurrency prevents it in practice)

The harness reproduces the failure class (push rejected → exit 1 → release stranded). The literal cannot lock ref … but expected … string is HTTPS-transport-specific and collapses to "up-to-date" against a local remote — the rebase-retry covers it regardless, and concurrency removes the concurrent-twin trigger.

Not yet run on a live release. Canary against a low-traffic block PR (flip its build.yaml to @v4-beta) and confirm the bump, npm publish, and tag all complete before promoting to v4.

The release job's "Commit changed files to main" step ran `changeset version`
then a bare `git push`, with no serialization and no retry. When `main`
advanced between checkout and push — a concurrent run producing the identical
"Auto-generated changes" commit, another merge, or a queued run — the push
failed with "cannot lock ref 'refs/heads/main': is at X but expected Y", the
step exited 1, and the release was left half-done: versions bumped on main but
npm publish + tag (gated on the follow-up run) never happened. Re-running could
not recover, since a re-run checks out the frozen run head, now behind main.

- Add per-ref `concurrency` (cancel-in-progress: false) to serialize releases
  so two runs cannot race the push.
- Replace the bare push with a rebase-and-retry loop that treats an identical
  commit already on main as success (idempotent).

Applied to both node-simple-pnpm.yaml and node-matrix-pnpm.yaml.
Correct the concurrency comment: with queue:single (default) a pending
publish run is superseded, not queued, when a newer same-ref run arrives.
Document why this is safe (two-pass release self-heals; only an intermediate
version/tag is skipped under burst merges) and note queue:max as the
per-bump-tag escape hatch. Reconcile the push-loop comment with the
concurrency block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant