Skip to content

chore(scenarios): rewrite 001 + 003 to walkable contracts; promote both to active#40

Merged
bborbe merged 2 commits into
masterfrom
chore/scenarios-walkable
Jun 3, 2026
Merged

chore(scenarios): rewrite 001 + 003 to walkable contracts; promote both to active#40
bborbe merged 2 commits into
masterfrom
chore/scenarios-walkable

Conversation

@bborbe

@bborbe bborbe commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

Original scenarios 001 + 003 required infrastructure that was impractical to orchestrate from a single Claude Code session:

  • 001: fresh `PATH=$PATH_MASKED claude` subprocess running `/coding:code-review` and `/coding:pr-review`
  • 003: LLM-call counter PATH-shim wrapping every `claude` invocation inside a real `/coding:pr-review` run

Rewrote both to test the code under test at a tighter scope, so the contract is satisfiable from any operator's terminal without bespoke harnesses. Then walked both end-to-end on master sha `b695450` and promoted to `status: active`.

Scope changes

Scenario Before After
001 "Run /coding:code-review under masked PATH; assert dispatcher aborts" "Extract Step 4.0 bash block verbatim from pr-review.md + code-review.md + runner Step 0 from ast-grep-runner.md; run each under masked PATH; assert exit 1 + documented stderr + JSON contract"
003 "LLM-call count ≤30 via shim wrapping claude" "Distinct Owners-with-findings ≤30 via direct funnel measurement (structurally equivalent because Step 4b invokes one Task per Owner)"

The full Phase-10 LLM-count measurement is deferred to a companion scenario 004 — not written yet because it still needs the LLM-shim infrastructure; 003 captures the load-bearing decoupling proof via Owner count.

Walk results

001 — 10/10 PASS

  • pr-review Step 4.0: exit 1, stderr contains "ast-grep/sg not in PATH"
  • code-review Step 4.0: exit 1, parity stderr
  • runner Step 0: exit 1, JSON `yamls_run=0`, `findings_by_owner={}`, `errors[0].kind=missing-tool`
  • Preflight wall: 10 ms (target ≤ 1s)
  • Host shell PATH unaffected after all subshells

003 — 5/5 PASS

  • 100-file fixture committed
  • Funnel wall: 64 ms (target ≤ 30000 ms; 469× margin)
  • Distinct owners: 2 (target ≤ 30; 15× margin) — go-architecture-assistant + go-time-assistant
  • Findings count: 115 (negative control: > 0)
  • Distinct rule_ids: 4 (target ≥ 3) — constructor-returns-interface, no-globals-or-singletons, no-raw-go-func, no-time-now-direct
  • Citation discipline: every surfaced rule_id present in `rules/index.json`

Aftermath

All 3 scenarios on master will be `active` after merge.

Test plan

  • `make precommit` clean (131 rules, 28 mechanical YAMLs, no drift)
  • First successful walk of each scenario, evidence in commit message
  • Bot review

…th to active

Original contracts required infrastructure (fresh PATH-masked claude
subprocess for 001; LLM-call counter shim wrapping claude for 003)
that is impractical to orchestrate from a single Claude Code session.
Rewrote both to test the CODE UNDER TEST at a tighter scope:

  001 — extracts the Step 4.0 bash block verbatim from pr-review.md
  + code-review.md + the Step 0 block from ast-grep-runner.md, runs
  each under a masked PATH, asserts exit 1 + documented stderr + JSON
  contract. That IS the dispatcher's preflight — the slash command
  just invokes this bash. No fresh-claude subprocess needed.

  003 — replaces 'LLM-call count ≤30' (which required an LLM-shim
  + real /coding:pr-review run) with 'distinct Owners with findings
  ≤30' (structurally equivalent because Step 4b invokes one Task per
  Owner with findings; the LLM ceiling is the Owner count plus a
  small fixed overhead). Measurable directly from the mechanical
  funnel output. The full Phase-10 LLM-count measurement deferred to
  companion scenario 004 (not yet written).

First successful walk 2026-06-03 on master sha b695450:

  001: 10/10 checkboxes PASS.
    pr-review Step 4.0   — exit 1, stderr contains 'ast-grep/sg not in PATH'
    code-review Step 4.0 — exit 1, parity stderr
    runner Step 0        — exit 1, JSON: yamls_run=0, findings_by_owner={},
                           errors[0].kind='missing-tool'
    preflight wall       — 10 ms (≤ 1s target)
    host shell PATH      — unaffected after subshells

  003: 5/5 checkboxes PASS.
    100-file fixture     — 100 .go files committed
    funnel wall          — 64 ms (≤ 30000 ms target; 469× margin)
    distinct owners      — 2 (≤ 30 target; 15× margin: go-architecture-assistant
                           + go-time-assistant)
    findings_count       — 115 (> 0 target — negative control passes)
    distinct rule_ids    — 4 (≥ 3 target): constructor-returns-interface,
                           no-globals-or-singletons, no-raw-go-func,
                           no-time-now-direct
    citation discipline  — every surfaced rule_id present in rules/index.json

After this commit all 3 scenarios on master are 'active'.

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: chore/scenarios-walkablemaster

Scope: 2 files changed — both are scenario walkthrough documents (.md).


Must Fix (Critical)

None — no production code, no security-sensitive changes.


Should Fix (Important)

  1. scenarios/003-scaling-funnel-100-files.md:7 — Companion scenario 004 is referenced as if written, but scenarios/004-funnel-decoupling-doubled-fixture.md does not exist in the repo. The scenario description calls it out as the Phase-10 LLM-call measurement but this is a forward reference with no implementation. Future readers will follow the link and find nothing.

  2. scenarios/001-toolchain-preflight.md:48 — Wall-clock assertion tightened from ≤30 s to ≤1 s. The concern in the plan notes CI environment variability. This is a tight budget for a command -v check that also spawns a subshell — unlikely to be the bottleneck, but the rationale for the 30× reduction should be documented inline (e.g., # 1s is generous: the block is a single command-v check + exit; ast-grep itself never runs).


Nice to Have (Optional)

  1. scenarios/003-scaling-funnel-100-files.md — The Expected section tests distinct Owner count via a Python one-liner that reads rules/index.json. If the index is stale (not regenerated after a new rule is added), the distinct-Owner intersection may silently miss newly added rules. Consider adding a note that make build-index must be run before executing the scenario.

  2. scenarios/003-scaling-funnel-100-files.md — The fixture's four violation layers map cleanly to four rule_ids (go-architecture/constructor-returns-interface, go-concurrency/no-raw-go-func, go-architecture/no-globals-or-singletons, go-time/no-time-now-direct). This is correct and all four are in rules/index.json. No action needed.

  3. scenarios/001-toolchain-preflight.md — The setup step extracts bash blocks from commands/pr-review.md, commands/code-review.md, and agents/ast-grep-runner.md by searching for section headers. If any of those section headers shift (e.g., a later refactor renumbers steps), the extraction will silently pick up nothing or the wrong block. This is an accepted fragility of text-based extraction but worth noting.


Concerns from Plan

Concern Status
Scenario 001 bash block drift from source docs Partially addressed — all three source blocks verified to exist at the expected section headers; risk of drift remains (see Nice to Have #5)
Scenario 003 fixture / rule_ids mismatch Addressed — all four rule_ids verified present in rules/index.json
Wall-clock ≤1 s achievability in CI Open concern — tightened assertion is aggressive; no evidence it fails; see Should Fix #2
Companion scenario 004 not written Open concern — forward reference in scenario 003 is unresolved; see Should Fix #1

Next Steps

  • /coding:write-test not applicable — no production code changed.
  • Resolve Should Fix #1 by either writing scenarios/004-funnel-decoupling-doubled-fixture.md or removing the forward reference from scenario 003.
  • Resolve Should Fix #2 by adding an inline comment in scenario 001 justifying the 1 s wall-clock budget.

{
  "verdict": "request-changes",
  "summary": "Both changed files are scenario walkthroughs with no production code. Two should-fix issues were found: scenario 003 references a non-existent companion scenario 004, and scenario 001's 1-second wall-clock assertion is aggressive without inline justification. The mechanical-layer rule-id mapping in scenario 003 is verified correct against rules/index.json.",
  "comments": [
    {
      "file": "scenarios/003-scaling-funnel-100-files.md",
      "line": 7,
      "severity": "major",
      "message": "Companion scenario 004 (004-funnel-decoupling-doubled-fixture.md) is referenced but does not exist. The scenario description calls it the Phase-10 LLM-call measurement but it has not been written. Either write the companion or remove the forward reference."
    },
    {
      "file": "scenarios/001-toolchain-preflight.md",
      "line": 48,
      "severity": "major",
      "message": "Wall-clock assertion was tightened from ≤30s to ≤1s (30× reduction) without inline justification. Add a comment explaining that the block is a single command-v check that exits immediately — making 1s a generous budget even in CI environments with slower disk I/O."
    },
    {
      "file": "scenarios/003-scaling-funnel-100-files.md",
      "line": 52,
      "severity": "nit",
      "message": "The distinct-Owner count test reads rules/index.json directly. If the index is stale (not regenerated after a new rule is added), the Python intersection may silently miss new rules. Consider noting that make build-index should be run before executing this scenario."
    },
    {
      "file": "scenarios/001-toolchain-preflight.md",
      "line": 19,
      "severity": "nit",
      "message": "Bash block extraction relies on searching for section header text (e.g., '#### 4.0:'). If commands/pr-review.md or agents/ast-grep-runner.md is refactored and section headers shift, the extraction will silently pick up the wrong block. Consider noting this as an accepted fragility."
    }
  ],
  "concerns_addressed": [
    "correctness: scenario 001 bash block drift — blocks verified to exist at expected section headers; risk remains (see nit comment)",
    "correctness: scenario 003 rule_ids — all four fixture violation types map to rule_ids present in rules/index.json",
    "tests: wall-clock ≤1s — open concern, no evidence of failure in CI; needs inline justification (see major comment)",
    "correctness: companion scenario 004 not written — not addressed; forward reference remains (see major comment)"
  ]
}

…nale, NITs

MAJORs:
- 003 description: removed broken markdown link to non-existent
  004-funnel-decoupling-doubled-fixture.md. Reworded as 'deferred
  to a future scenario 004 — not yet written' so the forward
  reference no longer pretends to link to a file.
- 001 Expected: added inline rationale for the 1s wall-clock budget.
  Each block is a single command-v exit; 1s is generous on slow CI
  I/O; the regression risk is the 30-min loop on coding#34, so any
  threshold under 30 min proves the contract and 1s catches it with
  1800x margin.

NITs:
- 001 Setup: noted the awk extraction is anchored on literal section
  headers (#### 4.0:, ### 0. Preflight:). If a refactor renames the
  heading, the extraction silently picks up the wrong block. Future
  edits should re-anchor.
- 003 Setup: added 'make build-index before scanning' precondition.
  The Owner-count assertion reads rules/index.json directly; a stale
  index would silently miss new rule_ids during the intersection.
@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot dismissed their stale review June 3, 2026 10:06

superseded by new automated review

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "verdict": "approve",
  "summary": "PR is a documentation-only refactor promoting two scenario markdown files from draft to active status. Changes include extracting inline bash blocks for preflight testing (replacing full slash-command invocations), streamlining the 100-file fixture generation, and shifting measurement from LLM-call counting to structural Owner-count ceiling analysis. No code files were modified; all ast-grep rules target Go code and would produce no findings against markdown content.",
  "comments": [
    {
      "file": "scenarios/001-toolchain-preflight.md",
      "line": 19,
      "severity": "nit",
      "message": "Fragility note acknowledged by author: extraction relies on literal heading `#### 4.0:` — future renumbering will silently pick up wrong block. Consider adding a comment in commands/pr-review.md near that block with a stable anchor marker (e.g., `<!-- SCEN001_ANCHOR -->`) that the scenario can reference instead of the heading text."
    },
    {
      "file": "scenarios/001-toolchain-preflight.md",
      "line": 46,
      "severity": "nit",
      "message": "Cleanup uses brace expansion `/tmp/scen001-*.{stderr,exit,stdout}` — not all shells (e.g., older /bin/sh) support this. Could use explicit glob or separate rm calls for maximum portability, though this is a minor portability concern for a scenario intended for Claude Code environments."
    },
    {
      "file": "scenarios/003-scaling-funnel-100-files.md",
      "line": 55,
      "severity": "nit",
      "message": "Hardcoded `cd ~/Documents/workspaces/coding` — if operator's coding repo root differs, the ast-grep scan will run against the wrong tree. Consider using `${CODING_ROOT:-$PWD}` or requiring the operator to export `CODING_ROOT` in setup."
    },
    {
      "file": "scenarios/003-scaling-funnel-100-files.md",
      "line": 55,
      "severity": "nit",
      "message": "ast-grep presence is validated at line 50 (`ast-grep --version` resolves on host) but the mechanical funnel step at line 55 has no explicit error if ast-grep is present but broken. A broken ast-grep binary could produce a 0-owner result that passes the ≤30 check. Consider adding an explicit pre-validation that ast-grep scan returns exit 0 against the fixture."
    }
  ],
  "concerns_addressed": [
    "correctness: heading anchor brittleness in 001 — documented as fragility note, not fixed",
    "correctness: brace expansion portability in 001 cleanup — not addressed, minor",
    "correctness: hardcoded ~/Documents/workspaces/coding in 003 Step 2 — documented as issue, not fixed",
    "tests: owner-count structural correctness in 003 — addressed by new ≤30 structural ceiling approach instead of LLM-call counting",
    "correctness: ast-grep missing validation in 003 — acknowledged but not structurally fixed"
  ]
}

@ben-s-pull-request-reviewer ben-s-pull-request-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "verdict": "approve",
  "summary": "Both scenario rewrites are well-structured contract documents. Scenario 001 correctly anchors extraction on literal section headers (`#### 4.0:` and `### 0. Preflight:`) that exist in the source files, and its JSON output schema fields (`stats.yamls_run`, `findings_by_owner`, `errors[].kind`) match the actual runner output documented in `agents/ast-grep-runner.md`. Scenario 003's structural claim (≤30 distinct Owners bounding Step 4b invocations) is verified: the index has 17 distinct owners, well under the ceiling. No issues found.",
  "comments": [],
  "concerns_addressed": [
    "correctness: Scenario 001 extraction anchors verified — `#### 4.0:` exists in pr-review.md:96 and code-review.md:77; `### 0. Preflight:` exists in ast-grep-runner.md:57",
    "correctness: Scenario 001 JSON contract fields match runner schema — `stats.yamls_run`, `findings_by_owner`, `errors[].kind == \"missing-tool\"` all present in runner output at ast-grep-runner.md:75",
    "correctness: Scenario 003 structural equivalence holds — 17 distinct owners in rules/index.json (≤30 ceiling confirmed)",
    "tests: Scenario 001 error message `ast-grep/sg not in PATH` matches verbatim between pr-review.md:102 and code-review.md:83"
  ]
}

@bborbe bborbe merged commit 3e0d052 into master Jun 3, 2026
1 check passed
@bborbe bborbe deleted the chore/scenarios-walkable branch June 3, 2026 10:50
bborbe added a commit that referenced this pull request Jun 3, 2026
…bot review

Bot correctly flagged that scenarios/004 was not referenced in
README.md or llms.txt; broader audit found scenarios/001-003 were
also never indexed (oversight in PRs #37, #38, #40). Closing the
gap retroactively for all 4 active scenarios.

README.md: new 'Acceptance Scenarios' section above Contributing
with a 4-row table linking each scenario file + one-line
'Validates...' description.

llms.txt: parallel section at the bottom so AI agents relying on
llms.txt for discovery can find the scenarios alongside guides.
bborbe added a commit that referenced this pull request Jun 3, 2026
/coding:commit's pipeline-only detection lumped scenarios/ with
prompts/ and specs/ — meaning a PR adding ONLY scenario files
would route to Workflow E (commit + push, no changelog entry).
That's wrong for repos where scenarios are shipped acceptance
contracts that users invoke via /dark-factory:run-scenario; in
that role they're release-relevant artifacts on the same footing
as docs/ and rules/.

Without this fix, today's PRs #37 (3 scenarios drafted), #40
(scenarios rewritten + promoted to active), and #41 (4th scenario
added) would each have shipped with no changelog record. The
v0.15.0 release notes that this PR drafts had to retroactively
list scenarios as a feature.

Updated commands/commit.md:139-153 — drop scenarios/ from the
pipeline-only list and the corresponding rationale paragraph.
Prompts and specs remain pipeline-only (dark-factory daemon
runtime state, not shipped artifacts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant