Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions scenarios/001-toolchain-preflight.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
status: draft
---

# Scenario 001: Dispatcher fail-fasts when ast-grep binary is missing

Validates that `/coding:code-review` and `/coding:pr-review` abort with an actionable error when `ast-grep` / `sg` is absent from PATH — closing the silent-empty-review failure mode observed on [bborbe/coding#34](https://github.com/bborbe/coding/pull/34).

## Setup

- [ ] Build a masked PATH that excludes any directory containing `ast-grep` or `sg`:
```bash
PATH_MASKED=$(echo "$PATH" | tr ':' '\n' | while read d; do
[ -x "$d/ast-grep" ] || [ -x "$d/sg" ] || echo "$d"
done | paste -sd: -)
```
- [ ] `PATH=$PATH_MASKED command -v ast-grep; echo "exit=$?"` prints `exit=1`
- [ ] `PATH=$PATH_MASKED command -v sg; echo "exit=$?"` prints `exit=1`
- [ ] Create a minimal Go fixture: `WORK=$(mktemp -d) && cd "$WORK" && git init -q && printf 'package main\n\nfunc main() {}\n' > main.go && git add . && git commit -qm initial`
- [ ] Host shell `command -v ast-grep` (outside any subshell) still resolves a path

## Action

- [ ] In a fresh Claude Code session launched as `PATH=$PATH_MASKED claude`, run `/coding:code-review master` against `$WORK`; tee stdout to `/tmp/cr-stdout.log`, stderr to `/tmp/cr-stderr.log`, and capture exit code: `... ; echo $? > /tmp/cr-exit`
- [ ] In a fresh Claude Code session launched as `PATH=$PATH_MASKED claude`, run `/coding:pr-review master` against `$WORK`; same tee + exit-code capture to `/tmp/pr-{stdout,stderr,exit}.log`
- [ ] Direct runner test: in a `PATH=$PATH_MASKED` subshell, invoke the `coding:ast-grep-runner` agent via `claude` with the prompt `TARGET_DIR=$WORK. Run every YAML in rules/<lang>/*.yml. Return findings grouped by Owner.` — tee stdout to `/tmp/runner-stdout.log`, capture exit code to `/tmp/runner-exit`

## Expected

- [ ] `/tmp/cr-stderr.log` contains the literal string `ast-grep/sg not in PATH`
- [ ] `cat /tmp/cr-exit` prints `1`
- [ ] `grep -c 'coding:ast-grep-runner agent:' /tmp/cr-stdout.log` returns `0` (Step 4a was never invoked)
- [ ] `/tmp/pr-stderr.log` contains `ast-grep/sg not in PATH` and `cat /tmp/pr-exit` prints `1`
- [ ] `jq -e '.errors[] | select(.kind == "missing-tool" and .tool == "ast-grep")' /tmp/runner-stdout.log` exits 0
- [ ] `jq '.stats.yamls_run' /tmp/runner-stdout.log` returns `0` and `jq '.findings_by_owner' /tmp/runner-stdout.log` returns `{}`
- [ ] `cat /tmp/runner-exit` prints `1`
- [ ] Each invocation's wall-clock under 30 seconds (`time` output's `real` < 30s) — the regression risk being locked down is a 30-min `activeDeadlineSeconds` kill instead of immediate fail
- [ ] Host shell `command -v ast-grep` (re-run after all subshells) still resolves a path

## Cleanup

- `rm -rf "$WORK" /tmp/cr-* /tmp/pr-* /tmp/runner-*`
32 changes: 32 additions & 0 deletions scenarios/002-clean-pr-zero-findings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
status: draft
---

# Scenario 002: Zero-violation PR in standard mode produces empty findings, no false positives

Validates that a diff with no mechanical-rule violations flows through `/coding:code-review master standard` (Step 4.0 → 4a → 4b → 4c → 4d → 5) and produces a report with empty Must Fix / Should Fix / Nice to Have sections — locking down the regression risk that the LLM tier hallucinates findings to fill the void when the mechanical layer surfaces none.

## Setup

- [ ] Clone master at the current HEAD: `WORK=$(mktemp -d) && cd "$WORK" && git clone --depth=1 --branch master git@github.com:bborbe/coding.git . && git checkout -b clean-pr-fixture`
- [ ] Apply a README-only whitespace edit (no `.go` / `.py` / `rules/**` / `docs/**` touched): `sed -i.bak 's/^# bborbe\/coding$/# bborbe\/coding /' README.md && rm README.md.bak && git add README.md && git commit -qm 'docs: typo whitespace'`
- [ ] `git diff --name-only master..HEAD` prints exactly one line: `README.md`
- [ ] `ast-grep --version` resolves on host (Step 4.0 preflight will pass)

## Action

- [ ] Run `/coding:code-review master` against `$WORK` in a fresh Claude Code session (standard mode is the default — do not pass a mode argument); tee stdout to `/tmp/code-review-stdout.log`, stderr to `/tmp/code-review-stderr.log`, capture exit code to `/tmp/code-review-exit`

## Expected

- [ ] `cat /tmp/code-review-exit` prints `0`
- [ ] Runner block in stdout has `findings_count: 0`: extract the JSON block via `awk '/^{$/,/^}$/' /tmp/code-review-stdout.log > /tmp/code-review-runner.json && jq '.stats.findings_count' /tmp/code-review-runner.json` returns `0`
- [ ] Report includes a `Must Fix (Critical)` section whose body line is exactly `None.` — verified via: `awk '/^#### Must Fix/{flag=1; next} /^#### /{flag=0} flag && NF' /tmp/code-review-stdout.log` prints exactly `None.`
- [ ] Report includes a `Should Fix (Important)` section whose body line is exactly `None.` (same `awk` shape)
- [ ] Report includes a `Nice to Have (Optional)` section whose body line is exactly `None.` (same `awk` shape)
- [ ] `grep -c 'dropped finding' /tmp/code-review-stderr.log` returns `0` — citation validator had nothing to drop because no findings were generated
- [ ] Run completes in under 5 minutes (`time` output's `real` < 5m)

## Cleanup

- `rm -rf "$WORK" /tmp/code-review-*`
71 changes: 71 additions & 0 deletions scenarios/003-scaling-funnel-100-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
status: draft
---

# Scenario 003: 100-file synthetic PR completes review in ≤30 LLM calls

Validates that a 100-file synthetic PR with known mechanical violations completes `/coding:pr-review master standard` within ≤30 LLM calls and ≤30 minutes — proving the ast-grep funnel decouples LLM cost from PR size and stays under the prod bot's `activeDeadlineSeconds=1800` ceiling. Companion decoupling scenario (200-file re-run with ≤30 LLM calls) lives separately as scenario 004.

## Setup

- [ ] Build the synthetic-PR fixture:
```bash
WORK=$(mktemp -d) && cd "$WORK" && git init -q
mkdir -p pkg/{handler,service,store,worker,internal}/{user,order,product,customer,billing}
i=0
for dir in pkg/handler/{user,order,product,customer,billing} \
pkg/service/{user,order,product,customer,billing} \
pkg/store/{user,order,product,customer,billing} \
pkg/worker/{user,order,product,customer,billing} \
pkg/internal/{user,order,product,customer,billing}; do
for n in 1 2 3 4; do
i=$((i+1))
cat > "$dir/file$n.go" <<EOF
package $(basename $dir)
import "fmt"
func NewService$i() *Service$i { return &Service$i{} }
type Service$i struct{}
func (s *Service$i) Process() error { return fmt.Errorf("processing failed") }
EOF
done
done
git add . && git commit -qm initial
```
- [ ] `git ls-files '*.go' | wc -l` returns exactly `100`
- [ ] `ast-grep --version` resolves on host
- [ ] Pin the LLM-call counter via a Claude CLI shim:
```bash
CLAUDE_BIN=$(command -v claude)
mkdir -p /tmp/llm-count-shim
cat > /tmp/llm-count-shim/claude <<SHIM
#!/bin/sh
printf '1\n' >> /tmp/llm-call-count.log
exec "$CLAUDE_BIN" "\$@"
SHIM
chmod +x /tmp/llm-count-shim/claude
: > /tmp/llm-call-count.log
export PATH=/tmp/llm-count-shim:$PATH
```
Every `claude` invocation under the shim appends one line; final count is `wc -l < /tmp/llm-call-count.log`
- [ ] Positive control on the shim: `claude --version >/dev/null && [ "$(wc -l < /tmp/llm-call-count.log)" = "1" ]` (after the manual probe, reset with `: > /tmp/llm-call-count.log` before the Action step)

## Action

- [ ] Run `/coding:pr-review master standard` against `$WORK` in a fresh Claude Code session under the shim PATH; tee stdout to `/tmp/scaling-pr-stdout.log`, stderr to `/tmp/scaling-pr-stderr.log`, capture exit code to `/tmp/scaling-pr-exit`
- [ ] Record wall-clock duration with `time` wrapping the slash command; capture the `real` line to `/tmp/scaling-pr-time.log`
- [ ] Extract the Step 5 Consolidated Report section: `awk '/^### Step 5:/{flag=1} flag' /tmp/scaling-pr-stdout.log > /tmp/scaling-pr-report.md`

## Expected

- [ ] `cat /tmp/scaling-pr-exit` prints `0`
- [ ] `wc -l < /tmp/llm-call-count.log` returns ≤ `30` (proves the funnel: ast-grep mechanical layer carries 100-file scale at constant LLM cost)
- [ ] `/tmp/scaling-pr-time.log` `real` line parses to ≤ `30m00s` (stays under the prod bot's `activeDeadlineSeconds=1800` ceiling — the same one that killed coding#27)
- [ ] `jq '.stats.findings_count' /tmp/scaling-pr-stdout.log` returns > `0` — negative control: the synthetic violations were actually surfaced (if 0, the funnel didn't run and the LLM count is misleadingly low)
- [ ] `grep -oE 'go-[a-z-]+/[a-z-]+' /tmp/scaling-pr-report.md | sort -u | wc -l` returns ≥ `3` — at least 3 distinct rule_ids surfaced (the per-Owner adjudication phase processed findings, didn't drop them)
- [ ] `grep -c 'dropped finding' /tmp/scaling-pr-stderr.log` returns `0` — citation validator dropped nothing; every surfaced finding cites a real rule_id

## Cleanup

- `rm -rf "$WORK" /tmp/scaling-pr-* /tmp/llm-count-shim /tmp/llm-call-count.log`

After the scenario passes, the operator should record the measured `(LLM count, duration, findings count)` tuple in the Progress section of the task page (`[[Refactor coding pr-review to doc-driven rules pipeline]]`) so future Phase-10 reruns have a baseline. This is a follow-up note, not part of the scenario contract.
Loading