diff --git a/plugins/aidd-refine/CATALOG.md b/plugins/aidd-refine/CATALOG.md index 153d6edd..8f7cb4f7 100644 --- a/plugins/aidd-refine/CATALOG.md +++ b/plugins/aidd-refine/CATALOG.md @@ -50,9 +50,10 @@ Auto-generated index of skills, agents, references and assets shipped by the `ai | Group | File | Description | |-------|------|---| | `actions` | [01-challenge.md](skills/02-challenge/actions/01-challenge.md) | - | +| `assets` | [report-template.md](skills/02-challenge/assets/report-template.md) | - | | `-` | [README.md](skills/02-challenge/README.md) | - | | `references` | [confidence-rubric.md](skills/02-challenge/references/confidence-rubric.md) | - | -| `-` | [SKILL.md](skills/02-challenge/SKILL.md) | `Rethink prior work to verify correctness against an agreed plan, classifying findings as deal-breakers, suggestions, or correct, with a confidence score. Use when the user says "challenge this", "rethink your plan", "is this correct", "review my last decision", "challenge my decision", "challenge what you did", "is my decision right", "criticize this", "find flaws", or asks for a critical review of just-completed work. Do NOT use for line-by-line code review against a style guide, implementing features, writing tests, or generating new code.` | +| `-` | [SKILL.md](skills/02-challenge/SKILL.md) | `Rethink just-completed work against an agreed plan, classify findings as deal-breakers, suggestions, or correct, with a confidence score. Use to challenge a decision, criticize, or critically review recent work; not for line-by-line style review or writing code.` | #### `skills/03-condense` @@ -62,7 +63,7 @@ Auto-generated index of skills, agents, references and assets shipped by the `ai | `actions` | [02-stats.md](skills/03-condense/actions/02-stats.md) | - | | `-` | [README.md](skills/03-condense/README.md) | - | | `references` | [intensity-levels.md](skills/03-condense/references/intensity-levels.md) | - | -| `-` | [SKILL.md](skills/03-condense/SKILL.md) | `Toggle terse output mode with intensity levels (lite, full, ultra) so prose drops articles, filler, and pleasantries while code, quoted errors, and security warnings stay verbatim. Also reports real token usage and estimated savings under condense mode for the current session. Use when the user says "condense", "condense output", "be more concise", "shorter answers", "tighten output", "/condense", "/condense full", "/condense ultra", "stop condense", "normal mode", "/condense-stats", "how much have we saved", or "token savings". Do NOT use for editing existing prose, summarizing a long document, or compressing source code (only output style is affected, not content).` | +| `-` | [SKILL.md](skills/03-condense/SKILL.md) | `Toggle terse output mode (lite, full, ultra) that drops filler while code, errors, and warnings stay verbatim, and report token savings for the session. Use to condense output, shorten answers, switch intensity, or check savings; not for editing prose or compressing code.` | #### `skills/04-shadow-areas` @@ -77,7 +78,7 @@ Auto-generated index of skills, agents, references and assets shipped by the `ai | `references` | [locked-sets.json](skills/04-shadow-areas/references/locked-sets.json) | - | | `references` | [probe-style.md](skills/04-shadow-areas/references/probe-style.md) | - | | `references` | [severity-rubric.md](skills/04-shadow-areas/references/severity-rubric.md) | - | -| `-` | [SKILL.md](skills/04-shadow-areas/SKILL.md) | `Analytical scan of a markdown artifact (idea, user-stories, PRD, spec) to surface blind spots - unstated assumption, missing actor, missing failure mode, ambiguous term, missing acceptance criterion, missing edge case, and missing dependency - emitting a structured shadow report grouped by category and sorted by severity. Use when the user says "find blind spots in this spec", "what's missing in this PRD", "shadow report", "shadow analysis", "scan for gaps", "find what's missing", "spot blind spots", "review for gaps", or asks for an analytical gap scan of a written artifact. Do NOT use for interactive clarification through iterative Q&A (use aidd-refine:01-brainstorm for that), implementing features, writing tests, or reviewing code style.` | +| `-` | [SKILL.md](skills/04-shadow-areas/SKILL.md) | `Scan a markdown artifact (idea, user stories, PRD, spec) for blind spots and emit a structured shadow report grouped by category and sorted by severity. Use to find gaps, missing parts, or what's missing in a written artifact; not for interactive Q&A (use aidd-refine:01-brainstorm) or code review.` | #### `skills/05-fact-check` @@ -89,6 +90,7 @@ Auto-generated index of skills, agents, references and assets shipped by the `ai | `assets` | [report-template.md](skills/05-fact-check/assets/report-template.md) | - | | `-` | [README.md](skills/05-fact-check/README.md) | - | | `references` | [claim-categories.md](skills/05-fact-check/references/claim-categories.md) | - | +| `references` | [report-output-discipline.md](skills/05-fact-check/references/report-output-discipline.md) | - | | `references` | [verification-cascade.md](skills/05-fact-check/references/verification-cascade.md) | - | -| `-` | [SKILL.md](skills/05-fact-check/SKILL.md) | `Verify factual claims in a piece of text against authoritative sources and rewrite it with footnote citations, hedging any claim that cannot be confirmed. Runs a cheapest-first verification cascade (project memory and docs, then codebase inspection, then web lookup) and reports both sources when they disagree. Use when the user says "fact-check this", "verify that claim", "are you sure about that", "is that actually true", "cite your sources", "where did you get that fact", "did you make that up", "double-check the version you gave me", "vérifie cette information", or "es-tu sûr de ça". Do NOT use to auto-guard the AI's own output (this skill only fires on an explicit request), to judge code logic correctness, or to clarify vague requirements through iterative Q&A - use `aidd-refine:01-brainstorm` for that.` | +| `-` | [SKILL.md](skills/05-fact-check/SKILL.md) | `Verify factual claims in a text against authoritative sources and rewrite it with footnote citations, hedging anything unconfirmed. Use to fact-check, verify a claim, or cite sources on explicit request; not for judging code logic or clarifying vague requirements (use aidd-refine:01-brainstorm).` | diff --git a/plugins/aidd-refine/skills/02-challenge/SKILL.md b/plugins/aidd-refine/skills/02-challenge/SKILL.md index 0a51675a..053694b9 100644 --- a/plugins/aidd-refine/skills/02-challenge/SKILL.md +++ b/plugins/aidd-refine/skills/02-challenge/SKILL.md @@ -1,17 +1,17 @@ --- name: 02-challenge -description: Rethink prior work to verify correctness against an agreed plan, classifying findings as deal-breakers, suggestions, or correct, with a confidence score. Use when the user says "challenge this", "rethink your plan", "is this correct", "review my last decision", "challenge my decision", "challenge what you did", "is my decision right", "criticize this", "find flaws", or asks for a critical review of just-completed work. Do NOT use for line-by-line code review against a style guide, implementing features, writing tests, or generating new code. +description: Rethink just-completed work against an agreed plan, classify findings as deal-breakers, suggestions, or correct, with a confidence score. Use to challenge a decision, criticize, or critically review recent work; not for line-by-line style review or writing code. --- # Challenge Rethink prior work and surface what is wrong, missing, or duplicated. Output a structured report with a confidence score so the user knows whether to ship, iterate, or rework. -## Available actions +## Actions | # | Action | Role | Input | | --- | ----------- | ------------------------------------------------------------- | ------------------------------ | -| 01 | `challenge` | Rethink prior work, classify findings, score confidence | review_target + agreed_plan | +| 01 | `challenge` | Rethink prior work, classify findings, score confidence | the work + agreed reference | ## Default flow @@ -19,20 +19,14 @@ Single action skill. The router dispatches to `challenge` whenever the trigger p ## Transversal rules -- Think in first principles. Every step must be logical, with no gap and no missing information. -- Challenge own assumptions and the user's decisions before declaring confidence. -- Look for edge cases, errors, inconsistencies, missing parts, duplications, and optimizations. +- Reason from first principles, no logical gaps. - Aim for simplifications. If the work can be smaller, say so. -- Output the structured report verbatim per the action's `## Outputs` block. +- Fill `assets/report-template.md` verbatim. ## References -- `@references/confidence-rubric.md`: tiered rubric for the confidence percentage. +- `references/confidence-rubric.md`: tiered rubric for the confidence percentage. ## Assets -- None. - -## External data - -- None. +- `assets/report-template.md`: findings report skeleton, filled per run. diff --git a/plugins/aidd-refine/skills/02-challenge/actions/01-challenge.md b/plugins/aidd-refine/skills/02-challenge/actions/01-challenge.md index eae6932b..b7167e1a 100644 --- a/plugins/aidd-refine/skills/02-challenge/actions/01-challenge.md +++ b/plugins/aidd-refine/skills/02-challenge/actions/01-challenge.md @@ -2,39 +2,25 @@ Rethink prior work and verify correctness against an agreed plan, then emit a structured findings report. -## Inputs +## Input -- `review_target` (required): what to review. One of: last assistant turn, specific file paths, plan document, or commit range. -- `agreed_plan` (required): the prior agreement, specification, or set of requirements to compare against. +- The work to review: the last answer, specific files, a plan, or a commit range. +- The agreed reference to judge it against: a plan, a spec, or stated requirements. Without one, judge against stated user intent. -## Outputs +## Output -```text -My confidence level of correctness now: XX% - -# Previous work to review - -# Correctness (100%) -- - -# Deal breakers -- - -# Suggestions (enhancements only) -- -``` +The findings report following `@../assets/report-template.md`: a confidence percentage plus the Correctness, Deal breakers, and Suggestions sections. ## Process -1. Read `review_target` and align it against `agreed_plan`. -2. Challenge own assumptions and the user's decisions. -3. Scan for edge cases, errors, gaps, duplications, and inconsistencies. -4. Classify each finding as Correctness, Deal breaker, or Suggestion. -5. Score confidence per the rubric in `references/confidence-rubric.md`. -6. Emit the Output report verbatim. +1. **Align.** Read the work and line it up against the agreed reference. +2. **Challenge.** Challenge own assumptions and the user's decisions. +3. **Scan.** Scan for edge cases, errors, gaps, duplications, and inconsistencies. +4. **Classify.** Classify each finding as Correctness, Deal breaker, or Suggestion. +5. **Score.** Score confidence per the rubric in `@../references/confidence-rubric.md`. +6. **Emit.** Fill `@../assets/report-template.md` verbatim and emit it. ## Test -- The emitted report contains a confidence percentage and the three classification sections. -- `confidence >= 95%` if and only if the Deal breakers section is empty. -- The confidence value sits in the rubric tier consistent with the findings. +- The report has a confidence percentage and the Correctness, Deal breakers, and Suggestions sections. +- The Deal breakers section is non-empty only when confidence is below 75%. diff --git a/plugins/aidd-refine/skills/02-challenge/assets/report-template.md b/plugins/aidd-refine/skills/02-challenge/assets/report-template.md new file mode 100644 index 00000000..e7b0af23 --- /dev/null +++ b/plugins/aidd-refine/skills/02-challenge/assets/report-template.md @@ -0,0 +1,14 @@ + + +My confidence level of correctness now: XX% + +# Previous work to review + +# Correctness (100%) +- + +# Deal breakers +- + +# Suggestions (enhancements only) +- diff --git a/plugins/aidd-refine/skills/03-condense/SKILL.md b/plugins/aidd-refine/skills/03-condense/SKILL.md index 70173e0e..93ea005f 100644 --- a/plugins/aidd-refine/skills/03-condense/SKILL.md +++ b/plugins/aidd-refine/skills/03-condense/SKILL.md @@ -1,6 +1,6 @@ --- name: 03-condense -description: Toggle terse output mode with intensity levels (lite, full, ultra) so prose drops articles, filler, and pleasantries while code, quoted errors, and security warnings stay verbatim. Also reports real token usage and estimated savings under condense mode for the current session. Use when the user says "condense", "condense output", "be more concise", "shorter answers", "tighten output", "/condense", "/condense full", "/condense ultra", "stop condense", "normal mode", "/condense-stats", "how much have we saved", or "token savings". Do NOT use for editing existing prose, summarizing a long document, or compressing source code (only output style is affected, not content). +description: Toggle terse output mode (lite, full, ultra) that drops filler while code, errors, and warnings stay verbatim, and report token savings for the session. Use to condense output, shorten answers, switch intensity, or check savings; not for editing prose or compressing code. argument-hint: condense | stats --- @@ -8,12 +8,12 @@ argument-hint: condense | stats Toggles a terse output mode with three intensity levels (lite, full, ultra). Strips articles, filler, and pleasantries from prose while preserving technical substance, code blocks, quoted errors, and security warnings. -## Available actions +## Actions | # | Action | Role | Input | | --- | ---------- | --------------------------------------------------------------------- | ------------------------------------ | -| 01 | `condense` | Toggle terse mode and apply intensity rules | current_state + requested_intensity | -| 02 | `stats` | Report real token usage and estimated savings for the current session | session log + intensity timeline | +| 01 | `condense` | Toggle terse mode and apply intensity rules | current state + requested level | +| 02 | `stats` | Report real token usage and estimated savings for the current session | session messages + level timeline | ## Default flow @@ -25,24 +25,18 @@ Router dispatches by intent: ## Transversal rules - **Persistence**: once active, terse mode applies to EVERY response until explicitly turned off. Do not drift back to verbose prose after many turns, when uncertain, or when the task changes. The level remains active for the rest of the session unless changed or stopped. -- **Off switch**: terse mode stops only on explicit user signal - `stop condense`, `normal mode`, `/condense off`, or invoking the skill again to toggle. +- **Off switch**: terse mode stops only on explicit user signal: `stop condense`, `normal mode`, `/condense off`, or invoking the skill again to toggle. - **Toggle**: invoking the skill while active toggles it off; invoking while off turns it on at the default level (`full`) unless an explicit intensity is given. - **Drop fluff**: drop articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries (sure/certainly/of course/happy to), and hedging. Fragments are acceptable. - **Short synonyms**: prefer short words (big not extensive, fix not "implement a solution for"). Technical terms stay exact. Code blocks are unchanged. Errors are quoted verbatim. - **Pattern**: `[thing] [action] [reason]. [next step].` - - Bad: "Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by..." - - Good: "Bug in auth middleware. Token expiry check uses `<` not `<=`. Fix:" -- **Auto-pause** (drop terse mode for these passages, then resume): security warnings, irreversible action confirmations, multi-step sequences where fragment order or omitted conjunctions risk misread, compression itself creating technical ambiguity, user asks to clarify or repeats a question. -- **Boundaries**: code, commits, and pull request bodies are written in normal English regardless of intensity. The intensity level persists until toggled off or until session end. +- **Auto-pause**: drop terse mode for the passages listed in `references/intensity-levels.md` (security warnings, irreversible confirmations, ambiguity risks), then resume. +- **Boundaries**: code, commits, and pull request bodies are written in normal English regardless of intensity. ## References -- `@references/intensity-levels.md`: detailed per-level rules and side-by-side examples. - -## Assets - -- None. +- `references/intensity-levels.md`: detailed per-level rules and side-by-side examples. ## External data -- `../hooks/condense-stats.js` - UserPromptSubmit hook that intercepts stats triggers, reads the session transcript, and returns the formatted savings report without invoking the model. +- `../hooks/condense-stats.js`: UserPromptSubmit hook that intercepts stats triggers, reads the session transcript, and returns the formatted savings report without invoking the model. diff --git a/plugins/aidd-refine/skills/03-condense/actions/01-condense.md b/plugins/aidd-refine/skills/03-condense/actions/01-condense.md index b5892323..dbdf45bb 100644 --- a/plugins/aidd-refine/skills/03-condense/actions/01-condense.md +++ b/plugins/aidd-refine/skills/03-condense/actions/01-condense.md @@ -2,39 +2,26 @@ Toggle terse output mode and apply the requested intensity rules to subsequent prose turns. -## Inputs +## Input -- `current_state` (required): inferred from session context. Either `on` (with current intensity level) or `off`. -- `requested_intensity` (required): one of `lite`, `full`, `ultra`, or `toggle` to flip the current state. +- Whether condense is currently on (and at which level) or off, read from session context. +- The requested change: a level (lite, full, ultra) or a plain on/off toggle. -## Outputs +## Output -```text -Condense: ON (full). -Articles dropped, filler removed. Code, errors, warnings stay verbatim. Stop with "stop condense" or "normal mode". -``` - -Or on off: - -```text -Condense: OFF. -Normal prose resumed. -``` +A single confirmation line: `Condense: ON ().` when enabling, or `Condense: OFF.` when disabling. ## Process -1. Detect the toggle command and target intensity from the user message. -2. Resolve the new state by combining `current_state` with `requested_intensity`: +1. **Detect.** Read the toggle command and target level from the user message. +2. **Resolve.** Combine the current state with the request: - Explicit level (`lite | full | ultra`) sets that level (or switches level if already on). - `toggle` flips on/off; default level when turning on is `full`. - Off phrases (`stop condense`, `normal mode`, `/condense off`) force off. -3. Emit the confirmation block with the resolved state filled in. -4. Apply the transversal rules to every subsequent prose turn until the next off signal, using per-level rules from `@../references/intensity-levels.md`. -5. **Hold persistence.** Do not drift back to verbose prose across many turns, when uncertain, or when the topic changes. Auto-pause only for the specific passages listed in the reference. +3. **Emit.** The reply MUST begin with this exact line, filled in and unaltered: `Condense: ON ().` when enabling, or `Condense: OFF.` when disabling. The stats action and the hook parse this line from the transcript, so never paraphrase, decorate, or omit it. +4. **Apply.** Apply the transversal rules to every subsequent prose turn until the next off signal, using per-level rules and auto-pause passages from `@../references/intensity-levels.md`. ## Test -- After turning condense ON: the next non-code, non-warning assistant turn drops articles consistent with the active intensity. -- After turning condense OFF: the next assistant turn returns to normal prose. -- Code blocks, quoted errors, and security warnings remain verbatim regardless of condense state. -- After 5 consecutive turns post-activation: the terse style is still applied (no drift back to verbose). +- After ON, the next non-code, non-warning turn drops articles at the active intensity; after OFF, it returns to normal prose. +- Code blocks, quoted errors, and security warnings stay verbatim regardless of state. diff --git a/plugins/aidd-refine/skills/03-condense/actions/02-stats.md b/plugins/aidd-refine/skills/03-condense/actions/02-stats.md index 031e3073..135f27ee 100644 --- a/plugins/aidd-refine/skills/03-condense/actions/02-stats.md +++ b/plugins/aidd-refine/skills/03-condense/actions/02-stats.md @@ -1,42 +1,25 @@ # 02 - Stats -Show real token usage and estimated savings for the current session under condense mode. +Show real token usage and estimated savings for the current session under condense mode. On Claude Code the bundled `hooks/condense-stats.js` hook owns this path; the model runs the steps below only on tools without hook support. -## Inputs +## Input -- Session log of the current AI tool (assistant messages produced since session start). -- Active intensity level and its on/off transitions during the session. +- The session's assistant messages since it started. +- The active level and every on/off switch during the session. -## Outputs +## Output -```text -Condense session stats ----------------------- -Mode: ON (full) -Active turns: 18 / 32 (56%) -Tokens out (active): 4,210 -Tokens out (off): 5,830 -Avg saved / turn: ~38% (vs unmodified prose baseline) -Approx total saved: ~2,650 tokens - -Top savings: full (-42%), ultra (-58%), lite (-18%). -``` +A stats block reporting, in order: mode, active turns and ratio, tokens out while active, tokens out while off, average saved per turn versus the unmodified baseline, approximate total saved, and per-level top savings. ## Process -1. **Read the session log** for the current AI tool (Claude Code: the active session JSONL; other tools: their equivalent transcript). -2. **Detect intensity transitions** by scanning assistant messages for the confirmation block emitted by `01-condense` (`Condense: ON (...)` / `Condense: OFF`). Build a timeline of `(turn_index, level)` segments. -3. **Tokenize each assistant message.** Use the AI tool's token counter when available, otherwise approximate at 4 chars per token. -4. **Compute the baseline.** For each `active` segment, estimate the verbose-prose baseline using the level's compression ratio (`lite ~18%`, `full ~38%`, `ultra ~58%` - published averages, replaceable by measured ratios when available). -5. **Render the report block** with the exact field order shown in `## Outputs`. Round percentages to whole numbers; round token counts to the nearest 10. +1. **Read.** Load the session log for the current AI tool (Claude Code: the active session JSONL; other tools: their equivalent transcript). +2. **Detect.** Scan assistant messages for the confirmation line emitted by `01-condense` (`Condense: ON (...)` / `Condense: OFF`). Build a timeline of `(turn_index, level)` segments. +3. **Tokenize.** Count tokens per assistant message. Use the AI tool's token counter when available, otherwise approximate at 4 chars per token. +4. **Compute.** For each `active` segment, estimate the verbose-prose baseline using the level's compression ratio (`lite ~18%`, `full ~38%`, `ultra ~58%`, published averages, replaceable by measured ratios when available). +5. **Render.** Emit the report with the exact field order shown in `## Output`. Round percentages to whole numbers; round token counts to the nearest 10. 6. **Stop.** Do not invoke any other action. -## Implementation - -A `UserPromptSubmit` hook bundled with this plugin at `hooks/condense-stats.js` intercepts the trigger phrase, reads the active session transcript, detects intensity transitions emitted by `01-condense`, computes the report, and returns `{ decision: "block", reason: "" }` so the model is not invoked. - -The model only runs this action's inline logic on AI tools that lack hook support; on Claude Code the hook owns the path. - ## Test -The output matches the field order in `## Outputs`, every numeric field is filled (no `-` placeholders), and the active-turns ratio is consistent with the detected intensity transitions in the session. +Output follows the `## Output` field order, every numeric field is filled (no `-`), and the active-turns ratio matches the detected intensity transitions. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/SKILL.md b/plugins/aidd-refine/skills/04-shadow-areas/SKILL.md index a6c7fdf4..1def8159 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/SKILL.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/SKILL.md @@ -1,6 +1,6 @@ --- name: 04-shadow-areas -description: Analytical scan of a markdown artifact (idea, user-stories, PRD, spec) to surface blind spots - unstated assumption, missing actor, missing failure mode, ambiguous term, missing acceptance criterion, missing edge case, and missing dependency - emitting a structured shadow report grouped by category and sorted by severity. Use when the user says "find blind spots in this spec", "what's missing in this PRD", "shadow report", "shadow analysis", "scan for gaps", "find what's missing", "spot blind spots", "review for gaps", or asks for an analytical gap scan of a written artifact. Do NOT use for interactive clarification through iterative Q&A (use aidd-refine:01-brainstorm for that), implementing features, writing tests, or reviewing code style. +description: Scan a markdown artifact (idea, user stories, PRD, spec) for blind spots and emit a structured shadow report grouped by category and sorted by severity. Use to find gaps, missing parts, or what's missing in a written artifact; not for interactive Q&A (use aidd-refine:01-brainstorm) or code review. argument-hint: detect | render-report | diff --- @@ -8,7 +8,7 @@ argument-hint: detect | render-report | diff Analytically scans a written artifact for gaps the author has not addressed. Unlike iterative Q&A clarification, this skill reads the existing material and emits a structured report: each gap carries a category from a locked 7-category taxonomy, a 3-tier severity, and a direct-question probe the author can act on immediately. -## Available actions +## Actions | # | Action | Role | Input | | --- | ---------------- | ------------------------------------------------------------------------ | ---------------------------------------- | @@ -28,23 +28,19 @@ The `02-render-report` action always runs last and writes `-shadow-repor ## Transversal rules - Never modify the source artifact. -- Every emitted gap must have all three fields populated: `category`, `severity`, `probe`. -- Every probe must be a direct question ending with `?`. -- Categories and severities must come from the locked sets in `@references/locked-sets.json`. +- Every gap carries all three: a category, a severity, and a probe question. +- Every probe is a direct question ending with `?`. +- Categories and severities come from the locked sets in `references/locked-sets.json`. - When zero blockers and zero majors remain, stamp the report `status: clean`. -- On re-runs, the identity key for diffing is `category + normalized snippet` - not probe wording - so minor probe rephrasing does not create spurious "newly introduced" gaps. +- On re-runs, gaps are matched by category and snippet, never by question wording, so rephrasing a question never creates a spurious "newly introduced" gap. ## References -- `@references/categories.md`: locked 7-category taxonomy with definition and example per category. -- `@references/severity-rubric.md`: blocker / major / minor decision rules and examples. -- `@references/probe-style.md`: direct-question form rules. -- `@references/locked-sets.json`: machine-readable sets reused by the validator. +- `references/categories.md`: locked 7-category taxonomy with definition and example per category. +- `references/severity-rubric.md`: blocker / major / minor decision rules and examples. +- `references/probe-style.md`: direct-question form rules. +- `references/locked-sets.json`: machine-readable sets reused by the validator. ## Assets -- `@assets/report-template.md`: report skeleton with header, per-category sections, and `status: clean` block. - -## External data - -- None. +- `assets/report-template.md`: report skeleton with header, per-category sections, and `status: clean` block. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/actions/01-detect.md b/plugins/aidd-refine/skills/04-shadow-areas/actions/01-detect.md index 8a43dfb2..9460c674 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/actions/01-detect.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/actions/01-detect.md @@ -1,41 +1,28 @@ # 01 - Detect -Parse the source artifact and extract a structured list of gaps, each classified by category, severity, and direct-question probe. +Parse the source artifact and pull out a list of gaps, each tagged with a category, a severity, and a direct question. -## Inputs +## Input -- `source` (required): file path OR inline markdown text. - - Accept absolute paths and relative paths inside the working directory. - - Reject paths outside the working directory and filenames matching `*-shadow-report.md`. +- The source to scan: a file path or inline markdown text. -## Outputs +## Output -Two arrays: - -1. `gaps[]`: each entry has `category`, `severity`, `probe`, and optional `snippet`. - - `category` ∈ the 7 locked categories in `references/locked-sets.json`. - - `severity` ∈ `{blocker, major, minor}` (see `references/severity-rubric.md`). - - `probe`: direct question ending with `?` (see `references/probe-style.md`). - - `snippet`: quoted excerpt from the source when traceable. - -2. `warnings[]`: top-of-report notes that are not gap entries (e.g. non-markdown source). +A list of gaps, each with its category, severity, a probe question, and the quoted snippet it came from, plus any top-of-report warnings such as a non-markdown source. ## Process -1. Load locked sets from `references/locked-sets.json` and category definitions from `references/categories.md`. -2. Validate the source. Reject per the rules in Inputs. -3. Edge cases: - - Empty source → emit one blocker gap `{category: "missing acceptance criterion", probe: "What content should this artifact contain?"}` and stop. - - Non-markdown source → append warning `"Source is not markdown; gap attribution may be imprecise."` and continue. -4. Scan content for each of the 7 categories in their locked order. Emit one gap per distinct issue found, assigning severity per the rubric and drafting probes per the style rules. -5. Deduplicate by `(category, normalized_snippet)`. Snippet-less gaps fall back to `(category, severity)`. -6. Return `gaps` and `warnings`. Sorting is done by `02-render-report`. +1. **Load.** Read the locked categories and their definitions from `@../references/locked-sets.json` and `@../references/categories.md`. +2. **Validate.** Check the source. Reject anything outside the working directory or already named `*-shadow-report.md`. +3. **Handle edges.** An empty source emits one blocker gap asking what content the artifact should hold, then stops. A non-markdown source adds a warning that attribution may be imprecise, then continues. +4. **Scan.** Walk the seven categories in their locked order. Emit one gap per distinct issue, set its severity from `@../references/severity-rubric.md`, and write its question per `@../references/probe-style.md`. +5. **Dedupe.** Treat two gaps with the same category and snippet as one. A snippet-less gap falls back to its category plus severity. +6. **Return.** Hand the gaps and warnings to the next action: `03-diff` when a prior report exists, else `02-render-report`. Sorting happens there. ## Test -- Outside-tree relative path → error and empty `gaps`. -- Filename matching `*-shadow-report.md` → error and empty `gaps`. -- Empty source → exactly one blocker gap (`missing acceptance criterion`). -- Non-markdown source → one entry in `warnings`, scanning continues. -- Every emitted gap has `category` and `severity` in the locked set and `probe` ending with `?`. -- A duplicated gap (same `category` + normalized `snippet`) appears once in the output. +- A path outside the working directory, or a file named `*-shadow-report.md`, is rejected with no gaps. +- An empty source yields exactly one blocker gap about missing content. +- A non-markdown source adds one warning and keeps scanning. +- Every gap has a category and severity from the locked set and a question ending in `?`. +- A repeated gap (same category and snippet) appears once. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/actions/02-render-report.md b/plugins/aidd-refine/skills/04-shadow-areas/actions/02-render-report.md index c6d08bd8..16996ce6 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/actions/02-render-report.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/actions/02-render-report.md @@ -1,50 +1,35 @@ # 02 - Render Report -Render the detected gap list into a structured markdown shadow report and write it next to the source artifact. +Turn the detected gaps into a structured markdown report and write it next to the source. -## Inputs +## Input -- `gaps[]` (required when `diff` is absent): gap objects from `01-detect`. Ignored when `diff` is supplied. -- `warnings[]` (required, may be empty): warnings from `01-detect`. -- `source_path` (required): path used to derive the output filename and directory. -- `diff` (optional): the three labeled sets from `03-diff` — `closed[]`, `still_open[]`, `newly_introduced[]`. Triggers diff mode. +- The gaps and warnings from `01-detect`. +- The source path, used to name and place the report. +- Optional: the three labelled sets from `03-diff` (closed, still open, newly introduced). Their presence switches on diff mode. -## Outputs +## Output -A markdown file at `/-shadow-report.md`. - -Filename rule: strip the last extension from the source filename and append `-shadow-report.md`. Examples: - -| Source | Report | -| --- | --- | -| `prd.md` | `prd-shadow-report.md` | -| `feature-v2.draft.md` | `feature-v2.draft-shadow-report.md` | -| `Makefile` | `Makefile-shadow-report.md` | - -The source artifact is never modified. +A markdown report written next to the source, named by stripping the source's last extension and appending `-shadow-report.md`. ## Process -1. Load the skeleton from `assets/report-template.md`. -2. Derive `source_dir` and `source_stem` per the filename rule. -3. If `warnings` is non-empty, emit `## Warnings` at the top with each entry as a bullet. Otherwise omit the block. -4. Render gaps grouped by category in locked order (`references/locked-sets.json`): - - Non-diff mode: emit one `### ` per category with at least one gap. - - Diff mode: for each category, emit `#### Closed`, `#### Still Open`, `#### Newly Introduced` in that fixed order, omitting empty subsections. -5. Within any subsection, sort gaps by severity: `blocker` → `major` → `minor`. -6. Render each gap as `**[severity]** `. If `snippet` is non-empty, append a blockquote on the next line. -7. Populate header counts: total + per-severity. In diff mode, counts come from `still_open` + `newly_introduced` only. -8. Stamp `status: clean` in front matter when zero `blocker` and zero `major` entries remain in scope. Otherwise omit the `status` key entirely. -9. Write to `/-shadow-report.md`. +1. **Load.** Start from the skeleton in `@../assets/report-template.md`. +2. **Name.** Derive the report's folder and filename from the source per the rule above. +3. **Warn.** If there are warnings, list them under `## Warnings` at the top. Otherwise omit the block. +4. **Group.** Lay gaps out by category in locked order (`@../references/locked-sets.json`). In plain mode, one heading per category that has a gap. In diff mode, split each category into Closed, Still Open, and Newly Introduced, in that order, dropping empty parts. +5. **Sort.** Within a part, order gaps blocker, then major, then minor. +6. **Render.** Write each gap as `**[severity]** `, with its snippet as a blockquote on the next line when present. +7. **Count.** Fill the header totals: overall and per severity. In diff mode, count only still-open and newly-introduced gaps. +8. **Stamp.** Mark the front matter `status: clean` when no blocker and no major remain in scope. Otherwise leave the status out. +9. **Write.** Save the report at the derived path. ## Test -- Grouping: gaps spanning multiple categories produce one `### ` per category, in locked order. -- Sorting: within a category, `blocker` precedes `major` precedes `minor`. -- Filename: source ending in `feature-v2.draft.md` produces `feature-v2.draft-shadow-report.md`. Source `Makefile` produces `Makefile-shadow-report.md`. -- Clean: zero blocker and zero major → front matter contains `status: clean`. -- Dirty: at least one blocker or major → front matter has no `status` key. -- Warnings: non-empty `warnings` → `## Warnings` section emitted. Empty → omitted. -- No source mutation: `source_path` content and mtime unchanged after the action. -- Diff order: when a category has entries in all three subsets, output is `Closed` → `Still Open` → `Newly Introduced`. -- Diff clean: a `blocker` in `closed[]` does not block clean status; only `still_open` + `newly_introduced` count. +- Gaps spanning several categories produce one heading per category, in locked order. +- Within a category, blocker comes before major before minor. +- A source named `feature-v2.draft.md` produces `feature-v2.draft-shadow-report.md`; `Makefile` produces `Makefile-shadow-report.md`. +- Zero blocker and zero major in scope stamps `status: clean`; closed gaps never count toward scope; otherwise no status key. +- Warnings present emits `## Warnings`; none omits it. +- The source content and timestamp are unchanged after the run. +- In diff mode a category with entries in all three parts renders Closed, then Still Open, then Newly Introduced. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/actions/03-diff.md b/plugins/aidd-refine/skills/04-shadow-areas/actions/03-diff.md index f4a290c3..61a4bff6 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/actions/03-diff.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/actions/03-diff.md @@ -1,39 +1,28 @@ # 03 - Diff -Load the prior shadow report, compare it against the freshly detected gaps, and classify each gap as closed, still open, or newly introduced. +Load the prior shadow report, compare it with the freshly detected gaps, and sort each gap into closed, still open, or newly introduced. -## Inputs +## Input -- `current_gaps[]` (required): gap objects from `01-detect` for the current run. -- `source_path` (required): path used to derive the prior report location, applying the filename rule from `02-render-report`. +- The current run's gaps from `01-detect`. +- The source path, used to find the prior report by the same naming rule as `02-render-report`. -## Outputs +## Output -Three labeled sets, passed to `02-render-report`: - -- `closed[]`: gaps present in the prior report but absent from `current_gaps`. -- `still_open[]`: gaps present in both runs. -- `newly_introduced[]`: gaps present in `current_gaps` but absent from the prior report. - -Each entry carries `category`, `severity`, and `probe`. `closed` entries carry the prior probe; `still_open` and `newly_introduced` carry the current probe. +Three labelled sets handed to `02-render-report`: closed (in the prior report, gone now), still open (in both runs), and newly introduced (new this run). ## Process -1. Derive prior report path from `source_path` using the filename rule of `02-render-report`. -2. If the prior report does not exist: emit `closed = []`, `still_open = []`, `newly_introduced = current_gaps`. Stop. This is the expected first-run behavior. -3. Parse the prior report: - - Locate `## Gaps by Category` and walk `### ` subsections. - - Each line matching `**[severity]** ` is a gap; an immediately following blockquote line is its `snippet`. - - Diff-mode subsections (`Closed` / `Still Open` / `Newly Introduced`) are parsed identically. -4. Build identity keys: `(category, normalized_snippet)`. Probe wording is NOT part of the key. Snippet-less gaps fall back to `(category, severity)`. This matches `01-detect`'s dedup rule so identity is consistent between runs. -5. Compute the three sets by set difference / intersection on identity keys. -6. Pass `closed`, `still_open`, `newly_introduced` to `02-render-report`. +1. **Locate.** Derive the prior report's path from the source, same rule as `02-render-report`. +2. **First run.** If no prior report exists, everything is newly introduced and the other two sets are empty. Stop. This is the expected first run. +3. **Parse.** Read the prior report: walk its category sections, treat each `**[severity]** ` line as a gap, and take an immediately following blockquote as its snippet. Diff-mode sections parse the same way. +4. **Match.** Identify gaps by category plus snippet, ignoring question wording, so a reworded question is not seen as new. A snippet-less gap falls back to category plus severity, matching `01-detect`'s dedupe rule so identity stays consistent across runs. +5. **Sort.** Compare the two runs to fill closed, still open, and newly introduced. Each gap keeps its category and severity; closed gaps carry the prior question, the others the current one. +6. **Hand off.** Pass the three sets to `02-render-report`. ## Test -- No change between runs → `closed = []`, all gaps in `still_open`, `newly_introduced = []`. -- A prior gap whose source anchor is removed → appears in `closed`, not in `still_open`. -- A new gap not in the prior report → appears in `newly_introduced`, not in `still_open`. -- First run (no prior report) → `closed = []`, `still_open = []`, all current gaps in `newly_introduced`. No error. -- Probe wording change with same category + snippet → classified as `still_open` (probe wording is not part of identity). -- Snippet-less gaps with identical `(category, severity)` across runs → classified as `still_open`. +- No change between runs: every gap is still open, closed and newly introduced empty. +- A prior gap whose anchor is gone lands in closed; a gap absent from the prior report lands in newly introduced. +- First run with no prior report: every current gap is newly introduced, the rest empty, no error. +- A reworded question, or a snippet-less gap with the same category and severity, stays still open (gaps match by category and snippet, not wording). diff --git a/plugins/aidd-refine/skills/04-shadow-areas/references/categories.md b/plugins/aidd-refine/skills/04-shadow-areas/references/categories.md index 4e865dde..14383274 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/references/categories.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/references/categories.md @@ -38,7 +38,7 @@ The 7 locked categories below come from `references/locked-sets.json`. No other ## missing actor -**Definition**: An entity - person, system, or role - that takes an action or is affected by the system is absent from the document. The process cannot be fully traced without naming it. +**Definition**: An entity (person, system, or role) that takes an action or is affected by the system is absent from the document. The process cannot be fully traced without naming it. **Positive example**: A user-story describes the approval workflow: a request is submitted, reviewed, and approved. The document names the requester but never names the reviewer role or what system sends the approval notification. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/references/probe-style.md b/plugins/aidd-refine/skills/04-shadow-areas/references/probe-style.md index 7fbf280f..bce1c07b 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/references/probe-style.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/references/probe-style.md @@ -11,7 +11,7 @@ Rules for writing direct-question probes. The locked question forms come from `r 1. Each probe begins with a question form from the locked list: `what`, `when`, `who`, `which`, `how`, `why`, `where`, `does`, `can`, `will`, `should`, `is`, `are`, `do`. 2. Each probe ends with `?`. 3. Each probe targets one specific gap. Do not combine two questions into a single probe. -4. The probe names the specific subject (role, field, condition, term) - not the artifact or a generic concept. +4. The probe names the specific subject (role, field, condition, term), not the artifact or a generic concept. 5. Prefer the shortest question form that makes the gap actionable. Avoid preamble. --- @@ -20,11 +20,11 @@ Rules for writing direct-question probes. The locked question forms come from `r These probes satisfy all 5 rules. The question form used is noted for clarity. -- `who` - Who is responsible for approving the access request before it is acted on? -- `what` - What should the system return to the caller when the payment provider responds with a timeout? -- `which` - Which user roles are permitted to delete a published record? -- `how` - How is the session invalidated when the user's account is suspended mid-session? -- `does` - Does the 10 MB file-size limit apply to each individual file in a multi-file upload or to the combined total? +- `who`: Who is responsible for approving the access request before it is acted on? +- `what`: What should the system return to the caller when the payment provider responds with a timeout? +- `which`: Which user roles are permitted to delete a published record? +- `how`: How is the session invalidated when the user's account is suspended mid-session? +- `does`: Does the 10 MB file-size limit apply to each individual file in a multi-file upload or to the combined total? --- @@ -32,6 +32,6 @@ These probes satisfy all 5 rules. The question form used is noted for clarity. These examples violate one or more rules. Do not write probes in these forms. -- "The spec is unclear about authentication." - Statement, no question form, does not end with `?`. Describes the problem abstractly instead of asking for the specific missing information. -- "Authentication and authorization both need clarification, and the roles section is incomplete." - Combines multiple targets in one sentence. Breaks rule 3 (one specific gap per probe). -- "Could you clarify the access control model?" - Vague subject. Does not identify which part of the access control model is missing or ambiguous. Breaks rule 4. +- "The spec is unclear about authentication." Statement, no question form, does not end with `?`. Describes the problem abstractly instead of asking for the specific missing information. +- "Authentication and authorization both need clarification, and the roles section is incomplete." Combines multiple targets in one sentence. Breaks rule 3 (one specific gap per probe). +- "Could you clarify the access control model?" Vague subject. Does not identify which part of the access control model is missing or ambiguous. Breaks rule 4. diff --git a/plugins/aidd-refine/skills/04-shadow-areas/references/severity-rubric.md b/plugins/aidd-refine/skills/04-shadow-areas/references/severity-rubric.md index 0fe42c1d..23dac7cb 100644 --- a/plugins/aidd-refine/skills/04-shadow-areas/references/severity-rubric.md +++ b/plugins/aidd-refine/skills/04-shadow-areas/references/severity-rubric.md @@ -24,7 +24,7 @@ The 3 locked severity tiers below come from `references/locked-sets.json`. Assig ## major -**Decision rule**: The gap does not prevent starting work, but it will cause rework - incomplete implementation, a failed review cycle, or a missed requirement that surfaces during testing. +**Decision rule**: The gap does not prevent starting work, but it will cause rework: incomplete implementation, a failed review cycle, or a missed requirement that surfaces during testing. **When to assign**: - A failure mode is undescribed; it will be discovered during integration or QA and require a code change. @@ -39,7 +39,7 @@ The 3 locked severity tiers below come from `references/locked-sets.json`. Assig ## minor -**Definition rule**: The gap is cosmetic or affects documentation clarity only. Resolving it improves precision or readability but will not change an implementation decision or require rework. +**Decision rule**: The gap is cosmetic or affects documentation clarity only. Resolving it improves precision or readability but will not change an implementation decision or require rework. **When to assign**: - An ambiguous term exists in a non-critical context where both interpretations lead to the same implementation. diff --git a/plugins/aidd-refine/skills/05-fact-check/SKILL.md b/plugins/aidd-refine/skills/05-fact-check/SKILL.md index b09fae28..460737f8 100644 --- a/plugins/aidd-refine/skills/05-fact-check/SKILL.md +++ b/plugins/aidd-refine/skills/05-fact-check/SKILL.md @@ -1,6 +1,6 @@ --- name: 05-fact-check -description: Verify factual claims in a piece of text against authoritative sources and rewrite it with footnote citations, hedging any claim that cannot be confirmed. Runs a cheapest-first verification cascade (project memory and docs, then codebase inspection, then web lookup) and reports both sources when they disagree. Use when the user says "fact-check this", "verify that claim", "are you sure about that", "is that actually true", "cite your sources", "where did you get that fact", "did you make that up", "double-check the version you gave me", "vérifie cette information", or "es-tu sûr de ça". Do NOT use to auto-guard the AI's own output (this skill only fires on an explicit request), to judge code logic correctness, or to clarify vague requirements through iterative Q&A - use `aidd-refine:01-brainstorm` for that. +description: Verify factual claims in a text against authoritative sources and rewrite it with footnote citations, hedging anything unconfirmed. Use to fact-check, verify a claim, or cite sources on explicit request; not for judging code logic or clarifying vague requirements (use aidd-refine:01-brainstorm). argument-hint: identify-claims | verify | report --- @@ -8,7 +8,7 @@ argument-hint: identify-claims | verify | report Verifies the factual claims inside a target text and rewrites it grounded in evidence. The skill extracts each verifiable claim, runs a cheapest-first verification cascade, and emits a rewritten answer where every confirmed claim carries a footnote citation and every unconfirmed claim is explicitly hedged. When sources disagree, both are reported rather than one being silently chosen. -## Available actions +## Actions | # | Action | Role | Input | | --- | ----------------- | ----------------------------------------------------------------------------- | --------------------------- | @@ -22,25 +22,21 @@ Sequential skill: `01 → 02 → 03`. No skipping. The router materializes the t ## Transversal rules -- Never alter the meaning of a claim while verifying it - verify what was stated, not a charitable reinterpretation. -- The verification cascade is cheapest-first and short-circuits: stop at the first tier that resolves a claim. See `@references/verification-cascade.md`. +- Never alter the meaning of a claim while verifying it: verify what was stated, not a charitable reinterpretation. +- The verification cascade is cheapest-first and short-circuits: stop at the first tier that resolves a claim. See `references/verification-cascade.md`. - A web lookup is the last resort, never the first. Reach it only when project memory and codebase inspection both fail to resolve a claim. -- Claim categories come from the locked set in `@references/claim-categories.md`. Opinion, preference, and trivially-known statements are not claims and are skipped. -- When two sources disagree, report both with their origin - never silently pick one. +- Claim categories come from the locked set in `references/claim-categories.md`. Opinion, preference, and trivially-known statements are not claims and are skipped. +- When two sources disagree, report both with their origin, never silently pick one. - An unverified claim is never deleted and never asserted as fact: it is kept and marked `(unverified - no source found)`. -- Caching a verified fact is opt-in: propose it with a recommendation, never cache silently. The skill itself stores nothing - on approval it restates the fact for the user's own memory tooling. Persistence is out of scope. -- The final report is reader-facing prose - the corrected text, `## Sources`, and `## Unverified claims`, nothing else. Internal mechanics never appear in the output: no cascade or tier trace (`Cascade:`, `tier 1/2/3`, `miss`, `resolved`), no category labels, no raw verdict words. State conclusions, not the process. Action 03 holds the exhaustive forbidden list. +- Caching a verified fact is opt-in: propose it with a recommendation, never cache silently. The skill persists nothing; the mechanics live in action 03. +- The final report is reader-facing prose: the corrected text, `## Sources`, and `## Unverified claims`, nothing else. Internal mechanics never appear in the output: no cascade or tier trace (`Cascade:`, `tier 1/2/3`, `miss`, `resolved`), no category labels, no raw verdict words. State conclusions, not the process. Action 03 holds the exhaustive forbidden list. - The report is rendered in plain prose and is never restyled by an active session output mode (terse, caveman, condensed). The skill's output format is fixed by action 03 alone. ## References -- `@references/claim-categories.md` - locked taxonomy of verifiable claim categories with definition and example. -- `@references/verification-cascade.md` - the three-tier cascade, short-circuit rule, web-cost guardrail. +- `references/claim-categories.md`: locked taxonomy of verifiable claim categories with definition and example. +- `references/verification-cascade.md`: the three-tier cascade, short-circuit rule, web-cost guardrail. ## Assets -- `@assets/report-template.md` - rewritten-answer skeleton with the `## Sources` footnote block. - -## External data - -- None. +- `assets/report-template.md`: rewritten-answer skeleton with the `## Sources` footnote block. diff --git a/plugins/aidd-refine/skills/05-fact-check/actions/01-identify-claims.md b/plugins/aidd-refine/skills/05-fact-check/actions/01-identify-claims.md index 07af7644..472c229a 100644 --- a/plugins/aidd-refine/skills/05-fact-check/actions/01-identify-claims.md +++ b/plugins/aidd-refine/skills/05-fact-check/actions/01-identify-claims.md @@ -1,30 +1,23 @@ # 01 - Identify claims -Extract every verifiable factual claim from the target text and classify each one. +Pull every verifiable factual claim out of the target text and tag each one. -## Inputs +## Input -- `target_text` (required) - string, the text whose facts must be checked. The user's prior answer, a quoted passage, or an explicitly pasted block. +- The text whose facts need checking: the user's prior answer, a quoted passage, or a pasted block. -## Outputs +## Output -A claim list. Each entry pairs the claim text with one category from the locked taxonomy. - -```json -[ - { "claim": "Next.js 15 shipped the use cache directive in 2024", "category": "version" }, - { "claim": "the file aidd_docs/memory/architecture.md exists in this repo", "category": "project-fact" } -] -``` +A list of claims, each paired with one category from the locked taxonomy. ## Process -1. Read `target_text` sentence by sentence. -2. For each sentence, decide: does it assert a fact? Split a mixed sentence into its factual part and its opinion part. -3. Drop every non-claim per `@../references/claim-categories.md` - opinion, preference, trivially-known general knowledge, the AI's own intent. -4. Assign each surviving claim exactly one category from the locked taxonomy. When two categories fit, prefer the one routing to the cheapest tier (`project-fact` over `hard-to-know` for repo claims). -5. Emit the claim list. If the list is empty, report "no verifiable claims" and stop the skill. +1. **Read.** Go through the text sentence by sentence. +2. **Decide.** For each sentence, ask whether it states a fact. Split a mixed sentence into its factual part and its opinion part. +3. **Drop.** Discard every non-claim per `@../references/claim-categories.md`: opinion, preference, trivially-known general knowledge, the AI's own intent. +4. **Tag.** Give each surviving claim one category. When two fit, pick the one routing to the cheapest tier (a repo fact over a hard-to-know fact). +5. **Emit.** Return the claim list. If it is empty, report "no verifiable claims" and stop the skill. ## Test -Run on the text `"Next.js 15 shipped the use cache directive in 2024. This naming is clean."` - the output lists the first sentence as a claim (category `version` or `date-event-person`) and excludes "This naming is clean" as opinion. +- Run on `"Next.js 15 shipped the use cache directive in 2024. This naming is clean."`: the output lists the first sentence as a claim and excludes "This naming is clean" as opinion. diff --git a/plugins/aidd-refine/skills/05-fact-check/actions/02-verify.md b/plugins/aidd-refine/skills/05-fact-check/actions/02-verify.md index d7188885..18de9541 100644 --- a/plugins/aidd-refine/skills/05-fact-check/actions/02-verify.md +++ b/plugins/aidd-refine/skills/05-fact-check/actions/02-verify.md @@ -1,40 +1,24 @@ # 02 - Verify -Run the cheapest-first verification cascade against each claim and assign it a verdict. +Run the cheapest-first verification cascade against each claim and give it a verdict. -## Inputs +## Input -- `claim_list` (required) - the classified claim list from action 01. +- The tagged claims from `01-identify-claims`. -## Outputs +## Output -A verdict list. Each claim gains a verdict and its supporting sources. - -```json -[ - { - "claim": "the source file plugins/aidd-refine/hooks/condense-stats.js exists in this repo", - "category": "project-fact", - "verdict": "verified", - "tier": "codebase", - "sources": ["plugins/aidd-refine/hooks/condense-stats.js"] - } -] -``` - -## Depends on - -- `01-identify-claims` +A list of verdicts: each claim gains one verdict (verified, refuted, conflict, or unverified) with the sources behind it and the tier that resolved it. ## Process -1. For each claim, walk the cascade in `@../references/verification-cascade.md`: tier 1 project memory and docs, tier 2 codebase inspection, tier 3 web lookup. -2. Route by category - `project-fact` favors tiers 1 and 2; other categories favor tier 1 then tier 3. -3. Short-circuit: the first tier that resolves the claim sets the verdict. Do not consult later tiers. -4. Respect the web-cost guardrail - reach tier 3 only after tiers 1 and 2 fail, prefer one authoritative source, stop once resolved. -5. Assign exactly one verdict: `verified` (record every source), `conflict` (record both sides with origin, pick no winner), or `unverified` (cascade exhausted, no source). -6. Emit the verdict list. +1. **Walk.** For each claim, walk the cascade in `@../references/verification-cascade.md`: first project memory and docs, then codebase inspection, then web lookup. +2. **Route.** Send repo facts to memory and codebase first; send other claims to memory then the web. +3. **Short-circuit.** The first tier that resolves a claim sets its verdict. Do not consult later tiers. +4. **Guard.** Reach the web only after memory and codebase both fail. Prefer one authoritative source, and stop once resolved. +5. **Judge.** Give each claim one verdict: verified (record every source), refuted (a source contradicts the claim, record it), conflict (record both sides with their origin, pick no winner), or unverified (cascade exhausted, no source). +6. **Emit.** Return the verdict list. ## Test -Run on the single claim `"the source file plugins/aidd-refine/hooks/condense-stats.js exists in this repo"` - the cascade resolves at the codebase tier (tier 2), the verdict is `verified`, the source is that file path, and the web tier is never reached. +- Run on `"the source file plugins/aidd-refine/hooks/condense-stats.js exists in this repo"`: the cascade resolves at the codebase tier, the verdict is verified, the source is that file path, and the web tier is never reached. diff --git a/plugins/aidd-refine/skills/05-fact-check/actions/03-report.md b/plugins/aidd-refine/skills/05-fact-check/actions/03-report.md index 887a613b..e1c41cff 100644 --- a/plugins/aidd-refine/skills/05-fact-check/actions/03-report.md +++ b/plugins/aidd-refine/skills/05-fact-check/actions/03-report.md @@ -1,49 +1,27 @@ # 03 - Report -Rewrite the original text grounded in the verdicts: cite verified claims, hedge unverified ones, surface conflicts. The output is reader-facing prose only. +Rewrite the original text on the evidence: cite verified claims, hedge unverified ones, surface conflicts. The output is reader-facing prose only. -## Output discipline (hard constraint - read first) +## Input -The delivered output contains EXACTLY these blocks, in this order, and nothing else: +- The verdicts from `02-verify`. +- The original text, reused as the base for the rewrite. -1. The rewritten text, with a `[n]` marker on each verified claim and a `(unverified - no source found)` marker on each unverified claim. -2. A `## Sources` block. -3. A `## Unverified claims` block - only when at least one claim is unverified. +## Output -The following are internal to actions 01 and 02. They are FORBIDDEN in the output - never write them: - -- Any cascade or tier trace. Never emit the words `Cascade`, `tier 1`, `tier 2`, `tier 3`, `miss`, `N/A`, or `resolved` in the output. -- Any category label - `hard-to-know`, `version`, `api-signature`, `date-event-person`, `project-fact`. -- Any raw verdict vocabulary - `verdict`, `claim false`, `claim true`, or the enum values `verified` / `conflict` / `unverified` used as a status word (the inline `(unverified - no source found)` marker is the one allowed exception). -- Any sentence explaining why a cache line was or was not added. -- The report is plain prose. It is NOT styled by any active session output mode (terse, caveman, condensed, etc.). Render it normally regardless of how the surrounding conversation is styled. - -Before delivering, scan the draft: if any line contains a forbidden item, delete that line. - -## Inputs - -- `verdict_list` (required) - the verdict list from action 02. -- `target_text` (required) - the original text, reused as the rewrite base. - -## Outputs - -The rewritten answer following `@../assets/report-template.md`: original content preserved, a `[n]` marker on each verified claim, a `(unverified - no source found)` marker on each unverified claim, conflicts stated with both sides, and a `## Sources` footnote block. - -## Depends on - -- `02-verify` +The rewritten answer per `@../assets/report-template.md`, obeying `@../references/report-output-discipline.md`. ## Process -1. Copy `@../assets/report-template.md` as the structure. -2. Rewrite `target_text`: append `[n]` to each `verified` claim, numbered in reading order. -3. For each `conflict`, state both sides in full ("Source A reports X; source B reports Y") - choose no winner. -4. Append `(unverified - no source found)` to each `unverified` claim; never delete it, never assert it. -5. Build the `## Sources` block - one numbered entry per source, with title or file path, location, and the claim it verifies. Conflicts get one entry per side. -6. Add the `## Unverified claims` section only when at least one claim is unverified; omit it otherwise. -7. If any verified fact is stable (project paths, pinned-version APIs), append a single cache-suggestion line proposing the user cache it, with a yes/no recommendation. The skill persists nothing itself: on approval, restate the fact and its source plainly so the user (or their memory tooling) can store it. When no fact qualifies, omit the line silently - never explain its absence. -8. Apply the Output discipline scan above, then deliver. +1. **Copy.** Start from `@../assets/report-template.md`. +2. **Rewrite.** Carry the original text over, appending `[n]` to each verified claim, numbered in reading order. Replace each refuted claim with the corrected fact and cite the contradicting source `[n]`; never restate the false claim as true. +3. **Surface.** For each conflict, state both sides in full ("Source A reports X; source B reports Y"), choosing no winner. +4. **Mark.** Append the exact marker `(unverified - no source found)`, verbatim and unreworded, to each unverified claim. Never delete it, never assert it. +5. **Cite.** Build the `## Sources` block: one numbered entry per source, with its title or file path, location, and the claim it backs. Each side of a conflict gets its own entry. +6. **List.** Add the `## Unverified claims` section only when at least one claim is unverified; otherwise omit it. +7. **Suggest.** When a verified fact is stable (project paths, pinned-version APIs), append one cache-suggestion line with a yes/no recommendation. The skill stores nothing itself: on approval, restate the fact and its source so the user's own memory tooling can keep it. When nothing qualifies, omit the line silently. +8. **Scrub.** Apply `@../references/report-output-discipline.md`, then deliver. ## Test -Given one `verified` claim and one `unverified` claim, the rendered output contains a `## Sources` section with a `[1]` footnote for the verified claim and an inline `(unverified - no source found)` marker on the other, and contains none of the forbidden words (`Cascade`, `tier 1/2/3`, `verdict`, category labels). +- Given one verified claim and one unverified claim, the output carries a `## Sources` section with a `[1]` footnote for the verified claim, an inline `(unverified - no source found)` marker on the other, and none of the forbidden words from `@../references/report-output-discipline.md`. diff --git a/plugins/aidd-refine/skills/05-fact-check/assets/report-template.md b/plugins/aidd-refine/skills/05-fact-check/assets/report-template.md index 8389af5f..3387cf29 100644 --- a/plugins/aidd-refine/skills/05-fact-check/assets/report-template.md +++ b/plugins/aidd-refine/skills/05-fact-check/assets/report-template.md @@ -1,8 +1,4 @@ - - - - - + @@ -17,11 +13,10 @@ ## Unverified claims -- "" - cascade exhausted (memory, codebase, web), no source found. +- "": no source found. - > Cache suggestion: "" looks stable - cache it for reuse? (recommended: ) diff --git a/plugins/aidd-refine/skills/05-fact-check/references/claim-categories.md b/plugins/aidd-refine/skills/05-fact-check/references/claim-categories.md index 7518ec93..14bfec38 100644 --- a/plugins/aidd-refine/skills/05-fact-check/references/claim-categories.md +++ b/plugins/aidd-refine/skills/05-fact-check/references/claim-categories.md @@ -9,10 +9,10 @@ Locked taxonomy. Every extracted claim is assigned exactly one category. A state | `version` | A version, release number, or the existence of a package or tool | "React 19 is released", "the `zod` package exists" | | `api-signature` | A function or method signature, its parameters, return type, or documented behavior | "`useEffect` runs after paint", "`fetch` returns a Promise" | | `date-event-person` | A date, event, release timeline, or a fact about a person | "Node 22 shipped in 2024", "X wrote library Y" | -| `project-fact` | A claim about this repository - a file, function, config value, or structure | "the file `src/auth.ts` exists", "the API runs on port 3000" | -| `hard-to-know` | Any non-trivially-knowable fact not covered above - statistics, quotes, external facts | "this framework has 40k stars", "the RFC says Z" | +| `project-fact` | A claim about this repository: a file, function, config value, or structure | "the file `src/auth.ts` exists", "the API runs on port 3000" | +| `hard-to-know` | Any non-trivially-knowable fact not covered above: statistics, quotes, external facts | "this framework has 40k stars", "the RFC says Z" | -## Not claims - skip +## Not claims to skip - Opinion, preference, or aesthetic judgment ("this naming is clean", "the design feels heavy"). - Trivially-known general knowledge a competent reader would never dispute ("HTTP 404 means not found"). @@ -21,4 +21,4 @@ Locked taxonomy. Every extracted claim is assigned exactly one category. A state ## Classification rule -When a sentence mixes a fact and an opinion, split it: verify the fact, drop the opinion. When a claim could fit two categories, pick the one that drives the cheapest verification tier - `project-fact` over `hard-to-know` whenever the claim concerns this repository. +When a sentence mixes a fact and an opinion, split it: verify the fact, drop the opinion. When a claim could fit two categories, pick the one that drives the cheapest verification tier: `project-fact` over `hard-to-know` whenever the claim concerns this repository. diff --git a/plugins/aidd-refine/skills/05-fact-check/references/report-output-discipline.md b/plugins/aidd-refine/skills/05-fact-check/references/report-output-discipline.md new file mode 100644 index 00000000..05cf3955 --- /dev/null +++ b/plugins/aidd-refine/skills/05-fact-check/references/report-output-discipline.md @@ -0,0 +1,25 @@ +# Report output discipline + +The hard constraint on what the `03-report` action may deliver. Read it before drafting the report. + +## What the output contains + +Exactly these blocks, in this order, and nothing else: + +1. The rewritten text, with a `[n]` marker on each verified or corrected claim and a `(unverified - no source found)` marker on each unverified claim. +2. A `## Sources` block. +3. A `## Unverified claims` block, only when at least one claim is unverified. + +## What is forbidden + +These belong to the earlier actions and never appear in the output: + +- Any cascade or tier trace. Never write `Cascade`, `tier 1`, `tier 2`, `tier 3`, `miss`, `N/A`, or `resolved`. +- Any category label: `hard-to-know`, `version`, `api-signature`, `date-event-person`, `project-fact`. +- Any raw verdict word: `verdict`, `claim false`, `claim true`, or the values `verified` / `refuted` / `conflict` / `unverified` used as a status (the inline `(unverified - no source found)` marker is the one allowed exception). +- How a claim was checked: shell commands, `ls` / `find` / grep output, or phrases like "by inspection" or "codebase inspection". Cite the source and state the conclusion, not the method. +- Any sentence explaining why a cache line was or was not added. + +The report is plain prose. No active session output mode (terse, caveman, condensed) restyles it; render it normally however the surrounding conversation is styled. + +Before delivering, scan the draft: if a line carries any forbidden item, delete that line. diff --git a/plugins/aidd-refine/skills/05-fact-check/references/verification-cascade.md b/plugins/aidd-refine/skills/05-fact-check/references/verification-cascade.md index 08b90fb8..dd5af0c4 100644 --- a/plugins/aidd-refine/skills/05-fact-check/references/verification-cascade.md +++ b/plugins/aidd-refine/skills/05-fact-check/references/verification-cascade.md @@ -1,6 +1,6 @@ # Verification cascade -Claims are verified cheapest-first. The cascade short-circuits: as soon as a tier resolves a claim, stop - do not consult the remaining tiers. +Claims are verified cheapest-first. The cascade short-circuits: as soon as a tier resolves a claim, stop and skip the remaining tiers. ## Tiers @@ -12,12 +12,12 @@ Claims are verified cheapest-first. The cascade short-circuits: as soon as a tie ## Short-circuit rule -For each claim, walk tiers in order. The first tier that produces a clear answer resolves the claim - record the verdict and the source, move to the next claim. Never run a later tier once an earlier one has resolved. +For each claim, walk tiers in order. The first tier that produces a clear answer resolves the claim: record the verdict and the source, then move to the next claim. Never run a later tier once an earlier one has resolved. ## Tier routing by category -- `project-fact` - tier 1, then tier 2. A web lookup is almost never needed and must be skipped once tier 1 or 2 resolves. -- `version`, `api-signature`, `date-event-person`, `hard-to-know` - tier 1 first (a pinned version or doc may already answer it), then tier 3. Tier 2 only helps when the project itself embeds the fact. +- `project-fact`: tier 1, then tier 2. A web lookup is almost never needed and must be skipped once tier 1 or 2 resolves. +- `version`, `api-signature`, `date-event-person`, `hard-to-know`: tier 1 first (a pinned version or doc may already answer it), then tier 3. Tier 2 only helps when the project itself embeds the fact. ## Web-cost guardrail @@ -25,12 +25,13 @@ A web lookup is a last resort, never an opener. - Reach tier 3 only when tiers 1 and 2 both failed to resolve the claim. - Prefer one authoritative source (official documentation, the package registry, the primary publication) over many low-quality pages. -- Stop as soon as the claim is resolved or a contradiction is found - do not keep fetching for extra confirmation. +- Stop as soon as the claim is resolved or a contradiction is found; do not keep fetching for extra confirmation. ## Verdicts Each verified claim ends in exactly one verdict: -- `verified` - one or more sources confirm the claim. Record every source. -- `conflict` - sources disagree. Record both sides with their origin; do not pick a winner. -- `unverified` - no tier produced a source. The claim is kept and hedged, never asserted and never deleted. +- `verified`: one or more sources confirm the claim. Record every source. +- `refuted`: a source contradicts the claim. Record the contradicting source. +- `conflict`: sources disagree. Record both sides with their origin; do not pick a winner. +- `unverified`: no tier produced a source. The claim is kept and hedged, never asserted and never deleted. diff --git a/scripts/skill-eval.mjs b/scripts/skill-eval.mjs new file mode 100644 index 00000000..ab66e1f7 --- /dev/null +++ b/scripts/skill-eval.mjs @@ -0,0 +1,140 @@ +#!/usr/bin/env node +// Behavioral eval harness for aidd-refine skills. +// +// Each case runs the skill for real through a headless `claude -p`, in an +// isolated temp project where the skill is installed under a unique name as a +// project skill (`.claude/skills//`). The unique name guarantees the +// worktree copy runs, never a globally-installed plugin of the same name. +// +// Usage: +// node scripts/skill-eval.mjs # run every case (deterministic checks) +// node scripts/skill-eval.mjs 04-shadow-areas # run cases for one skill +// node scripts/skill-eval.mjs --judge # also run LLM-judge criteria (metered) +// node scripts/skill-eval.mjs --keep # keep temp dirs for inspection +// +// Local / opt-in only: needs an authenticated `claude` CLI and spends tokens. +// Not a CI gate. + +import { mkdtempSync, mkdirSync, writeFileSync, cpSync, rmSync, readFileSync, existsSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join, dirname, resolve } from "node:path"; +import { spawnSync } from "node:child_process"; +import { fileURLToPath } from "node:url"; + +const HERE = dirname(fileURLToPath(import.meta.url)); +const ROOT = resolve(HERE, ".."); +const SKILLS_DIR = join(ROOT, "plugins", "aidd-refine", "skills"); +const CASES = JSON.parse(readFileSync(join(HERE, "skill-eval", "cases.json"), "utf8")); + +const args = process.argv.slice(2); +const JUDGE = args.includes("--judge"); +const KEEP = args.includes("--keep"); +const filter = args.find((a) => !a.startsWith("--")); +const cases = CASES.filter((c) => !filter || c.skill === filter); + +if (cases.length === 0) { + console.error(`No cases match ${filter ? `"${filter}"` : "(any)"}.`); + process.exit(2); +} + +function runClaude(prompt, cwd) { + const res = spawnSync( + "claude", + // --setting-sources project,local isolates the run from the user's global + // settings (hooks, output modes) so results are reproducible across machines. + ["-p", prompt, "--setting-sources", "project,local", "--add-dir", cwd, "--dangerously-skip-permissions"], + { cwd, input: "", encoding: "utf8", timeout: 600000, maxBuffer: 10 * 1024 * 1024 }, + ); + if (res.error) throw res.error; + return (res.stdout || "") + (res.stderr || ""); +} + +function has(haystack, needle) { + return haystack.toLowerCase().includes(String(needle).toLowerCase()); +} + +// One assertion = { ok: boolean, label: string } +function evaluate(expect, ctx) { + const checks = []; + const fileText = (name) => (existsSync(join(ctx.tmp, name)) ? readFileSync(join(ctx.tmp, name), "utf8") : null); + + for (const name of expect.filesExist || []) { + checks.push({ ok: existsSync(join(ctx.tmp, name)), label: `file exists: ${name}` }); + } + for (const [name, subs] of Object.entries(expect.fileContains || {})) { + const text = fileText(name); + for (const s of subs) checks.push({ ok: text != null && has(text, s), label: `${name} contains "${s}"` }); + } + for (const [name, subs] of Object.entries(expect.fileNotContains || {})) { + const text = fileText(name); + for (const s of subs) checks.push({ ok: text != null && !has(text, s), label: `${name} omits "${s}"` }); + } + for (const s of expect.stdoutContains || []) { + checks.push({ ok: has(ctx.stdout, s), label: `stdout contains "${s}"` }); + } + for (const s of expect.stdoutNotContains || []) { + checks.push({ ok: !has(ctx.stdout, s), label: `stdout omits "${s}"` }); + } + for (const re of expect.stdoutMatches || []) { + checks.push({ ok: new RegExp(re, "i").test(ctx.stdout), label: `stdout matches /${re}/` }); + } + + if (expect.judge) { + if (!JUDGE) { + checks.push({ ok: true, skipped: true, label: `judge (skipped, pass --judge): ${expect.judge}` }); + } else { + const evidence = [ctx.stdout, ...(expect.judgeFiles || []).map((n) => fileText(n) || "")].join("\n\n"); + const verdict = runClaude( + `You grade a test. Criterion: "${expect.judge}". Output under test follows between <<< and >>>.\n` + + `Reply with exactly PASS or FAIL on the first line, then one line of reason.\n<<<\n${evidence}\n>>>`, + ctx.tmp, + ); + checks.push({ ok: /^\s*pass\b/i.test(verdict), label: `judge: ${expect.judge}`, note: verdict.split("\n")[0].trim() }); + } + } + return checks; +} + +function setupCase(c) { + const tmp = mkdtempSync(join(tmpdir(), "skilleval-")); + const skillDst = join(tmp, ".claude", "skills", c.evalName); + mkdirSync(skillDst, { recursive: true }); + cpSync(join(SKILLS_DIR, c.skill), skillDst, { recursive: true }); + // Rewrite the frontmatter name so it matches the unique eval folder. + const skillMd = join(skillDst, "SKILL.md"); + const rewritten = readFileSync(skillMd, "utf8").replace(/^name:.*$/m, `name: ${c.evalName}`); + writeFileSync(skillMd, rewritten); + for (const [rel, content] of Object.entries(c.setup?.files || {})) { + const dst = join(tmp, rel); + mkdirSync(dirname(dst), { recursive: true }); + writeFileSync(dst, content); + } + return tmp; +} + +console.log(`Running ${cases.length} case(s)${JUDGE ? " with --judge" : ""}.\n`); +let failed = 0; +for (const c of cases) { + const tmp = setupCase(c); + let checks; + try { + const prompt = c.prompt.replaceAll("{{SKILL}}", c.evalName); + const stdout = runClaude(prompt, tmp); + checks = evaluate(c.expect, { tmp, stdout }); + } catch (err) { + checks = [{ ok: false, label: `run error: ${err.message}` }]; + } + const caseFailed = checks.some((k) => !k.ok); + if (caseFailed) failed++; + console.log(`${caseFailed ? "FAIL" : "PASS"} ${c.skill} :: ${c.name}`); + for (const k of checks) { + const mark = k.skipped ? "~" : k.ok ? "✓" : "✗"; + console.log(` ${mark} ${k.label}${k.note ? ` (${k.note})` : ""}`); + } + if (KEEP) console.log(` tmp: ${tmp}`); + else rmSync(tmp, { recursive: true, force: true }); + console.log(""); +} + +console.log(`${cases.length - failed}/${cases.length} cases passed.`); +process.exit(failed ? 1 : 0); diff --git a/scripts/skill-eval/README.md b/scripts/skill-eval/README.md new file mode 100644 index 00000000..663cda8e --- /dev/null +++ b/scripts/skill-eval/README.md @@ -0,0 +1,78 @@ +# skill-eval + +Behavioral eval harness for the `aidd-refine` skills. Runs each skill for real +through a headless `claude -p` and asserts the outcomes its action `## Test` +blocks describe. + +## Run + +```bash +node scripts/skill-eval.mjs # every case, deterministic checks +node scripts/skill-eval.mjs 04-shadow-areas # one skill +node scripts/skill-eval.mjs --judge # also run LLM-judge criteria (metered) +node scripts/skill-eval.mjs --keep # keep temp dirs to inspect +``` + +Local and opt-in. Needs an authenticated `claude` CLI and spends tokens, so it +is not a CI gate. + +## How it works + +Each case runs in a throwaway temp project. The skill under test is copied into +`.claude/skills//` and its frontmatter `name` is rewritten to a unique +`xeval-*` value. The unique name guarantees the worktree copy runs, never a +globally-installed plugin of the same name. The harness writes the case setup +files, runs `claude -p` in that project, then checks the written files and the +output. + +## Cases + +`cases.json` holds the cases. Each one: + +```json +{ + "skill": "04-shadow-areas", + "evalName": "xeval-shadow", + "name": "short description", + "setup": { "files": { "prd.md": "..." } }, + "prompt": "Use the {{SKILL}} skill on ./prd.md ...", + "expect": { + "filesExist": ["prd-shadow-report.md"], + "fileContains": { "prd-shadow-report.md": ["## Gaps by Category"] }, + "fileNotContains": { "report.md": ["..."] }, + "stdoutContains": ["..."], + "stdoutNotContains": ["tier 1"], + "stdoutMatches": ["\\d+%"], + "judge": "natural-language criterion, only checked under --judge", + "judgeFiles": ["prd-shadow-report.md"] + } +} +``` + +`{{SKILL}}` is replaced with the case `evalName`. Deterministic checks run +always; `judge` criteria run only with `--judge` and use a second `claude -p` +as grader, for outcomes that cannot be matched literally. + +## Coverage and limits + +- Deterministic where possible (file written, filename rule, required sections, + hedged unverified claim, forbidden mechanics absent). Fuzzy outcomes + (claim-extraction quality, terseness) go to `--judge` and can flake. +- `01-brainstorm` is interactive (multi-turn Q&A) and is not covered here; a + single headless turn cannot exercise its loop. + +## Known findings + +The harness surfaced two limitations worth tracking: + +- **condense's `Condense: ON ().` line is model-emitted, not guaranteed.** + `02-stats` and the `condense-stats.js` hook parse that exact line from the + transcript, but the model paraphrases it ("Condense mode on, level full") + even when the action mandates the literal. So the condense case gates on + semantics (mentions condense + the level), not the literal, and stats + detection is best-effort. A robust fix would emit the marker from the hook + rather than rely on model output. +- **empty-source scanning varies.** Most runs produce exactly one blocker, but + the exact count/header drifts between runs. The case asserts deterministically + only that a report is written, and gates the "exactly one blocker" semantics + behind `--judge`. diff --git a/scripts/skill-eval/cases.json b/scripts/skill-eval/cases.json new file mode 100644 index 00000000..ad8bd4b8 --- /dev/null +++ b/scripts/skill-eval/cases.json @@ -0,0 +1,69 @@ +[ + { + "skill": "04-shadow-areas", + "evalName": "xeval-shadow", + "name": "scans a PRD and writes a structured report", + "setup": { + "files": { + "prd.md": "# Checkout PRD\n\nUsers can buy items. The system must respond quickly when the cart is submitted.\nOn submit, the order is created and a confirmation is shown.\n" + } + }, + "prompt": "Use the {{SKILL}} skill on ./prd.md to find blind spots and write the shadow report file next to it.", + "expect": { + "filesExist": ["prd-shadow-report.md"], + "fileContains": { "prd-shadow-report.md": ["# Shadow Areas Report", "## Gaps by Category", "Total gaps:", "[blocker]"] } + } + }, + { + "skill": "04-shadow-areas", + "evalName": "xeval-shadow", + "name": "filename rule keeps a dotless name (Makefile)", + "setup": { "files": { "Makefile": "build:\n\techo hi\n" } }, + "prompt": "Use the {{SKILL}} skill on ./Makefile to find blind spots and write the shadow report next to it.", + "expect": { "filesExist": ["Makefile-shadow-report.md"] } + }, + { + "skill": "04-shadow-areas", + "evalName": "xeval-shadow", + "name": "empty source yields a single blocker", + "setup": { "files": { "empty.md": "" } }, + "prompt": "Use the {{SKILL}} skill on ./empty.md to find blind spots and write the shadow report next to it.", + "expect": { + "filesExist": ["empty-shadow-report.md"], + "judge": "The report flags exactly one blocker gap, asking what content the (empty) artifact should contain.", + "judgeFiles": ["empty-shadow-report.md"] + } + }, + { + "skill": "05-fact-check", + "evalName": "xeval-factcheck", + "name": "verifies a real file, refutes a fake one, hedges the unknowable, hides mechanics", + "setup": { "files": { "src/auth.ts": "export const auth = true;\n" } }, + "prompt": "Use the {{SKILL}} skill to fact-check this text and rewrite it with citations: \"The file src/auth.ts exists in this repo. The file src/ghost9000.ts exists in this repo. The lead maintainer of this project owns exactly 47 cats.\"", + "expect": { + "stdoutContains": ["## Sources", "(unverified - no source found)"], + "stdoutNotContains": ["tier 1", "tier 2", "tier 3", "Cascade", "codebase inspection"], + "judge": "src/auth.ts is confirmed with a citation; the src/ghost9000.ts claim is corrected as not present (never restated as true); the 47-cats claim is marked unverified; no shell commands or inspection methods appear.", + "judgeFiles": [] + } + }, + { + "skill": "03-condense", + "evalName": "xeval-condense", + "name": "turning on confirms condense at the full level", + "prompt": "Use the {{SKILL}} skill to turn condense mode on, set to the full level (not lite, not ultra).", + "expect": { + "stdoutContains": ["condense", "full"] + } + }, + { + "skill": "02-challenge", + "evalName": "xeval-challenge", + "name": "flags a plan violation as a deal breaker with low confidence", + "prompt": "Use the {{SKILL}} skill. Agreed plan: the function must RETURN THE SUM of a and b. Work to review: function add(a, b) { return a - b }. Challenge the work against the plan and emit the report.", + "expect": { + "stdoutContains": ["confidence", "Deal breakers"], + "judge": "A confidence percentage below 75% is stated, and the subtraction (returning a - b instead of a + b) is listed as a deal breaker." + } + } +]