[Experiment] code-review: inline-knowledge arm (pre-#8700 al-code-review skill) by gggdttt · Pull Request #714 · microsoft/BC-Bench

gggdttt · 2026-06-29T15:20:44Z

Experiment Description

Replicate how BCApps prod ran Copilot PR review before microsoft/BCApps#8700 — the "inline knowledge" arm — for the code-review category.

Before #8700, the reviewer lived in-repo under tools/Code Review/: an al-code-review orchestrating super-skill that dispatched the 6 domain checklists (security / performance / style / accessibility / upgrade / privacy). #8700 later replaced this with a runtime clone+filter of microsoft/BCQuality ("live skills"). This branch reconstructs the pre-#8700 mechanism faithfully so we can measure it as a treatment arm.

The 6 domain checklists already landed on main via #707. This PR adds the orchestrating skill and wires the config.

Configuration Changes

Custom instructions (instructions.enabled: true) — superset; copies the whole microsoft-BCApps/ folder (the al-code-review skill + the 6 instructions/*.md checklists) into <repo>/.github/
Skills (skills.enabled: true) — not needed (covered by instructions.enabled)
Custom agents (agents.enabled: true, name: ___)
MCP servers
Other:
- Add instructions/microsoft-BCApps/skills/al-code-review/SKILL.md (super-skill; references ../../instructions/<domain>.md)
- config.yaml code-review-template now invokes the al-code-review skill (full-domain, no domain arg) instead of /review. The review.json output schema is unchanged (current evaluator contract preserved)
- Remove test-generation confounders (agents/ALTest.agent.md, skills/al-test-generation/) to keep the experiment variable clean

Agent & Model

Agent: GitHub Copilot CLI
Model: (default)
Category: code-review

Hypothesis / Expected Outcome

Injecting the pre-#8700 inline review knowledge (al-code-review skill + 6 domain checklists) should improve code-review quality (precision/recall/F1 of findings against gold) over the vanilla /review baseline, since the agent reviews against explicit domain rules instead of generic judgment. Expected ordering: vanilla < inline knowledge (this arm) < live BCQuality.

Notes

Draft only — not meant to merge; serves as the entry point describing exactly what is evaluated.
All 81 codereview.jsonl entries target microsoft/BCApps, so only the microsoft-BCApps/ instruction tree matters here.
The al-code-review skill content is taken verbatim from the prior proven BC-Bench inline-knowledge run (commit 6c2437b).

Replicate how BCApps prod ran PR review before microsoft/BCApps#8700: the al-code-review orchestrating skill dispatches the 6 domain checklists (security/performance/style/accessibility/upgrade/privacy) already on main. - Add skills/al-code-review/SKILL.md (super-skill referencing ../../instructions/<domain>.md) - config.yaml: code-review-template invokes the al-code-review skill (full-domain); instructions.enabled=true so the whole microsoft-BCApps folder (skill + 6 instructions) is copied into .github/ - Remove test-generation confounders (agents/ALTest.agent.md, skills/al-test-generation) to keep the experiment variable clean - review.json output schema unchanged (current evaluator contract preserved)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Experiment] code-review: inline-knowledge arm (pre-#8700 al-code-review skill)#714

[Experiment] code-review: inline-knowledge arm (pre-#8700 al-code-review skill)#714
gggdttt wants to merge 1 commit into
mainfrom
experiment/code-review/inline-knowledge

gggdttt commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gggdttt commented Jun 29, 2026

Experiment Description

Configuration Changes

Agent & Model

Hypothesis / Expected Outcome

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant