Address residual P2s+P3 from re-audit of PR #402 by igerber · Pull Request #420 · igerber/diff-diff

igerber · 2026-05-12T22:18:10Z

Summary

Audit follow-up to PR #402. The restored CI reviewer surfaced three findings the degraded reviewer missed across its 8 R rounds:

P2 - The `llms-full.txt` HAD usage block reused one `data` symbol for both `aggregate="overall"` (two-period only) and `aggregate="event_study"` (multi-period) calls. A reader copy-pasting the block literally hit a front-door error on the second `fit()`. Split into explicit `data_2p` and `data_mp` panels with an explanatory header.
P2 - The practitioner Step-3 wording on both `_handle_had` and `_handle_had_event_study` said survey-weighted fits "skip QUG" and return a linearity-conditional verdict. That is only true for `pweight + PSU/FPC`. Stratified (`SurveyDesign(strata=...)`) and replicate-weight (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` on the linearity kernels (see `had_pretests.py:1725`, `:1927`). Both step descriptions now qualify the supported subset and call out the deferred regimes explicitly.
P3 - The REGISTRY claim that HAD constructor/fit "signatures match the real API (regression-tested via inspect.signature)" overstated what `test_llms_full_had_*_signature_matches_real_api` actually checks - the tests assert parameter-name presence only, not defaults, type annotations, or return-type unions. Relax the REGISTRY wording to match the test's actual contract.

No estimator behavior, weighting, variance/SE, identification check, or default statistical surface changed - guidance/documentation accuracy only.

Test plan

CI - existing `tests/test_guides.py` and `tests/test_practitioner.py` regressions cover the documented HAD surfaces.
Verify by inspection that the new wording matches `had_pretests.py:1725-1740` (replicate-weight `NotImplementedError`) and `:1927-1940` (stratified `NotImplementedError`).

🤖 Generated with Claude Code

The restored CI reviewer surfaced three findings the degraded reviewer missed across its 8 R rounds on PR #402: 1. (P2) The llms-full.txt HAD usage block reused one `data` symbol for both `aggregate="overall"` (two-period-only) and `aggregate="event_study"` (multi-period) calls. A reader copy- pasting hit a front-door error on the second `fit()` call when the first two calls' panel was used as-is. Split into `data_2p` and `data_mp` with an explanatory header. 2. (P2) The practitioner Step-3 wording on both `_handle_had` and `_handle_had_event_study` said survey-weighted fits "skip QUG" and return a linearity-conditional verdict. That was only true for pweight + PSU/FPC. Stratified (SurveyDesign(strata=...)) and replicate-weight (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError on the linearity kernels. Qualify both instances to the supported subset and note the deferred regimes explicitly. 3. (P3) The REGISTRY claim that HAD constructor/fit "signatures match the real API (regression-tested via inspect.signature)" overstated what `test_llms_full_had_*_signature_matches_real_api` actually checks - the test asserts parameter-name presence only, not defaults, type annotations, or return-type unions. Relax the REGISTRY note to match the test's actual contract. No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-12T22:22:44Z

Overall Assessment

✅ Looks good

Executive Summary

The diff is documentation-only; I found no changes to estimator behavior, weighting, variance/SE, identification checks, or defaults.
diff_diff/guides/llms-full.txt:L639-L661 now correctly separates the two-period aggregate="overall" path from the multi-period aggregate="event_study" path, matching the actual validators in diff_diff/had.py:L980-L987 and diff_diff/had.py:L1181-L1217.
The updated HAD practitioner wording in diff_diff/practitioner.py:L874-L889 and diff_diff/practitioner.py:L1031-L1046 is consistent with the shipped survey-pretest contract in diff_diff/had_pretests.py:L4555-L4581, the methodology registry, and the tracked follow-ups in TODO.md:L99-L113.
One documentation issue remains: docs/methodology/REGISTRY.md still has an adjacent PR HAD Phase 5 wave 1: agent-facing surfaces (_handle_had + llms-full.txt) #402 bullet that overstates what the HAD inspect.signature tests prove.
No security, performance, or runtime-code concerns in the diff.

Methodology

No findings. Affected surfaces are HeterogeneousAdoptionDiD.fit(), did_had_pretest_workflow(), stute_test(), and the joint Stute survey path. The revised guide/practitioner text matches the actual two-period vs multi-period gatekeeping and weighted-pretest limitations in diff_diff/had.py:L980-L987, diff_diff/had.py:L1181-L1217, diff_diff/had_pretests.py:L4434-L4581, and docs/methodology/REGISTRY.md:L2271-L2277, L2356-L2357, L2452-L2455, L2531-L2534.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The weighted/stratified HAD pretest follow-ups and the T22 survey tutorial remain explicitly tracked in TODO.md:L99-L113, so the updated guidance is aligned with existing debt tracking.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: docs/methodology/REGISTRY.md:L2555-L2556 is still internally inconsistent about the HAD signature tests. L2555 correctly narrows the claim to parameter-name presence, but L2556 still says the constructor / fit() signatures “match the real API (regression-tested via inspect.signature)”. The actual tests in tests/test_guides.py:L353-L405 only assert that real parameter names appear in the guide text; they do not verify defaults, annotations, return-type unions, or absence of extra documented kwargs. Concrete fix: reword docs/methodology/REGISTRY.md:L2556 to the same limited contract as L2555, or collapse the two bullets into one precise statement.
Severity: P3. Impact: the new practitioner caveat about supported survey weighting and unsupported stratified / replicate-weight pretests is not regression-locked at the practitioner layer. diff_diff/practitioner.py:L874-L885 and L1031-L1046 add new user-facing constraints, but tests/test_practitioner.py:L860-L895 only checks QUG-skipping and “linearity-conditional” wording. The underlying behavior is tested in tests/test_had_pretests.py:L4183-L4208 and L4302-L4439, but the practitioner text could drift silently. Concrete fix: add practitioner tests asserting the supported pweight + PSU/FPC scope and explicit deferral of SurveyDesign(strata=...) and replicate-weight designs for both HAD Step-3 handlers.

…rvey-scope regression test R0 review on the prior commit caught two follow-on items: (P2) REGISTRY.md:2555-2556 had two adjacent Phase 5 wave 1 bullets about HAD signatures. The first (already narrowed) correctly limits the regression-lock to parameter-name presence. The second still claimed "constructor / fit() signatures match the real API (regression-tested via inspect.signature)" - the same overstatement the prior commit fixed in the first bullet. Bring the second bullet in line with the narrower contract. (P3) The new practitioner Step-3 caveats about the supported survey-pretest scope (pweight + PSU/FPC) and the deferred stratified + replicate-weight regimes were not regression-locked at the practitioner test layer. The existing test_had_step_3_flags_qug_under_survey_deferral only covers the QUG-skip / linearity-conditional wording, leaving the new scope qualifications free to drift silently. Add test_had_step_3_qualifies_supported_survey_scope asserting the supported subset is named explicitly (pweight + PSU + FPC) and the deferred regimes are flagged by name (stratif, replicate, NotImplementedError) on both HAD handler variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-12T22:29:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 80cb9aef36a0d35719aafd0ed7aaf1d2ebb65aed

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. In re-review scope, the prior residual P2/P3 items appear resolved, and I did not find any new methodology, inference, or default-behavior defects in the changed files.

Executive Summary

The prior REGISTRY overstatement is fixed: the updated HAD bullets now describe the signature checks as parameter-name-presence tests only, which matches the actual inspect.signature assertions in tests/test_guides.py:L353-L405 and the revised registry text in docs/methodology/REGISTRY.md:L2555-L2556.
The HAD guide usage block now correctly separates two-period aggregate="overall" from multi-period aggregate="event_study", matching the estimator validators in diff_diff/had.py:L974-L987, diff_diff/had.py:L1181-L1217, and the workflow contract in diff_diff/had_pretests.py:L4434-L4465; see diff_diff/guides/llms-full.txt:L639-L661.
The revised practitioner Step-3 text now aligns with the shipped survey-pretest surface: QUG is skipped under survey, the supported weighted subset is narrowed, and stratified / replicate-weight regimes are explicitly deferred, consistent with diff_diff/practitioner.py:L874-L885, diff_diff/practitioner.py:L1031-L1046, diff_diff/had_pretests.py:L1725-L1740, diff_diff/had_pretests.py:L1927-L1940, and TODO.md:L99-L113.
The prior practitioner-layer regression gap is addressed by the new assertions in tests/test_practitioner.py:L897-L936.
Validation note: pytest is not installed in this sandbox, so test verification here is by inspection rather than execution.

Methodology

No findings. Affected surfaces are HAD fit() usage docs, did_had_pretest_workflow(), stute_test(), and the joint Stute wrappers. The revised text matches the Methodology Registry and in-code contracts; I found no undocumented deviation from the shipped Baker/de Chaisemartin workflow. See diff_diff/guides/llms-full.txt:L639-L661, diff_diff/guides/llms-full.txt:L1415-L1429, and diff_diff/had_pretests.py:L4434-L4465.

Code Quality

No findings.

Performance

No findings. This is a docs/tests-only diff.

Maintainability

No findings. The registry wording is now materially more precise about what the HAD signature regression tests do and do not prove. See docs/methodology/REGISTRY.md:L2555-L2556.

Tech Debt

No findings. The unsupported weighted HAD pretest regimes referenced by the new practitioner text are already tracked in TODO.md:L99-L113, so they are non-blocking.

Security

No findings.

Documentation/Tests

No findings. The prior re-review items appear resolved in docs/methodology/REGISTRY.md:L2555-L2556 and tests/test_practitioner.py:L897-L936.
Residual validation gap: I could not execute the touched tests because pytest is unavailable in this sandbox.

…ndoff + fit() Union annotation + REGISTRY contract clarifications Holistic codex re-audit of merged igerber#402 (HAD Phase 5 agent surfaces) + igerber#420 (cleanup) surfaced residuals that the per-PR CI review path could not see because the cleanup PR's diff scope hid the holistic state. 10 codex rounds against the combined post-PR state surfaced: Methodology / contracts (REGISTRY): - did_had_pretest_workflow Phase 3 Note still claimed the overall path "reduces a multi-period panel"; the validator actually requires exactly two periods. Updated to describe the aggregate-dispatched contract explicitly (overall = 2-period; event_study = multi-period). - Phase 5 wave-1 notes still said the fit() return Union was "not test-enforced"; new regression closes the gap. Code / API: - HeterogeneousAdoptionDiD.fit() return annotation widened to Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults] so the static type contract matches the documented runtime polymorphism on aggregate. Updated the stale dispatch comment that claimed the annotation was single-period. - ContinuousDiD -> HeterogeneousAdoptionDiD Step-4 handoff in practitioner.py now (a) explicitly recodes first_treat=inf -> 0 before both HAD example fits (HAD's _validate_had_panel rejects any first_treat outside {0, t_post}; ContinuousDiD's silent inf-normalization is HAD-incompatible), (b) frames the routing nudge around the WAS vs ATT(d) estimand difference rather than around the existence of untreated units, and (c) names HAD's stricter panel-shape and encoding requirements before showing the code. Agent guides (llms-full.txt, llms-practitioner.txt): - HeterogeneousAdoptionDiDEventStudyResults table now mirrors the single-period table's variance_formula and effective_dose_mean semantics (event-study path populates the same four pweight / survey_binder_tsl / pweight_2sls / survey_binder_tsl_2sls labels per had.py:639-648; effective_dose_mean inherits the same mass-point Wald-IV semantics per had.py:721-734). - llms-practitioner.txt continuous-treatment decision tree rewritten estimand-first (WAS vs ATT(d)) with panel-shape contract spelled out for HAD (aggregate='overall' two-period vs 'event_study' multi-period; staggered last-cohort-only WAS warning). Tests: - tests/test_had.py::TestFitReturnAnnotation pins the Union return annotation via typing.get_type_hints — drift would have been silent before. - tests/test_had_pretests.py::TestMultiPeriodWorkflow::test_overall_aggregate_rejects_multi_period locks the registry-described 2-period-only contract. - tests/test_guides.py::TestLLMsFullHADCoverage::test_llms_full_had_event_study_mirrors_weighted_metadata_semantics asserts the event-study guide table covers all four variance_formula labels and the mass-point Wald-IV effective_dose_mean. - tests/test_practitioner.py::TestHADDispatch::test_handle_continuous_step_4_recodes_first_treat_inf_for_had locks the inf->0 recode in the emitted snippet. - Plus 5 additional regression tests for cross-surface alignment between the practitioner handler, llms-practitioner.txt routing, and the underlying estimator contracts. CHANGELOG: - Original igerber#402 entry updated to credit the new return-annotation regression and clarify what the inspect.signature-based test does and does not pin. No methodology changes, no behavior changes to fit() outputs on any existing surface — all changes are contract clarifications, guidance corrections, and test additions on surfaces that igerber#402 + igerber#420 already established.

igerber added the ready-for-ci Triggers CI test workflows label May 12, 2026

igerber merged commit d8b0f67 into main May 12, 2026
31 of 32 checks passed

igerber deleted the fix-audit-402 branch May 12, 2026 23:54

igerber mentioned this pull request May 14, 2026

Fix #402 holistic audit residuals: HAD↔ContinuousDiD Step-4 + fit() Union annotation + REGISTRY contracts #431

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address residual P2s+P3 from re-audit of PR #402#420

Address residual P2s+P3 from re-audit of PR #402#420
igerber merged 2 commits into
mainfrom
fix-audit-402

igerber commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 12, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant