Address residual P2s+P3 from re-audit of PR #402#420
Merged
Conversation
The restored CI reviewer surfaced three findings the degraded reviewer missed across its 8 R rounds on PR #402: 1. (P2) The llms-full.txt HAD usage block reused one `data` symbol for both `aggregate="overall"` (two-period-only) and `aggregate="event_study"` (multi-period) calls. A reader copy- pasting hit a front-door error on the second `fit()` call when the first two calls' panel was used as-is. Split into `data_2p` and `data_mp` with an explanatory header. 2. (P2) The practitioner Step-3 wording on both `_handle_had` and `_handle_had_event_study` said survey-weighted fits "skip QUG" and return a linearity-conditional verdict. That was only true for pweight + PSU/FPC. Stratified (SurveyDesign(strata=...)) and replicate-weight (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError on the linearity kernels. Qualify both instances to the supported subset and note the deferred regimes explicitly. 3. (P3) The REGISTRY claim that HAD constructor/fit "signatures match the real API (regression-tested via inspect.signature)" overstated what `test_llms_full_had_*_signature_matches_real_api` actually checks - the test asserts parameter-name presence only, not defaults, type annotations, or return-type unions. Relax the REGISTRY note to match the test's actual contract. No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…rvey-scope regression test R0 review on the prior commit caught two follow-on items: (P2) REGISTRY.md:2555-2556 had two adjacent Phase 5 wave 1 bullets about HAD signatures. The first (already narrowed) correctly limits the regression-lock to parameter-name presence. The second still claimed "constructor / fit() signatures match the real API (regression-tested via inspect.signature)" - the same overstatement the prior commit fixed in the first bullet. Bring the second bullet in line with the narrower contract. (P3) The new practitioner Step-3 caveats about the supported survey-pretest scope (pweight + PSU/FPC) and the deferred stratified + replicate-weight regimes were not regression-locked at the practitioner test layer. The existing test_had_step_3_flags_qug_under_survey_deferral only covers the QUG-skip / linearity-conditional wording, leaving the new scope qualifications free to drift silently. Add test_had_step_3_qualifies_supported_survey_scope asserting the supported subset is named explicitly (pweight + PSU + FPC) and the deferred regimes are flagged by name (stratif, replicate, NotImplementedError) on both HAD handler variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
No unmitigated P0/P1 findings. In re-review scope, the prior residual P2/P3 items appear resolved, and I did not find any new methodology, inference, or default-behavior defects in the changed files. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
2 tasks
TDL77
pushed a commit
to TDL77/diff-diff
that referenced
this pull request
May 18, 2026
…ndoff + fit() Union annotation + REGISTRY contract clarifications Holistic codex re-audit of merged igerber#402 (HAD Phase 5 agent surfaces) + igerber#420 (cleanup) surfaced residuals that the per-PR CI review path could not see because the cleanup PR's diff scope hid the holistic state. 10 codex rounds against the combined post-PR state surfaced: Methodology / contracts (REGISTRY): - did_had_pretest_workflow Phase 3 Note still claimed the overall path "reduces a multi-period panel"; the validator actually requires exactly two periods. Updated to describe the aggregate-dispatched contract explicitly (overall = 2-period; event_study = multi-period). - Phase 5 wave-1 notes still said the fit() return Union was "not test-enforced"; new regression closes the gap. Code / API: - HeterogeneousAdoptionDiD.fit() return annotation widened to Union[HeterogeneousAdoptionDiDResults, HeterogeneousAdoptionDiDEventStudyResults] so the static type contract matches the documented runtime polymorphism on aggregate. Updated the stale dispatch comment that claimed the annotation was single-period. - ContinuousDiD -> HeterogeneousAdoptionDiD Step-4 handoff in practitioner.py now (a) explicitly recodes first_treat=inf -> 0 before both HAD example fits (HAD's _validate_had_panel rejects any first_treat outside {0, t_post}; ContinuousDiD's silent inf-normalization is HAD-incompatible), (b) frames the routing nudge around the WAS vs ATT(d) estimand difference rather than around the existence of untreated units, and (c) names HAD's stricter panel-shape and encoding requirements before showing the code. Agent guides (llms-full.txt, llms-practitioner.txt): - HeterogeneousAdoptionDiDEventStudyResults table now mirrors the single-period table's variance_formula and effective_dose_mean semantics (event-study path populates the same four pweight / survey_binder_tsl / pweight_2sls / survey_binder_tsl_2sls labels per had.py:639-648; effective_dose_mean inherits the same mass-point Wald-IV semantics per had.py:721-734). - llms-practitioner.txt continuous-treatment decision tree rewritten estimand-first (WAS vs ATT(d)) with panel-shape contract spelled out for HAD (aggregate='overall' two-period vs 'event_study' multi-period; staggered last-cohort-only WAS warning). Tests: - tests/test_had.py::TestFitReturnAnnotation pins the Union return annotation via typing.get_type_hints — drift would have been silent before. - tests/test_had_pretests.py::TestMultiPeriodWorkflow::test_overall_aggregate_rejects_multi_period locks the registry-described 2-period-only contract. - tests/test_guides.py::TestLLMsFullHADCoverage::test_llms_full_had_event_study_mirrors_weighted_metadata_semantics asserts the event-study guide table covers all four variance_formula labels and the mass-point Wald-IV effective_dose_mean. - tests/test_practitioner.py::TestHADDispatch::test_handle_continuous_step_4_recodes_first_treat_inf_for_had locks the inf->0 recode in the emitted snippet. - Plus 5 additional regression tests for cross-surface alignment between the practitioner handler, llms-practitioner.txt routing, and the underlying estimator contracts. CHANGELOG: - Original igerber#402 entry updated to credit the new return-annotation regression and clarify what the inspect.signature-based test does and does not pin. No methodology changes, no behavior changes to fit() outputs on any existing surface — all changes are contract clarifications, guidance corrections, and test additions on surfaces that igerber#402 + igerber#420 already established.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit follow-up to PR #402. The restored CI reviewer surfaced three findings the degraded reviewer missed across its 8 R rounds:
P2 - The `llms-full.txt` HAD usage block reused one `data` symbol for both `aggregate="overall"` (two-period only) and `aggregate="event_study"` (multi-period) calls. A reader copy-pasting the block literally hit a front-door error on the second `fit()`. Split into explicit `data_2p` and `data_mp` panels with an explanatory header.
P2 - The practitioner Step-3 wording on both `_handle_had` and `_handle_had_event_study` said survey-weighted fits "skip QUG" and return a linearity-conditional verdict. That is only true for `pweight + PSU/FPC`. Stratified (`SurveyDesign(strata=...)`) and replicate-weight (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` on the linearity kernels (see `had_pretests.py:1725`, `:1927`). Both step descriptions now qualify the supported subset and call out the deferred regimes explicitly.
P3 - The REGISTRY claim that HAD constructor/fit "signatures match the real API (regression-tested via inspect.signature)" overstated what `test_llms_full_had_*_signature_matches_real_api` actually checks - the tests assert parameter-name presence only, not defaults, type annotations, or return-type unions. Relax the REGISTRY wording to match the test's actual contract.
No estimator behavior, weighting, variance/SE, identification check, or default statistical surface changed - guidance/documentation accuracy only.
Test plan
🤖 Generated with Claude Code