HAD Phase 4.5 C0: QUG-under-survey decision gate#367
Conversation
Add survey=/weights= kwargs to qug_test and did_had_pretest_workflow as keyword-only with default None. Both raise NotImplementedError when either kwarg is non-None, with an educational message naming the methodology rationale and pointing users to joint Stute (Phase 4.5 C, planned) as the survey-compatible alternative. Mutex guard on survey=+weights= mirrors HeterogeneousAdoptionDiD.fit() at had.py:2890. QUG-under-survey is permanently deferred. The test statistic uses extreme order statistics (D_(1), D_(2)) which are not smooth functionals of the empirical CDF -- standard survey machinery (Binder TSL, Rao-Wu rescaled bootstrap, Krieger-Pfeffermann (1997) EDF tests) does not yield a calibrated test; under cluster sampling the Exp(1)/Exp(1) limit law's independence assumption breaks; and the EVT-under-unequal- probability-sampling literature (Quintos et al. 2001, Beirlant et al.) addresses tail-index estimation, not boundary tests. The workflow's gate is temporary -- Phase 4.5 C will close it for the linearity-family pretests (stute_test, yatchew_hr_test, joint variants) via Rao-Wu rescaled bootstrap. Sister pretests keep their closed signatures in this release; Phase 4.5 C will add kwargs and implementation together to avoid API churn. - 11 new tests: 5 on TestQUGTest covering rejection / mutex / message- text checks / unweighted regression; 6 on new TestHADPretestWorkflowSurveyGuards covering both kwarg paths, mutex, methodology pointer, both aggregate paths, and unweighted regression. - docs/methodology/REGISTRY.md: Note (Phase 4.5 C0) under QUG section with three-reason rationale plus a research-direction sketch (the theoretical bridge would combine Hall 1982 / Aarssen-de Haan 1994 / Hall-Wang 1999 endpoint EVT, Boistard-Lopuhaa-Ruiz-Gazen 2017 / Bertail-Chautru-Clemencon 2017 survey-aware functional CLT, and Drees 2003 tail-empirical-process theory -- publishable methodology research, not engineering work). - CHANGELOG.md: Phase 4.5 C0 entry under [Unreleased]. - TODO.md: replaces decision-gate row with carry-forward research row. Unweighted qug_test(d) and did_had_pretest_workflow(...) calls are bit-exact pre-PR (kwargs are keyword-only after *; positional path unchanged). 138 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good No unmitigated P0/P1 findings. The diff adds front-door rejection for unsupported Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R1 P3 — forward-looking docs were internally inconsistent about Phase 4.5 C survey support for yatchew_hr_test. TODO.md correctly distinguishes Rao-Wu (Stute family) from weighted-OLS+weighted-variance (Yatchew); the new C0 docs in had_pretests.py, REGISTRY.md, and CHANGELOG.md described the whole linearity family as Rao-Wu-based, which would mislead the next implementer (Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based). Updated all three new docs to spell out the per-test mechanism explicitly: Rao-Wu rescaled bootstrap for stute_test + stute_joint_pretest + joint_pretrends_test + joint_homogeneity_test; weighted OLS residuals + weighted variance estimator for yatchew_hr_test. Now consistent with TODO.md. Test message-text assertions still pass (assert "Rao-Wu" in msg holds because Stute family still uses Rao-Wu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good The affected methods are Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…cross all canonical surfaces P2 - HeterogeneousAdoptionDiD shipped (Phases 1a-3 + 3-followup + 4.5 A/B/C0 merged via PRs #366/#367) but missing from the new authoritative documentation surfaces this PR establishes - README.md ## Estimators catalog: add one-line entry after ContinuousDiD (de Chaisemartin, Ciccia, D'Haultfoeuille & Knau 2026; alias HAD). - diff_diff/guides/llms.txt ## Estimators: add matching one-liner. - .claude/commands/docs-check.md required-estimators table: add HAD row pointing at had.rst as the API target. - docs/references.rst: add new "Heterogeneous Adoption (No-Untreated Designs)" sub-section with the de Chaisemartin et al. (2026) arXiv:2405.04465v6 citation. - docs/api/index.rst: add HeterogeneousAdoptionDiD to estimators autosummary; add HeterogeneousAdoptionDiDResults + HeterogeneousAdoptionDiDEventStudyResults to results autosummary. - docs/api/had.rst (NEW): autoclass page for the three classes with a brief intro and a "When to use HAD" note pointing at sibling estimators (ContinuousDiD for never-treated controls, dCDH for binary reversible). - docs/doc-deps.yaml: add diff_diff/had.py + diff_diff/had_pretests.py source mappings (REGISTRY methodology, had.rst api_reference, README catalog, references.rst, llms.txt). The llms-full.txt mapping is intentionally omitted with a comment - that section is deferred to the Phase 5 follow-up tracked in TODO.md. TODO.md: narrow the Phase 5 entry from "llms.txt updates" to "llms-full.txt HeterogeneousAdoptionDiD section" since the catalog one-liner, API page, and bibliography landed here. practitioner_next_steps() integration and tutorial notebook remain deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity- family pretests now accept survey=/weights= keyword-only kwargs: - stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow. Stute family: PSU-level Mammen multiplier bootstrap via generate_survey_multiplier_weights_batch. Each replicate draws (B, n_psu) Mammen multipliers, broadcast to per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling -- multiplier bootstrap is a different mechanism. Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap): sigma2_lin = sum(w * eps^2) / sum(w) sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w)) [Reviewer CRITICAL #2] sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg) T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G); locked at atol=1e-14 by direct helper test. Workflow under survey/weights: skips the QUG step with UserWarning (per C0 deferral), sets qug=None on the report, dispatches the linearity family with survey-aware mechanism, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the verdict. all_pass drops the QUG-conclusiveness gate (one less precondition). HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to None-tolerant rendering. Pweight shortcut routing: weights= passes through a synthetic trivial ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern on HAD sup-t. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth, reciprocal- guard discipline). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; deferred to a parallel follow-up. Per-row weights= / survey=col aggregated to per-unit via existing HAD helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey; constant-within-unit invariant enforced) through new _resolve_pretest_unit_weights helper. Strictly-positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Stability invariants preserved: - Unweighted code paths bit-exact pre-PR (the new survey/weights branch is a separate if arm; existing 138 pretest tests pass unchanged). - Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey). - HADPretestReport schema bit-exact on the unweighted path; qug=None triggers the new None-tolerant rendering only on the survey path. 20 new tests across TestHADPretestWorkflowSurveyGuards (revised from C0 rejection-only to C functional + 2 mutex/replicate-weight retained), TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey (5). Full pretest suite: 158 tests pass. Patch-level addition (additive on stable surfaces). See docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for the full methodology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cross all canonical surfaces P2 - HeterogeneousAdoptionDiD shipped (Phases 1a-3 + 3-followup + 4.5 A/B/C0 merged via PRs #366/#367) but missing from the new authoritative documentation surfaces this PR establishes - README.md ## Estimators catalog: add one-line entry after ContinuousDiD (de Chaisemartin, Ciccia, D'Haultfoeuille & Knau 2026; alias HAD). - diff_diff/guides/llms.txt ## Estimators: add matching one-liner. - .claude/commands/docs-check.md required-estimators table: add HAD row pointing at had.rst as the API target. - docs/references.rst: add new "Heterogeneous Adoption (No-Untreated Designs)" sub-section with the de Chaisemartin et al. (2026) arXiv:2405.04465v6 citation. - docs/api/index.rst: add HeterogeneousAdoptionDiD to estimators autosummary; add HeterogeneousAdoptionDiDResults + HeterogeneousAdoptionDiDEventStudyResults to results autosummary. - docs/api/had.rst (NEW): autoclass page for the three classes with a brief intro and a "When to use HAD" note pointing at sibling estimators (ContinuousDiD for never-treated controls, dCDH for binary reversible). - docs/doc-deps.yaml: add diff_diff/had.py + diff_diff/had_pretests.py source mappings (REGISTRY methodology, had.rst api_reference, README catalog, references.rst, llms.txt). The llms-full.txt mapping is intentionally omitted with a comment - that section is deferred to the Phase 5 follow-up tracked in TODO.md. TODO.md: narrow the Phase 5 entry from "llms.txt updates" to "llms-full.txt HeterogeneousAdoptionDiD section" since the catalog one-liner, API page, and bibliography landed here. practitioner_next_steps() integration and tutorial notebook remain deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
qug_test(..., survey=)andqug_test(..., weights=)now raiseNotImplementedErrorpermanently with an educational methodology message; same gate ondid_had_pretest_workflow.D_(1), D_(2)— not smooth functionals of the empirical CDF, so standard survey machinery (Binder TSL linearization, Rao-Wu rescaled bootstrap) does not yield a calibrated test. Permanent deferral with documented rationale.stute_test,yatchew_hr_test, joint variants) via Rao-Wu rescaled bootstrap. Joint Stute is the survey-compatible alternative.TestQUGTest+ 6 on newTestHADPretestWorkflowSurveyGuards); 138 pretest tests pass.Methodology rationale (mirrored across docstring ↔ error message ↔ REGISTRY)
compute_survey_if_variance, Rao-Wu viabootstrap_utils.generate_rao_wu_weights, Krieger-Pfeffermann (1997) EDF tests) all rely on Hadamard differentiability — the first two order statistics fail it.D_(1)andD_(2)may both come from the same PSU, breaking independence; under stratification the smallest dose may come from a small over- or under-sampled stratum, biasing the test.The survey-compatible alternative for HAD pretesting is joint Stute (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF that admits Krieger-Pfeffermann (1997) + Rao-Wu rescaled bootstrap. Phase 4.5 C ships this.
Research direction sketch (out of scope)
The theoretical bridge is sketchable: combine endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006), survey-aware functional CLTs (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017), and tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity"
λ_eff = Σ_h W_h · f_h(0+). Under a "no boundary clumping" assumption, theExp(1)/Exp(1)pivotality is preserved and only the calibration needs a survey-aware bootstrap. Publishable methodology research, ~6-12 months for a methods PhD student. Not engineering work for this library. Seedocs/methodology/REGISTRY.md§ "QUG Null Test" — Note (Phase 4.5 C0) for the full sketch.Files
diff_diff/had_pretests.pyqug_test+did_had_pretest_workflow: new keyword-onlysurvey=/weights=kwargs, mutex + reject guards, docstring Survey/weighted data sections.docs/methodology/REGISTRY.mdtests/test_had_pretests.pyTestQUGTest+ newTestHADPretestWorkflowSurveyGuardsclass (6 tests).CHANGELOG.md[Unreleased]Added entry.TODO.mdStability invariants preserved
qug_test(d)anddid_had_pretest_workflow(...)are bit-exact pre-PR (kwargs are keyword-only after*; no positional change).TestQUGTesttests pass unchanged atatol=1e-12.tests/test_had_pretests.pypass.HeterogeneousAdoptionDiD.fit()athad.py:2890— cross-surface consistency.Test plan
pytest tests/test_had_pretests.py::TestQUGTest tests/test_had_pretests.py::TestHADPretestWorkflowSurveyGuards -v— 21/21 greenpytest tests/test_had_pretests.py -v— 138/138 green (full pretest regression)black diff_diff/had_pretests.py tests/test_had_pretests.py— cleanruff check diff_diff/had_pretests.py tests/test_had_pretests.py— cleanweights=raises,survey=raises, mutex raises ValueError before NotImplementedError🤖 Generated with Claude Code