Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351
Conversation
Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in
variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified
Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-
scale zeta. Default remains "placebo".
Cross-surface allow-list extensions land in one PR per
feedback_cross_surface_parity_audit.md:
- SyntheticDiD.fit() dispatcher and _bootstrap_se signature
- synthetic_did.py:695 n_bootstrap result population
- results.py:960 summary() "Bootstrap replications" gating
- business_report.py:602 inference-label allow-list
- power.py SDID guidance strings (2 sites)
- SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt
Survey + bootstrap_refit raises NotImplementedError upstream in fit()
(covers both pweight-only and full-design) — the Rao-Wu rescaled-weight
composition is tracked as a follow-up TODO.
Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds ×
B=200 × 3 DGPs × 4 methods under H0 and writes
benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈
{0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into
REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal
calibration across all 3 DGPs; fixed-weight over-rejects by roughly
1.8–3.2× on smaller panels, consistent with the SE under-estimate from
ignoring weight-estimation uncertainty.
Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical
in TestPValueSemantics + TestCoverageMCArtifact schema smoke test
(guarded with pytest.skip per feedback_golden_file_pytest_skip.md) +
cross-surface BR inference-label test. PR #349's 1e-10 R-parity
bit-identity gate still passes.
Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside
the refit loop and aggregated into a single summary warning at end-of-
loop if the rate exceeds 5% — the same threshold the retry-exhaustion
guard uses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review caught that the sklearn-style setter path bypassed the constructor's enum/coherence checks, so users could ``set_params(variance_method='not_a_method')`` after construction and slip past the __init__ validation added for ``bootstrap_refit``. Extract the existing checks into a private ``_validate_config()`` helper and call from both ``__init__`` and ``set_params`` so both paths enforce the same contract. Constant-hoist the valid-methods tuple onto the class as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a single source. Add regression tests under ``TestGetSetParams``: - set_params accepts ``bootstrap_refit`` - set_params rejects unknown variance_method (parity with __init__) - set_params rejects incoherent n_bootstrap < 2 when method != jackknife - set_params allows n_bootstrap=1 under jackknife (deterministic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the review's P2 finding: if ``_validate_config`` rejects the post-update state in a multi-attribute ``set_params`` call, the instance was left with partially-applied (invalid) values after the raised ``ValueError``. Snapshot originals before any setattr and restore them in an except handler so the raise leaves the object consistent with its pre-call configuration. Regression test asserts post-raise state matches the pre-call state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.
Code Quality
Performance No findings. The extra cost of Maintainability No unmitigated findings. Tech Debt
Security No findings. Documentation/Tests
Execution note: I could not run the test suite in this sandbox because |
…ression Three P3 items from the CI AI review, all under Documentation/Tests and Code Quality (no methodology change): 1. **FW warning counting**: the refit loop counted raw solver warnings (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY text describes the rate per valid bootstrap draw, so the two could diverge by up to 3×. Change counts "any non-convergence on this draw" as boolean 1 and tighten the REGISTRY wording to call out the per-draw convention explicitly (each draw runs Frank-Wolfe once for ω and once for λ; either can trip the draw). 2. **Docstring gaps**: `fit()` now documents the new survey + refit NotImplementedError under the `survey_design` parameter and in the Raises block. `SyntheticDiDResults.variance_method` and `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit` and describe the four-way semantics of `placebo_effects` under each variance method. 3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the `variance_method` parametrizations in `TestScaleEquivariance` (baseline-parity, scale-equivariance across ~15 orders of magnitude, and extreme-scale-detection). Capture a bit-identity baseline for the refit path on `_make_panel(seed=42)`, so the normalization-sensitive branch has its own drift guard, not inferred from sibling methods. All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight 1e-10 R-parity bit-identity gate preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files. Path to Approval
|
… deviation Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that R's default synthdid::vcov(method="bootstrap") rebinds attr(estimate, "opts") — which includes update.omega=TRUE from the original fit — back into synthdid_estimate inside its do.call, so the renormalized ω is used only as Frank-Wolfe initialization and ω and λ are re-estimated per draw. R's default bootstrap is refit, not fixed- weight. The sum_normalize helper in R's source explicitly comments that the supplied weights "are used only for initialization" in bootstrap and placebo SEs. Our variance_method="bootstrap" holds the renormalized ω exactly (no FW re-run). It is therefore a deliberate deviation from R's default. Our PR #349 fixture generator at benchmarks/R/... is a manual fixed-weight invocation — it omits the opts rebind, which defaults update.omega to FALSE given non-null weights. The 1e-10 parity test anchors our fixed-weight path to that manual R invocation, not to R's real vcov behavior. Documentation-only fix across all claim sites; no methodology or code behavior changes: - REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as "Alternative: Bootstrap at unit level — fixed-weight shortcut"; add explicit **Note (deviation from R)** citing the vcov.R / synthdid.R opts-rebind mechanism; call out bootstrap_refit as matching R's default vcov. Requirements checklist entries and R-parity test scope Note rewritten to match. - diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se method docstring drop the "matching R" framing on the fixed-weight path; bootstrap_refit is flagged as matching R's default. - diff_diff/results.py: SyntheticDiDResults.variance_method field doc fixed (I introduced the "R-compatible fixed-weight shortcut" misphrasing in round 1; it was wrong). - CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that bootstrap_refit matches R's default and the existing fixed-weight bootstrap is now explicitly documented as a deviation. - benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment calls out the non-default invocation shape (no opts rebind → runs fixed-weight); references the Python test that consumes this fixture. - tests/test_methodology_sdid.py::test_bootstrap_se_matches_r docstring: rewritten to scope the parity check correctly (manual R fixed-weight, not R's default vcov). - TODO.md: add a new row for the refit cross-language parity anchor (Julia Synthdid.jl or R via the real vcov path) to make the missing anchor explicit. All 57 targeted tests pass; no methodology change, no numerical output change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability
Tech Debt
Security Documentation/Tests |
variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021 Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap") behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so the renormalized ω serves only as Frank-Wolfe initialization). The previously-shipped fixed-weight shortcut is removed entirely; the "bootstrap_refit" enum value briefly added in earlier commits of this PR is folded back into "bootstrap". Why this is a correctness fix, not just a relabel: the old fixed-weight "bootstrap" matched neither the paper (which prescribes refit) nor R's default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored fixed-weight Python against a manual R invocation that omitted the opts rebind — both sides were wrong in the same direction. Coverage MC at benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the new "bootstrap" tracks placebo near-nominal across the three representative DGPs; the old fixed-weight column over-rejected at α=0.05 at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal). Capability regression: SDID + survey designs (pweight-only AND strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight bootstrap was the only SDID variance method that supported strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside _bootstrap_se). Pweight-only users can switch to variance_method="placebo" or "jackknife"; strata/PSU/FPC users have no SDID variance option on this release. Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's "Note (deferred survey + bootstrap composition)" and TODO.md. The deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via `git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite _bootstrap_se body. Cross-surface allow-list reverts: the additive "bootstrap_refit" enum shipped in earlier commits of this PR rippled through results.py:960 summary gating, business_report.py:602 inference-label allow-list, power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults field docstrings. All of those are now back to a 3-value surface ("bootstrap", "jackknife", "placebo"). Tests: - TestBootstrapRefitSE class deleted; 4 unique tests folded into TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey, raises-full-design-survey, summary-shows-replications). - test_bootstrap_se_matches_r deleted along with its fixture (tests/data/sdid_bootstrap_indices_r.json) and generator (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they anchored the now-removed fixed-weight path. - TestPValueSemantics::test_refit_p_value_matches_analytical deleted as duplicate of test_bootstrap_p_value_matches_analytical. - TestScaleEquivariance._BASELINE: "bootstrap" row updated to the refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit- identical to the captured "bootstrap_refit" baseline since the new bootstrap path is the same code as the old refit path. Tolerance tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity. - TestGetSetParams: variance_method literals rebound to "bootstrap"; test_set_params_accepts_bootstrap_refit deleted (redundant with constructor tests). - TestCoverageMCArtifact: expected methods list set exact-equal to ("placebo", "bootstrap", "jackknife"). - test_business_report.py inference-label test class + method renamed to drop "refit" suffix; assertion checks for "bootstrap variance". The benchmarks/data/sdid_coverage.json artifact is updated transitionally in this commit (fixed-weight column dropped; refit column renamed to bootstrap) so the schema test stays green; a follow-up commit regenerates from a fresh 500-seed MC re-run with the new code path. The REGISTRY coverage table cells are TBD pending that re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only follow-up to the previous commit's bootstrap rewrite. Updates every user-facing surface that referenced the (now-removed) fixed-weight bootstrap or the additive bootstrap_refit option: - docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the SDID survey-support row (no SDID variance method supports strata/PSU/FPC anymore); rewrites the misdirecting note steering users to bootstrap for full survey designs; updates the inference summary table description for SDID's variance methods. - docs/survey-roadmap.md: rewrites the SDID limitations table rows to reflect the regression matrix (pweight-only works with placebo / jackknife; strata/PSU/FPC has no SDID variance option in this release; bootstrap rejects all survey designs). - docs/performance-scenarios.md: updates the SE-comparison scenario's timing expectation note (bootstrap is now ~10-100x slower per fit than the previous fixed-weight shortcut). - docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19 (inference methods description) and 29 (summary) — bootstrap is now paper-faithful refit matching R's default vcov, not the prior fixed-weight shortcut. - docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs- placebo description (cell t18-cell-028); softens the stakeholder narrative claim "the two methods agree" to acknowledge that on small panels with non-exchangeable factor structure the SE magnitudes can differ while both methods still agree on significance and CI direction (cell t18-cell-033); re-executes the comparison cell so the output reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight). The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int / pre-fit RMSE — none of which change — so no guard updates needed. - diff_diff/synthetic_did.py: fit() docstring's survey_design parameter description is now consistent with the actual guards (no bootstrap_refit references; explicit pweight-only-on-placebo-or-jackknife matrix). - benchmarks/python/coverage_sdid.py: --help text drops the bootstrap_refit mention. - METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that bootstrap matches R's bootstrap_sample is replaced with an honest description of the corrected refit semantics, plus a parenthetical historical note about the prior fixed-weight shortcut. All 351 targeted tests pass; no methodology or numerical change in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with
the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife").
Total wall-clock ~40 min on M-series Mac, Rust backend.
Numbers match the transitional rename in commit 1 exactly (rej@0.05:
balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity
since the new "bootstrap" path is the same refit code as the previous
"bootstrap_refit" path, and the MC uses identical seeds (range(500)).
Confirms the rewrite didn't introduce numerical drift.
Headline across the three DGPs:
- bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040)
versus nominal 0.05; well within 2σ MC band at 500 seeds.
- placebo: also near-nominal (0.060 / 0.032 / 0.058).
- jackknife: mildly anti-conservative on the smaller panels (0.112 on
balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern.
REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_
placeholders with the transcribed numbers. The narrative under the
table (`Reading:` paragraph) was already rewritten in commit 1 for the
3-method surface; no further edit needed.
The transitional commit-1 artifact (where I renamed bootstrap_refit →
bootstrap in-place without running the MC) carried a `transitional_note`
field in the metadata; the fresh run naturally drops that field since
it's written from scratch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI review on commit 15481d9 flagged the docs as overclaiming parity with R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe from the renormalized fit-time ω per draw (and keeps fit-time λ as FW init for the λ re-estimation), while our Python port was cold-starting from uniform. On the strictly-convex FW objective with simplex constraint, warm- and cold-start converge to the same global minimum given enough iterations — but the 100-iter pre-sparsify pass may not fully converge on some draws, and then sparsification is path-dependent on the init. Port the warm-start shape: - diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the first pass. When None (default), preserves the Rust top-level fast-path unchanged. When provided, falls through to the Python two-pass dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so the perf cost is one Python call per pass per draw. - diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw. boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])). boot_lambda_init = time_weights (fit-time λ unchanged, matching R's weights.boot$lambda = weights$lambda). Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"] SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253 (warm-start), a handful of the 200 bootstrap draws converging to a different sparsification pattern than the cold-start pass. rel=1e-14 bit-identity on the warm-start capture holds. Coverage MC: NOT re-run. User course-correction ("not sure we need to run the full MC with every pass here"). Math argument: strictly-convex FW → unique global minimum; aggregated rejection rates shift well under 2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed benchmarks/data/sdid_coverage.json (generated under cold-start in commit 15481d9) stays accurate for the warm-start path within MC noise. REGISTRY.md coverage-MC Note now documents the artifact regeneration cadence explicitly: regenerate on methodology changes that materially shift per-draw numerics, not on refactors that preserve the FW global optimum (warm-start, backend migration, renames, docstring fixes). Per-seed bit-identity on the captured fixture is the cheaper, stricter parity check. Other changes in this commit: - diff_diff/synthetic_did.py replicate-weight guard message rewritten (R2 P3 maintainability): the old message told users to switch to a TSL-based survey design with strata/psu/fpc, which is no longer supported by any SDID variance method in this release. - docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE displayed in the comparison cell is now 4.44 (warm-start) vs 4.50 (cold-start). Narrative unchanged. - CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape and drop the "cross-language bit parity" implication from earlier wording. - REGISTRY.md bootstrap section describes the warm-start mechanism including the FW strict-convexity argument for why warm and cold converge to the same answer but warm matches R's procedure at the algorithm level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Addresses CI review R3 findings on PR #351: P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py that still asserted the old fixed-weight + Rao-Wu success contract so they match the new upstream guards (bootstrap + any survey design and full strata/PSU/FPC with any variance method now raise NotImplementedError). test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises; test_full_design_se_differs_from_weights_only is replaced by a positive placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE comparison no longer has two valid paths to compare; test_bootstrap_with _survey -> test_bootstrap_with_pweight_only_raises plus a new test_jackknife_with_pweight_only for positive coverage. Also align the pre-existing placebo/jackknife guard tests with the updated guard wording (same stale-regex class of issue the P1 flagged). P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap sections that survived the bootstrap rewrite. REGISTRY comparison table row now describes the alternative as paper-faithful refit bootstrap (not fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an explicit Note explaining why SDID is excluded (refit + Rao-Wu composition requires weighted-FW derivation, tracked in TODO.md and sketched in REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet and llms-full.txt survey-aware bootstrap bullet updated with the same framing. P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say the fit-time unit_weights and time_weights are Frank-Wolfe warm-start initializations on each draw, not unused parameters retained for signature parity. The in-loop narrative comment at the warm-start site was already correct; the top-of-function docstring was stale from the pre-warm-start commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred- work breadcrumbs with the actual fixed-weight-removal commit hash (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). The follow-up implementer can now directly run `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu scaffolding (rw_control/rw_treated split, degenerate-retry check, treated-mean weighting). Also cite PR #351 for stability against commit-hash churn. - Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only _att_match to test_placebo_with_pweight_only_full_design_stripped_att _match and promote the body from a single-fit smoke to the ATT- equivalence check the docstring claims: fit with two equivalent pweight-only SurveyDesign constructions and assert ATT matches bit- for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns sitting on the DataFrame are not tacitly read unless the SurveyDesign references them. - Fix stale "3 DGPs × 4 methods" docstring in tests/test_methodology_sdid.py::TestCoverageMCArtifact to read "3 DGPs × 3 methods", matching the 3-method schema the test now enforces (placebo / bootstrap / jackknife; bootstrap_refit and the old fixed-weight 'bootstrap' enum are both gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment No unmitigated P0 or P1 findings in the changed PR surface. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: runtime execution was not possible here because this environment is missing |
The previous rewrite compared two semantically identical pweight-only SurveyDesign constructions, which can't catch the contract the docstring claimed — if a future change silently picked up `stratum` or `psu` by name, both fits would pick them up identically and the test would still pass. Rewrite to compare a fit on the original DataFrame (with `stratum` / `psu` columns present) against a fit on the same data with those columns physically dropped. If the estimator ever silently reads those columns by naming convention the two fits would diverge and the abs=1e-12 ATT check would fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via ``warnings.catch_warnings``, but the Rust FW entry point is silent on ``max_iter`` exhaustion (only the pure-NumPy path called ``warn_if_not_converged``). On the default Rust backend the aggregate warning at the end of the bootstrap loop therefore never fired, even when draws did not converge — a silent failure. Fix: thread an explicit convergence bool out of the Rust solver. Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``) - ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return ``converged = true`` on a min-decrease break, ``false`` otherwise. - ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``. - Existing ``sc_weight_fw`` pyfunction destructures and drops the bool, preserving its ABI for the rank-selection heuristic in ``prep.py`` and for any third-party consumer. - New pyfunction ``sc_weight_fw_with_convergence`` returns the ``(array, bool)`` tuple, wrapping the same internal solver. - Internal helpers ``compute_time_weights_internal`` / ``compute_sdid_unit_weights_internal`` destructure the inner calls and still return ``Array1<f64>`` (their pyfunctions discard convergence — Python callers that need it use the Python two-pass dispatcher). Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``) - Import the new Rust entry point as ``_rust_sc_weight_fw_with_convergence``. - ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a ``return_convergence=False`` kwarg. Default path is unchanged; with the flag set, return ``(weights, converged)``. - ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the same kwarg and propagate the AND of the two FW passes (pre-sparsify + main). When the flag is set, the Rust top-level fast-path is skipped (it is silent on non-convergence) in favor of the Python two-pass dispatcher; inner FW calls still dispatch to the Rust solver via ``sc_weight_fw_with_convergence``. Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``) - Replace the ``warnings.catch_warnings`` dance with an explicit ``return_convergence=True`` call on both helpers. ``fw_nonconvergence _count`` now increments based on the AND of ``omega_converged`` and ``lambda_converged`` per draw — works under both Rust and pure- Python backends. Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE:: test_bootstrap_fw_nonconvergence_warning_fires_under_rust``) - Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always report ``converged=False`` and asserts the aggregate FW non- convergence ``UserWarning`` fires on an otherwise-nominal SDID bootstrap fit. Skipped when the Rust backend is unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only. Executive Summary
Methodology Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…W note P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration`` at ``tests/test_methodology_sdid.py:2604`` was still framed around the deleted fixed-weight regime and encoded an anti-conservative lower bound ``rejection_rate > 0.05`` that biased the test toward keeping over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion`` and replace the directional calibration assertion with a calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10`` catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed under the old bug), and a loose band ``0.01 <= rejection_rate <= 0.40`` catches both SE-collapse and SE-explosion. Matches this session's commit c0d089b by removing the remaining lingering fixed-weight framing in the test suite. Tested locally: 100-seed refit bootstrap passes with comfortable margin. P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge case (line 1528) still said the Rust backend silently returns the final iterate and cited a follow-up. This session's commit c0d089b already landed the Rust-side convergence flag, Python ``return_convergence=True`` kwargs on the weight helpers, and the aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to describe the new behavior: bootstrap path now uses ``sc_weight_fw_with_convergence`` (Rust) / numpy ``return_convergence=True`` and aggregates a single summary warning above the 5% threshold; standalone callers that do not opt into convergence tracking retain the legacy silent-Rust / warning-numpy asymmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official Executive Summary
Methodology Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both
Code Quality No findings. Performance No findings. Maintainability
Tech Debt
Security No findings. Documentation/Tests
|
Four P3-only items from R8 CI review:
1. Correctly attribute R's default `vcov()` method:
- diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
"R's default". R's `synthdid::vcov()` actually defaults to
`method="bootstrap"`. Reword to describe placebo as the library default
with a rationale paragraph (survey availability, perf) and cross-
reference to the REGISTRY Note below.
- METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
to frame the default as a deliberate library deviation with the same
two-reason rationale.
2. Add a REGISTRY.md Note (default variance_method deviation from R)
that documents the rationale explicitly (survey availability + perf)
so the AI reviewer recognizes the deviation as documented rather than
as an accidental contradiction between the docstring and R's actual
default.
3. Soften the placebo-failure fallback guidance in
`_placebo_variance_se`: the previous strings recommended
`variance_method="bootstrap"`, which now raises `NotImplementedError`
on every survey design. Branch on `w_control is not None` (survey
fit) to recommend jackknife + adding controls for survey users and
keep the bootstrap + jackknife + more-controls recommendation for
non-survey users.
4. Fix the REGISTRY pointer to the slow dispersion guard: reference
`TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
current name) instead of the pre-rename
`test_bootstrap_p_value_null_calibration`, and describe the new
contract (calibration-agnostic dispersion + loose rejection-rate
band).
5. Rephrase the REGISTRY coverage-MC narrative about jackknife
anti-conservatism: the paper's AER §6.3 shows mixed jackknife
evidence (98% iid — slightly conservative; 93% AR(1) — slightly
anti-conservative). Our observed anti-conservatism aligns with the
AR(1) branch of the paper's evidence, not a uniform "in line"
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official Code Quality No findings. Performance No code-level findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
Single actionable P3 from R9 CI review: user-facing runtime wording for refit bootstrap had diverged across surfaces, giving conflicting expectations about the cost of the new bootstrap path: - CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower. - diff_diff/power.py said ~10-100x slower (two sites). - docs/choosing_estimator.rst said ~10-100x slower. - docs/performance-scenarios.md said ~10-100x slower. - docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower. - docs/tutorials/03_synthetic_did.ipynb and docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower. - benchmarks/python/coverage_sdid.py said the 500-seed MC run takes ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined). Unify on "~5-30x slower than placebo (panel-size dependent)" for the per-fit slowdown (the warm-start plumbing closed the gap vs the pre- warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x figure as a historical "prior estimate" — left as-is so the release notes continue to explain the revision. Also fix two tutorial surfaces that still called placebo "R's default" (tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is the library default per the REGISTRY Note added in 710f966. Reword to describe placebo as the library default with the rationale pointer. Verified: 353 tests pass across test_methodology_sdid, test_power, test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift guards unaffected because the change is markdown-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only; I could not execute tests in this workspace because Executive Summary
Methodology No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
The Pure Python Fallback CI job failed this test at rel-diff 0.5310 > 0.40 tolerance. Root cause is test-infrastructure, not a correctness regression. ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python mode to keep CI fast (see tests/conftest.py:210); the test's 0.40 tolerance was explicitly calibrated for B∈[100, 200] per its docstring comment. At B=49 the bootstrap SE is not yet converged to the placebo SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the same seed), so the failure is MC-noise, not a regression. The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3 Python versions) all run the test at the full B=200 and pass with comfortable margin — the regression guard is still exercised on the default user install path. Skip under pure-Python mode with an explicit rationale citing the min_n cap and the Rust-backed coverage that preserves the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through Code Quality
Performance No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
R10 CI review found two items on top of the previous ✅ Looks good. P2 Code Quality — aggregate Frank-Wolfe non-convergence warning numerator/denominator mismatch. In ``_bootstrap_se``, ``fw_nonconvergence_count`` was incremented before the draw cleared the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND then produced non-finite τ would count toward the warning numerator while the denominator is ``n_successful`` (draws that cleared the finite-τ gate). That does not affect the reported SE, but it can overstate the documented "share of valid bootstrap draws" warning contract and cause the warning to over-trigger. Fix: move the increment inside the ``if np.isfinite(tau)`` block so the numerator only counts draws that also contribute to the SE. A draw failing the finite-τ gate is retried upstream and should not inflate the non-convergence rate. P3 Documentation (previously unresolved) — two survey-cross-reference docs still advertised SyntheticDiD Rao-Wu bootstrap support, which the estimator now rejects at fit-time with NotImplementedError: - ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet to exclude SDID explicitly, with a pointer to the REGISTRY sketch for the deferred weighted-FW composition and to pweight-only placebo/jackknife as the available SDID variance alternatives. - ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the support-matrix table so SDID's row reads "pweight only (placebo / jackknife)" with bootstrap struck out, and add a "Note on SyntheticDiD" below explaining which methods accept pweight-only and why bootstrap rejects all survey designs (weighted- FW derivation tracked in TODO.md). Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust with 48-of-50 non-convergence warning still firing on the regression test, confirming the warning numerator still tallies correctly after the gate-order change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R11 CI review flagged that compute_time_weights and compute_sdid_unit _weights describe the new return_convergence parameter but their Returns sections still advertise only np.ndarray, not the tuple return. Update both docstrings to mirror _sc_weight_fw's union-return shape: Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an explicit description of the two-pass AND convergence flag (True iff both pre-sparsify and main FW passes hit the min-decrease break; False if either hit max_iter). Also promote the return_convergence kwarg description from the implementation-detail placement to the main Parameters block with the same contract note as _sc_weight_fw so standalone callers see the legacy-ABI preservation guarantee and the Rust top-level fast-path skip. Docs-only change; no code or test changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology No findings. The changed estimator math, SE formula, retry-to- Code Quality No findings. Performance No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. Maintainability No findings. Extracting Tech Debt
Security No findings. Documentation/Tests
|
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tracker had fallen ~9 estimators behind the library. Audit against __init__.py __all__, docs/methodology/REGISTRY.md, docs/methodology/ papers/, and tests/test_methodology_*.py surfaced four "Not Started" entries that are stale and ~10 missing entries entirely. Changes: - Reorganized Review Status Summary into seven categories (Core, Staggered, Continuous & Universal-Treatment, Triple-Difference, Counterfactual, Diagnostics, Cross-Cutting Inference Features). - Added "What 'Complete' means" tier definition (Complete / In Progress / Not Started) so the bar is explicit. - Added In-Progress entries for ImputationDiD, TwoStageDiD, WooldridgeDiD (ETWFE), EfficientDiD, ContinuousDiD, ChaisemartinDHaultfoeuille (DCDH), HeterogeneousAdoptionDiD (HAD), TROP, StaggeredTripleDifference, ConleySpatialHAC, Survey Data Support, PlaceboTests. Each "In Progress" block lists what's already in place (REGISTRY section, paper review, methodology test file, parity fixtures, test counts) and what's needed to promote to Complete. - Updated SyntheticDiD last-review date to 2026-04-23 (PR igerber#351 bootstrap-refit landing date) to reflect the warm-start FW corrections. - Refreshed methodology-test counts on the existing Complete entries to match current `grep` output (CallawaySantAnna 61, HonestDiD 27, TripleDifference 45, DifferenceInDifferences 51, HonestDiD unit 72). - Updated Priority Order: BaconDecomposition flagged as next substantive review (chosen during this session); In-Progress promotion ladder documented (HAD largest, DCDH closest to ready, etc.). No source code changes. No status flips on existing Complete entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
variance_method="bootstrap_refit"for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains"placebo".variance_method:results.py:960summary line,business_report.py:602BR inference-label,synthetic_did.py:695n_bootstrapresult population,power.pySDID guidance strings,SyntheticDiD.__init__docstring,diff_diff/guides/llms-full.txt. Survey + refit raisesNotImplementedErrorupstream infit()— Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writesbenchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed intoREGISTRY.md§SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).SyntheticDiD.set_paramsvalidation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared_validate_confighelper; makeset_paramstransactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.Methodology references (required if estimator / math changes)
variance_method="bootstrap_refit"variant.Synthdid.jl::src/vcov.jl:96-103is the only existing refit-bootstrap implementation; Rsynthdid::vcov(method="bootstrap")and Statasdid.ado:1033-1037both use the fixed-weight shortcut. Full methodology surface + requirements checklist row indocs/methodology/REGISTRY.md§SyntheticDiD.bootstrap_refitraiseNotImplementedError— Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summaryUserWarningat end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1)— unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).Validation
tests/test_methodology_sdid.py::TestBootstrapRefitSE— 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical— mirror of the fixed-weight bootstrap regression guardtests/test_methodology_sdid.py::TestGetSetParams— 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback testtests/test_methodology_sdid.py::TestCoverageMCArtifact— schema smoke test on the MC JSON (guarded withpytest.skipif absent perfeedback_golden_file_pytest_skip.md)tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel— cross-surface guard that BR emits"bootstrap_refit variance"on alpha-override, not the analytical fallback labelTestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the_bootstrap_sesignature expansionbenchmarks/data/sdid_coverage.json(500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered inREGISTRY.md§SyntheticDiD:Security / privacy
Generated with Claude Code