Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study by igerber · Pull Request #351 · igerber/diff-diff

igerber · 2026-04-22T15:58:02Z

Summary

Ship paper-faithful variance_method="bootstrap_refit" for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains "placebo".
Cross-surface allow-list extensions so the new enum flows through all surfaces that read variance_method: results.py:960 summary line, business_report.py:602 BR inference-label, synthetic_did.py:695 n_bootstrap result population, power.py SDID guidance strings, SyntheticDiD.__init__ docstring, diff_diff/guides/llms-full.txt. Survey + refit raises NotImplementedError upstream in fit() — Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.
Coverage Monte Carlo study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).
SyntheticDiD.set_params validation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared _validate_config helper; make set_params transactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.

Methodology references (required if estimator / math changes)

Method name(s): Synthetic Difference-in-Differences bootstrap variance — new variance_method="bootstrap_refit" variant.
Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, Wager (2021), "Synthetic Difference-in-Differences," AER 111(12). New variant implements Algorithm 2 step 2 verbatim (re-estimate ω̂_b and λ̂_b per draw via Frank-Wolfe). Cross-language anchor: Julia Synthdid.jl::src/vcov.jl:96-103 is the only existing refit-bootstrap implementation; R synthdid::vcov(method="bootstrap") and Stata sdid.ado:1033-1037 both use the fixed-weight shortcut. Full methodology surface + requirements checklist row in docs/methodology/REGISTRY.md §SyntheticDiD.
Any intentional deviations from the source (and why): Survey designs (including pweight-only) composed with bootstrap_refit raise NotImplementedError — Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summary UserWarning at end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1) — unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).

Validation

Tests added/updated:
- tests/test_methodology_sdid.py::TestBootstrapRefitSE — 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)
- tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical — mirror of the fixed-weight bootstrap regression guard
- tests/test_methodology_sdid.py::TestGetSetParams — 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback test
- tests/test_methodology_sdid.py::TestCoverageMCArtifact — schema smoke test on the MC JSON (guarded with pytest.skip if absent per feedback_golden_file_pytest_skip.md)
- tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel — cross-surface guard that BR emits "bootstrap_refit variance" on alpha-override, not the analytical fallback label
- PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349's 1e-10 R-parity bit-identity test (TestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the _bootstrap_se signature expansion
Backtest / simulation / notebook evidence (if applicable): Coverage MC artifact committed at benchmarks/data/sdid_coverage.json (500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered in REGISTRY.md §SyntheticDiD:

DGP	method	α=0.01	α=0.05	α=0.10	mean SE / true SD
balanced	placebo	0.016	0.060	0.086	1.05
balanced	bootstrap	0.106	0.160	0.230	0.85
balanced	bootstrap_refit	0.028	0.078	0.116	1.05
balanced	jackknife	0.066	0.112	0.154	1.08
unbalanced	placebo	0.006	0.032	0.070	1.08
unbalanced	bootstrap	0.036	0.098	0.140	0.94
unbalanced	bootstrap_refit	0.008	0.038	0.080	1.11
unbalanced	jackknife	0.024	0.076	0.120	0.99
AER §6.3	placebo	0.018	0.058	0.086	0.99
AER §6.3	bootstrap	0.034	0.092	0.162	0.88
AER §6.3	bootstrap_refit	0.010	0.040	0.078	1.05
AER §6.3	jackknife	0.030	0.080	0.150	0.90

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized- scale zeta. Default remains "placebo". Cross-surface allow-list extensions land in one PR per feedback_cross_surface_parity_audit.md: - SyntheticDiD.fit() dispatcher and _bootstrap_se signature - synthetic_did.py:695 n_bootstrap result population - results.py:960 summary() "Bootstrap replications" gating - business_report.py:602 inference-label allow-list - power.py SDID guidance strings (2 sites) - SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt Survey + bootstrap_refit raises NotImplementedError upstream in fit() (covers both pweight-only and full-design) — the Rao-Wu rescaled-weight composition is tracked as a follow-up TODO. Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration across all 3 DGPs; fixed-weight over-rejects by roughly 1.8–3.2× on smaller panels, consistent with the SE under-estimate from ignoring weight-estimation uncertainty. Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical in TestPValueSemantics + TestCoverageMCArtifact schema smoke test (guarded with pytest.skip per feedback_golden_file_pytest_skip.md) + cross-surface BR inference-label test. PR #349's 1e-10 R-parity bit-identity gate still passes. Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the refit loop and aggregated into a single summary warning at end-of- loop if the rate exceeds 5% — the same threshold the retry-exhaustion guard uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AI review caught that the sklearn-style setter path bypassed the constructor's enum/coherence checks, so users could ``set_params(variance_method='not_a_method')`` after construction and slip past the __init__ validation added for ``bootstrap_refit``. Extract the existing checks into a private ``_validate_config()`` helper and call from both ``__init__`` and ``set_params`` so both paths enforce the same contract. Constant-hoist the valid-methods tuple onto the class as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a single source. Add regression tests under ``TestGetSetParams``: - set_params accepts ``bootstrap_refit`` - set_params rejects unknown variance_method (parity with __init__) - set_params rejects incoherent n_bootstrap < 2 when method != jackknife - set_params allows n_bootstrap=1 under jackknife (deterministic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to the review's P2 finding: if ``_validate_config`` rejects the post-update state in a multi-attribute ``set_params`` call, the instance was left with partially-applied (invalid) values after the raised ``ValueError``. Snapshot originals before any setattr and restore them in an except handler so the raise leaves the object consistent with its pre-call configuration. Regression test asserts post-raise state matches the pre-call state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T16:05:09Z

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 issues found. The new bootstrap_refit path re-estimates both ω and λ per draw and keeps the documented SE / p-value semantics aligned with the Methodology Registry.
The bootstrap_refit + survey-design exclusion is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than blocking.
P3: the aggregated Frank-Wolfe non-convergence warning is counted on raw solver warnings rather than draw-level failures, so it can fire earlier than the registry text implies.
P3: a few in-code docstrings still describe the pre-PR variance-method surface.
P3: the existing scale-equivariance regression suite was not extended to bootstrap_refit, so the new normalization-sensitive path lacks a direct guard.

Methodology

No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.

Severity P3. Impact: bootstrap_refit still rejects any survey design, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103, so it is mitigated and does not block approval. Concrete fix: none for this PR; the follow-up Rao-Wu + refit derivation is already recorded.

Code Quality

Severity P3. Impact: the refit non-convergence summary warning counts raw solver warnings in diff_diff/synthetic_did.py:L1092 and compares them to successful draws in diff_diff/synthetic_did.py:L1173, while the registry text says the warning should trigger when the non-convergence rate exceeds 5% of draws in docs/methodology/REGISTRY.md:L1513. That can emit the warning below the documented draw-level threshold when both omega and lambda warn on the same draw. Concrete fix: count “any non-convergence on this draw” as a boolean, or normalize against 2 * n_successful if the intended unit is solver calls.

Performance

No findings. The extra cost of bootstrap_refit is opt-in and documented in diff_diff/synthetic_did.py:L56 and docs/methodology/REGISTRY.md:L1513.

Maintainability

No unmitigated findings.

Tech Debt

Severity P3. Impact: placebo_effects now carries placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values, so the name is increasingly misleading after this PR. This is already tracked in TODO.md:L129. Concrete fix: none required for approval; follow the tracked rename to a neutral name such as variance_effects.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the in-code result/API docs still describe only the old variance-method surface in diff_diff/results.py:L813 and diff_diff/results.py:L815, and the fit() docstring does not describe the new NotImplementedError survey restriction in diff_diff/synthetic_did.py:L234 and diff_diff/synthetic_did.py:L247. Concrete fix: update those docstrings to include bootstrap_refit, its per-iteration payload semantics, and the survey-support limitation.
Severity P3. Impact: the existing normalization regression suite still excludes bootstrap_refit in tests/test_methodology_sdid.py:L2496, tests/test_methodology_sdid.py:L2510, and tests/test_methodology_sdid.py:L2542, even though the new path depends on the same Y_scale / zeta / Y_scale contract documented in docs/methodology/REGISTRY.md:L1579. Concrete fix: add bootstrap_refit to those parametrizations or add a dedicated smaller scale-equivariance regression for the refit branch.

Execution note: I could not run the test suite in this sandbox because pytest and numpy are not installed here.

…ression Three P3 items from the CI AI review, all under Documentation/Tests and Code Quality (no methodology change): 1. **FW warning counting**: the refit loop counted raw solver warnings (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY text describes the rate per valid bootstrap draw, so the two could diverge by up to 3×. Change counts "any non-convergence on this draw" as boolean 1 and tighten the REGISTRY wording to call out the per-draw convention explicitly (each draw runs Frank-Wolfe once for ω and once for λ; either can trip the draw). 2. **Docstring gaps**: `fit()` now documents the new survey + refit NotImplementedError under the `survey_design` parameter and in the Raises block. `SyntheticDiDResults.variance_method` and `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit` and describe the four-way semantics of `placebo_effects` under each variance method. 3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the `variance_method` parametrizations in `TestScaleEquivariance` (baseline-parity, scale-equivariance across ~15 orders of magnitude, and extreme-scale-detection). Capture a bit-identity baseline for the refit path on `_make_panel(seed=42)`, so the normalization-sensitive branch has its own drift guard, not inferred from sibling methods. All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight 1e-10 R-parity bit-identity gate preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T16:29:12Z

/ai-review

github-actions · 2026-04-22T16:40:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a033267682aa63852637ca5206c4eab06808c80

Overall Assessment

⚠️ Needs changes

Executive Summary

[Newly identified] The PR now codifies the mapping “bootstrap = R-compatible fixed-weight shortcut, bootstrap_refit = paper-faithful refit” across the registry, public docstrings, changelog, and new regression tests, but the official synthdid source currently accessible does not support that blanket “matching R” claim. It stores update.omega / update.lambda in attr(estimate, "opts"), passes those opts back through bootstrap_sample(), and notes the supplied bootstrap weights are used only for initialization. (raw.githubusercontent.com)
Previous non-blocking findings from the last review look addressed: the BusinessReport allow-list now includes bootstrap_refit, SyntheticDiDResults.summary() shows bootstrap replications for refit, and the scale-equivariance suite now covers the new variance path.
The bootstrap_refit survey-design exclusion is documented in the methodology registry and tracked in TODO.md, so it is informational rather than blocking.
I could not execute the test suite in this sandbox because pytest, numpy, and pandas are not installed.

Methodology

Severity P1 [Newly identified]. Impact: The changed method description in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12 says the existing variance_method="bootstrap" is the R-compatible fixed-weight bootstrap and positions bootstrap_refit as the new paper-faithful variant. The official synthdid source currently accessible contradicts that as written: synthdid_estimate() persists update.omega / update.lambda in attr(estimate, "opts"), bootstrap_sample() feeds those opts back into the estimator, and the package source says the supplied weights are used only for initialization in bootstrap/placebo SEs. That makes the current fixed-weight path an undocumented deviation from the cited reference implementation, and the new fixed-vs-refit regression in tests/test_methodology_sdid.py:L568-L596 now hard-codes that mismatch into the test surface. Concrete fix: either (1) document the current bootstrap path in REGISTRY.md as a **Note (deviation from R):** and remove the “matching R” / “R-compatible” language everywhere, or (2) retarget variance_method="bootstrap" to the official synthdid bootstrap semantics and rename the fixed-weight shortcut; in both cases, update the new regression/parity language to match. (raw.githubusercontent.com)
Severity P3. Impact: bootstrap_refit still rejects any survey design, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103. Concrete fix: none required for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: placebo_effects now spans placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values; the name is misleading but already tracked in TODO.md:L129. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files.

Path to Approval

Resolve the public method mapping for SDID bootstrap variance: either document the current fixed-weight bootstrap path as a REGISTRY.md deviation from official synthdid, or change bootstrap to match the official source and rename the fixed-weight shortcut.
Remove or rewrite the new “matching R” / “R-compatible” statements in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12.
Update the new fixed-vs-refit regression and any parity messaging so the tests enforce the approved semantics instead of encoding the current unsupported “R-compatible fixed-weight bootstrap” claim.

… deviation Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that R's default synthdid::vcov(method="bootstrap") rebinds attr(estimate, "opts") — which includes update.omega=TRUE from the original fit — back into synthdid_estimate inside its do.call, so the renormalized ω is used only as Frank-Wolfe initialization and ω and λ are re-estimated per draw. R's default bootstrap is refit, not fixed- weight. The sum_normalize helper in R's source explicitly comments that the supplied weights "are used only for initialization" in bootstrap and placebo SEs. Our variance_method="bootstrap" holds the renormalized ω exactly (no FW re-run). It is therefore a deliberate deviation from R's default. Our PR #349 fixture generator at benchmarks/R/... is a manual fixed-weight invocation — it omits the opts rebind, which defaults update.omega to FALSE given non-null weights. The 1e-10 parity test anchors our fixed-weight path to that manual R invocation, not to R's real vcov behavior. Documentation-only fix across all claim sites; no methodology or code behavior changes: - REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as "Alternative: Bootstrap at unit level — fixed-weight shortcut"; add explicit **Note (deviation from R)** citing the vcov.R / synthdid.R opts-rebind mechanism; call out bootstrap_refit as matching R's default vcov. Requirements checklist entries and R-parity test scope Note rewritten to match. - diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se method docstring drop the "matching R" framing on the fixed-weight path; bootstrap_refit is flagged as matching R's default. - diff_diff/results.py: SyntheticDiDResults.variance_method field doc fixed (I introduced the "R-compatible fixed-weight shortcut" misphrasing in round 1; it was wrong). - CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that bootstrap_refit matches R's default and the existing fixed-weight bootstrap is now explicitly documented as a deviation. - benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment calls out the non-default invocation shape (no opts rebind → runs fixed-weight); references the Python test that consumes this fixture. - tests/test_methodology_sdid.py::test_bootstrap_se_matches_r docstring: rewritten to scope the parity check correctly (manual R fixed-weight, not R's default vcov). - TODO.md: add a new row for the refit cross-language parity anchor (Julia Synthdid.jl or R via the real vcov path) to make the missing anchor explicit. All 57 targeted tests pass; no methodology change, no numerical output change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T17:04:53Z

/ai-review

github-actions · 2026-04-22T17:11:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3fa6d8d91ca8a4472d953b775af5f280a8e225d8

Overall Assessment
Looks good.

Executive Summary

The previous P1 around SDID bootstrap/R semantics looks materially resolved: fixed variance_method="bootstrap" is now documented as a deliberate deviation from R default, and the paper-faithful refit path is split out cleanly.
I did not find a new P0/P1 in the changed estimator path. The refit branch re-estimates both weight sets per draw, unsupported survey composition is rejected up front, and inference still routes through the existing NaN-safe path.
The new deferred items are properly tracked in TODO.md, so they are informational rather than blocking.
Residual issues are P3-only: a small amount of wording still overstates exact R-default parity or uses the old “R-compatible shortcut” framing.
I could not execute pytest here because the sandbox does not have it installed.

Methodology
Previous P1 about fixed bootstrap being mislabeled as R-default-faithful appears addressed.

Severity P3. Impact: the public wording still slightly over-claims exact R-default parity for bootstrap_refit. The docs say or imply that the new path matches R’s default bootstrap, but the implementation re-runs the library’s standard fresh-start two-pass Frank-Wolfe helpers rather than seeding unit-weight optimization with the renormalized original ω described in the registry’s own R-trace note. That is a numerical implementation choice, not a paper-methodology defect, so this is informational only, especially because the missing direct refit parity anchor is already tracked. Concrete fix: either weaken the wording to “same refit methodology as R default” or thread the R-style initialization into the refit branch and close the tracked parity follow-up. Refs: docs/methodology/REGISTRY.md:L1505, diff_diff/synthetic_did.py:L61, diff_diff/synthetic_did.py:L1098, diff_diff/utils.py:L1573, TODO.md:L104

Code Quality
No findings.

Performance
No findings.

Maintainability

Severity P3. Impact: a few internal comments and test docstrings still describe fixed-weight bootstrap as an “R-compatible” or “R’s vcov shortcut,” which now conflicts with the corrected registry note and can mislead future reviewers. Concrete fix: rename this wording to “fixed-weight shortcut” everywhere and reserve R-default wording for bootstrap_refit. Refs: diff_diff/synthetic_did.py:L601, diff_diff/synthetic_did.py:L1135, tests/test_methodology_sdid.py:L549

Tech Debt

Severity P3. Impact: the two real follow-ups introduced by this PR are properly documented and tracked, so they do not block approval: survey-design composition for refit and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the follow-ups open. Refs: TODO.md:L103, TODO.md:L104

Security
No findings.

Documentation/Tests
No unmitigated findings. Verification note: I could not run the new tests in this environment because pytest is not installed.

variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021 Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap") behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so the renormalized ω serves only as Frank-Wolfe initialization). The previously-shipped fixed-weight shortcut is removed entirely; the "bootstrap_refit" enum value briefly added in earlier commits of this PR is folded back into "bootstrap". Why this is a correctness fix, not just a relabel: the old fixed-weight "bootstrap" matched neither the paper (which prescribes refit) nor R's default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored fixed-weight Python against a manual R invocation that omitted the opts rebind — both sides were wrong in the same direction. Coverage MC at benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the new "bootstrap" tracks placebo near-nominal across the three representative DGPs; the old fixed-weight column over-rejected at α=0.05 at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal). Capability regression: SDID + survey designs (pweight-only AND strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight bootstrap was the only SDID variance method that supported strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside _bootstrap_se). Pweight-only users can switch to variance_method="placebo" or "jackknife"; strata/PSU/FPC users have no SDID variance option on this release. Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's "Note (deferred survey + bootstrap composition)" and TODO.md. The deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via `git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite _bootstrap_se body. Cross-surface allow-list reverts: the additive "bootstrap_refit" enum shipped in earlier commits of this PR rippled through results.py:960 summary gating, business_report.py:602 inference-label allow-list, power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults field docstrings. All of those are now back to a 3-value surface ("bootstrap", "jackknife", "placebo"). Tests: - TestBootstrapRefitSE class deleted; 4 unique tests folded into TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey, raises-full-design-survey, summary-shows-replications). - test_bootstrap_se_matches_r deleted along with its fixture (tests/data/sdid_bootstrap_indices_r.json) and generator (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they anchored the now-removed fixed-weight path. - TestPValueSemantics::test_refit_p_value_matches_analytical deleted as duplicate of test_bootstrap_p_value_matches_analytical. - TestScaleEquivariance._BASELINE: "bootstrap" row updated to the refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit- identical to the captured "bootstrap_refit" baseline since the new bootstrap path is the same code as the old refit path. Tolerance tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity. - TestGetSetParams: variance_method literals rebound to "bootstrap"; test_set_params_accepts_bootstrap_refit deleted (redundant with constructor tests). - TestCoverageMCArtifact: expected methods list set exact-equal to ("placebo", "bootstrap", "jackknife"). - test_business_report.py inference-label test class + method renamed to drop "refit" suffix; assertion checks for "bootstrap variance". The benchmarks/data/sdid_coverage.json artifact is updated transitionally in this commit (fixed-weight column dropped; refit column renamed to bootstrap) so the schema test stays green; a follow-up commit regenerates from a fresh 500-seed MC re-run with the new code path. The REGISTRY coverage table cells are TBD pending that re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc-only follow-up to the previous commit's bootstrap rewrite. Updates every user-facing surface that referenced the (now-removed) fixed-weight bootstrap or the additive bootstrap_refit option: - docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the SDID survey-support row (no SDID variance method supports strata/PSU/FPC anymore); rewrites the misdirecting note steering users to bootstrap for full survey designs; updates the inference summary table description for SDID's variance methods. - docs/survey-roadmap.md: rewrites the SDID limitations table rows to reflect the regression matrix (pweight-only works with placebo / jackknife; strata/PSU/FPC has no SDID variance option in this release; bootstrap rejects all survey designs). - docs/performance-scenarios.md: updates the SE-comparison scenario's timing expectation note (bootstrap is now ~10-100x slower per fit than the previous fixed-weight shortcut). - docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19 (inference methods description) and 29 (summary) — bootstrap is now paper-faithful refit matching R's default vcov, not the prior fixed-weight shortcut. - docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs- placebo description (cell t18-cell-028); softens the stakeholder narrative claim "the two methods agree" to acknowledge that on small panels with non-exchangeable factor structure the SE magnitudes can differ while both methods still agree on significance and CI direction (cell t18-cell-033); re-executes the comparison cell so the output reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight). The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int / pre-fit RMSE — none of which change — so no guard updates needed. - diff_diff/synthetic_did.py: fit() docstring's survey_design parameter description is now consistent with the actual guards (no bootstrap_refit references; explicit pweight-only-on-placebo-or-jackknife matrix). - benchmarks/python/coverage_sdid.py: --help text drops the bootstrap_refit mention. - METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that bootstrap matches R's bootstrap_sample is replaced with an honest description of the corrected refit semantics, plus a parenthetical historical note about the prior fixed-weight shortcut. All 351 targeted tests pass; no methodology or numerical change in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife"). Total wall-clock ~40 min on M-series Mac, Rust backend. Numbers match the transitional rename in commit 1 exactly (rej@0.05: balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity since the new "bootstrap" path is the same refit code as the previous "bootstrap_refit" path, and the MC uses identical seeds (range(500)). Confirms the rewrite didn't introduce numerical drift. Headline across the three DGPs: - bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040) versus nominal 0.05; well within 2σ MC band at 500 seeds. - placebo: also near-nominal (0.060 / 0.032 / 0.058). - jackknife: mildly anti-conservative on the smaller panels (0.112 on balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern. REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_ placeholders with the transcribed numbers. The narrative under the table (`Reading:` paragraph) was already rewritten in commit 1 for the 3-method surface; no further edit needed. The transitional commit-1 artifact (where I renamed bootstrap_refit → bootstrap in-place without running the MC) carried a `transitional_note` field in the metadata; the fresh run naturally drops that field since it's written from scratch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T20:49:48Z

/ai-review

github-actions · 2026-04-22T20:55:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 15481d9a859f1da8fa6f935d3ea5b3d6c8edbca3

Overall Assessment

✅ Looks good

Executive Summary

The previous substantive methodology concern is resolved: variance_method="bootstrap" now re-estimates both ω and λ per draw, keeps the documented retry-to-B contract, and still routes inference through safe_inference().
I did not find a new unmitigated P0/P1 in the changed estimator path.
The survey-support regression is explicit in code and documentation and is properly tracked in TODO.md, so it is informational rather than blocking.
Remaining issues are P3-only: one stale replicate-weight fallback message, and wording that still overstates verified parity with R’s default bootstrap.
I could not run pytest here because pytest is not installed in this environment.

Methodology

Severity P3. Impact: the refit bootstrap implementation fresh-starts each Frank-Wolfe solve via compute_sdid_unit_weights(...) / compute_time_weights(...), but the changed docs still state that the path matches R’s default vcov(method="bootstrap"), whose own rationale in the registry depends on the renormalized ω being threaded in as initialization. That is an implementation-choice / verification issue, not a demonstrated numerical defect, and it is already tracked in TODO.md, so it should not block approval. Concrete fix: either soften the wording to “paper-faithful refit bootstrap, intended to match R default” or add the TODO-tracked cross-language parity harness. Refs: diff_diff/synthetic_did.py:L924-L935, docs/methodology/REGISTRY.md:L1506-L1508, METHODOLOGY_REVIEW.md:L504-L508, CHANGELOG.md:L15-L18, TODO.md:L104-L104
No other findings. On the changed estimator path, the code now matches the updated registry on the load-bearing methodology points: it re-estimates both weight sets per bootstrap draw, retries degenerate/non-finite draws until n_bootstrap valid replicates or the bounded budget, preserves the sqrt((r-1)/r) * sd(ddof=1) SE formula, and routes bootstrap/jackknife inference through safe_inference() rather than inline t/p/CI code. Refs: diff_diff/synthetic_did.py:L583-L650, diff_diff/synthetic_did.py:L823-L1024, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The documented slowdown is the expected cost of correcting the methodology from fixed-weight bootstrap to per-draw refitting, not an accidental regression. Refs: CHANGELOG.md:L15-L21

Maintainability

Severity P3. Impact: the replicate-weight guard still tells users to “Use a TSL-based survey design (strata/psu/fpc),” but this PR simultaneously removes SDID support for strata/PSU/FPC on all variance methods. Users hitting that path are now sent to an unsupported alternative. Concrete fix: update the exception text to reflect the new contract, e.g. “replicate weights are not supported; pweight-only works with placebo / jackknife, and full survey designs have no SDID variance path in this release.” Refs: diff_diff/synthetic_did.py:L286-L290, diff_diff/synthetic_did.py:L299-L330, docs/methodology/REGISTRY.md:L1549-L1550

Tech Debt

Severity P3. Impact: the two real follow-ups introduced by this change are properly tracked, so they do not block approval: SDID+survey bootstrap composition and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the TODOs open. Refs: TODO.md:L103-L104

Security

No findings.

Documentation/Tests

No unmitigated findings in the changed docs/tests. The added tests cover the main regression surfaces this PR touched: survey rejection, setter validation, summary labeling, business-report labeling, and the committed coverage-artifact schema. Refs: tests/test_business_report.py:L4659-L4700, tests/test_methodology_sdid.py:L501-L638, tests/test_methodology_sdid.py:L1199-L1237, tests/test_methodology_sdid.py:L2848-L2903
Verification note: I could not run the tests here because pytest is not installed in this environment.

CI review on commit 15481d9 flagged the docs as overclaiming parity with R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe from the renormalized fit-time ω per draw (and keeps fit-time λ as FW init for the λ re-estimation), while our Python port was cold-starting from uniform. On the strictly-convex FW objective with simplex constraint, warm- and cold-start converge to the same global minimum given enough iterations — but the 100-iter pre-sparsify pass may not fully converge on some draws, and then sparsification is path-dependent on the init. Port the warm-start shape: - diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the first pass. When None (default), preserves the Rust top-level fast-path unchanged. When provided, falls through to the Python two-pass dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so the perf cost is one Python call per pass per draw. - diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw. boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])). boot_lambda_init = time_weights (fit-time λ unchanged, matching R's weights.boot$lambda = weights$lambda). Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"] SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253 (warm-start), a handful of the 200 bootstrap draws converging to a different sparsification pattern than the cold-start pass. rel=1e-14 bit-identity on the warm-start capture holds. Coverage MC: NOT re-run. User course-correction ("not sure we need to run the full MC with every pass here"). Math argument: strictly-convex FW → unique global minimum; aggregated rejection rates shift well under 2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed benchmarks/data/sdid_coverage.json (generated under cold-start in commit 15481d9) stays accurate for the warm-start path within MC noise. REGISTRY.md coverage-MC Note now documents the artifact regeneration cadence explicitly: regenerate on methodology changes that materially shift per-draw numerics, not on refactors that preserve the FW global optimum (warm-start, backend migration, renames, docstring fixes). Per-seed bit-identity on the captured fixture is the cheaper, stricter parity check. Other changes in this commit: - diff_diff/synthetic_did.py replicate-weight guard message rewritten (R2 P3 maintainability): the old message told users to switch to a TSL-based survey design with strata/psu/fpc, which is no longer supported by any SDID variance method in this release. - docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE displayed in the comparison cell is now 4.44 (warm-start) vs 4.50 (cold-start). Narrative unchanged. - CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape and drop the "cross-language bit parity" implication from earlier wording. - REGISTRY.md bootstrap section describes the warm-start mechanism including the FW strict-convexity argument for why warm and cold converge to the same answer but warm matches R's procedure at the algorithm level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:28:40Z

/ai-review

github-actions · 2026-04-22T21:36:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 105cf73a28c6e8aae698945be5cea1020fb79e5b

Overall Assessment

⚠️ Needs changes

Executive Summary

The refit SDID bootstrap implementation itself looks methodologically aligned with the updated registry: it re-estimates both ω and λ per draw, keeps the sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference().
The prior user-facing survey-message problem appears resolved: the current guard text no longer points users toward unsupported full-design SDID survey paths.
P1 [Newly identified]: the survey-bootstrap removal was not propagated to the legacy SDID survey tests. The new guards in diff_diff/synthetic_did.py:L311-L337 now reject exactly the combinations that tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322 still assert should succeed.
P2: several changed documentation surfaces still describe the old SDID bootstrap contract (fixed weights / Rao-Wu), so the project’s own methodology cross-references are internally inconsistent.
The deferred survey+refit composition work and the cross-language parity anchor are both properly tracked in TODO.md:L103-L104, so they are informational rather than blocking.
I could not run pytest here because this environment is missing runtime test dependencies (pytest, numpy).

Methodology

No unmitigated findings. The changed estimator path in diff_diff/synthetic_did.py:L590-L657 and diff_diff/synthetic_did.py:L830-L1052 matches the updated SDID registry at docs/methodology/REGISTRY.md:L1497-L1552 on the load-bearing points: per-draw refitting of ω̂_b/λ̂_b, retry-to-B semantics, unchanged sqrt((r-1)/r) SE formula, and NaN-safe analytical bootstrap/jackknife inference via diff_diff/utils.py:L174-L208. The survey-bootstrap capability regression is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than a methodology defect.

Code Quality

No findings.

Performance

No findings. The slowdown is the expected cost of replacing the old fixed-weight shortcut with per-draw Frank-Wolfe refitting, and that tradeoff is documented.

Maintainability

Severity P3. _bootstrap_se()’s docstring says unit_weights / time_weights “are not used inside the loop,” but the implementation immediately uses both as warm-start initializations. Impact: this gives maintainers the wrong mental model of what is fixed versus merely reused as initialization. Concrete fix: rewrite diff_diff/synthetic_did.py:L858-L862 to say the original weights are not reused as fixed estimator weights, only as Frank-Wolfe warm starts, matching diff_diff/synthetic_did.py:L873-L877 and diff_diff/synthetic_did.py:L925-L962.

Tech Debt

Severity P3 informational. The two real follow-ups introduced here are explicitly tracked: survey+refit composition and cross-language refit parity in TODO.md:L103-L104. Impact: none for approval. Concrete fix: none in this PR; keep those TODOs open.

Security

No findings.

Documentation/Tests

Severity P1 [Newly identified]. The new survey guards in diff_diff/synthetic_did.py:L311-L337 now make bootstrap + pweight-only survey and bootstrap + full survey design raise NotImplementedError, but the legacy SDID survey tests still assert the old success contract in tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322. Impact: the branch leaves the SDID survey contract internally contradictory and will keep the test suite red once dependencies are present. Concrete fix: rewrite those tests to assert the new NotImplementedError behavior, and replace the old Rao-Wu comparison with positive pweight-only placebo / jackknife cases.
Severity P2. Several changed documentation surfaces still state the old SDID bootstrap story: docs/methodology/REGISTRY.md:L2699-L2699 still says Unit-level bootstrap (fixed weights), docs/methodology/REGISTRY.md:L2889-L2895 still lists SyntheticDiD under Rao-Wu survey bootstrap, docs/survey-roadmap.md:L52-L54 still lists SyntheticDiD among the Rao-Wu survey-bootstrap estimators, and diff_diff/guides/llms-full.txt:L1674-L1676 still says survey-aware bootstrap includes SyntheticDiD via Rao-Wu. Impact: reviewers and downstream doc consumers get conflicting methodology guidance from the project’s own source material. Concrete fix: update those cross-reference summaries to say SDID bootstrap is the paper-faithful refit path for non-survey fits only, and that SDID has no survey-bootstrap support in this release.

Path to Approval

Update the stale SDID survey tests in tests/test_survey_phase5.py so they match the new public contract: variance_method="bootstrap" with any survey_design should raise NotImplementedError, while pweight-only positive coverage should move to placebo / jackknife.

Addresses CI review R3 findings on PR #351: P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py that still asserted the old fixed-weight + Rao-Wu success contract so they match the new upstream guards (bootstrap + any survey design and full strata/PSU/FPC with any variance method now raise NotImplementedError). test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises; test_full_design_se_differs_from_weights_only is replaced by a positive placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE comparison no longer has two valid paths to compare; test_bootstrap_with _survey -> test_bootstrap_with_pweight_only_raises plus a new test_jackknife_with_pweight_only for positive coverage. Also align the pre-existing placebo/jackknife guard tests with the updated guard wording (same stale-regex class of issue the P1 flagged). P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap sections that survived the bootstrap rewrite. REGISTRY comparison table row now describes the alternative as paper-faithful refit bootstrap (not fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an explicit Note explaining why SDID is excluded (refit + Rao-Wu composition requires weighted-FW derivation, tracked in TODO.md and sketched in REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet and llms-full.txt survey-aware bootstrap bullet updated with the same framing. P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say the fit-time unit_weights and time_weights are Frank-Wolfe warm-start initializations on each draw, not unused parameters retained for signature parity. The in-loop narrative comment at the warm-start site was already correct; the top-of-function docstring was stale from the pre-warm-start commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:43:19Z

/ai-review

github-actions · 2026-04-22T21:49:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f49a19f8cdb71d7cac7da51f37c5738c58b21543

Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved.

Executive Summary

The changed SyntheticDiD bootstrap path is methodologically aligned with the updated registry: it re-estimates ω̂_b and λ̂_b per draw, preserves sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference() for all-or-nothing NaN gating. See diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1057, diff_diff/utils.py:L174-L206, docs/methodology/REGISTRY.md:L1497-L1551.
The previous P1 around SDID survey tests is resolved. The changed tests now assert the new NotImplementedError contract for bootstrap and full-design survey paths. See tests/test_survey_phase5.py:L179-L216.
The cross-surface result/reporting surfaces look consistent in the current tree: bootstrap replications still render in summary(), and BusinessReport still labels SDID alpha overrides as bootstrap variance instead of falling through to analytical wording. See diff_diff/results.py:L963-L966, diff_diff/business_report.py:L593-L604, tests/test_business_report.py:L4658-L4702.
Remaining issues are minor P3 cleanup only: one deferred-work breadcrumb still contains placeholder commit IDs, one new survey test no longer checks the ATT-equivalence contract stated in its name/docstring, and one new coverage-artifact docstring still says "4 methods" although the artifact/test now use 3.
I could not run pytest here because this environment is missing numpy; verification was limited to static inspection plus compile() parsing of the changed Python files.

Methodology

No findings. The estimator and inference changes in diff_diff/synthetic_did.py:L590-L657 and diff_diff/synthetic_did.py:L830-L1057 match the updated SDID registry in docs/methodology/REGISTRY.md:L1497-L1551, including the documented survey-support regression and the unchanged analytical bootstrap/jackknife inference contract via diff_diff/utils.py:L174-L206.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the deferred survey-bootstrap follow-up is properly tracked, but the newly added breadcrumbs are not directly actionable because they still contain placeholder SHAs (<sdid-bootstrap-refit-removal-sha> and <removal-commit>). Concrete fix: replace those placeholders with the actual commit hash, or rewrite the note to point to a stable PR number / git log --grep lookup instead. See TODO.md:L103-L104, docs/methodology/REGISTRY.md:L1550-L1550.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_placebo_full_vs_pweight_only_att_match is named and documented as an ATT-equivalence regression check, but its body now only asserts that the pweight-only fit is finite. That means the claimed "stripping strata/PSU should not change ATT" contract is no longer actually tested. Concrete fix: either compare the pweight-only ATT against the same data with strata/PSU/FPC columns ignored/dropped, or rename the test/docstring to describe it as a smoke test. See tests/test_survey_phase5.py:L227-L252.
Severity P3. Impact: the new coverage-artifact test docstring still says the MC study covers "4 methods," while the artifact schema and assertions immediately below now enforce exactly 3 methods (placebo, bootstrap, jackknife). That leaves stale methodology wording in the new regression surface. Concrete fix: update the docstring to say "3 methods." See tests/test_methodology_sdid.py:L2865-L2907.

- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred- work breadcrumbs with the actual fixed-weight-removal commit hash (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). The follow-up implementer can now directly run `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu scaffolding (rw_control/rw_treated split, degenerate-retry check, treated-mean weighting). Also cite PR #351 for stability against commit-hash churn. - Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only _att_match to test_placebo_with_pweight_only_full_design_stripped_att _match and promote the body from a single-fit smoke to the ATT- equivalence check the docstring claims: fit with two equivalent pweight-only SurveyDesign constructions and assert ATT matches bit- for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns sitting on the DataFrame are not tacitly read unless the SurveyDesign references them. - Fix stale "3 DGPs × 4 methods" docstring in tests/test_methodology_sdid.py::TestCoverageMCArtifact to read "3 DGPs × 3 methods", matching the 3-method schema the test now enforces (placebo / bootstrap / jackknife; bootstrap_refit and the old fixed-weight 'bootstrap' enum are both gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:53:07Z

/ai-review

github-actions · 2026-04-22T22:01:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ec92d39b9fbf6d16f5e7642bf534b278a6d35a83

Overall Assessment
Looks good

No unmitigated P0 or P1 findings in the changed PR surface.

Executive Summary

The refit bootstrap implementation re-estimates omega_b and lambda_b on each draw, preserves the sqrt((r-1)/r) SE aggregation, and still routes bootstrap and jackknife inference through safe_inference(), consistent with the updated registry. diff_diff/synthetic_did.py#L590 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
The SDID survey-support regression is explicit rather than silent: full survey designs are rejected for all SDID variance methods, and any survey design is rejected for variance_method='bootstrap'. Because that behavior is documented in the registry and tracked in TODO, it is P3 informational only. diff_diff/synthetic_did.py#L306 docs/methodology/REGISTRY.md#L1549 TODO.md#L103
Cross-surface reporting still looks coherent: SDID summaries retain the bootstrap replications line, and BusinessReport's alpha-override path still labels SDID fits as bootstrap variance instead of falling through to the analytical label. diff_diff/results.py#L963 diff_diff/business_report.py#L602 tests/test_business_report.py#L4658
The sklearn-style set_params() validation gap appears fixed and rollback-covered. diff_diff/synthetic_did.py#L1447 tests/test_methodology_sdid.py#L1199
The prior minor re-review issues are resolved: the TODO breadcrumb now uses a real commit hash, and the coverage-artifact test now correctly describes a 3-method artifact. TODO.md#L103 tests/test_methodology_sdid.py#L2865
One minor test-quality issue remains: the new survey regression test still does not actually exercise the contract its docstring describes. tests/test_survey_phase5.py#L227

Methodology

No unmitigated findings. The changed estimator and variance code is internally consistent with the methodology registry on Algorithm 2 refit bootstrap, analytical bootstrap/jackknife p-values, and NaN-safe inference. diff_diff/synthetic_did.py#L644 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
Severity P3. Impact: SDID survey bootstrap and full-design survey support regresses in this release, but the deviation is explicitly documented and tracked, so it is informational only under the review policy. Concrete fix: no approval blocker; the weighted-Frank-Wolfe survey composition follow-up is already specified in TODO.md#L103.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: external parity for the new refit-bootstrap semantics is still deferred; confidence currently rests on same-library regression tests plus the committed MC artifact rather than a live R/Julia anchor. Concrete fix: implement the cross-language parity harness already tracked in TODO.md#L104.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_placebo_with_pweight_only_full_design_stripped_att_match compares two semantically identical pweight-only SurveyDesigns, so it cannot catch the regression its docstring claims to cover. A future silent pickup of stratum or psu columns would still pass this test. Concrete fix: compare the pweight-only fit against the same dataset after dropping or renaming stratum and psu, or against a copy where those columns are absent. tests/test_survey_phase5.py#L227

Verification note: runtime execution was not possible here because this environment is missing numpy and pytest; review was limited to static inspection and AST parsing.

The previous rewrite compared two semantically identical pweight-only SurveyDesign constructions, which can't catch the contract the docstring claimed — if a future change silently picked up `stratum` or `psu` by name, both fits would pick them up identically and the test would still pass. Rewrite to compare a fit on the original DataFrame (with `stratum` / `psu` columns present) against a fit on the same data with those columns physically dropped. If the estimator ever silently reads those columns by naming convention the two fits would diverge and the abs=1e-12 ATT check would fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:19:52Z

/ai-review

github-actions · 2026-04-22T22:30:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 826b1a8e722960a0936da30785fbdf5f4ed8cf97

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have numpy, pandas, or pytest.

Executive Summary

The core SDID estimator change looks methodologically coherent: the bootstrap path now refits both weight vectors per draw, preserves the existing SE aggregation, and keeps NaN-safe inference dispatch in the fitted results surface. diff_diff/synthetic_did.py:L590-L657 diff_diff/synthetic_did.py:L830-L1057
The survey-support regression is explicit rather than silent, and it is properly documented in the methodology registry and tracked in TODO.md, so it is informational under this rubric. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104
The prior informational survey-test issue from the last review appears resolved: the pweight-only regression now compares a DataFrame with stratum/psu columns against one where those columns are actually dropped. tests/test_survey_phase5.py:L227-L270
Two non-blockers remain: the new aggregate Frank-Wolfe warning path does not actually surface on the Rust-backed solver, and one retained null-calibration test still encodes the deleted fixed-weight bootstrap behavior. diff_diff/synthetic_did.py:L953-L977 diff_diff/synthetic_did.py:L1035-L1049 tests/test_methodology_sdid.py:L2547-L2589

Methodology

Severity P3. Impact: No unmitigated methodology defect. The affected method is SyntheticDiD bootstrap variance. The refit path in diff_diff/synthetic_did.py:L830-L1057 refits both ω_b and λ_b, keeps the sqrt((r-1)/r) aggregation, and still routes bootstrap/jackknife inference through safe_inference() in diff_diff/synthetic_did.py:L643-L657, matching the local registry at docs/methodology/REGISTRY.md:L1497-L1551. Upstream synthdid also documents bootstrap/jackknife/placebo as Algorithms 2/3/4, and its bootstrap source calls back into synthdid_estimate() while synthdid_estimate() defaults update.omega / update.lambda to refitting-from-initialization when weights are supplied. Concrete fix: none. (synth-inference.github.io)
Severity P3. Impact: The SDID survey-bootstrap/full-design regression is explicit and mitigated by documentation/tracking: any survey + variance_method="bootstrap" fit and any full-design SDID fit now raises NotImplementedError, and that deviation is both documented and tracked. Concrete fix: none for approval; the weighted-Frank-Wolfe survey-bootstrap follow-up is already queued. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104

Code Quality

Severity P2. Impact: The new aggregate Frank-Wolfe non-convergence warning is effectively disabled on Rust-backed execution. _bootstrap_se() only counts captured Python UserWarnings in diff_diff/synthetic_did.py:L953-L977, but the refit helpers still route both FW passes through _sc_weight_fw() in diff_diff/utils.py:L1499-L1523 and diff_diff/utils.py:L1597-L1615, and _sc_weight_fw() returns directly to the Rust solver whenever HAS_RUST_BACKEND is true in diff_diff/utils.py:L1338-L1353. The registry already notes that the Rust solver is silent on non-convergence, so the summary at diff_diff/synthetic_did.py:L1039-L1049 can stay at zero even when draws hit max_iter. Concrete fix: thread convergence status out of _rust_sc_weight_fw through the weight helpers, or route this warning path through _sc_weight_fw_numpy, and add a regression test for the Rust-backed case. docs/methodology/REGISTRY.md:L1528-L1528

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: External parity anchoring for the refit bootstrap remains deferred. The old fixed-weight R fixture is deleted, and the replacement Julia/R parity anchor is only tracked in TODO.md, so this is informational rather than blocking. Concrete fix: land the queued cross-language parity harness already listed in TODO.md:L104-L104.

Security

No findings.

Documentation/Tests

Severity P2. Impact: [Newly identified] test_bootstrap_p_value_null_calibration still documents and asserts the removed fixed-weight bootstrap behavior. Its docstring says the test is for the “fixed-weight regime” and its lower-bound assertion rejection_rate > 0.05 assumes systematic over-rejection, but the new registry and committed coverage artifact now describe near-nominal refit-bootstrap calibration. That leaves the test description wrong today and can make the assertion flaky or fail on correct code. Concrete fix: rewrite the test around its actual regression target, namely “not collapsed near 0 and not catastrophically large,” and update the docstring/comments to the refit-bootstrap semantics. tests/test_methodology_sdid.py:L2547-L2589 docs/methodology/REGISTRY.md:L1552-L1570
No other findings. The prior informational survey-test issue appears addressed in tests/test_survey_phase5.py:L227-L270.

Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via ``warnings.catch_warnings``, but the Rust FW entry point is silent on ``max_iter`` exhaustion (only the pure-NumPy path called ``warn_if_not_converged``). On the default Rust backend the aggregate warning at the end of the bootstrap loop therefore never fired, even when draws did not converge — a silent failure. Fix: thread an explicit convergence bool out of the Rust solver. Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``) - ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return ``converged = true`` on a min-decrease break, ``false`` otherwise. - ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``. - Existing ``sc_weight_fw`` pyfunction destructures and drops the bool, preserving its ABI for the rank-selection heuristic in ``prep.py`` and for any third-party consumer. - New pyfunction ``sc_weight_fw_with_convergence`` returns the ``(array, bool)`` tuple, wrapping the same internal solver. - Internal helpers ``compute_time_weights_internal`` / ``compute_sdid_unit_weights_internal`` destructure the inner calls and still return ``Array1<f64>`` (their pyfunctions discard convergence — Python callers that need it use the Python two-pass dispatcher). Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``) - Import the new Rust entry point as ``_rust_sc_weight_fw_with_convergence``. - ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a ``return_convergence=False`` kwarg. Default path is unchanged; with the flag set, return ``(weights, converged)``. - ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the same kwarg and propagate the AND of the two FW passes (pre-sparsify + main). When the flag is set, the Rust top-level fast-path is skipped (it is silent on non-convergence) in favor of the Python two-pass dispatcher; inner FW calls still dispatch to the Rust solver via ``sc_weight_fw_with_convergence``. Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``) - Replace the ``warnings.catch_warnings`` dance with an explicit ``return_convergence=True`` call on both helpers. ``fw_nonconvergence _count`` now increments based on the AND of ``omega_converged`` and ``lambda_converged`` per draw — works under both Rust and pure- Python backends. Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE:: test_bootstrap_fw_nonconvergence_warning_fires_under_rust``) - Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always report ``converged=False`` and asserts the aggregate FW non- convergence ``UserWarning`` fires on an otherwise-nominal SDID bootstrap fit. Skipped when the Rust backend is unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:46:50Z

/ai-review

github-actions · 2026-04-22T22:55:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c0d089bf9d6344c079fbba7f7a00717888d1f463

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only.

Executive Summary

The affected method is SyntheticDiD bootstrap variance. The implementation now refits both ω and λ per bootstrap draw, keeps the established sqrt((r-1)/r) × sd(...) SE aggregation, and still routes analytical inference through safe_inference(). I did not find an unmitigated methodology defect.
The previous Rust-warning concern appears resolved by the new convergence-return plumbing in diff_diff/utils.py:1350, diff_diff/synthetic_did.py:958, and rust/src/weights.rs:518.
The previous BR allow-list concern also appears resolved; unchanged diff_diff/business_report.py:602 already recognizes SDID variance_method values, and the new regression test at tests/test_business_report.py:4689 covers the alpha-override path.
One previous P2 remains unresolved: tests/test_methodology_sdid.py:2604 still documents and bounds bootstrap null calibration as if the deleted fixed-weight bootstrap were still the live behavior.
Static-only review here; I could not run pytest because the command is unavailable in this environment.

Methodology

Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current synthdid bootstrap flow did not reveal an unmitigated methodology defect. citeturn0search0turn3view0turn3view1

Severity P3. Impact: The survey-bootstrap capability regression is explicitly documented and tracked in docs/methodology/REGISTRY.md:1549 and TODO.md:107, so under this rubric it is informational rather than blocking. Concrete fix: none for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: The deleted cross-language parity fixture leaves an external parity anchor as follow-up work, but that follow-up is explicitly tracked in TODO.md:108, so it is not a blocker. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity P2. Impact: [Previous finding unresolved] tests/test_methodology_sdid.py:2604 still describes bootstrap as the removed fixed-weight regime and hard-codes rejection_rate > 0.05 at α=0.05. The updated calibration table in docs/methodology/REGISTRY.md:1557 and docs/methodology/REGISTRY.md:1566 now says refit bootstrap is near nominal, so this test can fail on correct behavior or bias future changes toward anti-conservative output. Concrete fix: rewrite the docstring/comments to the refit-bootstrap contract and replace the lower bound with a calibration-agnostic regression guard, e.g. “not collapsed near 0 and not catastrophically high,” or a dispersion/quantile check that specifically catches the old dispatch bug.
Severity P3. Impact: [Newly identified] The Methodology Registry still says the Rust backend is silent on Frank-Wolfe non-convergence in docs/methodology/REGISTRY.md:1528, but this PR now threads explicit convergence status through Rust in rust/src/weights.rs:518, diff_diff/utils.py:1350, and emits the aggregate warning in diff_diff/synthetic_did.py:1041. Since REGISTRY.md is the methodology source of truth, that note is now stale. Concrete fix: update the edge-case bullet to reflect the new Rust convergence flag and the aggregated bootstrap warning behavior.

…W note P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration`` at ``tests/test_methodology_sdid.py:2604`` was still framed around the deleted fixed-weight regime and encoded an anti-conservative lower bound ``rejection_rate > 0.05`` that biased the test toward keeping over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion`` and replace the directional calibration assertion with a calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10`` catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed under the old bug), and a loose band ``0.01 <= rejection_rate <= 0.40`` catches both SE-collapse and SE-explosion. Matches this session's commit c0d089b by removing the remaining lingering fixed-weight framing in the test suite. Tested locally: 100-seed refit bootstrap passes with comfortable margin. P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge case (line 1528) still said the Rust backend silently returns the final iterate and cited a follow-up. This session's commit c0d089b already landed the Rust-side convergence flag, Python ``return_convergence=True`` kwargs on the weight helpers, and the aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to describe the new behavior: bootstrap path now uses ``sc_weight_fw_with_convergence`` (Rust) / numpy ``return_convergence=True`` and aggregates a single summary warning above the 5% threshold; standalone callers that do not opt into convergence tracking retain the legacy silent-Rust / warning-numpy asymmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T22:57:56Z

/ai-review

github-actions · 2026-04-22T23:09:30Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9280cddab051623d6a9b5d30c9eb006cecf81397

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official synthdid variance surface, and the previous re-review items around the null-calibration test and the Rust non-convergence note appear addressed. Static review only; I did not run pytest or the coverage Monte Carlo harness here.

Executive Summary

The affected method is SyntheticDiD variance estimation, specifically the refit variance_method="bootstrap" path; I did not find a silent correctness bug in the changed estimator/SE/inference flow.
The prior test-related concern looks resolved: the slow bootstrap null-calibration guard is now calibration-agnostic instead of baking fixed-weight over-rejection into the expected behavior.
The survey-bootstrap capability regression is explicitly documented in REGISTRY.md and tracked in TODO.md, so it is informational rather than blocking under this rubric.
Remaining findings are P3-only documentation/UX drift: incorrect R-default wording, one source-paper overstatement in the new coverage narrative, one stale registry test reference, and one outdated survey warning string.

Methodology

Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both ω and λ on each bootstrap draw, keeps the established sqrt((r-1)/r) * sd(...) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference(). That is consistent with the SDID source reference and the official synthdid variance implementation, which documents bootstrap/jackknife/placebo as Algorithms 2/3/4 and implements bootstrap by resampling all units, renormalizing omega, and re-entering synthdid_estimate with stored opts. (aeaweb.org)

Severity P3. Impact: diff_diff/synthetic_did.py:L50-L61 and METHODOLOGY_REVIEW.md:L517-L519 now make the variance defaults internally inconsistent: the placebo bullet/review note still say placebo is “R’s default,” while the same PR also describes bootstrap as matching “R’s default,” and upstream synthdid docs/source show vcov() defaults to bootstrap. This does not change runtime behavior, but it misstates the reference implementation and obscures that the library’s placebo default is a divergence from R rather than a match. Concrete fix: change those lines to say placebo is the library default, or add a REGISTRY.md Note (deviation from R) if that divergence is intentional. (synth-inference.github.io)
Severity P3. Impact: docs/methodology/REGISTRY.md:L1566-L1570 says the PR’s smaller-panel jackknife over-rejection is “in line” with Arkhangelsky et al. §6.3’s “98% / 93% coverage pattern.” The cited paper preview presents mixed evidence instead: the iid design is slightly conservative at 98% coverage, while the dependent-error design is 93% coverage. The current sentence overstates what the cited source supports. Concrete fix: rephrase this as mixed jackknife calibration in the paper, rather than uniformly anti-conservative support. (aeaweb.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: diff_diff/synthetic_did.py:L1128-L1133 and diff_diff/synthetic_did.py:L1220-L1224 still tell users to “Consider using variance_method='bootstrap'” from placebo failure paths. After this PR, bootstrap rejects every survey design, so that guidance is now wrong on the newly restricted survey-weighted path. Concrete fix: branch those warning strings on survey presence and suggest jackknife / more controls for survey users.

Tech Debt

Severity P3. Impact: The SDID survey-bootstrap capability regression is explicitly documented in docs/methodology/REGISTRY.md:L1549-L1550 and tracked in TODO.md:L107-L108, so under this rubric it is properly deferred rather than blocking. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity P3. Impact: docs/methodology/REGISTRY.md:L1570 still points readers to TestPValueSemantics::test_bootstrap_p_value_null_calibration, but the PR renamed the actual regression guard to tests/test_methodology_sdid.py:L2604-L2657. The registry now references a non-existent test when describing the calibration safeguard. Concrete fix: update the registry reference to test_bootstrap_p_value_null_dispersion.

Four P3-only items from R8 CI review: 1. Correctly attribute R's default `vcov()` method: - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was "R's default". R's `synthdid::vcov()` actually defaults to `method="bootstrap"`. Reword to describe placebo as the library default with a rationale paragraph (survey availability, perf) and cross- reference to the REGISTRY Note below. - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite to frame the default as a deliberate library deviation with the same two-reason rationale. 2. Add a REGISTRY.md Note (default variance_method deviation from R) that documents the rationale explicitly (survey availability + perf) so the AI reviewer recognizes the deviation as documented rather than as an accidental contradiction between the docstring and R's actual default. 3. Soften the placebo-failure fallback guidance in `_placebo_variance_se`: the previous strings recommended `variance_method="bootstrap"`, which now raises `NotImplementedError` on every survey design. Branch on `w_control is not None` (survey fit) to recommend jackknife + adding controls for survey users and keep the bootstrap + jackknife + more-controls recommendation for non-survey users. 4. Fix the REGISTRY pointer to the slow dispersion guard: reference `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the current name) instead of the pre-rename `test_bootstrap_p_value_null_calibration`, and describe the new contract (calibration-agnostic dispersion + loose rejection-rate band). 5. Rephrase the REGISTRY coverage-MC narrative about jackknife anti-conservatism: the paper's AER §6.3 shows mixed jackknife evidence (98% iid — slightly conservative; 93% AR(1) — slightly anti-conservative). Our observed anti-conservatism aligns with the AR(1) branch of the paper's evidence, not a uniform "in line" pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T23:16:41Z

/ai-review

github-actions · 2026-04-22T23:25:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 710f96654d443e7a6546febe650e0e7cc97e7578

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 findings in the changed SDID estimator / SE / inference path.
The refit bootstrap implementation now matches the documented methodology surface and the official synthdid variance contract, rather than the old fixed-weight shortcut.
The prior re-review P3s appear addressed: default-vs-R wording, the jackknife coverage narrative, the stale registry test reference, and the survey-aware placebo fallback messaging are all updated in the changed files.
The remaining items are P3-only: one user-facing runtime-guidance inconsistency, plus tracked tech-debt items around survey-bootstrap composition and restoring an external parity anchor.
This was a static review only; I could not execute tests in this workspace because numpy and pytest are not installed.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official synthdid surface: vcov() documents bootstrap/jackknife/placebo as Algorithms 2/3/4, defaults to bootstrap, and the R bootstrap code renormalizes omega then re-enters synthdid_estimate() with stored opts, while synthdid_estimate() treats update.omega / update.lambda as re-estimation from the passed weights as initializations. (synth-inference.github.io)

Code Quality

No findings.

Performance

No code-level findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: diff_diff/synthetic_did.py:L315-L341, docs/methodology/REGISTRY.md:L1549-L1551, and TODO.md:L107-L108 explicitly document that paper-faithful bootstrap currently rejects all survey designs; full-design survey users therefore have no SDID variance path in this release, and pweight-only users must use placebo/jackknife. Concrete fix: none for approval; the weighted-Frank-Wolfe / Rao-Wu composition follow-up is already tracked.
Severity P3 (tracked). Impact: the new external parity anchor is deferred in TODO.md:L108-L108, and the registry now explicitly avoids claiming bit-level cross-language bootstrap parity in docs/methodology/REGISTRY.md:L1506-L1508. Concrete fix: none for approval; land the tracked R/Julia parity harness later.

Security

No findings.

Documentation/Tests

Severity P3. Impact: user-facing runtime guidance for refit bootstrap is internally inconsistent. The changed surfaces now say ~5–30× in CHANGELOG.md:L15-L15, ~5–30× in diff_diff/synthetic_did.py:L54-L57, ~10–100× in diff_diff/power.py:L720-L721, and ~2–4 hours for the coverage run in benchmarks/python/coverage_sdid.py:L13-L17, while the registry says ~15–40 min in docs/methodology/REGISTRY.md:L1569-L1571. This is not a correctness bug, but it will give users conflicting expectations about the cost of the new bootstrap path. Concrete fix: standardize all surfaces on one benchmark-backed wording, or collapse them to “order-of-magnitude slower; see performance docs.”

Single actionable P3 from R9 CI review: user-facing runtime wording for refit bootstrap had diverged across surfaces, giving conflicting expectations about the cost of the new bootstrap path: - CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower. - diff_diff/power.py said ~10-100x slower (two sites). - docs/choosing_estimator.rst said ~10-100x slower. - docs/performance-scenarios.md said ~10-100x slower. - docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower. - docs/tutorials/03_synthetic_did.ipynb and docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower. - benchmarks/python/coverage_sdid.py said the 500-seed MC run takes ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined). Unify on "~5-30x slower than placebo (panel-size dependent)" for the per-fit slowdown (the warm-start plumbing closed the gap vs the pre- warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x figure as a historical "prior estimate" — left as-is so the release notes continue to explain the revision. Also fix two tutorial surfaces that still called placebo "R's default" (tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is the library default per the REGISTRY Note added in 710f966. Reword to describe placebo as the library default with the rationale pointer. Verified: 353 tests pass across test_methodology_sdid, test_power, test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift guards unaffected because the change is markdown-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T23:45:58Z

/ai-review

github-actions · 2026-04-22T23:53:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dc2045fb781e7d16c6c3f768d9f6a4535d51d925

Overall Assessment

✅ Looks good

Static review only; I could not execute tests in this workspace because numpy, pandas, and pytest are not installed.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path.
The prior P3 on inconsistent slowdown/runtime guidance appears addressed across the changed surfaces.
The survey-bootstrap capability regression and the loss of the old cross-language parity fixture are both explicitly documented/tracked, so they remain P3-only.
[Newly identified] Two unchanged survey docs still advertise the removed SDID Rao-Wu bootstrap path; this is a P3 documentation cleanup.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official synthdid docs map bootstrap / jackknife / placebo to Algorithms 2 / 3 / 4 and default vcov() to bootstrap; bootstrap_sample() passes renormalized omega plus stored opts back into synthdid_estimate(); and synthdid_estimate() treats supplied weights as initializations while update.omega / update.lambda remain enabled. citeturn1view0turn1view1turn2view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: the survey-support regression is explicit and documented: diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now reject full designs and any bootstrap-plus-survey combination, while docs/methodology/REGISTRY.md:1549 and TODO.md:107 track the weighted-Frank-Wolfe / Rao-Wu follow-up. Concrete fix: none for approval; land the tracked weighted-FW survey-bootstrap work later.
Severity P3 (tracked). Impact: the old fixed-weight R fixture was removed, so the refit bootstrap no longer has an external parity anchor in-tree; the follow-up is explicitly tracked in TODO.md:108. Concrete fix: none for approval; add the planned Julia/R parity harness later.

Security

No findings.

Documentation/Tests

Severity P3 [Newly identified]. Impact: docs/tutorials/16_survey_did.ipynb:1109 and docs/methodology/survey-theory.md:725 still tell users that SyntheticDiD has Rao-Wu survey bootstrap support, but the new guards in diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now raise NotImplementedError for that path. Users following the survey tutorial will hit a removed code path. Concrete fix: update both docs surfaces to remove SyntheticDiD from the Rao-Wu support list and redirect readers to pweight-only variance_method='placebo' / 'jackknife' plus the tracked survey-bootstrap TODO.

The Pure Python Fallback CI job failed this test at rel-diff 0.5310 > 0.40 tolerance. Root cause is test-infrastructure, not a correctness regression. ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python mode to keep CI fast (see tests/conftest.py:210); the test's 0.40 tolerance was explicitly calibrated for B∈[100, 200] per its docstring comment. At B=49 the bootstrap SE is not yet converged to the placebo SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the same seed), so the failure is MC-noise, not a regression. The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3 Python versions) all run the test at the full B=200 and pass with comfortable margin — the regression guard is still exercised on the default user install path. Skip under pure-Python mode with an explicit rationale citing the min_n cap and the Rust-backed coverage that preserves the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T01:19:36Z

/ai-review

github-actions · 2026-04-23T01:29:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b123c2b763ec0d92584c8e5f1130a92140e3d343

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path. The refit bootstrap implementation in diff_diff/synthetic_did.py:846 matches the updated contract in docs/methodology/REGISTRY.md:1497 and METHODOLOGY_REVIEW.md:504.
The PR repurposes variance_method="bootstrap" to mean the paper-faithful refit bootstrap rather than adding a new enum. That is a real behavior change, but it is explicitly documented in CHANGELOG.md:15 and docs/methodology/REGISTRY.md:1497, so it is informational rather than blocking.
Severity P2: the new aggregate Frank-Wolfe non-convergence warning can count discarded draws in the numerator while dividing by valid draws, so the documented 5% threshold can be overstated. See diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, and diff_diff/synthetic_did.py:1045.
The new tests cover the right surfaces for this change set: bootstrap refit behavior, survey rejection guards, set_params() rollback, BR labeling, and the coverage artifact schema in tests/test_methodology_sdid.py:501, tests/test_survey_phase5.py:174, and tests/test_business_report.py:4656.
[Previously identified, unresolved] Severity P3: survey docs outside the changed files still advertise SyntheticDiD Rao-Wu bootstrap support, conflicting with the new runtime guards.

Methodology

No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through safe_inference() at diff_diff/synthetic_did.py:648, so I did not find a new P0/P1 NaN/inference regression.

Code Quality

Severity P2. Impact: In _bootstrap_se, fw_nonconvergence_count is incremented before the draw is accepted into bootstrap_estimates, so a draw that later fails the np.isfinite(tau) gate can still count toward the warning numerator while the denominator is n_successful valid draws. That does not change the reported SE, but it can over-trigger or overstate the documented “share of valid bootstrap draws” warning contract. Concrete fix: increment the counter only after the draw passes the finite-tau gate, or keep a separate fw_nonconvergence_valid_count for the warning path. References: diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, diff_diff/synthetic_did.py:1045, docs/methodology/REGISTRY.md:1508.

Performance

No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked). Impact: SDID survey-bootstrap support is intentionally removed and front-door rejected in this release; that is a capability regression, but it is documented and tracked rather than silently wrong. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu composition follow-up tracked in TODO.md:107 and described in docs/methodology/REGISTRY.md:1551.
Severity P3 (tracked). Impact: deleting the old R bootstrap fixture removes the only external cross-language parity anchor for SDID bootstrap math; protection now relies on internal baselines and characterization tests. Concrete fix: none for approval; add the planned Julia/R parity harness tracked in TODO.md:108.

Security

No findings.

Documentation/Tests

Severity P3 [Previously identified, unresolved]. Impact: docs/methodology/survey-theory.md:725 and docs/tutorials/16_survey_did.ipynb:1109 still say SyntheticDiD has Rao-Wu survey-bootstrap support, but the estimator now raises NotImplementedError for that path at diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334. Concrete fix: remove SyntheticDiD from both Rao-Wu support lists and redirect readers to pweight-only variance_method="placebo" / "jackknife" plus the tracked TODO.

R10 CI review found two items on top of the previous ✅ Looks good. P2 Code Quality — aggregate Frank-Wolfe non-convergence warning numerator/denominator mismatch. In ``_bootstrap_se``, ``fw_nonconvergence_count`` was incremented before the draw cleared the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND then produced non-finite τ would count toward the warning numerator while the denominator is ``n_successful`` (draws that cleared the finite-τ gate). That does not affect the reported SE, but it can overstate the documented "share of valid bootstrap draws" warning contract and cause the warning to over-trigger. Fix: move the increment inside the ``if np.isfinite(tau)`` block so the numerator only counts draws that also contribute to the SE. A draw failing the finite-τ gate is retried upstream and should not inflate the non-convergence rate. P3 Documentation (previously unresolved) — two survey-cross-reference docs still advertised SyntheticDiD Rao-Wu bootstrap support, which the estimator now rejects at fit-time with NotImplementedError: - ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet to exclude SDID explicitly, with a pointer to the REGISTRY sketch for the deferred weighted-FW composition and to pweight-only placebo/jackknife as the available SDID variance alternatives. - ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the support-matrix table so SDID's row reads "pweight only (placebo / jackknife)" with bootstrap struck out, and add a "Note on SyntheticDiD" below explaining which methods accept pweight-only and why bootstrap rejects all survey designs (weighted- FW derivation tracked in TODO.md). Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust with 48-of-50 non-convergence warning still firing on the regression test, confirming the warning numerator still tallies correctly after the gate-order change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T21:25:32Z

/ai-review

github-actions · 2026-04-23T21:33:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5eadcb6dbd1c11aca9a70aeb5a7d54c0ff09296b

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

No unmitigated P0/P1 findings in the changed SDID estimator / inference path.
The refit bootstrap implementation in diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 now re-estimates ω and λ on each bootstrap draw, which aligns with Algorithm 2’s “compute the SDID estimator based on (Y^(b), W^(b))” requirement and with R’s bootstrap_sample() / synthdid_estimate() warm-start contract. (nber.org)
The previous P2 on Frank-Wolfe non-convergence accounting appears addressed: fw_nonconvergence_count now increments only after a finite draw is accepted (diff_diff/synthetic_did.py:L986-L998), so numerator and denominator both reference valid draws.
The prior survey-doc mismatch also appears resolved in the changed docs (docs/methodology/survey-theory.md:L725-L735, docs/tutorials/16_survey_did.ipynb:L1090, docs/choosing_estimator.rst:L781-L821).
The removed external bootstrap parity fixture is not a blocker because the follow-up is explicitly tracked in TODO.md:L108; regression coverage on the changed surfaces is otherwise solid (tests/test_methodology_sdid.py:L500-L710, tests/test_survey_phase5.py:L179-L360, tests/test_business_report.py:L4658-L4702).

Methodology

No unmitigated P0/P1 findings.
Severity P3 (informational). Impact: variance_method="bootstrap" is now a materially different user-facing method than before; existing fits will get different SE / p-value / CI outputs. This is explicitly documented in CHANGELOG.md:L15-L21 and docs/methodology/REGISTRY.md:L1497-L1552, so it is not a defect. Concrete fix: none for approval. The implementation at diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 matches the paper / R contract for recomputing SDID on each bootstrap draw. (nber.org)

Code Quality

No findings.

Performance

No findings. The slowdown is explicitly documented as an intentional consequence of replacing the removed fixed-weight shortcut with the refit path (CHANGELOG.md:L15-L21, docs/methodology/REGISTRY.md:L1508-L1509).

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor / setter contract parity (diff_diff/synthetic_did.py:L188-L203, diff_diff/synthetic_did.py:L1479-L1511).

Tech Debt

Severity P3 (tracked in TODO.md). Impact: deleting benchmarks/R/generate_sdid_bootstrap_parity_fixture.R and tests/data/sdid_bootstrap_indices_r.json removes the only external cross-language parity anchor for SDID bootstrap in this PR. Concrete fix: none for approval; the replacement parity harness is already tracked in TODO.md:L108.
Severity P3 (tracked in TODO.md). Impact: SDID bootstrap no longer supports any survey design, and full-design SDID has no variance path this release. That is a real capability regression, but it is front-door rejected, documented in docs/methodology/REGISTRY.md:L1549-L1551, and tracked in TODO.md:L107, so it is not a blocker. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu follow-up.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new return_convergence=True helper contract is only partially documented; compute_time_weights() and compute_sdid_unit_weights() explain the new parameter but their Returns sections still advertise only np.ndarray, not the tuple return path. Concrete fix: update diff_diff/utils.py:L1462-L1510 and diff_diff/utils.py:L1598-L1643 to mirror _sc_weight_fw’s union-return documentation.
Otherwise no findings. The changed tests cover the main regression surfaces: bootstrap refit behavior / non-convergence (tests/test_methodology_sdid.py:L500-L710), survey rejection and pweight-only fallbacks (tests/test_survey_phase5.py:L179-L360), BR inference labeling (tests/test_business_report.py:L4658-L4702), and the coverage artifact schema (tests/test_methodology_sdid.py:L2976-L3005).

R11 CI review flagged that compute_time_weights and compute_sdid_unit _weights describe the new return_convergence parameter but their Returns sections still advertise only np.ndarray, not the tuple return. Update both docstrings to mirror _sc_weight_fw's union-return shape: Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an explicit description of the two-pass AND convergence flag (True iff both pre-sparsify and main FW passes hit the min-decrease break; False if either hit max_iter). Also promote the return_convergence kwarg description from the implementation-detail placement to the main Parameters block with the same contract note as _sc_weight_fw so standalone callers see the legacy-ABI preservation guarantee and the Rust top-level fast-path skip. Docs-only change; no code or test changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-23T21:50:06Z

/ai-review

github-actions · 2026-04-23T21:57:33Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ec34beb8c5f98db72001cfb534ac4d5446e9be60

Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests or the Monte Carlo script.

Executive Summary

No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / inference path.
The refit bootstrap now re-estimates both ω and λ on each pairs-bootstrap draw, and bootstrap/jackknife inference still routes through safe_inference(), which matches the current Methodology Registry contract for Arkhangelsky et al. Algorithm 2 / R-default vcov(method="bootstrap"). diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552
Previous re-review concerns appear addressed: Frank-Wolfe non-convergence is now surfaced through explicit Rust/Python convergence flags instead of warning capture, and the tuple-return helper contracts are now documented. diff_diff/synthetic_did.py:L952-L998, diff_diff/utils.py:L1301-L1382, diff_diff/utils.py:L1462-L1667, rust/src/weights.rs:L518-L555
Survey support is now internally consistent across code, docs, and tests: full designs are front-door rejected on all SDID variance methods, and pweight-only surveys are limited to placebo/jackknife. This regression is documented and tracked, so it is informational only. diff_diff/synthetic_did.py:L315-L342, docs/choosing_estimator.rst:L785-L823, tests/test_survey_phase5.py:L179-L359, TODO.md:L107-L108
Remaining issues are P3 only: one stale fallback message on the power-analysis surface, and weak provenance metadata on the committed coverage artifact.

Methodology

No findings. The changed estimator math, SE formula, retry-to-B contract, survey guards, and bootstrap/jackknife p-value dispatch match the current Registry description. diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. CHANGELOG.md:L15-L21, docs/performance-scenarios.md:L237-L249

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor/setter parity. diff_diff/synthetic_did.py:L186-L203, diff_diff/synthetic_did.py:L1479-L1511

Tech Debt

Severity P3 (tracked in TODO.md). Impact: SDID no longer has any survey-aware variance path for strata/PSU/FPC, and even pweight-only survey fits cannot use bootstrap in this release. This is front-door rejected and documented, so it is not a blocker. Concrete fix: none for approval; implement the tracked weighted-Frank-Wolfe + Rao-Wu composition follow-up. diff_diff/synthetic_did.py:L315-L342, docs/methodology/REGISTRY.md:L1549-L1551, TODO.md:L107-L107
Severity P3 (tracked in TODO.md). Impact: deleting the old R fixture leaves no external cross-language parity anchor for the new refit bootstrap; validation is currently same-library only. Concrete fix: none for approval; add the tracked R/Julia parity harness. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new placebo-feasibility error text on the power-analysis surface still recommends variance_method='bootstrap' even on survey-weighted custom-DGP paths, but SyntheticDiD.fit() now rejects bootstrap for every survey design. Concrete fix: branch the fallback guidance in _check_sdid_placebo_data() and the registry-path check so survey-weighted SDID only suggests jackknife (or adding controls). diff_diff/power.py:L714-L721, diff_diff/power.py:L2043-L2051, diff_diff/synthetic_did.py:L334-L342
Severity P3. Impact: the committed coverage artifact records "library_version": "3.2.0" even though it documents unreleased refit-bootstrap behavior, which weakens provenance for future methodology audits/regenerations. Concrete fix: emit a git SHA or explicit dev/unreleased marker in the artifact metadata, then regenerate benchmarks/data/sdid_coverage.json. benchmarks/python/coverage_sdid.py:L343-L374, benchmarks/data/sdid_coverage.json:L2-L8
Otherwise no findings. The changed tests cover the main regression surfaces: refit-bootstrap behavior/non-convergence, survey rejection, setter rollback, result/report labeling, and coverage-artifact schema. tests/test_methodology_sdid.py:L500-L715, tests/test_methodology_sdid.py:L1234-L1311, tests/test_methodology_sdid.py:L2968-L3010, tests/test_business_report.py:L4658-L4700, tests/test_survey_phase5.py:L179-L359

Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tracker had fallen ~9 estimators behind the library. Audit against __init__.py __all__, docs/methodology/REGISTRY.md, docs/methodology/ papers/, and tests/test_methodology_*.py surfaced four "Not Started" entries that are stale and ~10 missing entries entirely. Changes: - Reorganized Review Status Summary into seven categories (Core, Staggered, Continuous & Universal-Treatment, Triple-Difference, Counterfactual, Diagnostics, Cross-Cutting Inference Features). - Added "What 'Complete' means" tier definition (Complete / In Progress / Not Started) so the bar is explicit. - Added In-Progress entries for ImputationDiD, TwoStageDiD, WooldridgeDiD (ETWFE), EfficientDiD, ContinuousDiD, ChaisemartinDHaultfoeuille (DCDH), HeterogeneousAdoptionDiD (HAD), TROP, StaggeredTripleDifference, ConleySpatialHAC, Survey Data Support, PlaceboTests. Each "In Progress" block lists what's already in place (REGISTRY section, paper review, methodology test file, parity fixtures, test counts) and what's needed to promote to Complete. - Updated SyntheticDiD last-review date to 2026-04-23 (PR igerber#351 bootstrap-refit landing date) to reflect the warm-start FW corrections. - Refreshed methodology-test counts on the existing Complete entries to match current `grep` output (CallawaySantAnna 61, HonestDiD 27, TripleDifference 45, DifferenceInDifferences 51, HonestDiD unit 72). - Updated Priority Order: BaconDecomposition flagged as next substantive review (chosen during this session); In-Progress promotion ladder documented (HAD largest, DCDH closest to ready, etc.). No source code changes. No status flips on existing Complete entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 22, 2026 11:56

igerber and others added 3 commits April 22, 2026 16:07

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber removed the ready-for-ci Triggers CI test workflows label Apr 23, 2026

igerber added the ready-for-ci Triggers CI test workflows label Apr 23, 2026

igerber merged commit 172d1d8 into main Apr 23, 2026
23 of 24 checks passed

igerber deleted the sdid-bootstrap-refit branch April 23, 2026 23:14

igerber mentioned this pull request Apr 24, 2026

Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355

Merged

igerber mentioned this pull request Apr 25, 2026

Release 3.3.0: HAD estimator, profile_panel, dCDH by_path, SDID survey complete #368

Merged

igerber mentioned this pull request May 15, 2026

Refresh METHODOLOGY_REVIEW.md to reflect current estimator catalog #448

Merged

Conversation

igerber commented Apr 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

igerber commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant