Skip to content

Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351

Merged
igerber merged 19 commits into
mainfrom
sdid-bootstrap-refit
Apr 23, 2026
Merged

Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351
igerber merged 19 commits into
mainfrom
sdid-bootstrap-refit

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 22, 2026

Summary

  • Ship paper-faithful variance_method="bootstrap_refit" for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains "placebo".
  • Cross-surface allow-list extensions so the new enum flows through all surfaces that read variance_method: results.py:960 summary line, business_report.py:602 BR inference-label, synthetic_did.py:695 n_bootstrap result population, power.py SDID guidance strings, SyntheticDiD.__init__ docstring, diff_diff/guides/llms-full.txt. Survey + refit raises NotImplementedError upstream in fit() — Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.
  • Coverage Monte Carlo study (benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writes benchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed into REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).
  • SyntheticDiD.set_params validation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared _validate_config helper; make set_params transactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.

Methodology references (required if estimator / math changes)

  • Method name(s): Synthetic Difference-in-Differences bootstrap variance — new variance_method="bootstrap_refit" variant.
  • Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, Wager (2021), "Synthetic Difference-in-Differences," AER 111(12). New variant implements Algorithm 2 step 2 verbatim (re-estimate ω̂_b and λ̂_b per draw via Frank-Wolfe). Cross-language anchor: Julia Synthdid.jl::src/vcov.jl:96-103 is the only existing refit-bootstrap implementation; R synthdid::vcov(method="bootstrap") and Stata sdid.ado:1033-1037 both use the fixed-weight shortcut. Full methodology surface + requirements checklist row in docs/methodology/REGISTRY.md §SyntheticDiD.
  • Any intentional deviations from the source (and why): Survey designs (including pweight-only) composed with bootstrap_refit raise NotImplementedError — Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summary UserWarning at end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1) — unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).

Validation

  • Tests added/updated:
    • tests/test_methodology_sdid.py::TestBootstrapRefitSE — 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)
    • tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical — mirror of the fixed-weight bootstrap regression guard
    • tests/test_methodology_sdid.py::TestGetSetParams — 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback test
    • tests/test_methodology_sdid.py::TestCoverageMCArtifact — schema smoke test on the MC JSON (guarded with pytest.skip if absent per feedback_golden_file_pytest_skip.md)
    • tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel — cross-surface guard that BR emits "bootstrap_refit variance" on alpha-override, not the analytical fallback label
    • PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349's 1e-10 R-parity bit-identity test (TestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the _bootstrap_se signature expansion
  • Backtest / simulation / notebook evidence (if applicable): Coverage MC artifact committed at benchmarks/data/sdid_coverage.json (500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered in REGISTRY.md §SyntheticDiD:
DGP method α=0.01 α=0.05 α=0.10 mean SE / true SD
balanced placebo 0.016 0.060 0.086 1.05
balanced bootstrap 0.106 0.160 0.230 0.85
balanced bootstrap_refit 0.028 0.078 0.116 1.05
balanced jackknife 0.066 0.112 0.154 1.08
unbalanced placebo 0.006 0.032 0.070 1.08
unbalanced bootstrap 0.036 0.098 0.140 0.94
unbalanced bootstrap_refit 0.008 0.038 0.080 1.11
unbalanced jackknife 0.024 0.076 0.120 0.99
AER §6.3 placebo 0.018 0.058 0.086 0.99
AER §6.3 bootstrap 0.034 0.092 0.162 0.88
AER §6.3 bootstrap_refit 0.010 0.040 0.078 1.05
AER §6.3 jackknife 0.030 0.080 0.150 0.90

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 3 commits April 22, 2026 11:56
Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in
variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified
Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-
scale zeta. Default remains "placebo".

Cross-surface allow-list extensions land in one PR per
feedback_cross_surface_parity_audit.md:
- SyntheticDiD.fit() dispatcher and _bootstrap_se signature
- synthetic_did.py:695 n_bootstrap result population
- results.py:960 summary() "Bootstrap replications" gating
- business_report.py:602 inference-label allow-list
- power.py SDID guidance strings (2 sites)
- SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt

Survey + bootstrap_refit raises NotImplementedError upstream in fit()
(covers both pweight-only and full-design) — the Rao-Wu rescaled-weight
composition is tracked as a follow-up TODO.

Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds ×
B=200 × 3 DGPs × 4 methods under H0 and writes
benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈
{0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into
REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal
calibration across all 3 DGPs; fixed-weight over-rejects by roughly
1.8–3.2× on smaller panels, consistent with the SE under-estimate from
ignoring weight-estimation uncertainty.

Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical
in TestPValueSemantics + TestCoverageMCArtifact schema smoke test
(guarded with pytest.skip per feedback_golden_file_pytest_skip.md) +
cross-surface BR inference-label test. PR #349's 1e-10 R-parity
bit-identity gate still passes.

Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside
the refit loop and aggregated into a single summary warning at end-of-
loop if the rate exceeds 5% — the same threshold the retry-exhaustion
guard uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review caught that the sklearn-style setter path bypassed the
constructor's enum/coherence checks, so users could
``set_params(variance_method='not_a_method')`` after construction and
slip past the __init__ validation added for ``bootstrap_refit``.

Extract the existing checks into a private ``_validate_config()`` helper
and call from both ``__init__`` and ``set_params`` so both paths enforce
the same contract. Constant-hoist the valid-methods tuple onto the class
as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a
single source.

Add regression tests under ``TestGetSetParams``:
- set_params accepts ``bootstrap_refit``
- set_params rejects unknown variance_method (parity with __init__)
- set_params rejects incoherent n_bootstrap < 2 when method != jackknife
- set_params allows n_bootstrap=1 under jackknife (deterministic)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the review's P2 finding: if ``_validate_config`` rejects the
post-update state in a multi-attribute ``set_params`` call, the instance
was left with partially-applied (invalid) values after the raised
``ValueError``. Snapshot originals before any setattr and restore them
in an except handler so the raise leaves the object consistent with its
pre-call configuration.

Regression test asserts post-raise state matches the pre-call state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues found. The new bootstrap_refit path re-estimates both ω and λ per draw and keeps the documented SE / p-value semantics aligned with the Methodology Registry.
  • The bootstrap_refit + survey-design exclusion is explicitly documented in the registry and tracked in TODO.md, so it is informational rather than blocking.
  • P3: the aggregated Frank-Wolfe non-convergence warning is counted on raw solver warnings rather than draw-level failures, so it can fire earlier than the registry text implies.
  • P3: a few in-code docstrings still describe the pre-PR variance-method surface.
  • P3: the existing scale-equivariance regression suite was not extended to bootstrap_refit, so the new normalization-sensitive path lacks a direct guard.

Methodology

No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.

  • Severity P3. Impact: bootstrap_refit still rejects any survey design, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103, so it is mitigated and does not block approval. Concrete fix: none for this PR; the follow-up Rao-Wu + refit derivation is already recorded.

Code Quality

  • Severity P3. Impact: the refit non-convergence summary warning counts raw solver warnings in diff_diff/synthetic_did.py:L1092 and compares them to successful draws in diff_diff/synthetic_did.py:L1173, while the registry text says the warning should trigger when the non-convergence rate exceeds 5% of draws in docs/methodology/REGISTRY.md:L1513. That can emit the warning below the documented draw-level threshold when both omega and lambda warn on the same draw. Concrete fix: count “any non-convergence on this draw” as a boolean, or normalize against 2 * n_successful if the intended unit is solver calls.

Performance

No findings. The extra cost of bootstrap_refit is opt-in and documented in diff_diff/synthetic_did.py:L56 and docs/methodology/REGISTRY.md:L1513.

Maintainability

No unmitigated findings.

Tech Debt

  • Severity P3. Impact: placebo_effects now carries placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values, so the name is increasingly misleading after this PR. This is already tracked in TODO.md:L129. Concrete fix: none required for approval; follow the tracked rename to a neutral name such as variance_effects.

Security

No findings.

Documentation/Tests

Execution note: I could not run the test suite in this sandbox because pytest and numpy are not installed here.

…ression

Three P3 items from the CI AI review, all under Documentation/Tests and
Code Quality (no methodology change):

1. **FW warning counting**: the refit loop counted raw solver warnings
   (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY
   text describes the rate per valid bootstrap draw, so the two could
   diverge by up to 3×. Change counts "any non-convergence on this draw"
   as boolean 1 and tighten the REGISTRY wording to call out the per-draw
   convention explicitly (each draw runs Frank-Wolfe once for ω and once
   for λ; either can trip the draw).

2. **Docstring gaps**: `fit()` now documents the new survey + refit
   NotImplementedError under the `survey_design` parameter and in the
   Raises block. `SyntheticDiDResults.variance_method` and
   `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit`
   and describe the four-way semantics of `placebo_effects` under each
   variance method.

3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the
   `variance_method` parametrizations in `TestScaleEquivariance`
   (baseline-parity, scale-equivariance across ~15 orders of magnitude,
   and extreme-scale-detection). Capture a bit-identity baseline for the
   refit path on `_make_panel(seed=42)`, so the normalization-sensitive
   branch has its own drift guard, not inferred from sibling methods.

All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight
1e-10 R-parity bit-identity gate preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a033267682aa63852637ca5206c4eab06808c80


Overall Assessment

⚠️ Needs changes

Executive Summary

  • [Newly identified] The PR now codifies the mapping “bootstrap = R-compatible fixed-weight shortcut, bootstrap_refit = paper-faithful refit” across the registry, public docstrings, changelog, and new regression tests, but the official synthdid source currently accessible does not support that blanket “matching R” claim. It stores update.omega / update.lambda in attr(estimate, "opts"), passes those opts back through bootstrap_sample(), and notes the supplied bootstrap weights are used only for initialization. (raw.githubusercontent.com)
  • Previous non-blocking findings from the last review look addressed: the BusinessReport allow-list now includes bootstrap_refit, SyntheticDiDResults.summary() shows bootstrap replications for refit, and the scale-equivariance suite now covers the new variance path.
  • The bootstrap_refit survey-design exclusion is documented in the methodology registry and tracked in TODO.md, so it is informational rather than blocking.
  • I could not execute the test suite in this sandbox because pytest, numpy, and pandas are not installed.

Methodology

  • Severity P1 [Newly identified]. Impact: The changed method description in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12 says the existing variance_method="bootstrap" is the R-compatible fixed-weight bootstrap and positions bootstrap_refit as the new paper-faithful variant. The official synthdid source currently accessible contradicts that as written: synthdid_estimate() persists update.omega / update.lambda in attr(estimate, "opts"), bootstrap_sample() feeds those opts back into the estimator, and the package source says the supplied weights are used only for initialization in bootstrap/placebo SEs. That makes the current fixed-weight path an undocumented deviation from the cited reference implementation, and the new fixed-vs-refit regression in tests/test_methodology_sdid.py:L568-L596 now hard-codes that mismatch into the test surface. Concrete fix: either (1) document the current bootstrap path in REGISTRY.md as a **Note (deviation from R):** and remove the “matching R” / “R-compatible” language everywhere, or (2) retarget variance_method="bootstrap" to the official synthdid bootstrap semantics and rename the fixed-weight shortcut; in both cases, update the new regression/parity language to match. (raw.githubusercontent.com)
  • Severity P3. Impact: bootstrap_refit still rejects any survey design, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L1513 and tracked in TODO.md:L103. Concrete fix: none required for approval.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3. Impact: placebo_effects now spans placebo draws, fixed-bootstrap draws, refit-bootstrap draws, and jackknife LOO values; the name is misleading but already tracked in TODO.md:L129. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files.

Path to Approval

  1. Resolve the public method mapping for SDID bootstrap variance: either document the current fixed-weight bootstrap path as a REGISTRY.md deviation from official synthdid, or change bootstrap to match the official source and rename the fixed-weight shortcut.
  2. Remove or rewrite the new “matching R” / “R-compatible” statements in docs/methodology/REGISTRY.md:L1497-L1513, diff_diff/synthetic_did.py:L54-L59, diff_diff/synthetic_did.py:L912-L936, diff_diff/results.py:L813-L819, and CHANGELOG.md:L11-L12.
  3. Update the new fixed-vs-refit regression and any parity messaging so the tests enforce the approved semantics instead of encoding the current unsupported “R-compatible fixed-weight bootstrap” claim.

… deviation

Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that
R's default synthdid::vcov(method="bootstrap") rebinds
attr(estimate, "opts") — which includes update.omega=TRUE from the
original fit — back into synthdid_estimate inside its do.call, so the
renormalized ω is used only as Frank-Wolfe initialization and ω and λ
are re-estimated per draw. R's default bootstrap is refit, not fixed-
weight. The sum_normalize helper in R's source explicitly comments
that the supplied weights "are used only for initialization" in
bootstrap and placebo SEs.

Our variance_method="bootstrap" holds the renormalized ω exactly (no
FW re-run). It is therefore a deliberate deviation from R's default.
Our PR #349 fixture generator at benchmarks/R/... is a manual
fixed-weight invocation — it omits the opts rebind, which defaults
update.omega to FALSE given non-null weights. The 1e-10 parity test
anchors our fixed-weight path to that manual R invocation, not to R's
real vcov behavior.

Documentation-only fix across all claim sites; no methodology or code
behavior changes:

- REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as
  "Alternative: Bootstrap at unit level — fixed-weight shortcut";
  add explicit **Note (deviation from R)** citing the vcov.R /
  synthdid.R opts-rebind mechanism; call out bootstrap_refit as
  matching R's default vcov. Requirements checklist entries and
  R-parity test scope Note rewritten to match.
- diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se
  method docstring drop the "matching R" framing on the fixed-weight
  path; bootstrap_refit is flagged as matching R's default.
- diff_diff/results.py: SyntheticDiDResults.variance_method field doc
  fixed (I introduced the "R-compatible fixed-weight shortcut"
  misphrasing in round 1; it was wrong).
- CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that
  bootstrap_refit matches R's default and the existing fixed-weight
  bootstrap is now explicitly documented as a deviation.
- benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment
  calls out the non-default invocation shape (no opts rebind → runs
  fixed-weight); references the Python test that consumes this fixture.
- tests/test_methodology_sdid.py::test_bootstrap_se_matches_r
  docstring: rewritten to scope the parity check correctly (manual
  R fixed-weight, not R's default vcov).
- TODO.md: add a new row for the refit cross-language parity anchor
  (Julia Synthdid.jl or R via the real vcov path) to make the missing
  anchor explicit.

All 57 targeted tests pass; no methodology change, no numerical
output change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3fa6d8d91ca8a4472d953b775af5f280a8e225d8


Overall Assessment
Looks good.

Executive Summary

  • The previous P1 around SDID bootstrap/R semantics looks materially resolved: fixed variance_method="bootstrap" is now documented as a deliberate deviation from R default, and the paper-faithful refit path is split out cleanly.
  • I did not find a new P0/P1 in the changed estimator path. The refit branch re-estimates both weight sets per draw, unsupported survey composition is rejected up front, and inference still routes through the existing NaN-safe path.
  • The new deferred items are properly tracked in TODO.md, so they are informational rather than blocking.
  • Residual issues are P3-only: a small amount of wording still overstates exact R-default parity or uses the old “R-compatible shortcut” framing.
  • I could not execute pytest here because the sandbox does not have it installed.

Methodology
Previous P1 about fixed bootstrap being mislabeled as R-default-faithful appears addressed.

  • Severity P3. Impact: the public wording still slightly over-claims exact R-default parity for bootstrap_refit. The docs say or imply that the new path matches R’s default bootstrap, but the implementation re-runs the library’s standard fresh-start two-pass Frank-Wolfe helpers rather than seeding unit-weight optimization with the renormalized original ω described in the registry’s own R-trace note. That is a numerical implementation choice, not a paper-methodology defect, so this is informational only, especially because the missing direct refit parity anchor is already tracked. Concrete fix: either weaken the wording to “same refit methodology as R default” or thread the R-style initialization into the refit branch and close the tracked parity follow-up. Refs: docs/methodology/REGISTRY.md:L1505, diff_diff/synthetic_did.py:L61, diff_diff/synthetic_did.py:L1098, diff_diff/utils.py:L1573, TODO.md:L104

Code Quality
No findings.

Performance
No findings.

Maintainability

Tech Debt

  • Severity P3. Impact: the two real follow-ups introduced by this PR are properly documented and tracked, so they do not block approval: survey-design composition for refit and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the follow-ups open. Refs: TODO.md:L103, TODO.md:L104

Security
No findings.

Documentation/Tests
No unmitigated findings. Verification note: I could not run the new tests in this environment because pytest is not installed.

igerber and others added 3 commits April 22, 2026 16:07
variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021
Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap")
behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so
the renormalized ω serves only as Frank-Wolfe initialization). The
previously-shipped fixed-weight shortcut is removed entirely; the
"bootstrap_refit" enum value briefly added in earlier commits of this
PR is folded back into "bootstrap".

Why this is a correctness fix, not just a relabel: the old fixed-weight
"bootstrap" matched neither the paper (which prescribes refit) nor R's
default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored
fixed-weight Python against a manual R invocation that omitted the opts
rebind — both sides were wrong in the same direction. Coverage MC at
benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the
new "bootstrap" tracks placebo near-nominal across the three
representative DGPs; the old fixed-weight column over-rejected at α=0.05
at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal).

Capability regression: SDID + survey designs (pweight-only AND
strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight
bootstrap was the only SDID variance method that supported
strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside
_bootstrap_se). Pweight-only users can switch to variance_method="placebo"
or "jackknife"; strata/PSU/FPC users have no SDID variance option on
this release. Rao-Wu rescaled weights composed with paper-faithful
Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and
reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's
"Note (deferred survey + bootstrap composition)" and TODO.md. The
deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via
`git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite
_bootstrap_se body.

Cross-surface allow-list reverts: the additive "bootstrap_refit" enum
shipped in earlier commits of this PR rippled through results.py:960
summary gating, business_report.py:602 inference-label allow-list,
power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults
field docstrings. All of those are now back to a 3-value surface
("bootstrap", "jackknife", "placebo").

Tests:
- TestBootstrapRefitSE class deleted; 4 unique tests folded into
  TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey,
  raises-full-design-survey, summary-shows-replications).
- test_bootstrap_se_matches_r deleted along with its fixture
  (tests/data/sdid_bootstrap_indices_r.json) and generator
  (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they
  anchored the now-removed fixed-weight path.
- TestPValueSemantics::test_refit_p_value_matches_analytical deleted as
  duplicate of test_bootstrap_p_value_matches_analytical.
- TestScaleEquivariance._BASELINE: "bootstrap" row updated to the
  refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit-
  identical to the captured "bootstrap_refit" baseline since the new
  bootstrap path is the same code as the old refit path. Tolerance
  tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity.
- TestGetSetParams: variance_method literals rebound to "bootstrap";
  test_set_params_accepts_bootstrap_refit deleted (redundant with
  constructor tests).
- TestCoverageMCArtifact: expected methods list set exact-equal to
  ("placebo", "bootstrap", "jackknife").
- test_business_report.py inference-label test class + method renamed
  to drop "refit" suffix; assertion checks for "bootstrap variance".

The benchmarks/data/sdid_coverage.json artifact is updated transitionally
in this commit (fixed-weight column dropped; refit column renamed to
bootstrap) so the schema test stays green; a follow-up commit
regenerates from a fresh 500-seed MC re-run with the new code path.
The REGISTRY coverage table cells are TBD pending that re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only follow-up to the previous commit's bootstrap rewrite. Updates
every user-facing surface that referenced the (now-removed) fixed-weight
bootstrap or the additive bootstrap_refit option:

- docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the
  SDID survey-support row (no SDID variance method supports
  strata/PSU/FPC anymore); rewrites the misdirecting note steering users
  to bootstrap for full survey designs; updates the inference summary
  table description for SDID's variance methods.
- docs/survey-roadmap.md: rewrites the SDID limitations table rows to
  reflect the regression matrix (pweight-only works with placebo /
  jackknife; strata/PSU/FPC has no SDID variance option in this release;
  bootstrap rejects all survey designs).
- docs/performance-scenarios.md: updates the SE-comparison scenario's
  timing expectation note (bootstrap is now ~10-100x slower per fit
  than the previous fixed-weight shortcut).
- docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19
  (inference methods description) and 29 (summary) — bootstrap is now
  paper-faithful refit matching R's default vcov, not the prior
  fixed-weight shortcut.
- docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs-
  placebo description (cell t18-cell-028); softens the stakeholder
  narrative claim "the two methods agree" to acknowledge that on small
  panels with non-exchangeable factor structure the SE magnitudes can
  differ while both methods still agree on significance and CI direction
  (cell t18-cell-033); re-executes the comparison cell so the output
  reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight).
  The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int /
  pre-fit RMSE — none of which change — so no guard updates needed.
- diff_diff/synthetic_did.py: fit() docstring's survey_design parameter
  description is now consistent with the actual guards (no bootstrap_refit
  references; explicit pweight-only-on-placebo-or-jackknife matrix).
- benchmarks/python/coverage_sdid.py: --help text drops the
  bootstrap_refit mention.
- METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that
  bootstrap matches R's bootstrap_sample is replaced with an honest
  description of the corrected refit semantics, plus a parenthetical
  historical note about the prior fixed-weight shortcut.

All 351 targeted tests pass; no methodology or numerical change in this
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with
the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife").
Total wall-clock ~40 min on M-series Mac, Rust backend.

Numbers match the transitional rename in commit 1 exactly (rej@0.05:
balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity
since the new "bootstrap" path is the same refit code as the previous
"bootstrap_refit" path, and the MC uses identical seeds (range(500)).
Confirms the rewrite didn't introduce numerical drift.

Headline across the three DGPs:
- bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040)
  versus nominal 0.05; well within 2σ MC band at 500 seeds.
- placebo: also near-nominal (0.060 / 0.032 / 0.058).
- jackknife: mildly anti-conservative on the smaller panels (0.112 on
  balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern.

REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_
placeholders with the transcribed numbers. The narrative under the
table (`Reading:` paragraph) was already rewritten in commit 1 for the
3-method surface; no further edit needed.

The transitional commit-1 artifact (where I renamed bootstrap_refit →
bootstrap in-place without running the MC) carried a `transitional_note`
field in the metadata; the fresh run naturally drops that field since
it's written from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 15481d9a859f1da8fa6f935d3ea5b3d6c8edbca3


Overall Assessment

✅ Looks good

Executive Summary

  • The previous substantive methodology concern is resolved: variance_method="bootstrap" now re-estimates both ω and λ per draw, keeps the documented retry-to-B contract, and still routes inference through safe_inference().
  • I did not find a new unmitigated P0/P1 in the changed estimator path.
  • The survey-support regression is explicit in code and documentation and is properly tracked in TODO.md, so it is informational rather than blocking.
  • Remaining issues are P3-only: one stale replicate-weight fallback message, and wording that still overstates verified parity with R’s default bootstrap.
  • I could not run pytest here because pytest is not installed in this environment.

Methodology

  • Severity P3. Impact: the refit bootstrap implementation fresh-starts each Frank-Wolfe solve via compute_sdid_unit_weights(...) / compute_time_weights(...), but the changed docs still state that the path matches R’s default vcov(method="bootstrap"), whose own rationale in the registry depends on the renormalized ω being threaded in as initialization. That is an implementation-choice / verification issue, not a demonstrated numerical defect, and it is already tracked in TODO.md, so it should not block approval. Concrete fix: either soften the wording to “paper-faithful refit bootstrap, intended to match R default” or add the TODO-tracked cross-language parity harness. Refs: diff_diff/synthetic_did.py:L924-L935, docs/methodology/REGISTRY.md:L1506-L1508, METHODOLOGY_REVIEW.md:L504-L508, CHANGELOG.md:L15-L18, TODO.md:L104-L104
  • No other findings. On the changed estimator path, the code now matches the updated registry on the load-bearing methodology points: it re-estimates both weight sets per bootstrap draw, retries degenerate/non-finite draws until n_bootstrap valid replicates or the bounded budget, preserves the sqrt((r-1)/r) * sd(ddof=1) SE formula, and routes bootstrap/jackknife inference through safe_inference() rather than inline t/p/CI code. Refs: diff_diff/synthetic_did.py:L583-L650, diff_diff/synthetic_did.py:L823-L1024, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

  • No findings.

Performance

  • No findings. The documented slowdown is the expected cost of correcting the methodology from fixed-weight bootstrap to per-draw refitting, not an accidental regression. Refs: CHANGELOG.md:L15-L21

Maintainability

  • Severity P3. Impact: the replicate-weight guard still tells users to “Use a TSL-based survey design (strata/psu/fpc),” but this PR simultaneously removes SDID support for strata/PSU/FPC on all variance methods. Users hitting that path are now sent to an unsupported alternative. Concrete fix: update the exception text to reflect the new contract, e.g. “replicate weights are not supported; pweight-only works with placebo / jackknife, and full survey designs have no SDID variance path in this release.” Refs: diff_diff/synthetic_did.py:L286-L290, diff_diff/synthetic_did.py:L299-L330, docs/methodology/REGISTRY.md:L1549-L1550

Tech Debt

  • Severity P3. Impact: the two real follow-ups introduced by this change are properly tracked, so they do not block approval: SDID+survey bootstrap composition and a direct cross-language refit parity anchor. Concrete fix: none required in this PR; keep the TODOs open. Refs: TODO.md:L103-L104

Security

  • No findings.

Documentation/Tests

CI review on commit 15481d9 flagged the docs as overclaiming parity with
R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe
from the renormalized fit-time ω per draw (and keeps fit-time λ as FW
init for the λ re-estimation), while our Python port was cold-starting
from uniform. On the strictly-convex FW objective with simplex constraint,
warm- and cold-start converge to the same global minimum given enough
iterations — but the 100-iter pre-sparsify pass may not fully converge
on some draws, and then sparsification is path-dependent on the init.

Port the warm-start shape:

- diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights
  gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the
  first pass. When None (default), preserves the Rust top-level fast-path
  unchanged. When provided, falls through to the Python two-pass
  dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so
  the perf cost is one Python call per pass per draw.
- diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw.
  boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same
  shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])).
  boot_lambda_init = time_weights (fit-time λ unchanged, matching R's
  weights.boot$lambda = weights$lambda).

Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"]
SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253
(warm-start), a handful of the 200 bootstrap draws converging to a
different sparsification pattern than the cold-start pass. rel=1e-14
bit-identity on the warm-start capture holds.

Coverage MC: NOT re-run. User course-correction ("not sure we need to
run the full MC with every pass here"). Math argument: strictly-convex
FW → unique global minimum; aggregated rejection rates shift well under
2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed
benchmarks/data/sdid_coverage.json (generated under cold-start in commit
15481d9) stays accurate for the warm-start path within MC noise.
REGISTRY.md coverage-MC Note now documents the artifact regeneration
cadence explicitly: regenerate on methodology changes that materially
shift per-draw numerics, not on refactors that preserve the FW global
optimum (warm-start, backend migration, renames, docstring fixes).
Per-seed bit-identity on the captured fixture is the cheaper, stricter
parity check.

Other changes in this commit:
- diff_diff/synthetic_did.py replicate-weight guard message rewritten
  (R2 P3 maintainability): the old message told users to switch to a
  TSL-based survey design with strata/psu/fpc, which is no longer
  supported by any SDID variance method in this release.
- docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE
  displayed in the comparison cell is now 4.44 (warm-start) vs 4.50
  (cold-start). Narrative unchanged.
- CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape
  and drop the "cross-language bit parity" implication from earlier
  wording.
- REGISTRY.md bootstrap section describes the warm-start mechanism
  including the FW strict-convexity argument for why warm and cold
  converge to the same answer but warm matches R's procedure at the
  algorithm level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 105cf73a28c6e8aae698945be5cea1020fb79e5b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The refit SDID bootstrap implementation itself looks methodologically aligned with the updated registry: it re-estimates both ω and λ per draw, keeps the sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference().
  • The prior user-facing survey-message problem appears resolved: the current guard text no longer points users toward unsupported full-design SDID survey paths.
  • P1 [Newly identified]: the survey-bootstrap removal was not propagated to the legacy SDID survey tests. The new guards in diff_diff/synthetic_did.py:L311-L337 now reject exactly the combinations that tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322 still assert should succeed.
  • P2: several changed documentation surfaces still describe the old SDID bootstrap contract (fixed weights / Rao-Wu), so the project’s own methodology cross-references are internally inconsistent.
  • The deferred survey+refit composition work and the cross-language parity anchor are both properly tracked in TODO.md:L103-L104, so they are informational rather than blocking.
  • I could not run pytest here because this environment is missing runtime test dependencies (pytest, numpy).

Methodology

Code Quality

  • No findings.

Performance

  • No findings. The slowdown is the expected cost of replacing the old fixed-weight shortcut with per-draw Frank-Wolfe refitting, and that tradeoff is documented.

Maintainability

  • Severity P3. _bootstrap_se()’s docstring says unit_weights / time_weights “are not used inside the loop,” but the implementation immediately uses both as warm-start initializations. Impact: this gives maintainers the wrong mental model of what is fixed versus merely reused as initialization. Concrete fix: rewrite diff_diff/synthetic_did.py:L858-L862 to say the original weights are not reused as fixed estimator weights, only as Frank-Wolfe warm starts, matching diff_diff/synthetic_did.py:L873-L877 and diff_diff/synthetic_did.py:L925-L962.

Tech Debt

  • Severity P3 informational. The two real follow-ups introduced here are explicitly tracked: survey+refit composition and cross-language refit parity in TODO.md:L103-L104. Impact: none for approval. Concrete fix: none in this PR; keep those TODOs open.

Security

  • No findings.

Documentation/Tests

  • Severity P1 [Newly identified]. The new survey guards in diff_diff/synthetic_did.py:L311-L337 now make bootstrap + pweight-only survey and bootstrap + full survey design raise NotImplementedError, but the legacy SDID survey tests still assert the old success contract in tests/test_survey_phase5.py:L179-L196, tests/test_survey_phase5.py:L226-L253, and tests/test_survey_phase5.py:L309-L322. Impact: the branch leaves the SDID survey contract internally contradictory and will keep the test suite red once dependencies are present. Concrete fix: rewrite those tests to assert the new NotImplementedError behavior, and replace the old Rao-Wu comparison with positive pweight-only placebo / jackknife cases.
  • Severity P2. Several changed documentation surfaces still state the old SDID bootstrap story: docs/methodology/REGISTRY.md:L2699-L2699 still says Unit-level bootstrap (fixed weights), docs/methodology/REGISTRY.md:L2889-L2895 still lists SyntheticDiD under Rao-Wu survey bootstrap, docs/survey-roadmap.md:L52-L54 still lists SyntheticDiD among the Rao-Wu survey-bootstrap estimators, and diff_diff/guides/llms-full.txt:L1674-L1676 still says survey-aware bootstrap includes SyntheticDiD via Rao-Wu. Impact: reviewers and downstream doc consumers get conflicting methodology guidance from the project’s own source material. Concrete fix: update those cross-reference summaries to say SDID bootstrap is the paper-faithful refit path for non-survey fits only, and that SDID has no survey-bootstrap support in this release.

Path to Approval

  1. Update the stale SDID survey tests in tests/test_survey_phase5.py so they match the new public contract: variance_method="bootstrap" with any survey_design should raise NotImplementedError, while pweight-only positive coverage should move to placebo / jackknife.

Addresses CI review R3 findings on PR #351:

P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py
that still asserted the old fixed-weight + Rao-Wu success contract so they
match the new upstream guards (bootstrap + any survey design and full
strata/PSU/FPC with any variance method now raise NotImplementedError).
test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises;
test_full_design_se_differs_from_weights_only is replaced by a positive
placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE
comparison no longer has two valid paths to compare; test_bootstrap_with
_survey -> test_bootstrap_with_pweight_only_raises plus a new
test_jackknife_with_pweight_only for positive coverage. Also align the
pre-existing placebo/jackknife guard tests with the updated guard wording
(same stale-regex class of issue the P1 flagged).

P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap
sections that survived the bootstrap rewrite. REGISTRY comparison table
row now describes the alternative as paper-faithful refit bootstrap (not
fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an
explicit Note explaining why SDID is excluded (refit + Rao-Wu composition
requires weighted-FW derivation, tracked in TODO.md and sketched in
REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet
and llms-full.txt survey-aware bootstrap bullet updated with the same
framing.

P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say
the fit-time unit_weights and time_weights are Frank-Wolfe warm-start
initializations on each draw, not unused parameters retained for
signature parity. The in-loop narrative comment at the warm-start site
was already correct; the top-of-function docstring was stale from the
pre-warm-start commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f49a19f8cdb71d7cac7da51f37c5738c58b21543


Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved.

Executive Summary

  • The changed SyntheticDiD bootstrap path is methodologically aligned with the updated registry: it re-estimates ω̂_b and λ̂_b per draw, preserves sqrt((r-1)/r) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference() for all-or-nothing NaN gating. See diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1057, diff_diff/utils.py:L174-L206, docs/methodology/REGISTRY.md:L1497-L1551.
  • The previous P1 around SDID survey tests is resolved. The changed tests now assert the new NotImplementedError contract for bootstrap and full-design survey paths. See tests/test_survey_phase5.py:L179-L216.
  • The cross-surface result/reporting surfaces look consistent in the current tree: bootstrap replications still render in summary(), and BusinessReport still labels SDID alpha overrides as bootstrap variance instead of falling through to analytical wording. See diff_diff/results.py:L963-L966, diff_diff/business_report.py:L593-L604, tests/test_business_report.py:L4658-L4702.
  • Remaining issues are minor P3 cleanup only: one deferred-work breadcrumb still contains placeholder commit IDs, one new survey test no longer checks the ATT-equivalence contract stated in its name/docstring, and one new coverage-artifact docstring still says "4 methods" although the artifact/test now use 3.
  • I could not run pytest here because this environment is missing numpy; verification was limited to static inspection plus compile() parsing of the changed Python files.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the deferred survey-bootstrap follow-up is properly tracked, but the newly added breadcrumbs are not directly actionable because they still contain placeholder SHAs (<sdid-bootstrap-refit-removal-sha> and <removal-commit>). Concrete fix: replace those placeholders with the actual commit hash, or rewrite the note to point to a stable PR number / git log --grep lookup instead. See TODO.md:L103-L104, docs/methodology/REGISTRY.md:L1550-L1550.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_placebo_full_vs_pweight_only_att_match is named and documented as an ATT-equivalence regression check, but its body now only asserts that the pweight-only fit is finite. That means the claimed "stripping strata/PSU should not change ATT" contract is no longer actually tested. Concrete fix: either compare the pweight-only ATT against the same data with strata/PSU/FPC columns ignored/dropped, or rename the test/docstring to describe it as a smoke test. See tests/test_survey_phase5.py:L227-L252.
  • Severity P3. Impact: the new coverage-artifact test docstring still says the MC study covers "4 methods," while the artifact schema and assertions immediately below now enforce exactly 3 methods (placebo, bootstrap, jackknife). That leaves stale methodology wording in the new regression surface. Concrete fix: update the docstring to say "3 methods." See tests/test_methodology_sdid.py:L2865-L2907.

- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred-
  work breadcrumbs with the actual fixed-weight-removal commit hash
  (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful
  refit"). The follow-up implementer can now directly run
  `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu
  scaffolding (rw_control/rw_treated split, degenerate-retry check,
  treated-mean weighting). Also cite PR #351 for stability against
  commit-hash churn.

- Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only
  _att_match to test_placebo_with_pweight_only_full_design_stripped_att
  _match and promote the body from a single-fit smoke to the ATT-
  equivalence check the docstring claims: fit with two equivalent
  pweight-only SurveyDesign constructions and assert ATT matches bit-
  for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns
  sitting on the DataFrame are not tacitly read unless the SurveyDesign
  references them.

- Fix stale "3 DGPs × 4 methods" docstring in
  tests/test_methodology_sdid.py::TestCoverageMCArtifact to read
  "3 DGPs × 3 methods", matching the 3-method schema the test now
  enforces (placebo / bootstrap / jackknife; bootstrap_refit and the
  old fixed-weight 'bootstrap' enum are both gone).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ec92d39b9fbf6d16f5e7642bf534b278a6d35a83


Overall Assessment
Looks good

No unmitigated P0 or P1 findings in the changed PR surface.

Executive Summary

Methodology

  • No unmitigated findings. The changed estimator and variance code is internally consistent with the methodology registry on Algorithm 2 refit bootstrap, analytical bootstrap/jackknife p-values, and NaN-safe inference. diff_diff/synthetic_did.py#L644 diff_diff/synthetic_did.py#L842 docs/methodology/REGISTRY.md#L1497
  • Severity P3. Impact: SDID survey bootstrap and full-design survey support regresses in this release, but the deviation is explicitly documented and tracked, so it is informational only under the review policy. Concrete fix: no approval blocker; the weighted-Frank-Wolfe survey composition follow-up is already specified in TODO.md#L103.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: external parity for the new refit-bootstrap semantics is still deferred; confidence currently rests on same-library regression tests plus the committed MC artifact rather than a live R/Julia anchor. Concrete fix: implement the cross-language parity harness already tracked in TODO.md#L104.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_placebo_with_pweight_only_full_design_stripped_att_match compares two semantically identical pweight-only SurveyDesigns, so it cannot catch the regression its docstring claims to cover. A future silent pickup of stratum or psu columns would still pass this test. Concrete fix: compare the pweight-only fit against the same dataset after dropping or renaming stratum and psu, or against a copy where those columns are absent. tests/test_survey_phase5.py#L227

Verification note: runtime execution was not possible here because this environment is missing numpy and pytest; review was limited to static inspection and AST parsing.

The previous rewrite compared two semantically identical pweight-only
SurveyDesign constructions, which can't catch the contract the docstring
claimed — if a future change silently picked up `stratum` or `psu` by
name, both fits would pick them up identically and the test would still
pass.

Rewrite to compare a fit on the original DataFrame (with `stratum` /
`psu` columns present) against a fit on the same data with those columns
physically dropped. If the estimator ever silently reads those columns
by naming convention the two fits would diverge and the abs=1e-12 ATT
check would fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 826b1a8e722960a0936da30785fbdf5f4ed8cf97


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have numpy, pandas, or pytest.

Executive Summary

Methodology

  • Severity P3. Impact: No unmitigated methodology defect. The affected method is SyntheticDiD bootstrap variance. The refit path in diff_diff/synthetic_did.py:L830-L1057 refits both ω_b and λ_b, keeps the sqrt((r-1)/r) aggregation, and still routes bootstrap/jackknife inference through safe_inference() in diff_diff/synthetic_did.py:L643-L657, matching the local registry at docs/methodology/REGISTRY.md:L1497-L1551. Upstream synthdid also documents bootstrap/jackknife/placebo as Algorithms 2/3/4, and its bootstrap source calls back into synthdid_estimate() while synthdid_estimate() defaults update.omega / update.lambda to refitting-from-initialization when weights are supplied. Concrete fix: none. (synth-inference.github.io)
  • Severity P3. Impact: The SDID survey-bootstrap/full-design regression is explicit and mitigated by documentation/tracking: any survey + variance_method="bootstrap" fit and any full-design SDID fit now raises NotImplementedError, and that deviation is both documented and tracked. Concrete fix: none for approval; the weighted-Frank-Wolfe survey-bootstrap follow-up is already queued. diff_diff/synthetic_did.py:L283-L336 docs/methodology/REGISTRY.md:L1549-L1550 TODO.md:L103-L104

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: External parity anchoring for the refit bootstrap remains deferred. The old fixed-weight R fixture is deleted, and the replacement Julia/R parity anchor is only tracked in TODO.md, so this is informational rather than blocking. Concrete fix: land the queued cross-language parity harness already listed in TODO.md:L104-L104.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: [Newly identified] test_bootstrap_p_value_null_calibration still documents and asserts the removed fixed-weight bootstrap behavior. Its docstring says the test is for the “fixed-weight regime” and its lower-bound assertion rejection_rate > 0.05 assumes systematic over-rejection, but the new registry and committed coverage artifact now describe near-nominal refit-bootstrap calibration. That leaves the test description wrong today and can make the assertion flaky or fail on correct code. Concrete fix: rewrite the test around its actual regression target, namely “not collapsed near 0 and not catastrophically large,” and update the docstring/comments to the refit-bootstrap semantics. tests/test_methodology_sdid.py:L2547-L2589 docs/methodology/REGISTRY.md:L1552-L1570
  • No other findings. The prior informational survey-test issue appears addressed in tests/test_survey_phase5.py:L227-L270.

Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via
``warnings.catch_warnings``, but the Rust FW entry point is silent on
``max_iter`` exhaustion (only the pure-NumPy path called
``warn_if_not_converged``). On the default Rust backend the aggregate
warning at the end of the bootstrap loop therefore never fired, even
when draws did not converge — a silent failure.

Fix: thread an explicit convergence bool out of the Rust solver.

Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``)
- ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return
  ``converged = true`` on a min-decrease break, ``false`` otherwise.
- ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``.
- Existing ``sc_weight_fw`` pyfunction destructures and drops the bool,
  preserving its ABI for the rank-selection heuristic in ``prep.py``
  and for any third-party consumer.
- New pyfunction ``sc_weight_fw_with_convergence`` returns the
  ``(array, bool)`` tuple, wrapping the same internal solver.
- Internal helpers ``compute_time_weights_internal`` /
  ``compute_sdid_unit_weights_internal`` destructure the inner calls
  and still return ``Array1<f64>`` (their pyfunctions discard
  convergence — Python callers that need it use the Python two-pass
  dispatcher).

Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``)
- Import the new Rust entry point as
  ``_rust_sc_weight_fw_with_convergence``.
- ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a
  ``return_convergence=False`` kwarg. Default path is unchanged; with
  the flag set, return ``(weights, converged)``.
- ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the
  same kwarg and propagate the AND of the two FW passes (pre-sparsify
  + main). When the flag is set, the Rust top-level fast-path is
  skipped (it is silent on non-convergence) in favor of the Python
  two-pass dispatcher; inner FW calls still dispatch to the Rust
  solver via ``sc_weight_fw_with_convergence``.

Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``)
- Replace the ``warnings.catch_warnings`` dance with an explicit
  ``return_convergence=True`` call on both helpers. ``fw_nonconvergence
  _count`` now increments based on the AND of ``omega_converged`` and
  ``lambda_converged`` per draw — works under both Rust and pure-
  Python backends.

Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE::
test_bootstrap_fw_nonconvergence_warning_fires_under_rust``)
- Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always
  report ``converged=False`` and asserts the aggregate FW non-
  convergence ``UserWarning`` fires on an otherwise-nominal SDID
  bootstrap fit. Skipped when the Rust backend is unavailable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c0d089bf9d6344c079fbba7f7a00717888d1f463


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only.

Executive Summary

  • The affected method is SyntheticDiD bootstrap variance. The implementation now refits both ω and λ per bootstrap draw, keeps the established sqrt((r-1)/r) × sd(...) SE aggregation, and still routes analytical inference through safe_inference(). I did not find an unmitigated methodology defect.
  • The previous Rust-warning concern appears resolved by the new convergence-return plumbing in diff_diff/utils.py:1350, diff_diff/synthetic_did.py:958, and rust/src/weights.rs:518.
  • The previous BR allow-list concern also appears resolved; unchanged diff_diff/business_report.py:602 already recognizes SDID variance_method values, and the new regression test at tests/test_business_report.py:4689 covers the alpha-override path.
  • One previous P2 remains unresolved: tests/test_methodology_sdid.py:2604 still documents and bounds bootstrap null calibration as if the deleted fixed-weight bootstrap were still the live behavior.
  • Static-only review here; I could not run pytest because the command is unavailable in this environment.

Methodology

Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current synthdid bootstrap flow did not reveal an unmitigated methodology defect. citeturn0search0turn3view0turn3view1

  • Severity P3. Impact: The survey-bootstrap capability regression is explicitly documented and tracked in docs/methodology/REGISTRY.md:1549 and TODO.md:107, so under this rubric it is informational rather than blocking. Concrete fix: none for approval.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: The deleted cross-language parity fixture leaves an external parity anchor as follow-up work, but that follow-up is explicitly tracked in TODO.md:108, so it is not a blocker. Concrete fix: none for approval.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: [Previous finding unresolved] tests/test_methodology_sdid.py:2604 still describes bootstrap as the removed fixed-weight regime and hard-codes rejection_rate > 0.05 at α=0.05. The updated calibration table in docs/methodology/REGISTRY.md:1557 and docs/methodology/REGISTRY.md:1566 now says refit bootstrap is near nominal, so this test can fail on correct behavior or bias future changes toward anti-conservative output. Concrete fix: rewrite the docstring/comments to the refit-bootstrap contract and replace the lower bound with a calibration-agnostic regression guard, e.g. “not collapsed near 0 and not catastrophically high,” or a dispersion/quantile check that specifically catches the old dispatch bug.
  • Severity P3. Impact: [Newly identified] The Methodology Registry still says the Rust backend is silent on Frank-Wolfe non-convergence in docs/methodology/REGISTRY.md:1528, but this PR now threads explicit convergence status through Rust in rust/src/weights.rs:518, diff_diff/utils.py:1350, and emits the aggregate warning in diff_diff/synthetic_did.py:1041. Since REGISTRY.md is the methodology source of truth, that note is now stale. Concrete fix: update the edge-case bullet to reflect the new Rust convergence flag and the aggregated bootstrap warning behavior.

…W note

P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration``
at ``tests/test_methodology_sdid.py:2604`` was still framed around the
deleted fixed-weight regime and encoded an anti-conservative lower
bound ``rejection_rate > 0.05`` that biased the test toward keeping
over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion``
and replace the directional calibration assertion with a
calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10``
catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed
under the old bug), and a loose band ``0.01 <= rejection_rate <=
0.40`` catches both SE-collapse and SE-explosion. Matches this
session's commit c0d089b by removing the remaining lingering
fixed-weight framing in the test suite. Tested locally: 100-seed
refit bootstrap passes with comfortable margin.

P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge
case (line 1528) still said the Rust backend silently returns the
final iterate and cited a follow-up. This session's commit c0d089b
already landed the Rust-side convergence flag, Python
``return_convergence=True`` kwargs on the weight helpers, and the
aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to
describe the new behavior: bootstrap path now uses
``sc_weight_fw_with_convergence`` (Rust) / numpy
``return_convergence=True`` and aggregates a single summary warning
above the 5% threshold; standalone callers that do not opt into
convergence tracking retain the legacy silent-Rust / warning-numpy
asymmetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9280cddab051623d6a9b5d30c9eb006cecf81397


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official synthdid variance surface, and the previous re-review items around the null-calibration test and the Rust non-convergence note appear addressed. Static review only; I did not run pytest or the coverage Monte Carlo harness here.

Executive Summary

  • The affected method is SyntheticDiD variance estimation, specifically the refit variance_method="bootstrap" path; I did not find a silent correctness bug in the changed estimator/SE/inference flow.
  • The prior test-related concern looks resolved: the slow bootstrap null-calibration guard is now calibration-agnostic instead of baking fixed-weight over-rejection into the expected behavior.
  • The survey-bootstrap capability regression is explicitly documented in REGISTRY.md and tracked in TODO.md, so it is informational rather than blocking under this rubric.
  • Remaining findings are P3-only documentation/UX drift: incorrect R-default wording, one source-paper overstatement in the new coverage narrative, one stale registry test reference, and one outdated survey warning string.

Methodology

Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both ω and λ on each bootstrap draw, keeps the established sqrt((r-1)/r) * sd(...) SE aggregation, and still routes bootstrap/jackknife inference through safe_inference(). That is consistent with the SDID source reference and the official synthdid variance implementation, which documents bootstrap/jackknife/placebo as Algorithms 2/3/4 and implements bootstrap by resampling all units, renormalizing omega, and re-entering synthdid_estimate with stored opts. (aeaweb.org)

  • Severity P3. Impact: diff_diff/synthetic_did.py:L50-L61 and METHODOLOGY_REVIEW.md:L517-L519 now make the variance defaults internally inconsistent: the placebo bullet/review note still say placebo is “R’s default,” while the same PR also describes bootstrap as matching “R’s default,” and upstream synthdid docs/source show vcov() defaults to bootstrap. This does not change runtime behavior, but it misstates the reference implementation and obscures that the library’s placebo default is a divergence from R rather than a match. Concrete fix: change those lines to say placebo is the library default, or add a REGISTRY.md Note (deviation from R) if that divergence is intentional. (synth-inference.github.io)
  • Severity P3. Impact: docs/methodology/REGISTRY.md:L1566-L1570 says the PR’s smaller-panel jackknife over-rejection is “in line” with Arkhangelsky et al. §6.3’s “98% / 93% coverage pattern.” The cited paper preview presents mixed evidence instead: the iid design is slightly conservative at 98% coverage, while the dependent-error design is 93% coverage. The current sentence overstates what the cited source supports. Concrete fix: rephrase this as mixed jackknife calibration in the paper, rather than uniformly anti-conservative support. (aeaweb.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: diff_diff/synthetic_did.py:L1128-L1133 and diff_diff/synthetic_did.py:L1220-L1224 still tell users to “Consider using variance_method='bootstrap'” from placebo failure paths. After this PR, bootstrap rejects every survey design, so that guidance is now wrong on the newly restricted survey-weighted path. Concrete fix: branch those warning strings on survey presence and suggest jackknife / more controls for survey users.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: docs/methodology/REGISTRY.md:L1570 still points readers to TestPValueSemantics::test_bootstrap_p_value_null_calibration, but the PR renamed the actual regression guard to tests/test_methodology_sdid.py:L2604-L2657. The registry now references a non-existent test when describing the calibration safeguard. Concrete fix: update the registry reference to test_bootstrap_p_value_null_dispersion.

Four P3-only items from R8 CI review:

1. Correctly attribute R's default `vcov()` method:
   - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
     "R's default". R's `synthdid::vcov()` actually defaults to
     `method="bootstrap"`. Reword to describe placebo as the library default
     with a rationale paragraph (survey availability, perf) and cross-
     reference to the REGISTRY Note below.
   - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
     to frame the default as a deliberate library deviation with the same
     two-reason rationale.

2. Add a REGISTRY.md Note (default variance_method deviation from R)
   that documents the rationale explicitly (survey availability + perf)
   so the AI reviewer recognizes the deviation as documented rather than
   as an accidental contradiction between the docstring and R's actual
   default.

3. Soften the placebo-failure fallback guidance in
   `_placebo_variance_se`: the previous strings recommended
   `variance_method="bootstrap"`, which now raises `NotImplementedError`
   on every survey design. Branch on `w_control is not None` (survey
   fit) to recommend jackknife + adding controls for survey users and
   keep the bootstrap + jackknife + more-controls recommendation for
   non-survey users.

4. Fix the REGISTRY pointer to the slow dispersion guard: reference
   `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
   current name) instead of the pre-rename
   `test_bootstrap_p_value_null_calibration`, and describe the new
   contract (calibration-agnostic dispersion + loose rejection-rate
   band).

5. Rephrase the REGISTRY coverage-MC narrative about jackknife
   anti-conservatism: the paper's AER §6.3 shows mixed jackknife
   evidence (98% iid — slightly conservative; 93% AR(1) — slightly
   anti-conservative). Our observed anti-conservatism aligns with the
   AR(1) branch of the paper's evidence, not a uniform "in line"
   pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 710f96654d443e7a6546febe650e0e7cc97e7578


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings in the changed SDID estimator / SE / inference path.
  • The refit bootstrap implementation now matches the documented methodology surface and the official synthdid variance contract, rather than the old fixed-weight shortcut.
  • The prior re-review P3s appear addressed: default-vs-R wording, the jackknife coverage narrative, the stale registry test reference, and the survey-aware placebo fallback messaging are all updated in the changed files.
  • The remaining items are P3-only: one user-facing runtime-guidance inconsistency, plus tracked tech-debt items around survey-bootstrap composition and restoring an external parity anchor.
  • This was a static review only; I could not execute tests in this workspace because numpy and pytest are not installed.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official synthdid surface: vcov() documents bootstrap/jackknife/placebo as Algorithms 2/3/4, defaults to bootstrap, and the R bootstrap code renormalizes omega then re-enters synthdid_estimate() with stored opts, while synthdid_estimate() treats update.omega / update.lambda as re-estimation from the passed weights as initializations. (synth-inference.github.io)

Code Quality

No findings.

Performance

No code-level findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: diff_diff/synthetic_did.py:L315-L341, docs/methodology/REGISTRY.md:L1549-L1551, and TODO.md:L107-L108 explicitly document that paper-faithful bootstrap currently rejects all survey designs; full-design survey users therefore have no SDID variance path in this release, and pweight-only users must use placebo/jackknife. Concrete fix: none for approval; the weighted-Frank-Wolfe / Rao-Wu composition follow-up is already tracked.
  • Severity P3 (tracked). Impact: the new external parity anchor is deferred in TODO.md:L108-L108, and the registry now explicitly avoids claiming bit-level cross-language bootstrap parity in docs/methodology/REGISTRY.md:L1506-L1508. Concrete fix: none for approval; land the tracked R/Julia parity harness later.

Security

No findings.

Documentation/Tests

Single actionable P3 from R9 CI review: user-facing runtime wording for
refit bootstrap had diverged across surfaces, giving conflicting
expectations about the cost of the new bootstrap path:

- CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower.
- diff_diff/power.py said ~10-100x slower (two sites).
- docs/choosing_estimator.rst said ~10-100x slower.
- docs/performance-scenarios.md said ~10-100x slower.
- docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower.
- docs/tutorials/03_synthetic_did.ipynb and
  docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower.
- benchmarks/python/coverage_sdid.py said the 500-seed MC run takes
  ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed
  wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined).

Unify on "~5-30x slower than placebo (panel-size dependent)" for the
per-fit slowdown (the warm-start plumbing closed the gap vs the pre-
warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the
coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x
figure as a historical "prior estimate" — left as-is so the release
notes continue to explain the revision.

Also fix two tutorial surfaces that still called placebo "R's default"
(tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is
the library default per the REGISTRY Note added in 710f966. Reword to
describe placebo as the library default with the rationale pointer.

Verified: 353 tests pass across test_methodology_sdid, test_power,
test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift
guards unaffected because the change is markdown-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: dc2045fb781e7d16c6c3f768d9f6a4535d51d925


Overall Assessment

✅ Looks good

Static review only; I could not execute tests in this workspace because numpy, pandas, and pytest are not installed.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / SE / inference path.
  • The prior P3 on inconsistent slowdown/runtime guidance appears addressed across the changed surfaces.
  • The survey-bootstrap capability regression and the loss of the old cross-language parity fixture are both explicitly documented/tracked, so they remain P3-only.
  • [Newly identified] Two unchanged survey docs still advertise the removed SDID Rao-Wu bootstrap path; this is a P3 documentation cleanup.

Methodology

No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official synthdid docs map bootstrap / jackknife / placebo to Algorithms 2 / 3 / 4 and default vcov() to bootstrap; bootstrap_sample() passes renormalized omega plus stored opts back into synthdid_estimate(); and synthdid_estimate() treats supplied weights as initializations while update.omega / update.lambda remain enabled. citeturn1view0turn1view1turn2view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: the survey-support regression is explicit and documented: diff_diff/synthetic_did.py:315 and diff_diff/synthetic_did.py:334 now reject full designs and any bootstrap-plus-survey combination, while docs/methodology/REGISTRY.md:1549 and TODO.md:107 track the weighted-Frank-Wolfe / Rao-Wu follow-up. Concrete fix: none for approval; land the tracked weighted-FW survey-bootstrap work later.
  • Severity P3 (tracked). Impact: the old fixed-weight R fixture was removed, so the refit bootstrap no longer has an external parity anchor in-tree; the follow-up is explicitly tracked in TODO.md:108. Concrete fix: none for approval; add the planned Julia/R parity harness later.

Security

No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
The Pure Python Fallback CI job failed this test at rel-diff 0.5310 >
0.40 tolerance. Root cause is test-infrastructure, not a correctness
regression.

ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python
mode to keep CI fast (see tests/conftest.py:210); the test's 0.40
tolerance was explicitly calibrated for B∈[100, 200] per its docstring
comment. At B=49 the bootstrap SE is not yet converged to the placebo
SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the
same seed), so the failure is MC-noise, not a regression.

The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3
Python versions) all run the test at the full B=200 and pass with
comfortable margin — the regression guard is still exercised on the
default user install path.

Skip under pure-Python mode with an explicit rationale citing the
min_n cap and the Rust-backed coverage that preserves the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b123c2b763ec0d92584c8e5f1130a92140e3d343


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

Methodology

No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through safe_inference() at diff_diff/synthetic_did.py:648, so I did not find a new P0/P1 NaN/inference regression.

Code Quality

  • Severity P2. Impact: In _bootstrap_se, fw_nonconvergence_count is incremented before the draw is accepted into bootstrap_estimates, so a draw that later fails the np.isfinite(tau) gate can still count toward the warning numerator while the denominator is n_successful valid draws. That does not change the reported SE, but it can over-trigger or overstate the documented “share of valid bootstrap draws” warning contract. Concrete fix: increment the counter only after the draw passes the finite-tau gate, or keep a separate fw_nonconvergence_valid_count for the warning path. References: diff_diff/synthetic_did.py:982, diff_diff/synthetic_did.py:993, diff_diff/synthetic_did.py:1045, docs/methodology/REGISTRY.md:1508.

Performance

No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked). Impact: SDID survey-bootstrap support is intentionally removed and front-door rejected in this release; that is a capability regression, but it is documented and tracked rather than silently wrong. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu composition follow-up tracked in TODO.md:107 and described in docs/methodology/REGISTRY.md:1551.
  • Severity P3 (tracked). Impact: deleting the old R bootstrap fixture removes the only external cross-language parity anchor for SDID bootstrap math; protection now relies on internal baselines and characterization tests. Concrete fix: none for approval; add the planned Julia/R parity harness tracked in TODO.md:108.

Security

No findings.

Documentation/Tests

@igerber igerber removed the ready-for-ci Triggers CI test workflows label Apr 23, 2026
R10 CI review found two items on top of the previous ✅ Looks good.

P2 Code Quality — aggregate Frank-Wolfe non-convergence warning
numerator/denominator mismatch. In ``_bootstrap_se``,
``fw_nonconvergence_count`` was incremented before the draw cleared
the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND
then produced non-finite τ would count toward the warning numerator
while the denominator is ``n_successful`` (draws that cleared the
finite-τ gate). That does not affect the reported SE, but it can
overstate the documented "share of valid bootstrap draws" warning
contract and cause the warning to over-trigger.

Fix: move the increment inside the ``if np.isfinite(tau)`` block so
the numerator only counts draws that also contribute to the SE. A
draw failing the finite-τ gate is retried upstream and should not
inflate the non-convergence rate.

P3 Documentation (previously unresolved) — two survey-cross-reference
docs still advertised SyntheticDiD Rao-Wu bootstrap support, which
the estimator now rejects at fit-time with NotImplementedError:

- ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet
  to exclude SDID explicitly, with a pointer to the REGISTRY sketch
  for the deferred weighted-FW composition and to pweight-only
  placebo/jackknife as the available SDID variance alternatives.
- ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the
  support-matrix table so SDID's row reads
  "pweight only (placebo / jackknife)" with bootstrap struck out, and
  add a "Note on SyntheticDiD" below explaining which methods accept
  pweight-only and why bootstrap rejects all survey designs (weighted-
  FW derivation tracked in TODO.md).

Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust
with 48-of-50 non-convergence warning still firing on the regression
test, confirming the warning numerator still tallies correctly after
the gate-order change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5eadcb6dbd1c11aca9a70aeb5a7d54c0ff09296b


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SDID estimator / inference path.
  • The refit bootstrap implementation in diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 now re-estimates ω and λ on each bootstrap draw, which aligns with Algorithm 2’s “compute the SDID estimator based on (Y^(b), W^(b))” requirement and with R’s bootstrap_sample() / synthdid_estimate() warm-start contract. (nber.org)
  • The previous P2 on Frank-Wolfe non-convergence accounting appears addressed: fw_nonconvergence_count now increments only after a finite draw is accepted (diff_diff/synthetic_did.py:L986-L998), so numerator and denominator both reference valid draws.
  • The prior survey-doc mismatch also appears resolved in the changed docs (docs/methodology/survey-theory.md:L725-L735, docs/tutorials/16_survey_did.ipynb:L1090, docs/choosing_estimator.rst:L781-L821).
  • The removed external bootstrap parity fixture is not a blocker because the follow-up is explicitly tracked in TODO.md:L108; regression coverage on the changed surfaces is otherwise solid (tests/test_methodology_sdid.py:L500-L710, tests/test_survey_phase5.py:L179-L360, tests/test_business_report.py:L4658-L4702).

Methodology

  • No unmitigated P0/P1 findings.
  • Severity P3 (informational). Impact: variance_method="bootstrap" is now a materially different user-facing method than before; existing fits will get different SE / p-value / CI outputs. This is explicitly documented in CHANGELOG.md:L15-L21 and docs/methodology/REGISTRY.md:L1497-L1552, so it is not a defect. Concrete fix: none for approval. The implementation at diff_diff/synthetic_did.py:L594-L608 and diff_diff/synthetic_did.py:L846-L1065 matches the paper / R contract for recomputing SDID on each bootstrap draw. (nber.org)

Code Quality

  • No findings.

Performance

  • No findings. The slowdown is explicitly documented as an intentional consequence of replacing the removed fixed-weight shortcut with the refit path (CHANGELOG.md:L15-L21, docs/methodology/REGISTRY.md:L1508-L1509).

Maintainability

  • No findings. Extracting _validate_config() and making set_params() transactional improves constructor / setter contract parity (diff_diff/synthetic_did.py:L188-L203, diff_diff/synthetic_did.py:L1479-L1511).

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: deleting benchmarks/R/generate_sdid_bootstrap_parity_fixture.R and tests/data/sdid_bootstrap_indices_r.json removes the only external cross-language parity anchor for SDID bootstrap in this PR. Concrete fix: none for approval; the replacement parity harness is already tracked in TODO.md:L108.
  • Severity P3 (tracked in TODO.md). Impact: SDID bootstrap no longer supports any survey design, and full-design SDID has no variance path this release. That is a real capability regression, but it is front-door rejected, documented in docs/methodology/REGISTRY.md:L1549-L1551, and tracked in TODO.md:L107, so it is not a blocker. Concrete fix: none for approval; implement the weighted-Frank-Wolfe + Rao-Wu follow-up.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new return_convergence=True helper contract is only partially documented; compute_time_weights() and compute_sdid_unit_weights() explain the new parameter but their Returns sections still advertise only np.ndarray, not the tuple return path. Concrete fix: update diff_diff/utils.py:L1462-L1510 and diff_diff/utils.py:L1598-L1643 to mirror _sc_weight_fw’s union-return documentation.
  • Otherwise no findings. The changed tests cover the main regression surfaces: bootstrap refit behavior / non-convergence (tests/test_methodology_sdid.py:L500-L710), survey rejection and pweight-only fallbacks (tests/test_survey_phase5.py:L179-L360), BR inference labeling (tests/test_business_report.py:L4658-L4702), and the coverage artifact schema (tests/test_methodology_sdid.py:L2976-L3005).

R11 CI review flagged that compute_time_weights and compute_sdid_unit
_weights describe the new return_convergence parameter but their
Returns sections still advertise only np.ndarray, not the tuple return.

Update both docstrings to mirror _sc_weight_fw's union-return shape:
Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an
explicit description of the two-pass AND convergence flag (True iff
both pre-sparsify and main FW passes hit the min-decrease break; False
if either hit max_iter). Also promote the return_convergence kwarg
description from the implementation-detail placement to the main
Parameters block with the same contract note as _sc_weight_fw so
standalone callers see the legacy-ABI preservation guarantee and the
Rust top-level fast-path skip.

Docs-only change; no code or test changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 23, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ec34beb8c5f98db72001cfb534ac4d5446e9be60


Overall Assessment

✅ Looks good

Static diff review only; numpy, pandas, and pytest are not installed in this workspace, so I could not execute the updated tests or the Monte Carlo script.

Executive Summary

  • No unmitigated P0/P1 findings in the changed SyntheticDiD estimator / inference path.
  • The refit bootstrap now re-estimates both ω and λ on each pairs-bootstrap draw, and bootstrap/jackknife inference still routes through safe_inference(), which matches the current Methodology Registry contract for Arkhangelsky et al. Algorithm 2 / R-default vcov(method="bootstrap"). diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552
  • Previous re-review concerns appear addressed: Frank-Wolfe non-convergence is now surfaced through explicit Rust/Python convergence flags instead of warning capture, and the tuple-return helper contracts are now documented. diff_diff/synthetic_did.py:L952-L998, diff_diff/utils.py:L1301-L1382, diff_diff/utils.py:L1462-L1667, rust/src/weights.rs:L518-L555
  • Survey support is now internally consistent across code, docs, and tests: full designs are front-door rejected on all SDID variance methods, and pweight-only surveys are limited to placebo/jackknife. This regression is documented and tracked, so it is informational only. diff_diff/synthetic_did.py:L315-L342, docs/choosing_estimator.rst:L785-L823, tests/test_survey_phase5.py:L179-L359, TODO.md:L107-L108
  • Remaining issues are P3 only: one stale fallback message on the power-analysis surface, and weak provenance metadata on the committed coverage artifact.

Methodology

No findings. The changed estimator math, SE formula, retry-to-B contract, survey guards, and bootstrap/jackknife p-value dispatch match the current Registry description. diff_diff/synthetic_did.py:L594-L661, diff_diff/synthetic_did.py:L846-L1067, docs/methodology/REGISTRY.md:L1497-L1552

Code Quality

No findings.

Performance

No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. CHANGELOG.md:L15-L21, docs/performance-scenarios.md:L237-L249

Maintainability

No findings. Extracting _validate_config() and making set_params() transactional improves constructor/setter parity. diff_diff/synthetic_did.py:L186-L203, diff_diff/synthetic_did.py:L1479-L1511

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: SDID no longer has any survey-aware variance path for strata/PSU/FPC, and even pweight-only survey fits cannot use bootstrap in this release. This is front-door rejected and documented, so it is not a blocker. Concrete fix: none for approval; implement the tracked weighted-Frank-Wolfe + Rao-Wu composition follow-up. diff_diff/synthetic_did.py:L315-L342, docs/methodology/REGISTRY.md:L1549-L1551, TODO.md:L107-L107
  • Severity P3 (tracked in TODO.md). Impact: deleting the old R fixture leaves no external cross-language parity anchor for the new refit bootstrap; validation is currently same-library only. Concrete fix: none for approval; add the tracked R/Julia parity harness. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new placebo-feasibility error text on the power-analysis surface still recommends variance_method='bootstrap' even on survey-weighted custom-DGP paths, but SyntheticDiD.fit() now rejects bootstrap for every survey design. Concrete fix: branch the fallback guidance in _check_sdid_placebo_data() and the registry-path check so survey-weighted SDID only suggests jackknife (or adding controls). diff_diff/power.py:L714-L721, diff_diff/power.py:L2043-L2051, diff_diff/synthetic_did.py:L334-L342
  • Severity P3. Impact: the committed coverage artifact records "library_version": "3.2.0" even though it documents unreleased refit-bootstrap behavior, which weakens provenance for future methodology audits/regenerations. Concrete fix: emit a git SHA or explicit dev/unreleased marker in the artifact metadata, then regenerate benchmarks/data/sdid_coverage.json. benchmarks/python/coverage_sdid.py:L343-L374, benchmarks/data/sdid_coverage.json:L2-L8
  • Otherwise no findings. The changed tests cover the main regression surfaces: refit-bootstrap behavior/non-convergence, survey rejection, setter rollback, result/report labeling, and coverage-artifact schema. tests/test_methodology_sdid.py:L500-L715, tests/test_methodology_sdid.py:L1234-L1311, tests/test_methodology_sdid.py:L2968-L3010, tests/test_business_report.py:L4658-L4700, tests/test_survey_phase5.py:L179-L359

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 23, 2026
@igerber igerber merged commit 172d1d8 into main Apr 23, 2026
23 of 24 checks passed
@igerber igerber deleted the sdid-bootstrap-refit branch April 23, 2026 23:14
igerber added a commit that referenced this pull request Apr 24, 2026
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up
to #351 which front-door rejected all survey designs). This commit adds the
weighted-FW kernel + Python wrappers; the bootstrap integration lands in
the next commit.

Rust (rust/src/weights.rs, rust/src/lib.rs):
- New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted`
  loop variants. Identical to the unweighted loops except for the
  regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in
  place of `eta*lam[j]`, and the FW step-size denominator uses the
  diag(reg_w)-weighted simplex direction norm
  `Σ_j reg_w[j]*d[j]²` (which simplifies to
  `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam).
- New `sc_weight_fw_weighted_internal` dispatcher that delegates to the
  unweighted internal when reg_weights is None (preserves the legacy
  numeric contract for any future caller that wants the generic shape).
- Two new pyfunctions: `sc_weight_fw_weighted` and
  `sc_weight_fw_weighted_with_convergence`. Same call shape as the
  existing unweighted siblings plus a trailing `reg_weights` kwarg.
  Registered in lib.rs.
- 3 new Rust unit tests in rust/src/weights.rs:
    * test_weighted_fw_reg_weights_none_delegates — bit-identity at
      rel=1e-14 against the unweighted internal.
    * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform
      rw=1 collapses to uniform regularization (rel=1e-12, allowing for
      ULP-scale drift from different float reduction orders).
    * test_weighted_fw_simplex_invariants — for arbitrary positive rw
      and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to
      1 and is non-negative.

Python (diff_diff/utils.py, diff_diff/_backend.py):
- Export _rust_sc_weight_fw_weighted and _with_convergence from _backend
  (mirrors the shape added for _rust_sc_weight_fw_with_convergence in
  PR #351 c0d089b).
- Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a
  `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust
  path, dispatches to the new weighted pyfunctions; on the pure-Python
  path, runs a weighted FW loop mirroring the Rust derivation.
- New helper `compute_sdid_unit_weights_survey(Y_pre_control,
  Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by
  rw_control and passes rw_control as reg_weights so the FW solves the
  unit-weight survey-bootstrap objective
    min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² +
                    ζ²·Σ_i rw_i·ω_i²
  Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights.
  Returns ω on the standard simplex (caller composes ω_eff downstream).
- New helper `compute_time_weights_survey(Y_pre_control, Y_post_control,
  rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes
  no reg_weights (uniform reg on λ — λ is per-period, rw is per-control,
  no alignment for per-λ weighting). Two-pass structure unchanged.
- Both new helpers expose `return_convergence=True` returning the AND of
  the two pass convergence flags, mirroring the contract added in
  PR #351 c0d089b.

Tests (tests/test_weighted_fw.py — new, 15 tests):
- _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at
  bit-identity; uniform reg matches unweighted at rel=1e-12;
  Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw;
  return_convergence tuple shape.
- compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted
  helper, simplex invariants under arbitrary rw, shape-mismatch raises,
  return_convergence AND.
- compute_time_weights_survey: same coverage matrix, plus a zero-rw
  subset test (Rao-Wu-style undrawn PSU yields valid simplex λ).
- Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7
  for both unit and time helpers (monkeypatches HAS_RUST_BACKEND).

ABI preservation: existing unweighted callers of _sc_weight_fw,
compute_sdid_unit_weights, compute_time_weights are unaffected — the new
kwarg defaults to None and dispatches to the legacy code path. The
bit-identity check on TestScaleEquivariance::test_baseline_parity_small
_scale[bootstrap] still passes at rel=1e-14 (verified in the next commit
when the bootstrap integration lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…sition

PR #352 restores the SDID survey-bootstrap capability that PR #351 front-
door rejected as a known regression. Pweight-only and full-design surveys
now both succeed; placebo / jackknife continue to reject full designs (a
separate methodology gap tracked in TODO.md).

`diff_diff/synthetic_did.py::fit` (guards):
- Replace the unconditional strata/PSU/FPC NotImpl guard with a method-
  gated version that fires only for placebo / jackknife. Rationale +
  truth-table in REGISTRY.md §SyntheticDiD survey-support matrix:

      method     pweight-only      strata/PSU/FPC
      bootstrap  ✓ (this PR)       ✓ Rao-Wu (this PR)
      placebo    ✓ unchanged       ✗ NotImpl (separate gap)
      jackknife  ✓ unchanged       ✗ NotImpl (separate gap)

- Delete the unconditional `bootstrap + any-survey` guard added in #351.
  Keep the `weight_type != "pweight"` validation (fweight/aweight still
  rejected).

`diff_diff/synthetic_did.py::fit` (survey resolution):
- After validating the per-unit survey weights (`w_treated`, `w_control`),
  also collapse the observation-level `resolved_survey` to a unit-level
  view via `collapse_survey_to_unit_level(...)` ordered as
  `[*control_units, *treated_units]`. The resulting
  `resolved_survey_unit` is what `_bootstrap_se` slices via
  `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw.

`diff_diff/synthetic_did.py::fit` (dispatcher):
- Branch the bootstrap call on whether the design is pweight-only or
  full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for
  pweight-only, `resolved_survey=resolved_survey_unit` for full design,
  None/None for non-survey.

`diff_diff/synthetic_did.py::_bootstrap_se`:
- New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword-
  only, default None — preserves the legacy signature).
- Single-PSU short-circuit: unstratified survey with <2 PSUs returns
  (NaN, []) since the bootstrap distribution is unidentified
  (resampling one PSU yields the same subset every draw). Recovered from
  the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5).
- Per-draw Rao-Wu rescaling for full designs:
  ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the
  resampled units. Pweight-only path uses ``rw = w_control[boot_idx]``
  (constant per draw, no rescaling).
- Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)``
  when survey weights are present.
- Warm-start: the simplex init scales by rw before sum_normalize when on
  the survey path, matching the per-draw weighted-FW geometry.
- Per-draw FW dispatch: survey paths call the new
  ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey``
  helpers (PR #352 commit 1) which run the weighted-FW kernel; non-
  survey paths continue to call the unweighted helpers (bit-identity
  preserved on the non-survey refit path).
- Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID
  estimator (which expects simplex weights). Degenerate-retry if
  ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls).
- Aggregate FW non-convergence warning: tally is the AND of the two
  helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b
  shape preserved, no copy change).

Tests:
- ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as
  succeeds-tests with explicit SE assertions —
    * ``test_full_design_bootstrap_succeeds`` (was ``_raises``):
      finite SE, populated survey_metadata.n_strata/n_psu, summary()
      includes Survey Design + Bootstrap replications blocks.
    * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``):
      finite SE, variance_method preserved (cross-surface guard).
    * New ``test_bootstrap_full_design_se_differs_from_pweight_only``
      resurrects the PR #351 R3-deleted differs-from contract: ATT
      matches between paths (both compose ω_eff post-fit) but SE
      differs (Rao-Wu adds PSU clustering variance).
- ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351
  raises-tests as succeeds-tests, plus add the
  ``test_bootstrap_single_psu_returns_nan`` short-circuit regression.

Verified: 308 tests pass across test_methodology_sdid /
test_business_report SDID subset / test_rust_backend / test_survey_phase5
/ test_weighted_fw / test_guides.

Bit-identity check: the non-survey refit path goes through the
unweighted helpers (no weighted-FW dispatch), so
``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]``
remains at rel=1e-14 — verified passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap
composition and propagates the landed capability across the
documentation surfaces.

Coverage MC harness (benchmarks/python/coverage_sdid.py):
- Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses
  ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/
  stratum) null-treatment panel with moderate weight variation and
  modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post
  periods). Converts per-observation ``treated`` to a unit-level
  ever-treated indicator (SDID's block-treatment requirement).
- Extend ``DGPSpec`` with an optional ``survey_design_factory``
  callable that returns ``(SurveyDesign, supported_methods_tuple)``.
  For ``stratified_survey``: bootstrap only — placebo / jackknife
  reject strata/PSU/FPC at fit-time, so the harness skips them
  rather than catching the NotImplementedError inside ``_fit_one``.
- ``_fit_one`` gains an optional ``survey_design`` kwarg routed
  through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls
  the factory once per seed (DataFrame contents don't affect
  columns) and gates methods on the supported set.

Regenerated ``benchmarks/data/sdid_coverage.json`` via
``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap
200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend);
aer63 remains the long tail at 2237 s, stratified_survey adds only
33 s.

Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at
α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration
band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the
bootstrap is slightly conservative (overestimates empirical
sampling SD by ~25%) — the safer direction under Rao-Wu rescaling
with only 4 PSUs total. Validates the weighted-FW + Rao-Wu
composition end-to-end.

REGISTRY.md §SyntheticDiD:
- Add ``stratified_survey`` row to the coverage MC table and a
  paragraph under it documenting the calibration verdict, the
  conservatism direction, and why placebo/jackknife rows are NaN.
- Replace the survey-support bullet with a truth-table matrix (PR
  #352 shape); add a ``Note (survey + bootstrap composition)``
  documenting the weighted-FW objective (unit and time forms), the
  ω_eff composition, the argmin-set caveat, the per-draw rw
  dispatch (pweight-only vs Rao-Wu), and the single-PSU
  short-circuit.
- Update the ``Note (default variance_method deviation from R)`` to
  drop the "bootstrap rejects surveys" framing (no longer accurate).
- Update the ``Note (coverage Monte Carlo calibration)`` header to
  say "4 representative null-panel DGPs" and flag stratified_survey
  as bootstrap-only.

User-facing docs:
- ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu
  Rescaled Bootstrap list; describe the weighted-FW composition.
- ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect
  full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet
  restores SDID.
- ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix
  table row for SyntheticDiD switches from "pweight only (placebo/
  jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC";
  "Note on SyntheticDiD" block rewritten for the landed contract.
- ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap
  bullet now describes survey support and the ω_eff composition.
- ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet
  includes SDID in the Rao-Wu list with the weighted-FW formula.

CHANGELOG.md:
- Retain the PR #351 regression Changed entry but annotate it as
  "restored in PR #352"; add new Added/Changed PR #352 entries
  documenting the weighted-FW kernel, survey helpers, _bootstrap_se
  Rao-Wu composition, and the new coverage MC row.

TODO.md:
- Row 103 (SDID + survey designs) → closed by PR #352; replaced
  with a narrower follow-up for placebo/jackknife + strata/PSU/FPC
  (Low priority, no concrete sketch yet).

Tests:
- ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including
  ``stratified_survey``); new explicit assertions that the
  stratified_survey bootstrap row has ≥100 successful fits and
  α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows
  n_successful_fits == 0 (strata/PSU/FPC rejection contract).

Verified: TestCoverageMCArtifact passes against the regenerated
artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 18, 2026
Tracker had fallen ~9 estimators behind the library. Audit against
__init__.py __all__, docs/methodology/REGISTRY.md, docs/methodology/
papers/, and tests/test_methodology_*.py surfaced four "Not Started"
entries that are stale and ~10 missing entries entirely.

Changes:

- Reorganized Review Status Summary into seven categories (Core,
  Staggered, Continuous & Universal-Treatment, Triple-Difference,
  Counterfactual, Diagnostics, Cross-Cutting Inference Features).
- Added "What 'Complete' means" tier definition (Complete / In Progress
  / Not Started) so the bar is explicit.
- Added In-Progress entries for ImputationDiD, TwoStageDiD, WooldridgeDiD
  (ETWFE), EfficientDiD, ContinuousDiD, ChaisemartinDHaultfoeuille (DCDH),
  HeterogeneousAdoptionDiD (HAD), TROP, StaggeredTripleDifference,
  ConleySpatialHAC, Survey Data Support, PlaceboTests. Each "In Progress"
  block lists what's already in place (REGISTRY section, paper review,
  methodology test file, parity fixtures, test counts) and what's needed
  to promote to Complete.
- Updated SyntheticDiD last-review date to 2026-04-23 (PR igerber#351
  bootstrap-refit landing date) to reflect the warm-start FW corrections.
- Refreshed methodology-test counts on the existing Complete entries to
  match current `grep` output (CallawaySantAnna 61, HonestDiD 27,
  TripleDifference 45, DifferenceInDifferences 51, HonestDiD unit 72).
- Updated Priority Order: BaconDecomposition flagged as next substantive
  review (chosen during this session); In-Progress promotion ladder
  documented (HAD largest, DCDH closest to ready, etc.).

No source code changes. No status flips on existing Complete entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant