Skip to content

SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention#482

Merged
igerber merged 2 commits into
mainfrom
spillover-conley-wave-e3-subpopulation
May 21, 2026
Merged

SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention#482
igerber merged 2 commits into
mainfrom
spillover-conley-wave-e3-subpopulation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 21, 2026

Summary

  • Wave E.3 closes the user-facing P3 limitation documented at REGISTRY.md:3249 since Wave E.1: SurveyDesign.subpopulation()-derived designs AND warn-and-drop fits now preserve the full-domain resolved survey design (n_psu / n_strata / df_survey / Binder TSL per-stratum centering all reflect the FULL domain rather than the post-finite_mask fit sample).
  • Documented synthesis (library-convention adoption, NOT new methodology): adopts the R survey::svyrecvar(subset()) zero-pad convention (Lumley 2010 §2.5) already established in imputation.py:2175-2183 and prep.py:1401-1432. Propagates the same convention to SpilloverDiD's Wave E.1 Binder TSL × Wave D Gardner GMM × Wave E.2/follow-up stratified-Conley + serial Bartlett meat.
  • New score_pad_mask kwarg on _compute_gmm_corrected_meat: gamma_hat/Psi built on survey_finite_mask = finite_mask & survey_weights > 0 (the R6 fix preserves FE drop-first column-space invariance against zero-weight subpop rows that pd.factorize compaction would otherwise reorder); Psi zero-padded back to full panel length inside the helper before kernel dispatch; kernel-dispatch arrays (cluster_ids, conley_coords, conley_time, conley_unit, resolved_survey) passed at FULL length so meat helpers see full-domain PSU/strata/centroid/time geometry.
  • count_mask invariant: res.n_obs / res.n_treated / res.n_control / res.n_far_away_obs + event-study per-cell n_obs all reflect survey_finite_mask on the survey path (matches the effective weighted estimation sample).

Methodology references (required if estimator / math changes)

  • Method name(s): SpilloverDiD survey-design analytical variance under Wave E.3 full-design retention (extending Wave E.1 Binder TSL + Wave D Gardner GMM + Wave E.2/follow-up stratified-Conley + serial Bartlett)
  • Paper / source link(s):
    • Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. §2.5 Domains and subpopulations. https://doi.org/10.1002/9780470580066
    • In-library precedents: diff_diff/imputation.py:2175-2183 (PreTrendsImputation lead regression), diff_diff/prep.py:1401-1432 (DCDH cell variance)
    • Inherited synthesis: Binder (1983), Gerber (2026) Prop 1, Butts (2021) §3.1, Gardner (2022) §4, Conley (1999), Newey-West (1987)
  • Any intentional deviations from the source: None. Library-convention adoption — Wave E.3 propagates an existing in-library zero-pad pattern to the SpilloverDiD survey path; no new methodology beyond aligning with R svyrecvar(subset()) semantics.

Validation

  • Tests added/updated: tests/test_spillover.py — new TestSpilloverDiDWaveE3SubpopulationFullDesign (14 tests) + TestSpilloverDiDWaveE3SubpopulationFullDesignEventStudy (5 tests). 316 spillover regression tests pass on both DIFF_DIFF_BACKEND=python and DIFF_DIFF_BACKEND=rust. Pre-existing Wave E.1 test_p2_finite_mask_forces_drop_under_survey assertion flipped from n_psu=8 (subset) to n_psu=10 (full domain) to reflect the new contract.
  • Backtest / simulation / notebook evidence: N/A — analytical methodology change.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…tion

Closes the user-facing limitation documented at REGISTRY.md:3249 since
Wave E.1: SurveyDesign.subpopulation()-derived designs and warn-and-drop
fits now preserve the full-domain resolved survey design. n_psu /
n_strata / df_survey / Binder TSL per-stratum centering reflect the
FULL domain rather than the post-finite_mask fit sample.

Documented synthesis (library-convention adoption, not new methodology):
adopts the R survey::svyrecvar(subset()) zero-pad convention (Lumley
2010 §2.5) already established in imputation.py:2175-2183 and
prep.py:1401-1432. Propagates the same convention to SpilloverDiD's
Wave E.1 Binder TSL × Wave D Gardner GMM × Wave E.2/follow-up
stratified-Conley + serial Bartlett meat.

Mechanical realization (one new _compute_gmm_corrected_meat kwarg):
the gamma_hat / Psi build stays on the SURVEY-FINITE-MASK subset of
fit-sample inputs (= finite_mask & survey_weights > 0) so the
drop-first stage-1 FE column space is INVARIANT to zero-weight subpop
rows (critical because _build_butts_fe_design_csr re-factorizes via
pd.factorize and drops the first unit/time code; including zero-weight
rows would silently shift gamma_hat). _compute_gmm_corrected_meat
gains a new optional kwarg score_pad_mask: when supplied, the helper
zero-pads the survey-finite-mask Psi to full panel length AFTER
construction but BEFORE kernel dispatch via Psi_padded[score_pad_mask]
= Psi. Kernel-dispatch arrays (cluster_ids, conley_coords, conley_time,
conley_unit, resolved_survey) are passed at FULL length so meat
helpers see the full-domain PSU / strata / centroid / time geometry.

The stage-2 OLS solve still runs on X_2_kept / y_tilde_fit (active
sample); only the meat-helper boundary sees full-length arrays.

count_mask invariant: top-level res.n_obs / res.n_treated /
res.n_control / res.n_far_away_obs AND event-study
event_study_meta["n_obs_per_col"] / att_dynamic.n_obs /
event_study_effects[k]["n_obs"] / spillover_effects.n_obs all use
count_mask (= survey_finite_mask on the survey path) so the reported
counts match the effective weighted sample.

Cross-surface n_psu consistency: top-level res.n_psu reads from
len(resolved_survey_fit.weights) on the implicit-PSU branch (was
int(finite_mask.sum())), so res.n_psu == res.survey_metadata.n_psu
on weights-only / strata-only survey designs under warn-and-drop.

A2 invariant (locked in _scratch/wave_e3_smoke.py): warn-and-drop and
SurveyDesign.subpopulation() drops apply the same zero-pad mechanism —
both produce identical meat output for identical row-level exclusions.

Restrictions inherited: replicate-weight variance + subpopulation
continues to raise NotImplementedError at the Wave E.1 gate.
TwoStageDiD's analogous finite_mask + design-subset pattern at
two_stage.py:567-601 is NOT yet adopted to Wave E.3 — separate parity
follow-up tracked in TODO.md.

Tests: 19 new tests in TestSpilloverDiDWaveE3SubpopulationFullDesign
(+EventStudy mirror): pre-E.3 baseline parity via pinned goldens,
n_psu cross-surface consistency on implicit-PSU branch, zero-pad
mechanics via mock-spy, cluster-as-PSU + subpop parity, conley +
lag>0 + subpop × {explicit PSU / cluster injection / weights-only
NotImplementedError}, unit with BOTH zero weight AND no Omega_0
support, gamma_hat-build sample excludes zero-weight rows (R6 P1
regression), n_obs metadata excludes zero-weight rows, n_far_away_obs
excludes zero-weight rows, warn-drop SE drift golden, ATT bit-equality
under PSU-last-sort exclusion, exact event-study n_obs propagation
across att_dynamic / event_study_effects / spillover_effects under
treated-PSU exclusion, event-study on both is_staggered branches with
analytical + conley+lag variants. Pre-existing Wave E.1
test_p2_finite_mask_forces_drop_under_survey assertion flipped from
n_psu=8 (subset) to n_psu=10 (full domain) to reflect the new contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The Wave E.3 estimator changes match the Methodology Registry: gamma_hat/Psi are built on survey_finite_mask, score_pad_mask zero-pads only at the meat boundary, and full-design PSU/strata/df bookkeeping is retained as documented in docs/methodology/REGISTRY.md:L3270-L3352, diff_diff/spillover.py:L2876-L3332, and diff_diff/two_stage.py:L211-L355.
  • I did not find new inference anti-patterns: the touched paths still use safe_inference(), and I did not see new partial-NaN or inline t = effect / se regressions in the modified estimator code.
  • The changed count surfaces appear internally consistent with the new active-sample contract: n_obs, n_treated, n_control, n_far_away_obs, and event-study n_obs_per_col all now derive from the same count_mask/survey_finite_mask logic in diff_diff/spillover.py:L3019-L3035 and diff_diff/spillover.py:L3387-L3566.
  • Test coverage in the diff is targeted at the new risk surfaces: subpopulation zero-padding, warn-drop retention, explicit PSU vs cluster injection, no-effective-PSU fail-closed Conley, and both event-study branches are all covered in tests/test_spillover.py:L7521-L8555.
  • Remaining follow-ups are explicitly tracked in TODO.md, so they are non-blocking for this PR. See TODO.md:L148-L149.

Methodology

  • Severity P3 — Informational. Impact: none. Concrete fix: none. The only notable source-material extension is treating warn-and-drop rows the same as subpopulation-excluded rows for zero-padding/full-design retention, but that behavior is explicitly documented in the registry, so it is not a defect. See docs/methodology/REGISTRY.md:L3270-L3352.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3 — Informational. Impact: low drift risk only. Concrete fix: later extract the duplicated raw_w_for_meta construction into the shared helper already tracked in TODO.md:L149.

Tech Debt

  • Severity P3 — Informational. Impact: SpilloverDiD now has Wave E.3 parity, but TwoStageDiD still does not on the analogous subpopulation full-design-retention path. Concrete fix: implement the same zero-pad/full-design pattern in TwoStageDiD.fit() and add the targeted regression called out in TODO.md:L148.

Security

  • No findings.

Documentation/Tests

  • No findings in the diff. The docs and registry updates align with the shipped behavior, and the test additions target the right edge cases. I could not execute the suite locally because this review environment is missing pytest and numpy, so this review is static-only.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3f075ba34a70101f20ee99a355659f83488c4019


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • [Newly identified] The Wave E.3 survey path now defines the effective estimation sample as survey_finite_mask, but treated-support validation still runs on full-domain D_it, so a SurveyDesign.subpopulation() that zeros out every treated row can pass the front door and only fail later via generic rank-deficiency handling.
  • The core Wave E.3 methodology changes themselves look aligned with the registry: gamma_hat/Psi are built on survey_finite_mask, score_pad_mask zero-pads at the meat boundary, and full-design PSU/strata bookkeeping is retained.
  • I did not find new inline-inference or partial-NaN regressions in the modified estimator code.
  • The remaining follow-ups in this diff are properly tracked in TODO.md and are non-blocking.
  • Static-only review: this environment is missing pytest and pandas, so I could not execute the new tests.

Path to Approval

  1. Add an active-sample treatment-support check after survey_finite_mask is computed in SpilloverDiD.fit; raise ValueError when the weighted subpopulation has zero treated observations (D_it[survey_finite_mask].sum() == 0).
  2. Add a regression in tests/test_spillover.py where SurveyDesign.subpopulation() excludes all treated rows and assert the new front-door error.

Methodology

  • Severity P1 [Newly identified]. Impact: Wave E.3’s registry note defines the survey estimating sample as finite_mask & (survey_weights > 0), but SpilloverDiD.fit() still checks treatment support on full-domain D_it.sum() before that mask is applied. A subpopulation can therefore remove every treated observation from the actual estimating sample, yet the fit still passes the identification guard and falls through to the generic rank-deficient OLS path instead of failing with a clear assumption error. Concrete fix: validate treated support on survey_finite_mask (or the later count_mask) once the survey-domain mask is known, and add a regression for the all-treated-removed subpopulation case. References: docs/methodology/REGISTRY.md:L3332-L3335, diff_diff/spillover.py:L2550-L2566, diff_diff/spillover.py:L2727-L2741, diff_diff/spillover.py:L3512-L3534, diff_diff/linalg.py:L1011-L1048.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: TwoStageDiD still lacks the analogous Wave E.3 full-design-retention path, and the metadata-resolution helper remains duplicated, but both items are explicitly tracked and are non-blocking for this PR. Concrete fix: none in this PR; keep the follow-ups in TODO.md. References: TODO.md:L148-L149.

Security

  • No findings.

Documentation/Tests

  • No additional findings. The Wave E.3 test additions are broad, but they do not cover the no-treated-active-subpopulation case above. Static-only review: pytest and pandas are unavailable in this environment, so I could not run the suite. References: tests/test_spillover.py:L7526-L8680.

…ubpop

Issue: the front-door treatment-support gate at spillover.py:2556
runs on full-domain D_it.sum() BEFORE survey_finite_mask is computed.
A SurveyDesign.subpopulation() mask that zeros out every treated row
passes that gate (because full-domain D_it.sum() > 0) but lands on a
rank-deficient stage-2 OLS solve downstream, surfacing as a generic
rank-deficiency warning instead of a clean assumption-violation
error.

Fix at spillover.py:~2745 (immediately after survey_finite_mask is
built): raise ValueError when resolved_survey is not None AND
D_it[survey_finite_mask].sum() == 0. The error message names the
survey_finite_mask construction + Wave E.3 contract + suggests
remediation (expand the mask or verify the weight column). This
matches the documented R svyrecvar(subset()) convention — domain
estimation requires the domain to contain identifying variation.

Added test_q4_subpop_excludes_all_treated_raises: subpopulation that
excludes all ever-treated units raises ValueError matching
"removes EVERY treated observation". Pre-fix would have fallen
through to rank-deficient OLS.

All 317 spillover tests pass + lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d2d660e2e9f7c17c4eb30b55c7fc0f9d6f02e569


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Re-review outcome: the prior P1 is addressed. SpilloverDiD.fit() now checks treated support on the active survey sample (survey_finite_mask) and raises before falling through to generic rank-deficiency handling when a subpopulation removes all treated rows. diff_diff/spillover.py:L2727-L2766, tests/test_spillover.py:L8352-L8390
  • Methodology affected: SpilloverDiD’s survey-design analytical variance path (Wave E.1 Binder TSL + Wave D Gardner GMM + Wave E.2/E.2-follow-up survey Conley) plus the shared _compute_gmm_corrected_meat() helper. The Wave E.3 zero-pad implementation matches the newly documented REGISTRY contract and the cited Lumley §2.5 domain-estimation convention. diff_diff/spillover.py:L3106-L3358, diff_diff/two_stage.py:L211-L355, docs/methodology/REGISTRY.md:L3328-L3352
  • I did not find new inline-inference or partial-NaN regressions in the modified estimator path; changed inference still routes through safe_inference(). diff_diff/spillover.py:L783-L840, diff_diff/spillover.py:L3484-L3503
  • Remaining follow-up work is properly tracked in TODO.md and is non-blocking for this PR. TODO.md:L148-L149
  • Static-only review: this environment lacks numpy, pandas, scipy, and pytest, so I could not execute the new tests.

Methodology

  • No findings. The affected method is the Wave E.3 SpilloverDiD survey-variance/full-design-retention path, and the code now implements the documented “build on survey_finite_mask, then zero-pad at the meat boundary” contract without an undocumented deviation from the REGISTRY entry. diff_diff/spillover.py:L2727-L2766, diff_diff/spillover.py:L3106-L3358, diff_diff/two_stage.py:L330-L478, docs/methodology/REGISTRY.md:L3328-L3352

Code Quality

  • No findings. The modified paths continue to centralize inference through safe_inference() rather than adding new inline t-stat / p-value / CI computation. diff_diff/spillover.py:L783-L840, diff_diff/spillover.py:L3484-L3503

Performance

  • No findings from static review. The new zero-pad step is localized and linear in the number of rows. diff_diff/two_stage.py:L330-L355

Maintainability

  • No findings. score_pad_mask is optional and backward-compatible, so the helper change is contained to the SpilloverDiD caller in this PR. diff_diff/two_stage.py:L62-L80, diff_diff/spillover.py:L3340-L3358

Tech Debt

  • Severity P3. Impact: TwoStageDiD still lacks the analogous Wave E.3 full-design-retention path, so survey-subpopulation behavior remains intentionally divergent across estimators. Concrete fix: leave this as the already-tracked parity follow-up in TODO.md. TODO.md:L148
  • Severity P3. Impact: compute_survey_metadata(...) assembly remains duplicated in SpilloverDiD after the Wave E.3 edits, which raises drift risk but does not change behavior today. Concrete fix: follow the tracked helper-extraction item in TODO.md. TODO.md:L149

Security

  • No findings.

Documentation/Tests

  • No findings. The docs/registry updates are consistent with the shipped behavior, and the re-review target is covered by a dedicated regression for the “subpopulation excludes all treated rows” case. docs/methodology/REGISTRY.md:L3328-L3352, docs/api/spillover.rst:L412-L483, tests/test_spillover.py:L8352-L8390
  • Static-only limitation: I could not run the suite in this environment because numpy, pandas, scipy, and pytest are unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 21, 2026
@igerber igerber merged commit fb1c6c9 into main May 21, 2026
33 of 34 checks passed
@igerber igerber deleted the spillover-conley-wave-e3-subpopulation branch May 21, 2026 21:32
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
…n survey design

Mechanical transfer of the Wave E.3 SpilloverDiD invariant (PR igerber#482, merge
24de906) to TwoStageDiD. When the always-treated handler drops units from
the OLS sample, the resolved survey design retains its FULL-DOMAIN n_psu /
n_strata / df_survey / strata / fpc / psu arrays instead of being subsetted
via replace(...). Per-cluster scores aggregate at fit-length then zero-pad
onto the full-domain unique-PSU list via two new optional kwargs on
_compute_gmm_variance: score_pad_mask and cluster_ids_full. PSUs containing
only always-treated rows get zero score rows but still count toward G_full
for n_psu / df_survey accounting.

Documented synthesis (library-convention adoption): adopts the canonical
R survey::svyrecvar(subset()) convention (Lumley 2010 §2.5), already
established at imputation.py:2175-2183 (PreTrendsImputation),
prep.py:1401-1432 (DCDH cell variance), and spillover.py (PR igerber#482).

Implementation:
- diff_diff/two_stage.py: delete L1485-1525 design-subset block; promote
  keep_mask to fit()-level scope (always defined; defaults all-True);
  cluster injection sources cluster_ids_raw from FULL-DOMAIN data[cluster]
  (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against
  resolved_survey.strata stays length-aligned; df["_survey_cluster"]
  aligned to post-drop length via resolved_survey.psu[keep_mask.values];
  _compute_gmm_variance + 3 inner _stage2_* methods gain score_pad_mask /
  cluster_ids_full kwargs; zero-pad expansion after per-cluster
  aggregation; strata/fpc obs_idx lookups use cluster_ids_full under
  padding; G < 2 unidentified gate fires on G_full when padding active.
- diff_diff/two_stage.py: _refit_ts callback subsets each replicate
  weight w_r via keep_mask.values before threading into _fit_untreated_model
  and _stage2_*, matching the keep_mask-subsetting applied to survey_weights
  in the main fit (otherwise solve_ols rejects the length mismatch and
  compute_replicate_refit_variance swallows the ValueError so replicate
  inference NaNs out).
- Always-treated warning text updated to reflect the new contract: weights
  are subsetted for OLS, design retained for variance.
- Replicate variance + always-treated: existing path now works correctly
  (score_pad_mask_arg stays None on _uses_replicate_ts paths; per-replicate
  refit handles resampling separately).

Tests (tests/test_two_stage.py):
- New TestTwoStageDiDWaveE3ParityAlwaysTreated class with 8 tests:
  (a) no-always-treated baseline, (b) full-domain df_survey preservation
  under drop, (c) full-domain n_psu reporting, (d) per-cluster zero-pad
  mock-spy on _compute_stratified_meat_from_psu_scores, (e) subpopulation
  + always-treated composition, (f) cluster-as-PSU + always-treated,
  (g) no-survey path unchanged, (h) PSU entirely-always-treated.

Tests (tests/test_replicate_weight_expansion.py):
- Strengthened test_two_stage_always_treated to assert finite overall_se /
  overall_p_value / overall_conf_int (was only asserting finite ATT,
  missing the replicate-SE regression class).
- New test_two_stage_always_treated_event_study_and_group_replicate
  exercising the event-study + group replicate refit branches end-to-end
  under always-treated drop with aggregate="all"; asserts finite se +
  p_value + conf_int on non-reference horizons and cohorts.

Docs:
- docs/methodology/REGISTRY.md: TwoStageDiD section gains "documented
  synthesis — Wave E.3 parity" note; SpilloverDiD Wave E.3 follow-up note
  updated to mark TwoStageDiD parity as shipped.
- CHANGELOG.md: Unreleased Added entry leading with documented-synthesis
  framing.
- TODO.md: drop parity follow-up row.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant