igerber · igerber · May 20, 2026 · May 20, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/README.md b/README.md
@@ -106,7 +106,7 @@ Full guide: `diff_diff.get_llm_guide("practitioner")`.
 - [SunAbraham](https://diff-diff.readthedocs.io/en/stable/api/staggered.html) - Sun & Abraham (2021) interaction-weighted estimator for heterogeneity-robust event studies
 - [ImputationDiD](https://diff-diff.readthedocs.io/en/stable/api/imputation.html) - Borusyak, Jaravel & Spiess (2024) imputation estimator, most efficient under homogeneous effects
 - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html) - Gardner (2022) two-stage estimator with GMM sandwich variance
-- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html) - Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover on near-control units; handles non-staggered and staggered timing; supports survey-design variance under `survey_design=` for HC1/CR1
+- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html) - Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover on near-control units; handles non-staggered and staggered timing; supports survey-design variance under `survey_design=` for HC1 / CR1 (Wave E.1 Binder TSL) and Conley (Wave E.2 panel-aware stratified-Conley sandwich on per-period PSU totals; `conley_lag_cutoff=0` only — serial Bartlett HAC composition queued as follow-up)
 - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html) - Synthetic DiD combining standard DiD and synthetic control for few treated units
 - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html) - triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
 - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html) - Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves

diff --git a/TODO.md b/TODO.md
@@ -134,9 +134,9 @@ Deferred items from PR reviews that were not addressed before merge.
 | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low |
 | Conley + survey weights / `survey_design`. Score-reweighted meat `s_i = w_i · X_i · ε_i` is mechanical, but PSU clustering interaction with the spatial kernel and replicate-weights variance under spatial correlation are non-trivial (Bertanha-Imbens 2014 covers cluster-sample but not the explicit Conley case). Phase 5 of the spillover-conley initiative; paper review prerequisite. Currently raises `NotImplementedError` at the linalg validator. | `linalg.py::_validate_vcov_args` | Phase 5 (spillover-conley) | Medium |
 | `SyntheticDiD(vcov_type="conley")` support. Currently raises `TypeError` at `__init__` because SyntheticDiD uses `variance_method ∈ {bootstrap, jackknife, placebo}` rather than the analytical sandwich that Conley plugs into. Wiring would require either reimplementing an analytical sandwich path for SyntheticDiD or designing a spatial-block bootstrap (new methodology, Politis-Romano 1994 territory). | `synthetic_did.py::SyntheticDiD` | follow-up (spillover-conley) | Low |
-| `SpilloverDiD(vcov_type="conley", survey_design=...)` composition (Wave E.2). Wave E.1 ships HC1 / CR1 + survey via Binder TSL (Gerber 2026 Prop 1 + Wave D GMM). Wave E.2 needs the novel within-stratum Conley sandwich on PSU totals — no reference software combines Conley spatial-HAC with Binder TSL on a two-stage IF. Methodologically: aggregate Psi to PSU totals first, demean within stratum, then apply within-stratum Conley sandwich. Strata become an exact independence partition (no kernel weight crosses stratum boundaries by sampling design). | `spillover.py::SpilloverDiD.fit`, `two_stage.py::_compute_gmm_corrected_meat` | follow-up (Wave E.2) | Medium |
 | `SpilloverDiD(survey_design=...)` replicate-weight variance (BRR / Fay / JK1 / JKn / SDR). Wave E.1 ships Taylor-linearization only. Per Gerber (2026) Appendix A, the IF-reweighting shortcut does NOT apply to TwoStageDiD-class estimators because `gamma_hat` is weight-sensitive; correct support requires per-replicate full re-fit of stage 1 and stage 2 (200+ LoC of test surface beyond E.1). | `spillover.py::SpilloverDiD.fit`, `survey.py::compute_replicate_refit_variance` | follow-up | Low |
 | `SpilloverDiD(survey_design=...)` subpopulation preservation (Wave E.3). Wave E.1's `finite_mask` block physically removes zero-weight rows that lose stage-1 FE support, so `SurveyDesign.subpopulation()`-derived designs see `n_psu` / `df_survey` / Binder centering recomputed on the reduced fit sample rather than the full domain design. Standard domain-estimation practice (R `survey::svyrecvar` on a `subset()` design) preserves the original PSU/strata counts and treats out-of-domain rows as zero-score padding. Fix requires separating fit-sample alignment (Psi array) from design-level bookkeeping: preserve a full-design `resolved_survey` for inference metadata + zero-pad dropped zero-weight rows' IF contribution. Add `SurveyDesign.subpopulation()` regression test to lock the contract. | `spillover.py::SpilloverDiD.fit`, `two_stage.py::_compute_binder_tsl_meat` | follow-up (Wave E.3) | Medium |
+| `SpilloverDiD(vcov_type="conley", conley_lag_cutoff > 0, survey_design=...)` serial Bartlett HAC composition. Wave E.2 ships the panel-aware `conley_lag_cutoff = 0` case ("within-period spatial only" — `sum_t sum_h M_h_t` per `tests/test_spillover.py::TestSpilloverDiDWaveE2ConleySurveyDesign::test_b_panel_aware_per_period_sum_invariant`) and raises `NotImplementedError` upfront at `spillover.py:fit` on `conley_lag_cutoff > 0`. The serial Bartlett component (within-unit / within-PSU temporal HAC at lag ≤ L) needs to compose with the panel-aware stratified-Conley spatial sandwich — the natural addition is `meat_serial = sum_g sum_{|t-s|<=L, t!=s} (1 - |t-s|/(L+1)) * (S_psu_t[g] - S_bar_h_t)(S_psu_s[g] - S_bar_h_s)'` per PSU, summed across all PSUs in each stratum, with appropriate Binder FPC scaling — plus a methodology call on whether to include cross-period spatial pairs in the serial term. Regression goldens vs the cross-sectional limit (lag=0, which is now the shipped path). | `spillover.py::SpilloverDiD.fit`, `two_stage.py::_compute_stratified_conley_meat` | follow-up (Wave E.2 follow-up) | Medium |
 | `SpilloverDiD(ring_method="count")` extension. Currently only the nearest-treated-ring specification is exposed. Count-of-treated-in-ring (paper Section 3.2 end) is methodologically supported by Butts but re-introduces functional-form dependence; expose with an explicit kwarg gate and documentation warning. | `spillover.py::SpilloverDiD.fit` | follow-up | Low |
 | `SpilloverDiD` data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight cross-validation). | `spillover.py::SpilloverDiD` | follow-up | Low |
 | `SpilloverDiD` T22 TVA tutorial (`docs/tutorials/22_spillover_did.ipynb`): synthetic TVA-style DGP reproducing Butts (2021) Section 4 Table 1 Panel A bias-correction direction (~40% understatement). Split from the methodology PR per user-confirmed scope split (2026-05-15). | `docs/tutorials/`, `tests/test_t22_*_drift.py` | follow-up (Wave B) | Medium |

diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt
@@ -58,7 +58,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
 - [SunAbraham](https://diff-diff.readthedocs.io/en/stable/api/staggered.html): Sun & Abraham (2021) interaction-weighted estimator for heterogeneity-robust event studies
 - [ImputationDiD](https://diff-diff.readthedocs.io/en/stable/api/imputation.html): Borusyak, Jaravel & Spiess (2024) imputation estimator — most efficient under homogeneous effects
 - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance
-- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=<col>` for CR1 via Gerber (2026) Binder TSL composed with Wave D Gardner GMM correction (Conley × survey + replicate weights queued as follow-up)
+- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=<col>` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2; `conley_lag_cutoff=0` only — serial Bartlett HAC composition queued as follow-up), both composed with the Wave D Gardner GMM correction (replicate weights queued as follow-up)
 - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units
 - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
 - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves

diff --git a/diff_diff/spillover.py b/diff_diff/spillover.py
@@ -2190,22 +2190,43 @@ def fit(
         heteroskedasticity-robust SE with the GMM correction.
         """
         # Wave E.1: lift the Wave B/C/D upfront survey_design rejection.
-        # The full resolution block (pweight gate, replicate gate, unit-constant
+        # Wave E.2 (this PR): conley × survey is now supported via a
+        # stratified-Conley sandwich on PSU totals (composition of Conley
+        # 1999 + Gerber 2026 Prop 1 Binder TSL + Wave D Gardner GMM). The
+        # full resolution block (pweight gate, replicate gate, unit-constant
         # check, cluster-vs-PSU warn) runs AFTER `_validate_spillover_inputs`
         # below so it sees the panel columns the validator guarantees.
         #
-        # The conley × survey composition is genuinely novel methodology
-        # (no reference software combines spatial-HAC + Binder TSL on a
-        # two-stage IF) and ships separately as Wave E.2. Reject upfront so
-        # users get the pointer without waiting through stage 1 / 2 work.
-        if survey_design is not None and self.vcov_type == "conley":
+        # Wave E.2 scope-limit (upfront, before resolution / panel work):
+        # the panel-block Conley HAC (`conley_lag_cutoff > 0`) is NOT
+        # composed with the survey path in this PR. The stratified-Conley
+        # helper applies a cross-sectional kernel on PSU-aggregated totals;
+        # composing the within-unit serial Bartlett HAC with the within-
+        # stratum cross-PSU spatial kernel requires carrying PSU-by-time
+        # scores into the meat construction, which is a separate Wave E.x
+        # follow-up tracked in TODO.md. Reject upfront with a clear pointer
+        # so users running `survey_design=` + `conley_lag_cutoff > 0` get
+        # the error before stage-1 / 2 work (per `feedback_no_silent_failures`).
+        if (
+            survey_design is not None
+            and self.vcov_type == "conley"
+            and self.conley_lag_cutoff is not None
+            and self.conley_lag_cutoff > 0
+        ):
             raise NotImplementedError(
-                "SpilloverDiD does not yet support vcov_type='conley' "
-                "combined with survey_design=. Wave E.2 (planned) will "
-                "compose Conley spatial-HAC with within-stratum Conley "
-                "sandwich on PSU totals; see TODO.md for the planned PR. "
-                "For Wave E.1, use vcov_type='hc1' (with optional "
-                "cluster=<col> for CR1) plus survey_design=."
+                "SpilloverDiD(vcov_type='conley', conley_lag_cutoff > 0) "
+                "combined with survey_design= is not supported in Wave E.2. "
+                "The Wave E.2 stratified-Conley sandwich aggregates Psi to "
+                "PSU totals before applying the cross-sectional Conley "
+                "kernel; the panel-block decomposition (within-unit serial "
+                "Bartlett HAC over time) would require carrying PSU-by-time "
+                "scores and composing the serial kernel with the within-"
+                "stratum cross-PSU spatial kernel. This composition is "
+                "queued as a follow-up (see TODO.md). For Wave E.2, use "
+                "conley_lag_cutoff=0 (cross-sectional Conley) with "
+                "survey_design=, or use survey_design= with "
+                "vcov_type='hc1' (+ cluster=<col> for CR1) for the full "
+                "Wave E.1 path."
             )
         # Validate `anticipation` up front: must be a non-negative integer.
         # Accepting fractional or negative values would silently shift

diff --git a/diff_diff/survey.py b/diff_diff/survey.py
@@ -1898,6 +1898,160 @@ def _compute_stratified_meat_from_psu_scores(
     return meat, _variance_computed, legitimate_zero_count
 
 
+def _compute_stratified_conley_meat_from_psu_scores(
+    psu_scores: np.ndarray,
+    psu_strata: np.ndarray,
+    psu_coords: np.ndarray,
+    *,
+    cutoff: float,
+    metric,
+    kernel: str,
+    fpc_per_psu: "Optional[np.ndarray]" = None,
+    lonely_psu: str = "remove",
+) -> Tuple[np.ndarray, bool, int]:
+    """Wave E.2 stratified-Conley meat on PSU-aggregated scores.
+
+    Composes Conley (1999) spatial-HAC with Gerber (2026, arXiv:2605.04124)
+    Proposition 1 Binder TSL (the Wave E.1 foundation) and the Wave D
+    Gardner GMM first-stage uncertainty correction (Butts 2021 ss3.1 +
+    Gardner 2022 ss4). Used by SpilloverDiD's Wave E.2 GMM sandwich when
+    ``vcov_type="conley"`` is combined with ``survey_design=``.
+
+    Per-stratum loop: demean PSU scores within the stratum, apply the
+    cross-sectional Conley kernel between PSU centroids in that stratum,
+    scale by the Binder finite-population correction
+    ``(1 - f_h) * n_h/(n_h-1)``. Cross-stratum kernel weights are zero by
+    sampling design (strata are exact independence partitions); total meat
+    is the sum across strata.
+
+    Parameters
+    ----------
+    psu_scores : np.ndarray
+        Score matrix of shape (G, k) — one row per PSU.
+    psu_strata : np.ndarray
+        Stratum assignment per PSU, shape (G,).
+    psu_coords : np.ndarray
+        Per-PSU spatial centroid coordinates, shape (G, 2). Typically the
+        mean of per-observation ``conley_coords`` within each PSU.
+    cutoff : float
+        Conley spatial-HAC bandwidth in the same units as ``psu_coords``
+        (km when ``metric="haversine"``).
+    metric : str or callable
+        Distance metric; ``"haversine"`` / ``"euclidean"`` / callable per
+        :mod:`diff_diff.conley` (``ConleyMetric``).
+    kernel : str
+        Spatial kernel: ``"bartlett"`` or ``"uniform"``.
+    fpc_per_psu : np.ndarray, optional
+        FPC population size per PSU, shape (G,). All PSUs in the same
+        stratum should share the same FPC value (first occurrence used).
+    lonely_psu : str
+        How to handle singleton strata: ``"remove"``, ``"certainty"``, or
+        ``"adjust"``. Matches the existing
+        :func:`_compute_stratified_meat_from_psu_scores` behaviour exactly,
+        including the ``"adjust"`` branch's ``continue`` that skips FPC
+        scaling (with ``n_h=1`` the scale ``n_h/(n_h-1)`` would divide by
+        zero).
+
+    Returns
+    -------
+    meat : np.ndarray
+        Meat matrix of shape (k, k).
+    variance_computed : bool
+        Whether any actual variance computation happened.
+    legitimate_zero_count : int
+        Number of strata that legitimately contribute zero variance.
+
+    Notes
+    -----
+    Reduction semantics (load-bearing for tests):
+
+    - bandwidth -> 0 (Bartlett: ``K(d/tiny) = 0`` for ``d > 0`` and
+      ``K(0) = 1`` on the diagonal so K is the identity matrix): the
+      within-stratum sandwich ``sum_{j,k} K_jk c_j c_k' = sum_j c_j c_j'
+      = centered.T @ centered``, which is precisely Binder's formula at
+      :func:`_compute_stratified_meat_from_psu_scores`.
+    - Single stratum (H = 1, FPC = inf): reduces to ordinary Conley
+      sandwich on PSU totals via :func:`diff_diff.conley._compute_conley_meat`.
+
+    No reference software combines all three ingredients (Conley
+    spatial-HAC + Binder TSL + Gardner GMM correction) on a two-stage
+    influence function.
+    """
+    from diff_diff.conley import _compute_conley_meat
+
+    if psu_scores.ndim == 1:
+        psu_scores = psu_scores[:, np.newaxis]
+    k = psu_scores.shape[1]
+    meat = np.zeros((k, k))
+
+    unique_strata = np.unique(psu_strata)
+    _variance_computed = False
+    legitimate_zero_count = 0
+
+    _global_psu_mean = None
+    if lonely_psu == "adjust":
+        _global_psu_mean = psu_scores.mean(axis=0, keepdims=True)
+
+    for h in unique_strata:
+        mask_h = psu_strata == h
+        scores_h = psu_scores[mask_h]
+        coords_h = psu_coords[mask_h]
+        n_psu_h = scores_h.shape[0]
+
+        if n_psu_h < 2:
+            if lonely_psu == "remove":
+                continue
+            elif lonely_psu == "certainty":
+                legitimate_zero_count += 1
+                continue
+            elif lonely_psu == "adjust":
+                # Degenerate one-PSU kernel K = [[K(0)]] = [[1.0]] for both
+                # Bartlett and uniform; equivalent to centered.T @ centered.
+                # MUST `continue` to skip the FPC block below — with n_h = 1
+                # the scale n_h/(n_h-1) divides by zero. Mirrors the Binder
+                # helper's singleton-adjust branch exactly.
+                centered = scores_h - _global_psu_mean
+                with np.errstate(invalid="ignore", over="ignore"):
+                    meat += centered.T @ centered
+                _variance_computed = True
+                continue
+
+        f_h = 0.0
+        if fpc_per_psu is not None:
+            N_h = fpc_per_psu[mask_h][0]
+            if N_h < n_psu_h:
+                raise ValueError(
+                    f"FPC ({N_h}) is less than the number of PSUs "
+                    f"({n_psu_h}) in stratum. FPC must be >= n_PSU."
+                )
+            f_h = n_psu_h / N_h
+            if f_h >= 1.0:
+                legitimate_zero_count += 1
+
+        psu_mean_h = scores_h.mean(axis=0, keepdims=True)
+        centered = scores_h - psu_mean_h
+
+        # Within-stratum Conley sandwich on PSU-centered scores. Pass
+        # ``cluster_ids=None`` explicitly: after PSU aggregation every PSU
+        # is its own cluster, so a cluster product kernel would zero all
+        # cross-PSU pairs. See Wave E.2 plan Chunk 3 step 4.
+        conley_meat_h = _compute_conley_meat(
+            centered,
+            coords_h,
+            cutoff,
+            metric,
+            kernel,
+            cluster_ids=None,
+        )
+
+        adjustment = (1.0 - f_h) * (n_psu_h / (n_psu_h - 1))
+        with np.errstate(invalid="ignore", over="ignore"):
+            meat += adjustment * conley_meat_h
+        _variance_computed = True
+
+    return meat, _variance_computed, legitimate_zero_count
+
+
 def compute_survey_vcov(
     X: np.ndarray,
     residuals: np.ndarray,