Add fixed effects and absorb parameters to DifferenceInDifferences by igerber · Pull Request #2 · igerber/diff-diff

igerber · 2026-01-01T21:55:16Z

Add fixed_effects parameter for low-dimensional categorical FE (dummy variables)
Add absorb parameter for high-dimensional FE (within-transformation)
Properly adjust degrees of freedom for absorbed fixed effects
Add comprehensive test suite for fixed effects functionality (8 new tests)
Update README with fixed effects usage examples and API documentation

- Add fixed_effects parameter for low-dimensional categorical FE (dummy variables) - Add absorb parameter for high-dimensional FE (within-transformation) - Properly adjust degrees of freedom for absorbed fixed effects - Add comprehensive test suite for fixed effects functionality (8 new tests) - Update README with fixed effects usage examples and API documentation

- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TwoStageDiD and ImputationDiD each run two iterative alternating-projection solvers (_iterative_fe, _iterative_demean) whose convergence loop exited silently on max_iter exhaustion, returning the current iterate as if converged. This matches the silent-failure pattern audited under axis B of the silent-failures initiative (findings #2-#5). Adds a shared warn_if_not_converged helper in diff_diff.utils and calls it from all four alternating-projection loops on non-convergence. Pattern mirrors the existing logistic + Poisson IRLS convergence warnings in linalg.py (lines 1329-1376). Warning-only: no new public parameter, no behavior change on inputs that already converge. Updates REGISTRY.md entries for ImputationDiD and TwoStageDiD with Note labels describing the new signal. Axis-B regression-lint baseline: 10 silent range(max_iter) loops -> 6 remaining (Frank-Wolfe and TROP addressed in follow-up PRs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses PR #311 AI review R6 (2 × P3 cleanups). P3 #1: Warning gate was computed from raw positive-weight groups, not the post-filter eligible-group set used to build the bootstrap PSU map. Panels where upstream dCDH filtering drops groups that share PSUs with kept groups could emit a misleading "PSU coarser than group" warning even when the effective bootstrap is one group per PSU. Fix: count PSUs and groups from `_eligible_group_ids` (the same set feeding `group_id_to_psu_code_bootstrap`), preserving the within- group-constant-PSU invariant by taking each eligible group's first positive-weight PSU label. P3 #2: Two docstrings said the bootstrap is "clustered at the group level" only — now incomplete after the PSU-level survey path: - `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring: extended to note PSU-level Hall-Mammen wild clustering under `survey_design` with coarser PSU. - `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module docstring: documents the identity-map fast path (auto-inject `psu=group`), the PSU-level broadcast when PSU is strictly coarser, and points to REGISTRY.md for the full contract. Full regression: 318 passing.

Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups). R7 P3 #1: Several sites still said dCDH "always clusters at the group level" — which was true when the PR was written but is now incomplete given the PSU-level Hall-Mammen wild bootstrap path under `survey_design`. Updated to distinguish user-specified `cluster=` (still unsupported, raises NotImplementedError) from automatic PSU-level clustering (takes over under `survey_design` with strictly-coarser PSUs; identity under auto-inject `psu=group`): - `docs/methodology/REGISTRY.md:592` Note (cluster contract) — rewrote to describe both paths; dropped "Phase 1" framing. - `docs/methodology/REGISTRY.md:636` checklist — added the automatic PSU-level upgrade clause. - `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor docstring — same contract split. - `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503` `cluster=` error messages — removed "Phase 1" phrasing, added PSU-level-under-survey_design context. - `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated to match the new error wording (no longer pins "Phase 1"). R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will add multiplier-bootstrap support for placebo and bootstrap covers `DID_M`, `DID_+`, `DID_-` only — both stale after this PR's L_max >= 1 placebo and event-study bootstrap paths. Rewrote to scope the NaN-SE contract to `L_max=None` only and describe the full bootstrap coverage (overall, joiners, leavers, per-horizon event-study, placebo horizons, shared weights for sup-t bands). Full regression: 336 passing.

Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map plumbing flagged by CI review. **P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.** When any variance-eligible group had no positive-weight cells (all- sentinel row in psu_codes_per_cell), the old code set valid_map=False and left BOTH group_id_to_psu_code_bootstrap AND psu_codes_per_cell_bootstrap as None. The bootstrap then silently dropped to unclustered group-level instead of excluding only that group's empty row. Fix: always populate psu_codes_per_cell_bootstrap once the tensor is built; the cell-level path already masks out -1 cells at unroll time. Always populate group_id_to_psu_code_bootstrap with a per-group code (use placeholder 0 for all-sentinel rows since those groups have no IF mass and the multiplier they receive is irrelevant on either the legacy or the cell-level path). **P0 #2 - dense PSU codes factorized over non-eligible subset.** `np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU labels from groups that were filtered out of _eligible_group_ids (e.g., singleton-baseline-excluded groups). The excluded groups' PSUs contributed dense codes that formed gaps in the eligible subset's map. Downstream `_generate_psu_or_group_weights` computes `n_psu = max(code) + 1` and triggers the identity fast path when `n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]` silently activated independent-draws clustering for eligible groups that should have shared a multiplier. Fix: restrict the np.unique factorization to the eligible-subset positive-weight obs only (`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)`), so the dense code domain exactly matches the PSUs actually used by variance-eligible groups. Tests: - `test_bootstrap_zero_weight_group_equivalent_to_removing_it`: fit with vs without an all-zero-weight eligible group must produce byte-identical bootstrap SE at the same seed (byte- identity would have failed before P0 #1 fix because valid_map flipped the PSU-aware path off for the with-zero-group fit). - `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`: spies on the group_id_to_psu_code dict passed to `_compute_dcdh_bootstrap` under a fixture with an always-treated singleton-baseline group and strictly-coarser PSU among eligible groups. Asserts the dict's values form a contiguous `[0, n_unique-1]` range (no gaps from the excluded group's PSU), and that eligible groups sharing a PSU label receive the same dense code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between  /  markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…= workaround text **P3 #1 (warning predicate inconsistent with "strictly coarser PSU" contract):** the new bootstrap warning block's comment said the warning fires only on strictly-coarser PSU designs, but the predicate `n_psu_eff_warn < n_groups_eff_warn` could also fire on supported varying-PSU designs whose eligible groups happened to share PSU labels across groups. Detect within-group-varying PSU explicitly (`.groupby("g")["p"].nunique().gt(1).any()`) and suppress the warning in that regime. Under auto-inject PSU=group and under within-group-varying PSU the warning now stays silent, matching the stated contract. **P3 #2 (`_unroll_target_to_cells` suggested `psu=<group_col>` as a bootstrap workaround):** the Registry / CHANGELOG already clarified that `psu=<group_col>` is ONLY a Binder TSL workaround; the cell- level wild PSU bootstrap has no allocator fallback. The helper's docstring and `ValueError` message still advertised it as a bootstrap-path workaround. Dropped that suggestion and explicitly clarified: the varying-PSU bootstrap IS the cell-level path, so there is no legacy-allocator alternative to fall back to — pre-processing the panel is the only workaround on the bootstrap side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 (methodology): mse_optimal_bandwidth now rejects boundary > d.min() with a clear ValueError. The Phase 1b wrapper is scoped to the HAD lower-boundary case (Design 1' with d_0 = 0 or Design 1 continuous-near- d_lower with d_0 = min D_2). Interior or upper-boundary inputs would silently run the boundary selector with a symmetric kernel and return a bandwidth incompatible with the one-sided fitter. The port remains available for interior / broader surface via _nprobust_port.lpbwselect_mse_dpi. P1 #2 (code quality): lprobust_bw validates in-window observation counts at each of the three local-poly fits before calling qrXXinv: - variance: n_V >= o+1 - B1: n_B1 >= o_B+1 - B2: n_B2 >= o_B+2 Each guard raises a targeted ValueError naming the failing stage, the bandwidth, and suggested remediation. Previously these failed with opaque LinAlgError from Cholesky on under-determined designs. P3 (doc): local_linear.py module docstring updated to say Phase 1b "ships" instead of "will add"; tiny-sample test now asserts the new ValueError contract instead of accepting any non-IndexError failure. New behavioral tests: - test_interior_boundary_rejected: boundary=0.5 on U(0,1) rejected - test_upper_boundary_rejected: boundary=d.max() rejected - test_boundary_equal_to_min_d_accepted: boundary=min(d) accepted (Design 1 continuous-near-d_lower path) - test_boundary_below_min_d_accepted: boundary=0 with d.min()>0 accepted (Design 1' path) - test_bwcheck_none_on_tiny_sample_raises_valueerror: upgraded from "catch anything non-IndexError" to pytest.raises(ValueError, match="lprobust_bw"). 153 tests pass (up from 149). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 (methodology): mse_optimal_bandwidth now rejects Design 1 mass-point designs. When boundary > 0 and the modal fraction at d.min() exceeds the REGISTRY-specified 2% threshold, raise NotImplementedError pointing to the 2SLS sample-average estimator per de Chaisemartin et al. (2026) Section 3.2.4. Design 1' with untreated units at d=0 (boundary=0) is still accepted per Garrett et al. (2020) application precedent. P1 #2 (code quality): qrXXinv now catches np.linalg.LinAlgError from Cholesky and re-raises as ValueError with a targeted message naming the failing dimension and suggesting remediation. Duplicate-support windows or other rank-deficient designs now fail with a clear error instead of leaking LinAlgError out of the port. P3 (tests): Added TestStageDiagnosticsParity::test_R_parity covering all four stages. Previously only V/B1/B2 were pinned; R (BWreg) was only trivially checked for stage_d1 (scale=0 -> R=0). Now stage_b and stage_h R values are explicitly parity-tested at 1% against R nprobust. New behavioral tests: - test_mass_point_design_rejected: 10% mass at 0.1 -> NotImplementedError - test_continuous_near_d_lower_accepted: uniform(0.1, 1.0) passes - test_untreated_at_zero_accepted: 15% at d=0 with boundary=0 passes - test_rank_deficient_design_raises_valueerror: rank-1 X -> ValueError - R parity on all four stages across 3 DGPs (12 new parametrized cases) 169 tests pass (up from 153). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer correctly flagged that the 1%-of-median rule is a Phase 2 design="auto" heuristic, not Phase 1b. Backed off that over-reach. P1 #1: Removed the min(d)/median(d) < 0.01 check. The mass-point guard now applies uniformly (whenever d.min() > 0 and modal fraction at d.min() > 2%) and does not gate on boundary. This still catches the original concern (silently routing mass-point data through the nonparametric branch) without rejecting valid Design 1' samples like Beta(2,2) where d.min() is strictly positive but small. P1 #2: Tightened boundary validation. The wrapper now accepts only boundary ~ 0 (Design 1') or boundary ~ d.min() (Design 1 continuous- near-d_lower) within float tolerance. Off-support values -- including the previously-allowed "boundary < d.min()" path -- are rejected with a targeted error message. P3: Added a public-wrapper duplicate-support regression that drives a rank-deficient X'X through the full selector stack (boundary = d.min(), unique minimum, only 4 distinct d values) and asserts a specific "qrXXinv" ValueError, not LinAlgError. Test updates: - Removed test_boundary_zero_with_positive_d_min_rejected: the case it modeled is now accepted (no mass point). - Added test_boundary_zero_thin_boundary_density_accepted: Beta(2,2) Design 1' with vanishing boundary density now passes. - Added test_off_support_boundary_rejected: boundary=0.5 on U(1,2). - Added test_negative_boundary_rejected: boundary<0 rejected. - Updated test_nonzero_boundary: uses boundary=float(d.min()), not boundary=1.0 (which is off the realized support of U(1,2)). 175 tests pass (up from 172). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1: boundary=0 now enforces a Design 1' support plausibility heuristic: d.min() <= 5% * median(|d|). Samples with d.min() substantially positive (e.g. U(0.5, 1)) are rejected with ValueError directing the caller to boundary=float(d.min()). Threshold chosen at 5% (not REGISTRY's 1%) so the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) still pass. Reordered so the mass-point check (NotImplementedError, paper Section 3.2.4) fires before the support-check -- mass-point data should be redirected to 2SLS regardless of the boundary the caller picked. P1 #2: Empty-input front-door guard. d.size == 0 raises ValueError with a targeted "must be non-empty" message instead of leaking the NumPy reduction error from d.min(). P3 (docstring sync): _nprobust_port module docstring no longer says weighted data can be handled by the public wrapper -- the wrapper explicitly raises NotImplementedError. Docstring now matches the actual contract. P3 (deferred, same as last round): tri/uni/shifted-boundary golden parity extension. REGISTRY.md Phase 1b note expanded to document the full input contract (nonnegativity, boundary applicability, Design 1' support heuristic, mass-point redirection) so the public API surface is fully specified in the methodology registry. 178 tests pass (up from 177). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI re-review P3 items, all documentation-only: - Scenario 3 operation chain: said "analytical TSL via strata + PSU", but aggregate_survey()'s returned second-stage design is pweight with geographic PSU clustering and no stage-2 strata. Reworded to match the actual second-stage design surface being benchmarked. - ImputationDiD "consistently dominant" claim in scaling finding #2 and hotspot table row #2: at Rust medium SunAbraham clearly leads (0.353s vs 0.214s). Both claims narrowed to "Python all scales + Rust small/large" with the Rust-medium SunAbraham exception called out explicitly; the "together ~70-80% of the chain" framing preserves the optimization recommendation. - SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo are the two largest at every scale/backend, but at Rust small bootstrap_variance slightly edges both (at sub-50ms totals, per- phase fixed overhead dominates ranking). Qualified to Python all scales + Rust medium/large. Docs-only. No script or baseline changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…race shifts CI re-review P1: bench_dose_response.py inherited the CDiD generator's default cohort [2], not the documented period 3. The fallback that would have set first_treat=3 never ran (generator already populates first_treat), so the committed baselines measured a different cohort onset than the scenario doc. The binarized DiD phase also hardcoded post >= 3, which further desynced it from the actual CDiD treatment start under the default DGP. Fix: - Pin the generator to cohort_periods=[3] so the DGP matches the docs. - Assert exactly one positive first_treat after generation; future DGP changes that break the single-cohort contract will fail loudly instead of drifting silently. - Binarized phase now derives its post cutoff from the actual first_treat in the data, not a hardcoded period number. No opportunity to desync from the CDiD fits above. - Regenerated dose-response baselines for both backends. Structural narrative hardening: Prior CI rounds have repeatedly re-flagged the same drift pattern: the staggered campaign and reversible dCDH narratives make phase- order claims at close-race cells (staggered Rust medium, dCDH at this shape) that can flip on rerun because the two contenders are within a few percentage points of each other. The underlying ranking is not the right level of abstraction for narrative; the phase-share table is. This commit rewrites both narratives to describe the aggregate share pattern and defer per-cell ordering to the generator-produced table. Scaling finding #2 and hotspot table row #2 get the same treatment. Net effect: narrative claims are now robust to rerun noise at close-race cells. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples** When a user explicitly forced design="continuous_near_d_lower" on a sample that actually satisfies the >2% modal-fraction mass-point criterion, the downstream regressor shift (D - d_lower) would move the support minimum to zero on the shifted scale. Phase 1c's mass-point rejection guard only fires when d.min() > 0 (_validate_had_inputs), so the silent coercion ran the nonparametric local-linear estimator on a sample the paper (Section 3.2.4) requires to use the 2SLS branch, producing the wrong estimand. Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction check on the ORIGINAL (unshifted) d_arr when the user explicitly selects design="continuous_near_d_lower". If the fraction at d.min() exceeds 2%, the fit raises ValueError pointing to design="mass_point" or design="auto". design="auto" is unaffected (_detect_design already correctly resolves such samples to mass_point). **P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic** The previous validator called `.astype(np.float64)` and `int(v)` on grouped first_treat values, which crashed on otherwise-supported string-labelled two-period panels (period in {"A","B"}, first_treat in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value set-membership against `{0, t_post}` with no numeric coercion. **P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type** When cluster was supplied, `_fit_mass_point_2sls` unconditionally switches to the CR1 cluster-robust sandwich, but the result object stored the REQUESTED family ("hc1" or "classical") as `vcov_type`. `summary()` rendered correctly via the cluster_name branch, but `to_dict()` and downstream programmatic consumers saw the stale requested label. Fixed: when cluster is supplied, `vcov_type` is stored as `"cr1"` regardless of the requested family. Renamed the local variable from `vcov_effective` to `vcov_requested` to separate the input from the effective family. Updated the `HeterogeneousAdoptionDiDResults.summary()` branch so the cluster rendering still works with the new stored value. **Tests added (+8 regression):** - TestValidateHadPanel.test_first_treat_col_with_string_periods - TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string - TestContinuousPathRejectsMassPoint (2 tests) - TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base family when unclustered, classical+cluster collapses to cr1, to_dict shows effective family) Targeted regression: 126 HAD tests + 505 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…x dCDH headline_attribute R1 surfaced three P1s, all legitimate: 1. StackedDiD wording mismatch. Claimed ``overall_att`` is a treated-share-weighted aggregate across sub-experiments; actual implementation (``stacked_did.py`` ~line 541) computes ``overall_att`` as the simple average of post-treatment event- study coefficients ``delta_h`` with delta-method SE. Per-horizon ``delta_h`` is the paper's ``theta_kappa^e`` cross-event aggregate, but the headline is an equally-weighted average over those per-horizon coefficients, not a separate cross-event weighting at the ATT level. Definition rewritten to describe the actual estimand. 2. Dead ``TwoWayFixedEffectsResults`` branch. ``TwoWayFixedEffects`` is a subclass of ``DifferenceInDifferences`` and its ``fit()`` returns ``DiDResults`` — there is no separate TWFE result class, so the ``type(results).__name__ == "TwoWayFixedEffectsResults"`` dispatch branch was unreachable on any real fit. Removed the dead branch and rewrote the ``DiDResults`` branch to cover both 2x2 DiD and TWFE interpretations explicitly (both estimators route here). Follow-up for future PR: persist estimator provenance on ``DiDResults`` (or return a dedicated TWFE result class) so the branch can split again; documented inline. 3. dCDH ``headline_attribute="att"``. Both dCDH branches (``DID_M`` for ``L_max=None``, ``DID_l``/derivatives for ``L_max >= 1``) named ``"att"`` as the headline attribute, but ``ChaisemartinDHaultfoeuilleResults`` stores the headline in ``overall_att`` (``chaisemartin_dhaultfoeuille_results.py:357``). Fixed both branches to ``"overall_att"``; downstream consumers using the machine-readable contract now point at the correct attribute. Tests: new ``TestTargetParameterRealFitIntegration`` covers the gap R1 P2 flagged — prior coverage was stub-based and would not have caught any of the three P1s. Four new real-fit tests: - ``TwoWayFixedEffects().fit(...)`` returns ``DiDResults``; target- parameter block uses the shared DiD/TWFE branch. - ``StackedDiD(...).fit(...)`` on a staggered panel; the ``headline_attribute`` matches the actual real attribute and the definition names the event-study-coefficient estimand. - ``ChaisemartinDHaultfoeuille().fit(...)`` on a reversible- treatment panel (both ``DID_M`` and ``DID_l`` regimes); ``headline_attribute == "overall_att"`` and the named attribute actually exists on the real fit object. Existing stub-based dispatch tests updated: the ``test_twfe_results`` test is now ``test_did_results_mentions_twfe`` (asserts the DiD branch describes both estimators). The dCDH stub tests now also assert ``headline_attribute == "overall_att"``. All 323 BR/DR tests pass (319 prior + 4 new real-fit integration). Out of scope (plan-review MEDIUM #2 — centralizing report metadata in a single registry shared by estimator outputs and reporting helpers): queued as a separate PR. Current approach (string dispatch on ``type(results).__name__`` + REGISTRY.md references) is working but brittle; a centralized registry is the principled fix for the TWFE-dispatch-dead-code class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 surfaced one P1 methodology finding: the dCDH dynamic branch flattened every ``L_max >= 1`` into a generic ``DID_l`` estimand, but the library's actual ``overall_att`` contract is: - ``L_max = None`` -> ``DID_M`` (Phase 1 per-period aggregate). - ``L_max = 1`` -> ``DID_1`` (single-horizon per-group estimand, Equation 3 of the dynamic companion paper — NOT the generic ``DID_l``). - ``L_max >= 2`` (no ``trends_linear``) -> ``delta`` (cost-benefit cross-horizon aggregate, Lemma 4; ``chaisemartin_dhaultfoeuille.py:2602-2634``). - ``trends_linear = True`` AND ``L_max >= 2`` -> ``overall_att`` is intentionally NaN by design (``chaisemartin_dhaultfoeuille.py:2828-2834``). No scalar aggregate; per-horizon level effects live on ``linear_trends_effects[l]``. Fix: ``describe_target_parameter()`` now mirrors the result class's own ``_estimand_label()`` at ``chaisemartin_dhaultfoeuille_results.py:454-490``. New aggregation tags: ``DID_1`` / ``DID_1_x`` / ``DID_1_fd`` / ``DID_1_x_fd`` for single-horizon, ``delta`` / ``delta_x`` for cost-benefit, and ``no_scalar_headline`` for the trends+L_max>=2 suppression case. On the no-scalar case, ``headline_attribute`` is ``None`` so downstream consumers do not point at a field whose value is NaN by design. Tests: added stub-based branches for every new case (``DID_1``, ``DID_1^X``, ``delta``, ``delta^X``, trends + L_max>=2 no-scalar, trends + L_max=1 still-has-scalar) and split the real-fit integration test into ``L_max=1`` and ``L_max=2`` real-panel cases so the contract is enforced end-to-end per R2 P2. The parameterized ``test_dcdh_config_branches_tag`` now covers 10 cases and also asserts ``headline_attribute`` flips to ``None`` only on the no-scalar case. Docs: ``REPORTING.md`` dCDH section rewritten to match the corrected dispatch, including the ``no_scalar_headline`` case and the L_max=None/1/>=2 contract. 332 BR/DR tests pass. Out of scope (still open from R1): centralizing report metadata in a single registry shared by estimator outputs and reporting helpers (plan-review MEDIUM #2 / R1 P2 maintainability). The current string dispatch on ``type(results).__name__`` + explicit REGISTRY.md citations is source-faithful but requires manual mirroring of result-class contracts; a centralized registry is the principled fix. Tracked for a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…full.txt schema Two P3 cleanups from R6. P3 #1: the StackedDiD ``target_parameter.definition`` embedded an internal implementation line reference (``stacked_did.py`` around line 541). That pointer is not methodology source material and will go stale under routine estimator edits even when the estimand itself is unchanged. Removed the reference; definition now stands on paper/registry terms alone. P3 #2: ``diff_diff/guides/llms-full.txt`` listed the pre-PR BR/DR schema top-level keys and omitted ``target_parameter``, so agent- facing documentation disagreed with the runtime schema. Added ``target_parameter`` to both schema-key lists (BR around line 1779 and DR around line 1844). Documented the field shape (``name`` / ``definition`` / ``aggregation`` / ``headline_attribute`` / ``reference``), the dispatch tag set, and the ``headline_attribute=None`` / ``aggregation="no_scalar_headline"`` edge case for the dCDH ``trends_linear=True, L_max>=2`` fit. Also noted the ``headline.status="no_scalar_by_design"`` value so guide-driven agents can dispatch correctly. UTF-8 fingerprint preserved per ``feedback_llms_guide_utf8_fingerprint.md`` (``tests/test_guides.py`` passes). 354 BR/DR + guide tests pass (337 BR/DR + 17 guide). Black clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y vs ES_avg note Two P1 findings from R7, both addressed. P1 #1 (schema version bump): the new ``headline.status`` / ``headline_metric.status`` value ``"no_scalar_by_design"`` added in R4 for the dCDH ``trends_linear=True, L_max>=2`` configuration is a breaking change per REPORTING.md stability policy (new status-enum values are breaking — agents doing exhaustive match will break on unknown enums). Bumped ``BUSINESS_REPORT_SCHEMA_VERSION`` and ``DIAGNOSTIC_REPORT_SCHEMA_VERSION`` from ``"1.0"`` to ``"2.0"``, updated the in-tree schema-version tests (one explicit ``== "1.0"`` assertion and six ``"schema_version": "1.0"`` stub dicts in BR / DR test files), added a REPORTING.md "Schema version 2.0" note, and documented the bump in the CHANGELOG Unreleased entry. The schemas remain marked experimental so the formal deprecation policy does not yet apply. P1 #2 (EfficientDiD library vs paper estimand): both EfficientDiD branches now explicitly state that BR/DR's headline ``overall_att`` is the library's cohort-size-weighted average over post-treatment ``(g, t)`` cells, NOT the paper's ``ES_avg`` uniform event-time average. The regime (PT-All / PT-Post) describes identification; the aggregation choice is a separate library-level policy that REGISTRY.md Sec. EfficientDiD documents. Added ``cohort-size-weighted`` + ``ES_avg`` / ``post-treatment`` assertions to ``test_efficient_did_pt_all`` and ``test_efficient_did_pt_post`` so the wording is pinned. 354 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n tests Both P3 cleanups from R8. P3 #1 (TROP wording in rst): ``business_report.rst`` summary listed TROP's target parameter as "factor-model residual" — which does not match the helper / REGISTRY definition. Both say the TROP target parameter is a factor-model-adjusted weighted average over treated cells (not a residual). Fixed the rst wording to "factor-model-adjusted ATT". P3 #2 (Bacon branch untested): the exhaustiveness guard iterates ``_APPLICABILITY``, but ``BaconDecompositionResults`` is a diagnostic read-out on the DR side and is NOT listed in ``_APPLICABILITY`` (BR rejects it with a TypeError). The helper branch for Bacon therefore slipped through the 16-class guard. Added two regressions: - ``test_bacon_decomposition`` (unit-level, direct helper call): asserts aggregation / headline_attribute / definition wording / Goodman-Bacon reference. - ``test_dr_with_bacon_result_emits_target_parameter`` (integration): passes a real ``BaconDecompositionResults`` from ``bacon_decompose`` on a staggered panel through DR, asserts the ``target_parameter`` block propagates into DR's schema, and confirms the named ``headline_attribute`` (``twfe_estimate``) exists on the real fit object. 356 BR/DR + guide + target-parameter tests pass. Black and ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive": The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" BEFORE checking whether any conclusive test had rejected. The reviewer's example: G=2 with QUG rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates — the workflow emitted "inconclusive - both linearity NaN", hiding a real assumption failure. The paper's rule is one-way: TWFE is admissible only if NO test rejects. A conclusive rejection therefore dominates unresolved-step notes. Fix: reorder _compose_verdict: 1. Collect rejections from conclusive tests first. If any, that is the primary verdict, and unresolved-step notes are APPENDED via "; additional steps unresolved: ..." rather than replacing the rejection. 2. Only when NO conclusive rejection exists AND a required step is unresolved do we return a pure "inconclusive - ..." verdict. 3. Otherwise fall through to the partial-workflow fail-to-reject verdict (with "(Yatchew NaN - skipped)" suffix if applicable). Regressions: - TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection - TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection - TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses: qug_test and _validate_had_panel both front-door-reject d < 0 (paper Section 2 HAD support restriction), but the new linearity helpers only validated shape + NaN. Negative doses are outside the method's stated scope and could silently produce conclusive-looking output. Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test now raise ValueError on any d < 0 with a message directing users to pre-process or check the dose column. Docstrings updated to list the new contract in the Raises section. Regressions: - TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises - TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises R6 P2 — Docstrings / REGISTRY sync: HADPretestReport.verdict docstring rewritten to describe the new "rejection-first, unresolved-suffix" priority. REGISTRY Phase 3 workflow checkbox updated to document the conclusive-rejection-not- hidden semantics plus the non-negative-dose contract. Test count: 64 -> 69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctive-support guard P1 #1 (FPC validator in SurveyDesign.resolve fires on placebo with explicit psu): The R10 fix gated the in-fit implicit-PSU FPC validator on bootstrap/jackknife only, but ``SurveyDesign.resolve()`` itself enforces ``FPC >= n_PSU`` design-validity (survey.py:349-368) before ``synthetic_did.fit()`` even sees the resolved object. So a placebo fit with explicit ``psu`` and low ``fpc`` would still raise — same parameter-interaction problem one layer earlier in resolution. Fix: when ``variance_method == "placebo"`` and ``survey_design.fpc is not None``, construct an FPC-stripped copy of the SurveyDesign (``dataclasses.replace(survey_design, fpc=None)``) BEFORE calling ``_resolve_survey_for_fit``. Emit the FPC no-op ``UserWarning`` at the same time. The original ``survey_design`` object is preserved (caller's reference unchanged); the resolved unit-level survey design carries no FPC on placebo, so the in-fit validators (and the downstream FPC-related dispatch flags) all correctly skip FPC handling. The duplicate downstream FPC no-op warning (added in R8 keyed on ``resolved_survey_unit.fpc``) becomes unreachable on placebo and is removed. New regression ``test_placebo_low_fpc_with_explicit_psu_skips_resolve_validator``: asserts (a) placebo with explicit psu + ``fpc < n_PSU`` succeeds + emits no-op warning, (b) SE matches the no-FPC fit at ``rel=1e-12``, (c) bootstrap on the same low-FPC design still raises ``"FPC (2.0) is less than the number of PSUs"`` from ``SurveyDesign.resolve()`` — validator-skip is correctly variance- method-gated. P1 #2 (Case D missed effective single-support): The Case D guard for placebo degeneracy keyed on raw control counts (``n_c_h > n_t_h`` for at least one stratum). It missed the case where ``n_c_h_positive < 2`` for every treated stratum: rows allow multiple subsets, but every successful pseudo-treated mean reduces to the unique positive-weight control's outcome (zero-weight cohabitants contribute 0 to numerator and denominator, R11 P1). The placebo null collapses to a single point and SE = FP noise. Fix: extend the non-degeneracy invariant to require **both** ``n_c_h > n_t_h`` AND ``n_c_h_positive >= 2`` for at least one treated stratum. The classical Case D shape (raw exact-count ``n_c_h == n_t_h``) and the new "effective single-support" shape (positive-weight controls < 2 even with extra zero-weight rows) both trigger Case D. Updated the Case D error message to enumerate ``n_c_positive`` alongside ``n_c`` / ``n_t`` per stratum. New regression ``test_placebo_full_design_raises_on_effective_single_support``: constructs a fixture with 1 treated unit + 1 positive-weight control + 9 zero-weight controls in stratum 0; raw guards (B/C/E) pass but Case D fires with the new "single distinct positive-mass pseudo-treated mean" message. Updated existing ``test_placebo_full_design_raises_on_exact_count_stratum`` regex to match the new message (same Case D path, slightly different wording). REGISTRY §SyntheticDiD Case enumeration updated: Case D now documents both the classical (``n_c == n_t``) and effective single- support (``n_c_positive < 2``) shapes, with the combined non- degeneracy invariant. Verification: 98 passed (2 new regressions; existing Case B/C/E/D- classical guards still fire on their fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity- family pretests now accept survey=/weights= keyword-only kwargs: - stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow. Stute family: PSU-level Mammen multiplier bootstrap via generate_survey_multiplier_weights_batch. Each replicate draws (B, n_psu) Mammen multipliers, broadcast to per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling -- multiplier bootstrap is a different mechanism. Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap): sigma2_lin = sum(w * eps^2) / sum(w) sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w)) [Reviewer CRITICAL #2] sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg) T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G); locked at atol=1e-14 by direct helper test. Workflow under survey/weights: skips the QUG step with UserWarning (per C0 deferral), sets qug=None on the report, dispatches the linearity family with survey-aware mechanism, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the verdict. all_pass drops the QUG-conclusiveness gate (one less precondition). HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to None-tolerant rendering. Pweight shortcut routing: weights= passes through a synthetic trivial ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern on HAD sup-t. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth, reciprocal- guard discipline). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; deferred to a parallel follow-up. Per-row weights= / survey=col aggregated to per-unit via existing HAD helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey; constant-within-unit invariant enforced) through new _resolve_pretest_unit_weights helper. Strictly-positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Stability invariants preserved: - Unweighted code paths bit-exact pre-PR (the new survey/weights branch is a separate if arm; existing 138 pretest tests pass unchanged). - Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey). - HADPretestReport schema bit-exact on the unweighted path; qug=None triggers the new None-tolerant rendering only on the survey path. 20 new tests across TestHADPretestWorkflowSurveyGuards (revised from C0 rejection-only to C functional + 2 mutex/replicate-weight retained), TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey (5). Full pretest suite: 158 tests pass. Patch-level addition (additive on stable surfaces). See docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for the full methodology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but R10 narrowed Stute-family support to pweight + PSU + FPC only (stratified rejected with NotImplementedError pending derivation). Updated to reflect the actual support surface and consolidated the stratified-Stute follow-up alongside replicate-weight pretests as the two known Phase 4.5 C follow-ups. R12 P3 #2 -- the new survey test matrix covered pweight-only and PSU-only smokes but no FPC-only case. The bootstrap helper applies sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned by direct regression. 2 new positive smokes: - test_stute_test_fpc_only_survey_smoke: direct helper with ResolvedSurveyDesign(fpc=...) populated. - test_workflow_overall_fpc_only_survey_smoke: workflow path with SurveyDesign(weights=, fpc=) column reference. 193 pretest tests pass (was 191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erage P3 #1: ``to_dataframe`` method docstring at ``chaisemartin_dhaultfoeuille_results.py:1375-1379`` listed the pre-change ``level="by_path"`` schema (no ``cband_*`` columns) even though the implementation now returns them. Updated the bullet to include ``cband_lower / cband_upper``, document the negative-horizon placebo convention, and document the NaN-on-absent-band behavior. P3 #2: ``TestByPathSupTBands::test_path_sup_t_seed_reproducibility`` only exercised the default ``rademacher`` weight family. Parameterized over ``["rademacher", "mammen", "webb"]`` to pin that the per-path sup-t branch correctly threads ``self.bootstrap_weights`` through ``_generate_psu_or_group_weights`` for all three multiplier families the feature advertises. The existing OVERALL machinery handles all three uniformly, but the per-path surface lacked direct coverage. Each variant must produce a finite, reproducible crit on the standard 3-path fixture. 17 tests pass on TestByPathSupTBands (was 15: +2 new parameterized variants on the existing seed_reproducibility test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1: extended dispatch-matrix coverage on the new survey_design= front door. Added 3 test classes covering paths that PR #376 fronted but didn't directly test: - TestHADFitMassPointSurveyDesign: design='mass_point' + survey_design= smoke + legacy-alias att-parity (vcov_type='hc1' required by the Phase 4.5 B mass-point + survey deviation). - TestHADFitEventStudySurveyDesign: aggregate='event_study' + cband=True + survey_design= smoke + legacy survey= parity (full bit-equality on att, se under same seed + design). - TestDidHadPretestWorkflowEventStudySurveyDesign: workflow event-study smoke via survey_design=, plus legacy survey= and weights= parity. The weights= parity test also locks the R2 P3 nested-warning suppression (asserts exactly ONE DeprecationWarning fires from the workflow front door, not three from cascading joint wrappers). R2 P3 #1: workflow's event-study `weights=` path was emitting up to 3 DeprecationWarnings (one at workflow front door + one each from the joint wrappers' internal weights= path). Wrap the internal joint wrapper calls in `warnings.catch_warnings() + simplefilter("ignore", DeprecationWarning)` since the user-facing warning has already fired at the workflow front door. Joint wrappers can't accept ResolvedSurveyDesign (their `_resolve_pretest_unit_weights` requires a SurveyDesign with .resolve()), so converting weights= to survey_design= via make_pweight_design isn't an option here. Locked by the new test_legacy_alias_parity_weights assertion `n_dep_warnings == 1`. R2 P3 #2: qug_test mutex error pointed users to `survey_design=make_pweight_design(arr)` as a migration target via the shared HAD_DUAL_KNOB_MUTEX_MSG_ARRAY_IN constant, but qug_test permanently rejects ALL survey_design/survey/weights inputs (Phase 4.5 C0 deferral). Replaced with a qug-specific mutex message that says "no migration path; see NotImplementedError below" instead of suggesting make_pweight_design. 545 tests pass (was 538 + 7 new dispatch-matrix tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R9 P3 #1 (helper error message canonical-kwarg consistency): `_resolve_pretest_unit_weights`'s TypeError on non-`SurveyDesign`-like input still said `survey=` must be a SurveyDesign — but on the data-in wrappers (workflow / joint_pretrends_test / joint_homogeneity_test) the canonical kwarg is now `survey_design=`. Updated the message to name `survey_design=` (with `survey=` flagged as the deprecated alias) and to point pre-resolved-design users to the array-in pretest helpers, mirroring HAD.fit's data-in guard. R9 P3 #2 (legacy-vs-canonical parity coverage on data-in pretests): Added 3 parity tests (test_legacy_alias_parity_survey on joint_pretrends_test + joint_homogeneity_test, plus test_legacy_alias_parity_survey_overall on did_had_pretest_workflow overall path). Locks the rebinding contract on the data-in surfaces that previously only had smoke / warning / mutex coverage. 558 tests pass (was 555 + 3 new R9 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R10 P3 #1 (qug_test deprecation warning text): qug_test was using the shared array-in deprecation messages that point users to migrate to `survey_design=` / `make_pweight_design(arr)`, but qug_test permanently rejects ALL survey-aware kwargs (Phase 4.5 C0 deferral). Replaced with qug-specific warning text that says the aliases are deprecated AND that survey-aware QUG remains unsupported, pointing users to `did_had_pretest_workflow(..., survey_design=...)` for the survey-aware linearity family instead. R10 P3 #2 (weights= parity tests on data-in wrappers): the previous round added survey= parity for joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(aggregate='overall') but left the weights= rebinding paths warning-only with no numerical parity lock. Added 3 new tests: test_legacy_alias_parity_weights (joint_pretrends_test + joint_homogeneity_test) and test_legacy_alias_parity_weights_overall (workflow). Each asserts `weights=np.ones(n)` ≡ `survey_design=SurveyDesign(weights="w")` (uniform 1.0 column) on identical-numerical-output, locking the rebinding contract. 561 tests pass (was 558 + 3 new R10 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…scope P3 #1 (Methodology): qualified the "exact R match" claim across docstring / REGISTRY / CHANGELOG / R-generator comment / parity test docstring with a cross-reference to the existing DID^X cell-weighting deviation (Python's first-stage uses equal cell weights, R weights by N_gt). The two coincide on one-observation-per-(g,t) panels (the common cell-aggregated regime that the parity scenario uses). The multi-observation-per-cell deviation is independent of the by_path lift and was already documented in REGISTRY's "Note (Phase 3 DID^X covariate adjustment)". P3 #2 (Maintainability): narrowed the Step 7b header comment in chaisemartin_dhaultfoeuille.py:1465-1473 to spell out that DID^X residualization applies to the per-group multi-horizon path (event_study_effects, overall_att, joiners/leavers, by_path, placebos, sup-t bands) but intentionally excludes per_period_effects which stays on raw outcomes per the existing "Note (Phase 3 DID^X covariate adjustment)" contract. Documentation-only fix; no runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes BR/DR foundation gap igerber#6 from project_br_dr_foundation.md: BusinessReport and DiagnosticReport now name what the headline scalar actually represents as an estimand, for each of the 16 result classes. Baker et al. (2025) Step 2 ("define the target parameter") was previously in BR's next_steps list but not done by BR itself — this PR closes that gap. New top-level ``target_parameter`` block (additive schema change; experimental per REPORTING.md stability policy): { "name": str, # stakeholder-facing name "definition": str, # plain-English description "aggregation": str, # machine-readable dispatch tag "headline_attribute": str, # which raw result attribute "reference": str, # REGISTRY.md citation pointer } Schema placement: top-level block (user preference, selected via AskUserQuestion in planning). Aggregation tags include "simple", "event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd", "staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x", "l_fd", "l_x_fd", "dose_overall", "pt_all_combined", "pt_post_single_baseline", "unknown". Per-estimator dispatch lives in the new ``diff_diff/_reporting_helpers.py::describe_target_parameter`` (own module rather than business_report / diagnostic_report to avoid circular-import risk — plan-review LOW igerber#7). All 17 result classes covered (16 from _APPLICABILITY + BaconDecompositionResults); exhaustiveness locked in by TestTargetParameterCoversEveryResultClass. Fit-time config reads: - ``EfficientDiDResults.pt_assumption`` branches the aggregation tag between pt_all_combined and pt_post_single_baseline. - ``StackedDiDResults.clean_control`` varies the definition clause (never_treated / strict / not_yet_treated). - ``ChaisemartinDHaultfoeuilleResults.L_max`` + ``covariate_residuals`` + ``linear_trends_effects`` branches the dCDH estimand between DID_M / DID_l / DID^X_l / DID^{fd}_l / DID^{X,fd}_l. Fixed-tag branches (per plan-review CRITICAL igerber#1 and igerber#2): - ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` / ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not change the ``overall_att`` scalar — it only populates additional horizon / group tables on the result object. Disambiguating those tables in prose is tracked under gap igerber#9. - ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level assumption, not a library setting. Emits a single "dose_overall" tag with disjunctive definition naming both regime readings (ATT^loc under PT, ATT^glob under SPT). Prose rendering: - BR ``_render_summary``: emits "Target parameter: <name>." after the headline sentence (short name only; full definition lives in the full_report and schema). - BR ``_render_full_report``: "## Target Parameter" section between "## Headline" and "## Identifying Assumption". - DR ``_render_overall_interpretation``: mirror sentence. - DR ``_render_dr_full_report``: "## Target Parameter" section with name, definition, aggregation tag, headline attribute, and reference. Cross-surface parity: both BR and DR consume the same helper (the single source of truth), so their ``target_parameter`` blocks are byte-identical (verified by TestTargetParameterCrossSurfaceParity). Tests: 37 new (TestTargetParameterPerEstimator + TestTargetParameterFitConfigReads + TestTargetParameterCoversEveryResultClass + TestTargetParameterCrossSurfaceParity + TestTargetParameterProseRendering). Existing BR/DR top-level-key contract tests updated to include ``target_parameter``. Total 319 tests pass (282 prior + 37 new). Docs: REPORTING.md gains a "Target parameter" section documenting the per-estimator dispatch and schema shape. business_report.rst and diagnostic_report.rst note the new field with a pointer to REPORTING.md. CHANGELOG entry under Unreleased. Out of scope: REGISTRY.md per-estimator "Target parameter" sub-sections (plan-review additional-note); the reporting-layer doc in REPORTING.md is the current source of truth. A follow-up docs PR can land those sub-sections if maintainers want the registry to own the canonical wording directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber merged commit 860f8c8 into main Jan 1, 2026

github-actions Bot mentioned this pull request Apr 19, 2026

Add practitioner-workflow performance baseline #333

Merged

4 tasks

igerber mentioned this pull request Apr 22, 2026

HAD Phase 2b: multi-period event-study extension (Appendix B.2) #350

Merged

igerber mentioned this pull request Apr 25, 2026

HAD Phase 4.5 C: linearity-family pretests under survey #370

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fixed effects and absorb parameters to DifferenceInDifferences#2

Add fixed effects and absorb parameters to DifferenceInDifferences#2
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf

igerber commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igerber commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants