Add initial diff-diff library implementation by igerber · Pull Request #1 · igerber/diff-diff

igerber · 2026-01-01T21:40:08Z

Implement difference-in-differences (DiD) library with:

DifferenceInDifferences estimator with sklearn-like API (fit, get_params, set_params)
DiDResults class with statsmodels-style output (summary tables, coefficients, p-values)
Support for formula interface (R-style) and column name interface
Heteroskedasticity-robust (HC1) and cluster-robust standard errors
TwoWayFixedEffects estimator for panel data
Utility functions for parallel trends testing
Comprehensive test suite (16 tests)
pyproject.toml for modern Python packaging

Implement difference-in-differences (DiD) library with: - DifferenceInDifferences estimator with sklearn-like API (fit, get_params, set_params) - DiDResults class with statsmodels-style output (summary tables, coefficients, p-values) - Support for formula interface (R-style) and column name interface - Heteroskedasticity-robust (HC1) and cluster-robust standard errors - TwoWayFixedEffects estimator for panel data - Utility functions for parallel trends testing - Comprehensive test suite (16 tests) - pyproject.toml for modern Python packaging

Review fixes: - Add edge case validation in _compute_flci (se > 0, 0 < alpha < 1) - Improve significance_stars docstring explaining partial identification - Standardize error messages to include parameter values (M, Mbar, alpha) - Make LP solver method configurable in _solve_bounds_lp - Add clarifying comment about constraint matrix design for pre+post periods - Improve CallawaySantAnna error message with actionable guidance Notes: - #4 (sensitivity_plot export) was verified as valid - function exists at honest_did.py:1437 - #1 (pre-period effects) verified correct - LP optimization covers all periods but only post-periods contribute to objective function

Revised review reflects: - #1, #4 verified as non-issues (correct by design) - #3, #5, #6, #8, #13 addressed in commit e40d6b4 - Updated recommendation to approve and merge - Remaining items are low-priority style suggestions for future PRs

- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses PR #311 AI review R6 (2 × P3 cleanups). P3 #1: Warning gate was computed from raw positive-weight groups, not the post-filter eligible-group set used to build the bootstrap PSU map. Panels where upstream dCDH filtering drops groups that share PSUs with kept groups could emit a misleading "PSU coarser than group" warning even when the effective bootstrap is one group per PSU. Fix: count PSUs and groups from `_eligible_group_ids` (the same set feeding `group_id_to_psu_code_bootstrap`), preserving the within- group-constant-PSU invariant by taking each eligible group's first positive-weight PSU label. P3 #2: Two docstrings said the bootstrap is "clustered at the group level" only — now incomplete after the PSU-level survey path: - `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring: extended to note PSU-level Hall-Mammen wild clustering under `survey_design` with coarser PSU. - `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module docstring: documents the identity-map fast path (auto-inject `psu=group`), the PSU-level broadcast when PSU is strictly coarser, and points to REGISTRY.md for the full contract. Full regression: 318 passing.

Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups). R7 P3 #1: Several sites still said dCDH "always clusters at the group level" — which was true when the PR was written but is now incomplete given the PSU-level Hall-Mammen wild bootstrap path under `survey_design`. Updated to distinguish user-specified `cluster=` (still unsupported, raises NotImplementedError) from automatic PSU-level clustering (takes over under `survey_design` with strictly-coarser PSUs; identity under auto-inject `psu=group`): - `docs/methodology/REGISTRY.md:592` Note (cluster contract) — rewrote to describe both paths; dropped "Phase 1" framing. - `docs/methodology/REGISTRY.md:636` checklist — added the automatic PSU-level upgrade clause. - `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor docstring — same contract split. - `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503` `cluster=` error messages — removed "Phase 1" phrasing, added PSU-level-under-survey_design context. - `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated to match the new error wording (no longer pins "Phase 1"). R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will add multiplier-bootstrap support for placebo and bootstrap covers `DID_M`, `DID_+`, `DID_-` only — both stale after this PR's L_max >= 1 placebo and event-study bootstrap paths. Rewrote to scope the NaN-SE contract to `L_max=None` only and describe the full bootstrap coverage (overall, joiners, leavers, per-horizon event-study, placebo horizons, shared weights for sup-t bands). Full regression: 336 passing.

_sc_weight_fw_numpy ran its iterative Frank-Wolfe loop up to max_iter (R's default: 10000) and silently returned the final iterate when the convergence check vals[t-1] - vals[t] < min_decrease^2 never triggered early exit. This matches the silent-failure pattern audited under axis B of the silent-failures initiative (finding #1); REGISTRY:1499 previously documented this as "No warning emitted". Adds a converged flag to the numpy path and calls the shared warn_if_not_converged helper on exhaustion. Updates the REGISTRY entry to describe the new signal. Rust-backend path is unchanged; the Rust FFI function signature currently returns weights only and would need to thread a convergence status — left as an axis-G backend-parity follow-up (tracked in the Phase 2 findings). Warning-only: no new public parameter, no change to returned weights on inputs that already converge. Axis-B regression-lint baseline: 6 -> 5 silent range(max_iter) loops remaining (TROP global outer + inner + local). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map plumbing flagged by CI review. **P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.** When any variance-eligible group had no positive-weight cells (all- sentinel row in psu_codes_per_cell), the old code set valid_map=False and left BOTH group_id_to_psu_code_bootstrap AND psu_codes_per_cell_bootstrap as None. The bootstrap then silently dropped to unclustered group-level instead of excluding only that group's empty row. Fix: always populate psu_codes_per_cell_bootstrap once the tensor is built; the cell-level path already masks out -1 cells at unroll time. Always populate group_id_to_psu_code_bootstrap with a per-group code (use placeholder 0 for all-sentinel rows since those groups have no IF mass and the multiplier they receive is irrelevant on either the legacy or the cell-level path). **P0 #2 - dense PSU codes factorized over non-eligible subset.** `np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU labels from groups that were filtered out of _eligible_group_ids (e.g., singleton-baseline-excluded groups). The excluded groups' PSUs contributed dense codes that formed gaps in the eligible subset's map. Downstream `_generate_psu_or_group_weights` computes `n_psu = max(code) + 1` and triggers the identity fast path when `n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]` silently activated independent-draws clustering for eligible groups that should have shared a multiplier. Fix: restrict the np.unique factorization to the eligible-subset positive-weight obs only (`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)`), so the dense code domain exactly matches the PSUs actually used by variance-eligible groups. Tests: - `test_bootstrap_zero_weight_group_equivalent_to_removing_it`: fit with vs without an all-zero-weight eligible group must produce byte-identical bootstrap SE at the same seed (byte- identity would have failed before P0 #1 fix because valid_map flipped the PSU-aware path off for the with-zero-group fit). - `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`: spies on the group_id_to_psu_code dict passed to `_compute_dcdh_bootstrap` under a fixture with an always-treated singleton-baseline group and strictly-coarser PSU among eligible groups. Asserts the dict's values form a contiguous `[0, n_unique-1]` range (no gaps from the excluded group's PSU), and that eligible groups sharing a PSU label receive the same dense code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rying-PSU equivalence test Addresses the three P2 findings from the CI re-review (all P0s cleared). 1. **Warning prepass assumed one PSU per group** (`chaisemartin_dhaultfoeuille.py:2111-2148`). The old code collected `labels[0]` per eligible group, so a within-group-varying PSU design was mis-counted as having one PSU per group and emitted a misleading "strictly-coarser PSU" UserWarning. Rewrite counts unique PSU labels across all positive-weight obs of eligible groups (not just the first label per group); under PSU=group unchanged, under varying-PSU no false warning. 2. **REGISTRY heterogeneity Note still claimed NotImplementedError** (`REGISTRY.md:618`, "Combining heterogeneity= with n_bootstrap > 0 and within-group-varying PSU still raises NotImplementedError"). That gate was removed in the current PR. Update to clarify that heterogeneity inference stays analytical when bootstrap runs on the main ATT surfaces — the two inference paths are independent. 3. **Zero-weight-equivalence test used `psu=group`** (`test_bootstrap_zero_weight_group_equivalent_to_removing_it`). Under PSU=group both the buggy and correct code paths collapse to the same identity-draw structure, so the test didn't actually exercise the P0 #1 regression. Switch the fixture to within-group-varying PSU (period parity per group) so the cell-level dispatcher is invoked and the before-fix silent-dropback bug would fail this test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the four CI review findings: - BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights= True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was labeling them as BRR, which uses a different variance formula. Fixed script, phase label, scenario doc data-shape text, and example code snippet. - Exit-code propagation. run_scenario now records a module-level failure flag; an atexit handler os._exit(1)s if any phase recorded ok=False. run_all.py's subprocess return-code check now reliably surfaces phase failures. Verified with a forced-failure harness test. - Path references. bench_shared.py and run_all.py docstrings plus performance-plan.md prose normalized to benchmarks/speed_review/baselines/. - Contributor README. "Commit HTMLs" instruction removed; flame HTMLs are gitignored and regenerated per run. Adds memory measurement: - psutil background RSS sampler (10ms) in run_scenario writes a memory field to every scenario JSON: start, peak, growth-during-run. Zero timing impact (background thread, single-syscall samples). - mem_profile_brfss.py - standalone tracemalloc allocator attribution for the BRFSS-1M scenario. Separate from the timing harness so its 2-5x overhead does not contaminate wall-clock baselines. Memory findings extend the optimization priority list without changing the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M rows grows only 23 MB of working memory (vs 46 MB input), and tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is pure CPU - confirms the precompute-scaffolding fix is low-risk and fits in any deployment target including 512 MB Lambda. Secondary finding: staggered CS chain allocates 252-322 MB at 1,500 units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda- tier deployments. Flagged as a lower-priority follow-up. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between  /  markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…= workaround text **P3 #1 (warning predicate inconsistent with "strictly coarser PSU" contract):** the new bootstrap warning block's comment said the warning fires only on strictly-coarser PSU designs, but the predicate `n_psu_eff_warn < n_groups_eff_warn` could also fire on supported varying-PSU designs whose eligible groups happened to share PSU labels across groups. Detect within-group-varying PSU explicitly (`.groupby("g")["p"].nunique().gt(1).any()`) and suppress the warning in that regime. Under auto-inject PSU=group and under within-group-varying PSU the warning now stays silent, matching the stated contract. **P3 #2 (`_unroll_target_to_cells` suggested `psu=<group_col>` as a bootstrap workaround):** the Registry / CHANGELOG already clarified that `psu=<group_col>` is ONLY a Binder TSL workaround; the cell- level wild PSU bootstrap has no allocator fallback. The helper's docstring and `ValueError` message still advertised it as a bootstrap-path workaround. Dropped that suggestion and explicitly clarified: the varying-PSU bootstrap IS the cell-level path, so there is no legacy-allocator alternative to fall back to — pre-processing the panel is the only workaround on the bootstrap side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 (methodology): mse_optimal_bandwidth now rejects boundary > d.min() with a clear ValueError. The Phase 1b wrapper is scoped to the HAD lower-boundary case (Design 1' with d_0 = 0 or Design 1 continuous-near- d_lower with d_0 = min D_2). Interior or upper-boundary inputs would silently run the boundary selector with a symmetric kernel and return a bandwidth incompatible with the one-sided fitter. The port remains available for interior / broader surface via _nprobust_port.lpbwselect_mse_dpi. P1 #2 (code quality): lprobust_bw validates in-window observation counts at each of the three local-poly fits before calling qrXXinv: - variance: n_V >= o+1 - B1: n_B1 >= o_B+1 - B2: n_B2 >= o_B+2 Each guard raises a targeted ValueError naming the failing stage, the bandwidth, and suggested remediation. Previously these failed with opaque LinAlgError from Cholesky on under-determined designs. P3 (doc): local_linear.py module docstring updated to say Phase 1b "ships" instead of "will add"; tiny-sample test now asserts the new ValueError contract instead of accepting any non-IndexError failure. New behavioral tests: - test_interior_boundary_rejected: boundary=0.5 on U(0,1) rejected - test_upper_boundary_rejected: boundary=d.max() rejected - test_boundary_equal_to_min_d_accepted: boundary=min(d) accepted (Design 1 continuous-near-d_lower path) - test_boundary_below_min_d_accepted: boundary=0 with d.min()>0 accepted (Design 1' path) - test_bwcheck_none_on_tiny_sample_raises_valueerror: upgraded from "catch anything non-IndexError" to pytest.raises(ValueError, match="lprobust_bw"). 153 tests pass (up from 149). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 (methodology): mse_optimal_bandwidth now rejects Design 1 mass-point designs. When boundary > 0 and the modal fraction at d.min() exceeds the REGISTRY-specified 2% threshold, raise NotImplementedError pointing to the 2SLS sample-average estimator per de Chaisemartin et al. (2026) Section 3.2.4. Design 1' with untreated units at d=0 (boundary=0) is still accepted per Garrett et al. (2020) application precedent. P1 #2 (code quality): qrXXinv now catches np.linalg.LinAlgError from Cholesky and re-raises as ValueError with a targeted message naming the failing dimension and suggesting remediation. Duplicate-support windows or other rank-deficient designs now fail with a clear error instead of leaking LinAlgError out of the port. P3 (tests): Added TestStageDiagnosticsParity::test_R_parity covering all four stages. Previously only V/B1/B2 were pinned; R (BWreg) was only trivially checked for stage_d1 (scale=0 -> R=0). Now stage_b and stage_h R values are explicitly parity-tested at 1% against R nprobust. New behavioral tests: - test_mass_point_design_rejected: 10% mass at 0.1 -> NotImplementedError - test_continuous_near_d_lower_accepted: uniform(0.1, 1.0) passes - test_untreated_at_zero_accepted: 15% at d=0 with boundary=0 passes - test_rank_deficient_design_raises_valueerror: rank-1 X -> ValueError - R parity on all four stages across 3 DGPs (12 new parametrized cases) 169 tests pass (up from 153). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer correctly flagged that the 1%-of-median rule is a Phase 2 design="auto" heuristic, not Phase 1b. Backed off that over-reach. P1 #1: Removed the min(d)/median(d) < 0.01 check. The mass-point guard now applies uniformly (whenever d.min() > 0 and modal fraction at d.min() > 2%) and does not gate on boundary. This still catches the original concern (silently routing mass-point data through the nonparametric branch) without rejecting valid Design 1' samples like Beta(2,2) where d.min() is strictly positive but small. P1 #2: Tightened boundary validation. The wrapper now accepts only boundary ~ 0 (Design 1') or boundary ~ d.min() (Design 1 continuous- near-d_lower) within float tolerance. Off-support values -- including the previously-allowed "boundary < d.min()" path -- are rejected with a targeted error message. P3: Added a public-wrapper duplicate-support regression that drives a rank-deficient X'X through the full selector stack (boundary = d.min(), unique minimum, only 4 distinct d values) and asserts a specific "qrXXinv" ValueError, not LinAlgError. Test updates: - Removed test_boundary_zero_with_positive_d_min_rejected: the case it modeled is now accepted (no mass point). - Added test_boundary_zero_thin_boundary_density_accepted: Beta(2,2) Design 1' with vanishing boundary density now passes. - Added test_off_support_boundary_rejected: boundary=0.5 on U(1,2). - Added test_negative_boundary_rejected: boundary<0 rejected. - Updated test_nonzero_boundary: uses boundary=float(d.min()), not boundary=1.0 (which is off the realized support of U(1,2)). 175 tests pass (up from 172). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1: boundary=0 now enforces a Design 1' support plausibility heuristic: d.min() <= 5% * median(|d|). Samples with d.min() substantially positive (e.g. U(0.5, 1)) are rejected with ValueError directing the caller to boundary=float(d.min()). Threshold chosen at 5% (not REGISTRY's 1%) so the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) still pass. Reordered so the mass-point check (NotImplementedError, paper Section 3.2.4) fires before the support-check -- mass-point data should be redirected to 2SLS regardless of the boundary the caller picked. P1 #2: Empty-input front-door guard. d.size == 0 raises ValueError with a targeted "must be non-empty" message instead of leaking the NumPy reduction error from d.min(). P3 (docstring sync): _nprobust_port module docstring no longer says weighted data can be handled by the public wrapper -- the wrapper explicitly raises NotImplementedError. Docstring now matches the actual contract. P3 (deferred, same as last round): tri/uni/shifted-boundary golden parity extension. REGISTRY.md Phase 1b note expanded to document the full input contract (nonnegativity, boundary applicability, Design 1' support heuristic, mass-point redirection) so the public API surface is fully specified in the methodology registry. 178 tests pass (up from 177). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby, stratum FPC lookup) on every output cell. At BRFSS scale (50 states x 10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K pandas groupby constructions, each summing a mostly-zero psi vector and paying full pandas setup cost — the entire chain's runtime. This PR adds a frozen _PsuScaffolding dataclass plus private _precompute_psu_scaffolding(resolved) and _compute_if_variance_fast( psi, scf) helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments) plus a closed-form meat = sum_h adjustment_h * centered_ss_h. Scope is deliberately localized: _compute_stratified_psu_meat and compute_survey_if_variance are unchanged, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate- weight designs continue to route through compute_replicate_if_variance unchanged. Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS): - Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust) [18.4-19.0x] - Medium: 6.1s → 0.49s [12.5-13.2x] - Small: 1.6s → 0.17s [7.6-10x] No regression in any other scenario (all within run-to-run noise). Numerical equivalence: new TestAggregateSurveyScaffolding asserts assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths across seven design cases — stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust) — plus structural tests on the scaffolding itself. On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16). Existing coverage unchanged: all 43 TestAggregateSurvey tests green on the fast path (new default); all 129 test_survey.py tests green. Documentation: - docs/performance-plan.md: finding #1 rewritten ("practitioner-fast at every scale"), BRFSS bullet updated, hotspots row #1 marked LANDED, memory finding updated, priority table item #1 marked LANDED, new "Optimization landed" subsection, bottom line updated ("no practitioner-perceptible bottleneck remains"). Auto-tables regenerated via gen_findings_tables.py. - CHANGELOG.md: new Performance entry under [Unreleased]. No user-facing API change. Methodology docs (REGISTRY.md, survey- theory.md) are deliberately not touched: this is a pure internal performance optimization with numerics preserved to sub-ULP tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples** When a user explicitly forced design="continuous_near_d_lower" on a sample that actually satisfies the >2% modal-fraction mass-point criterion, the downstream regressor shift (D - d_lower) would move the support minimum to zero on the shifted scale. Phase 1c's mass-point rejection guard only fires when d.min() > 0 (_validate_had_inputs), so the silent coercion ran the nonparametric local-linear estimator on a sample the paper (Section 3.2.4) requires to use the 2SLS branch, producing the wrong estimand. Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction check on the ORIGINAL (unshifted) d_arr when the user explicitly selects design="continuous_near_d_lower". If the fraction at d.min() exceeds 2%, the fit raises ValueError pointing to design="mass_point" or design="auto". design="auto" is unaffected (_detect_design already correctly resolves such samples to mass_point). **P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic** The previous validator called `.astype(np.float64)` and `int(v)` on grouped first_treat values, which crashed on otherwise-supported string-labelled two-period panels (period in {"A","B"}, first_treat in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value set-membership against `{0, t_post}` with no numeric coercion. **P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type** When cluster was supplied, `_fit_mass_point_2sls` unconditionally switches to the CR1 cluster-robust sandwich, but the result object stored the REQUESTED family ("hc1" or "classical") as `vcov_type`. `summary()` rendered correctly via the cluster_name branch, but `to_dict()` and downstream programmatic consumers saw the stale requested label. Fixed: when cluster is supplied, `vcov_type` is stored as `"cr1"` regardless of the requested family. Renamed the local variable from `vcov_effective` to `vcov_requested` to separate the input from the effective family. Updated the `HeterogeneousAdoptionDiDResults.summary()` branch so the cluster rendering still works with the new stored value. **Tests added (+8 regression):** - TestValidateHadPanel.test_first_treat_col_with_string_periods - TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string - TestContinuousPathRejectsMassPoint (2 tests) - TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base family when unclustered, classical+cluster collapses to cr1, to_dict shows effective family) Targeted regression: 126 HAD tests + 505 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive": The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" BEFORE checking whether any conclusive test had rejected. The reviewer's example: G=2 with QUG rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates — the workflow emitted "inconclusive - both linearity NaN", hiding a real assumption failure. The paper's rule is one-way: TWFE is admissible only if NO test rejects. A conclusive rejection therefore dominates unresolved-step notes. Fix: reorder _compose_verdict: 1. Collect rejections from conclusive tests first. If any, that is the primary verdict, and unresolved-step notes are APPENDED via "; additional steps unresolved: ..." rather than replacing the rejection. 2. Only when NO conclusive rejection exists AND a required step is unresolved do we return a pure "inconclusive - ..." verdict. 3. Otherwise fall through to the partial-workflow fail-to-reject verdict (with "(Yatchew NaN - skipped)" suffix if applicable). Regressions: - TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection - TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection - TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses: qug_test and _validate_had_panel both front-door-reject d < 0 (paper Section 2 HAD support restriction), but the new linearity helpers only validated shape + NaN. Negative doses are outside the method's stated scope and could silently produce conclusive-looking output. Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test now raise ValueError on any d < 0 with a message directing users to pre-process or check the dose column. Docstrings updated to list the new contract in the Raises section. Regressions: - TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises - TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises R6 P2 — Docstrings / REGISTRY sync: HADPretestReport.verdict docstring rewritten to describe the new "rejection-first, unresolved-suffix" priority. REGISTRY Phase 3 workflow checkbox updated to document the conclusive-rejection-not- hidden semantics plus the non-negative-dose contract. Test count: 64 -> 69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctive-support guard P1 #1 (FPC validator in SurveyDesign.resolve fires on placebo with explicit psu): The R10 fix gated the in-fit implicit-PSU FPC validator on bootstrap/jackknife only, but ``SurveyDesign.resolve()`` itself enforces ``FPC >= n_PSU`` design-validity (survey.py:349-368) before ``synthetic_did.fit()`` even sees the resolved object. So a placebo fit with explicit ``psu`` and low ``fpc`` would still raise — same parameter-interaction problem one layer earlier in resolution. Fix: when ``variance_method == "placebo"`` and ``survey_design.fpc is not None``, construct an FPC-stripped copy of the SurveyDesign (``dataclasses.replace(survey_design, fpc=None)``) BEFORE calling ``_resolve_survey_for_fit``. Emit the FPC no-op ``UserWarning`` at the same time. The original ``survey_design`` object is preserved (caller's reference unchanged); the resolved unit-level survey design carries no FPC on placebo, so the in-fit validators (and the downstream FPC-related dispatch flags) all correctly skip FPC handling. The duplicate downstream FPC no-op warning (added in R8 keyed on ``resolved_survey_unit.fpc``) becomes unreachable on placebo and is removed. New regression ``test_placebo_low_fpc_with_explicit_psu_skips_resolve_validator``: asserts (a) placebo with explicit psu + ``fpc < n_PSU`` succeeds + emits no-op warning, (b) SE matches the no-FPC fit at ``rel=1e-12``, (c) bootstrap on the same low-FPC design still raises ``"FPC (2.0) is less than the number of PSUs"`` from ``SurveyDesign.resolve()`` — validator-skip is correctly variance- method-gated. P1 #2 (Case D missed effective single-support): The Case D guard for placebo degeneracy keyed on raw control counts (``n_c_h > n_t_h`` for at least one stratum). It missed the case where ``n_c_h_positive < 2`` for every treated stratum: rows allow multiple subsets, but every successful pseudo-treated mean reduces to the unique positive-weight control's outcome (zero-weight cohabitants contribute 0 to numerator and denominator, R11 P1). The placebo null collapses to a single point and SE = FP noise. Fix: extend the non-degeneracy invariant to require **both** ``n_c_h > n_t_h`` AND ``n_c_h_positive >= 2`` for at least one treated stratum. The classical Case D shape (raw exact-count ``n_c_h == n_t_h``) and the new "effective single-support" shape (positive-weight controls < 2 even with extra zero-weight rows) both trigger Case D. Updated the Case D error message to enumerate ``n_c_positive`` alongside ``n_c`` / ``n_t`` per stratum. New regression ``test_placebo_full_design_raises_on_effective_single_support``: constructs a fixture with 1 treated unit + 1 positive-weight control + 9 zero-weight controls in stratum 0; raw guards (B/C/E) pass but Case D fires with the new "single distinct positive-mass pseudo-treated mean" message. Updated existing ``test_placebo_full_design_raises_on_exact_count_stratum`` regex to match the new message (same Case D path, slightly different wording). REGISTRY §SyntheticDiD Case enumeration updated: Case D now documents both the classical (``n_c == n_t``) and effective single- support (``n_c_positive < 2``) shapes, with the combined non- degeneracy invariant. Verification: 98 passed (2 new regressions; existing Case B/C/E/D- classical guards still fire on their fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but R10 narrowed Stute-family support to pweight + PSU + FPC only (stratified rejected with NotImplementedError pending derivation). Updated to reflect the actual support surface and consolidated the stratified-Stute follow-up alongside replicate-weight pretests as the two known Phase 4.5 C follow-ups. R12 P3 #2 -- the new survey test matrix covered pweight-only and PSU-only smokes but no FPC-only case. The bootstrap helper applies sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned by direct regression. 2 new positive smokes: - test_stute_test_fpc_only_survey_smoke: direct helper with ResolvedSurveyDesign(fpc=...) populated. - test_workflow_overall_fpc_only_survey_smoke: workflow path with SurveyDesign(weights=, fpc=) column reference. 193 pretest tests pass (was 191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erage P3 #1: ``to_dataframe`` method docstring at ``chaisemartin_dhaultfoeuille_results.py:1375-1379`` listed the pre-change ``level="by_path"`` schema (no ``cband_*`` columns) even though the implementation now returns them. Updated the bullet to include ``cband_lower / cband_upper``, document the negative-horizon placebo convention, and document the NaN-on-absent-band behavior. P3 #2: ``TestByPathSupTBands::test_path_sup_t_seed_reproducibility`` only exercised the default ``rademacher`` weight family. Parameterized over ``["rademacher", "mammen", "webb"]`` to pin that the per-path sup-t branch correctly threads ``self.bootstrap_weights`` through ``_generate_psu_or_group_weights`` for all three multiplier families the feature advertises. The existing OVERALL machinery handles all three uniformly, but the per-path surface lacked direct coverage. Each variant must produce a finite, reproducible crit on the standard 3-path fixture. 17 tests pass on TestByPathSupTBands (was 15: +2 new parameterized variants on the existing seed_reproducibility test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R2 P1: extended dispatch-matrix coverage on the new survey_design= front door. Added 3 test classes covering paths that PR #376 fronted but didn't directly test: - TestHADFitMassPointSurveyDesign: design='mass_point' + survey_design= smoke + legacy-alias att-parity (vcov_type='hc1' required by the Phase 4.5 B mass-point + survey deviation). - TestHADFitEventStudySurveyDesign: aggregate='event_study' + cband=True + survey_design= smoke + legacy survey= parity (full bit-equality on att, se under same seed + design). - TestDidHadPretestWorkflowEventStudySurveyDesign: workflow event-study smoke via survey_design=, plus legacy survey= and weights= parity. The weights= parity test also locks the R2 P3 nested-warning suppression (asserts exactly ONE DeprecationWarning fires from the workflow front door, not three from cascading joint wrappers). R2 P3 #1: workflow's event-study `weights=` path was emitting up to 3 DeprecationWarnings (one at workflow front door + one each from the joint wrappers' internal weights= path). Wrap the internal joint wrapper calls in `warnings.catch_warnings() + simplefilter("ignore", DeprecationWarning)` since the user-facing warning has already fired at the workflow front door. Joint wrappers can't accept ResolvedSurveyDesign (their `_resolve_pretest_unit_weights` requires a SurveyDesign with .resolve()), so converting weights= to survey_design= via make_pweight_design isn't an option here. Locked by the new test_legacy_alias_parity_weights assertion `n_dep_warnings == 1`. R2 P3 #2: qug_test mutex error pointed users to `survey_design=make_pweight_design(arr)` as a migration target via the shared HAD_DUAL_KNOB_MUTEX_MSG_ARRAY_IN constant, but qug_test permanently rejects ALL survey_design/survey/weights inputs (Phase 4.5 C0 deferral). Replaced with a qug-specific mutex message that says "no migration path; see NotImplementedError below" instead of suggesting make_pweight_design. 545 tests pass (was 538 + 7 new dispatch-matrix tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R9 P3 #1 (helper error message canonical-kwarg consistency): `_resolve_pretest_unit_weights`'s TypeError on non-`SurveyDesign`-like input still said `survey=` must be a SurveyDesign — but on the data-in wrappers (workflow / joint_pretrends_test / joint_homogeneity_test) the canonical kwarg is now `survey_design=`. Updated the message to name `survey_design=` (with `survey=` flagged as the deprecated alias) and to point pre-resolved-design users to the array-in pretest helpers, mirroring HAD.fit's data-in guard. R9 P3 #2 (legacy-vs-canonical parity coverage on data-in pretests): Added 3 parity tests (test_legacy_alias_parity_survey on joint_pretrends_test + joint_homogeneity_test, plus test_legacy_alias_parity_survey_overall on did_had_pretest_workflow overall path). Locks the rebinding contract on the data-in surfaces that previously only had smoke / warning / mutex coverage. 558 tests pass (was 555 + 3 new R9 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R10 P3 #1 (qug_test deprecation warning text): qug_test was using the shared array-in deprecation messages that point users to migrate to `survey_design=` / `make_pweight_design(arr)`, but qug_test permanently rejects ALL survey-aware kwargs (Phase 4.5 C0 deferral). Replaced with qug-specific warning text that says the aliases are deprecated AND that survey-aware QUG remains unsupported, pointing users to `did_had_pretest_workflow(..., survey_design=...)` for the survey-aware linearity family instead. R10 P3 #2 (weights= parity tests on data-in wrappers): the previous round added survey= parity for joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(aggregate='overall') but left the weights= rebinding paths warning-only with no numerical parity lock. Added 3 new tests: test_legacy_alias_parity_weights (joint_pretrends_test + joint_homogeneity_test) and test_legacy_alias_parity_weights_overall (workflow). Each asserts `weights=np.ones(n)` ≡ `survey_design=SurveyDesign(weights="w")` (uniform 1.0 column) on identical-numerical-output, locking the rebinding contract. 561 tests pass (was 558 + 3 new R10 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…scope P3 #1 (Methodology): qualified the "exact R match" claim across docstring / REGISTRY / CHANGELOG / R-generator comment / parity test docstring with a cross-reference to the existing DID^X cell-weighting deviation (Python's first-stage uses equal cell weights, R weights by N_gt). The two coincide on one-observation-per-(g,t) panels (the common cell-aggregated regime that the parity scenario uses). The multi-observation-per-cell deviation is independent of the by_path lift and was already documented in REGISTRY's "Note (Phase 3 DID^X covariate adjustment)". P3 #2 (Maintainability): narrowed the Step 7b header comment in chaisemartin_dhaultfoeuille.py:1465-1473 to spell out that DID^X residualization applies to the per-group multi-horizon path (event_study_effects, overall_att, joiners/leavers, by_path, placebos, sup-t bands) but intentionally excludes per_period_effects which stays on raw outcomes per the existing "Note (Phase 3 DID^X covariate adjustment)" contract. Documentation-only fix; no runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes BR/DR foundation gap igerber#6 from project_br_dr_foundation.md: BusinessReport and DiagnosticReport now name what the headline scalar actually represents as an estimand, for each of the 16 result classes. Baker et al. (2025) Step 2 ("define the target parameter") was previously in BR's next_steps list but not done by BR itself — this PR closes that gap. New top-level ``target_parameter`` block (additive schema change; experimental per REPORTING.md stability policy): { "name": str, # stakeholder-facing name "definition": str, # plain-English description "aggregation": str, # machine-readable dispatch tag "headline_attribute": str, # which raw result attribute "reference": str, # REGISTRY.md citation pointer } Schema placement: top-level block (user preference, selected via AskUserQuestion in planning). Aggregation tags include "simple", "event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd", "staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x", "l_fd", "l_x_fd", "dose_overall", "pt_all_combined", "pt_post_single_baseline", "unknown". Per-estimator dispatch lives in the new ``diff_diff/_reporting_helpers.py::describe_target_parameter`` (own module rather than business_report / diagnostic_report to avoid circular-import risk — plan-review LOW igerber#7). All 17 result classes covered (16 from _APPLICABILITY + BaconDecompositionResults); exhaustiveness locked in by TestTargetParameterCoversEveryResultClass. Fit-time config reads: - ``EfficientDiDResults.pt_assumption`` branches the aggregation tag between pt_all_combined and pt_post_single_baseline. - ``StackedDiDResults.clean_control`` varies the definition clause (never_treated / strict / not_yet_treated). - ``ChaisemartinDHaultfoeuilleResults.L_max`` + ``covariate_residuals`` + ``linear_trends_effects`` branches the dCDH estimand between DID_M / DID_l / DID^X_l / DID^{fd}_l / DID^{X,fd}_l. Fixed-tag branches (per plan-review CRITICAL igerber#1 and igerber#2): - ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` / ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not change the ``overall_att`` scalar — it only populates additional horizon / group tables on the result object. Disambiguating those tables in prose is tracked under gap igerber#9. - ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level assumption, not a library setting. Emits a single "dose_overall" tag with disjunctive definition naming both regime readings (ATT^loc under PT, ATT^glob under SPT). Prose rendering: - BR ``_render_summary``: emits "Target parameter: <name>." after the headline sentence (short name only; full definition lives in the full_report and schema). - BR ``_render_full_report``: "## Target Parameter" section between "## Headline" and "## Identifying Assumption". - DR ``_render_overall_interpretation``: mirror sentence. - DR ``_render_dr_full_report``: "## Target Parameter" section with name, definition, aggregation tag, headline attribute, and reference. Cross-surface parity: both BR and DR consume the same helper (the single source of truth), so their ``target_parameter`` blocks are byte-identical (verified by TestTargetParameterCrossSurfaceParity). Tests: 37 new (TestTargetParameterPerEstimator + TestTargetParameterFitConfigReads + TestTargetParameterCoversEveryResultClass + TestTargetParameterCrossSurfaceParity + TestTargetParameterProseRendering). Existing BR/DR top-level-key contract tests updated to include ``target_parameter``. Total 319 tests pass (282 prior + 37 new). Docs: REPORTING.md gains a "Target parameter" section documenting the per-estimator dispatch and schema shape. business_report.rst and diagnostic_report.rst note the new field with a pointer to REPORTING.md. CHANGELOG entry under Unreleased. Out of scope: REGISTRY.md per-estimator "Target parameter" sub-sections (plan-review additional-note); the reporting-layer doc in REPORTING.md is the current source of truth. A follow-up docs PR can land those sub-sections if maintainers want the registry to own the canonical wording directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber merged commit f3488e0 into main Jan 1, 2026

igerber deleted the claude/init-did-library-pvNmf branch January 1, 2026 21:42

igerber mentioned this pull request Feb 15, 2026

Add plan review workflow with hook enforcement #148

Merged

igerber mentioned this pull request Apr 11, 2026

Add ChaisemartinDHaultfoeuille (dCDH) DID_M estimator (Phase 1) #290

Merged

igerber mentioned this pull request Apr 19, 2026

Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup) #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial diff-diff library implementation#1

Add initial diff-diff library implementation#1
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf

igerber commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igerber commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants