Add initial diff-diff library implementation#1
Merged
Conversation
Implement difference-in-differences (DiD) library with: - DifferenceInDifferences estimator with sklearn-like API (fit, get_params, set_params) - DiDResults class with statsmodels-style output (summary tables, coefficients, p-values) - Support for formula interface (R-style) and column name interface - Heteroskedasticity-robust (HC1) and cluster-robust standard errors - TwoWayFixedEffects estimator for panel data - Utility functions for parallel trends testing - Comprehensive test suite (16 tests) - pyproject.toml for modern Python packaging
igerber
pushed a commit
that referenced
this pull request
Jan 4, 2026
Review fixes: - Add edge case validation in _compute_flci (se > 0, 0 < alpha < 1) - Improve significance_stars docstring explaining partial identification - Standardize error messages to include parameter values (M, Mbar, alpha) - Make LP solver method configurable in _solve_bounds_lp - Add clarifying comment about constraint matrix design for pre+post periods - Improve CallawaySantAnna error message with actionable guidance Notes: - #4 (sensitivity_plot export) was verified as valid - function exists at honest_did.py:1437 - #1 (pre-period effects) verified correct - LP optimization covers all periods but only post-periods contribute to objective function
igerber
pushed a commit
that referenced
this pull request
Jan 4, 2026
igerber
added a commit
that referenced
this pull request
Apr 16, 2026
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 16, 2026
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 16, 2026
…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 17, 2026
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 17, 2026
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 17, 2026
…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 17, 2026
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 18, 2026
Addresses PR #311 AI review R6 (2 × P3 cleanups). P3 #1: Warning gate was computed from raw positive-weight groups, not the post-filter eligible-group set used to build the bootstrap PSU map. Panels where upstream dCDH filtering drops groups that share PSUs with kept groups could emit a misleading "PSU coarser than group" warning even when the effective bootstrap is one group per PSU. Fix: count PSUs and groups from `_eligible_group_ids` (the same set feeding `group_id_to_psu_code_bootstrap`), preserving the within- group-constant-PSU invariant by taking each eligible group's first positive-weight PSU label. P3 #2: Two docstrings said the bootstrap is "clustered at the group level" only — now incomplete after the PSU-level survey path: - `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring: extended to note PSU-level Hall-Mammen wild clustering under `survey_design` with coarser PSU. - `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module docstring: documents the identity-map fast path (auto-inject `psu=group`), the PSU-level broadcast when PSU is strictly coarser, and points to REGISTRY.md for the full contract. Full regression: 318 passing.
igerber
added a commit
that referenced
this pull request
Apr 18, 2026
Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups). R7 P3 #1: Several sites still said dCDH "always clusters at the group level" — which was true when the PR was written but is now incomplete given the PSU-level Hall-Mammen wild bootstrap path under `survey_design`. Updated to distinguish user-specified `cluster=` (still unsupported, raises NotImplementedError) from automatic PSU-level clustering (takes over under `survey_design` with strictly-coarser PSUs; identity under auto-inject `psu=group`): - `docs/methodology/REGISTRY.md:592` Note (cluster contract) — rewrote to describe both paths; dropped "Phase 1" framing. - `docs/methodology/REGISTRY.md:636` checklist — added the automatic PSU-level upgrade clause. - `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor docstring — same contract split. - `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503` `cluster=` error messages — removed "Phase 1" phrasing, added PSU-level-under-survey_design context. - `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated to match the new error wording (no longer pins "Phase 1"). R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will add multiplier-bootstrap support for placebo and bootstrap covers `DID_M`, `DID_+`, `DID_-` only — both stale after this PR's L_max >= 1 placebo and event-study bootstrap paths. Rewrote to scope the NaN-SE contract to `L_max=None` only and describe the full bootstrap coverage (overall, joiners, leavers, per-horizon event-study, placebo horizons, shared weights for sup-t bands). Full regression: 336 passing.
igerber
added a commit
that referenced
this pull request
Apr 18, 2026
_sc_weight_fw_numpy ran its iterative Frank-Wolfe loop up to max_iter (R's default: 10000) and silently returned the final iterate when the convergence check vals[t-1] - vals[t] < min_decrease^2 never triggered early exit. This matches the silent-failure pattern audited under axis B of the silent-failures initiative (finding #1); REGISTRY:1499 previously documented this as "No warning emitted". Adds a converged flag to the numpy path and calls the shared warn_if_not_converged helper on exhaustion. Updates the REGISTRY entry to describe the new signal. Rust-backend path is unchanged; the Rust FFI function signature currently returns weights only and would need to thread a convergence status — left as an axis-G backend-parity follow-up (tracked in the Phase 2 findings). Warning-only: no new public parameter, no change to returned weights on inputs that already converge. Axis-B regression-lint baseline: 6 -> 5 silent range(max_iter) loops remaining (TROP global outer + inner + local). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map plumbing flagged by CI review. **P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.** When any variance-eligible group had no positive-weight cells (all- sentinel row in psu_codes_per_cell), the old code set valid_map=False and left BOTH group_id_to_psu_code_bootstrap AND psu_codes_per_cell_bootstrap as None. The bootstrap then silently dropped to unclustered group-level instead of excluding only that group's empty row. Fix: always populate psu_codes_per_cell_bootstrap once the tensor is built; the cell-level path already masks out -1 cells at unroll time. Always populate group_id_to_psu_code_bootstrap with a per-group code (use placeholder 0 for all-sentinel rows since those groups have no IF mass and the multiplier they receive is irrelevant on either the legacy or the cell-level path). **P0 #2 - dense PSU codes factorized over non-eligible subset.** `np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU labels from groups that were filtered out of _eligible_group_ids (e.g., singleton-baseline-excluded groups). The excluded groups' PSUs contributed dense codes that formed gaps in the eligible subset's map. Downstream `_generate_psu_or_group_weights` computes `n_psu = max(code) + 1` and triggers the identity fast path when `n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]` silently activated independent-draws clustering for eligible groups that should have shared a multiplier. Fix: restrict the np.unique factorization to the eligible-subset positive-weight obs only (`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >= 0)`), so the dense code domain exactly matches the PSUs actually used by variance-eligible groups. Tests: - `test_bootstrap_zero_weight_group_equivalent_to_removing_it`: fit with vs without an all-zero-weight eligible group must produce byte-identical bootstrap SE at the same seed (byte- identity would have failed before P0 #1 fix because valid_map flipped the PSU-aware path off for the with-zero-group fit). - `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`: spies on the group_id_to_psu_code dict passed to `_compute_dcdh_bootstrap` under a fixture with an always-treated singleton-baseline group and strictly-coarser PSU among eligible groups. Asserts the dict's values form a contiguous `[0, n_unique-1]` range (no gaps from the excluded group's PSU), and that eligible groups sharing a PSU label receive the same dense code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
…rying-PSU equivalence test Addresses the three P2 findings from the CI re-review (all P0s cleared). 1. **Warning prepass assumed one PSU per group** (`chaisemartin_dhaultfoeuille.py:2111-2148`). The old code collected `labels[0]` per eligible group, so a within-group-varying PSU design was mis-counted as having one PSU per group and emitted a misleading "strictly-coarser PSU" UserWarning. Rewrite counts unique PSU labels across all positive-weight obs of eligible groups (not just the first label per group); under PSU=group unchanged, under varying-PSU no false warning. 2. **REGISTRY heterogeneity Note still claimed NotImplementedError** (`REGISTRY.md:618`, "Combining heterogeneity= with n_bootstrap > 0 and within-group-varying PSU still raises NotImplementedError"). That gate was removed in the current PR. Update to clarify that heterogeneity inference stays analytical when bootstrap runs on the main ATT surfaces — the two inference paths are independent. 3. **Zero-weight-equivalence test used `psu=group`** (`test_bootstrap_zero_weight_group_equivalent_to_removing_it`). Under PSU=group both the buggy and correct code paths collapse to the same identity-draw structure, so the test didn't actually exercise the P0 #1 regression. Switch the fixture to within-group-varying PSU (period parity per group) so the cell-level dispatcher is invoked and the before-fix silent-dropback bug would fail this test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Addresses the four CI review findings: - BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights= True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was labeling them as BRR, which uses a different variance formula. Fixed script, phase label, scenario doc data-shape text, and example code snippet. - Exit-code propagation. run_scenario now records a module-level failure flag; an atexit handler os._exit(1)s if any phase recorded ok=False. run_all.py's subprocess return-code check now reliably surfaces phase failures. Verified with a forced-failure harness test. - Path references. bench_shared.py and run_all.py docstrings plus performance-plan.md prose normalized to benchmarks/speed_review/baselines/. - Contributor README. "Commit HTMLs" instruction removed; flame HTMLs are gitignored and regenerated per run. Adds memory measurement: - psutil background RSS sampler (10ms) in run_scenario writes a memory field to every scenario JSON: start, peak, growth-during-run. Zero timing impact (background thread, single-syscall samples). - mem_profile_brfss.py - standalone tracemalloc allocator attribution for the BRFSS-1M scenario. Separate from the timing harness so its 2-5x overhead does not contaminate wall-clock baselines. Memory findings extend the optimization priority list without changing the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M rows grows only 23 MB of working memory (vs 46 MB input), and tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is pure CPU - confirms the precompute-scaffolding fix is low-risk and fits in any deployment target including 512 MB Lambda. Secondary finding: staggered CS chain allocates 252-322 MB at 1,500 units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda- tier deployments. Flagged as a lower-priority follow-up. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
…= workaround text **P3 #1 (warning predicate inconsistent with "strictly coarser PSU" contract):** the new bootstrap warning block's comment said the warning fires only on strictly-coarser PSU designs, but the predicate `n_psu_eff_warn < n_groups_eff_warn` could also fire on supported varying-PSU designs whose eligible groups happened to share PSU labels across groups. Detect within-group-varying PSU explicitly (`.groupby("g")["p"].nunique().gt(1).any()`) and suppress the warning in that regime. Under auto-inject PSU=group and under within-group-varying PSU the warning now stays silent, matching the stated contract. **P3 #2 (`_unroll_target_to_cells` suggested `psu=<group_col>` as a bootstrap workaround):** the Registry / CHANGELOG already clarified that `psu=<group_col>` is ONLY a Binder TSL workaround; the cell- level wild PSU bootstrap has no allocator fallback. The helper's docstring and `ValueError` message still advertised it as a bootstrap-path workaround. Dropped that suggestion and explicitly clarified: the varying-PSU bootstrap IS the cell-level path, so there is no legacy-allocator alternative to fall back to — pre-processing the panel is the only workaround on the bootstrap side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
P1 #1 (methodology): mse_optimal_bandwidth now rejects boundary > d.min() with a clear ValueError. The Phase 1b wrapper is scoped to the HAD lower-boundary case (Design 1' with d_0 = 0 or Design 1 continuous-near- d_lower with d_0 = min D_2). Interior or upper-boundary inputs would silently run the boundary selector with a symmetric kernel and return a bandwidth incompatible with the one-sided fitter. The port remains available for interior / broader surface via _nprobust_port.lpbwselect_mse_dpi. P1 #2 (code quality): lprobust_bw validates in-window observation counts at each of the three local-poly fits before calling qrXXinv: - variance: n_V >= o+1 - B1: n_B1 >= o_B+1 - B2: n_B2 >= o_B+2 Each guard raises a targeted ValueError naming the failing stage, the bandwidth, and suggested remediation. Previously these failed with opaque LinAlgError from Cholesky on under-determined designs. P3 (doc): local_linear.py module docstring updated to say Phase 1b "ships" instead of "will add"; tiny-sample test now asserts the new ValueError contract instead of accepting any non-IndexError failure. New behavioral tests: - test_interior_boundary_rejected: boundary=0.5 on U(0,1) rejected - test_upper_boundary_rejected: boundary=d.max() rejected - test_boundary_equal_to_min_d_accepted: boundary=min(d) accepted (Design 1 continuous-near-d_lower path) - test_boundary_below_min_d_accepted: boundary=0 with d.min()>0 accepted (Design 1' path) - test_bwcheck_none_on_tiny_sample_raises_valueerror: upgraded from "catch anything non-IndexError" to pytest.raises(ValueError, match="lprobust_bw"). 153 tests pass (up from 149). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
P1 #1 (methodology): mse_optimal_bandwidth now rejects Design 1 mass-point designs. When boundary > 0 and the modal fraction at d.min() exceeds the REGISTRY-specified 2% threshold, raise NotImplementedError pointing to the 2SLS sample-average estimator per de Chaisemartin et al. (2026) Section 3.2.4. Design 1' with untreated units at d=0 (boundary=0) is still accepted per Garrett et al. (2020) application precedent. P1 #2 (code quality): qrXXinv now catches np.linalg.LinAlgError from Cholesky and re-raises as ValueError with a targeted message naming the failing dimension and suggesting remediation. Duplicate-support windows or other rank-deficient designs now fail with a clear error instead of leaking LinAlgError out of the port. P3 (tests): Added TestStageDiagnosticsParity::test_R_parity covering all four stages. Previously only V/B1/B2 were pinned; R (BWreg) was only trivially checked for stage_d1 (scale=0 -> R=0). Now stage_b and stage_h R values are explicitly parity-tested at 1% against R nprobust. New behavioral tests: - test_mass_point_design_rejected: 10% mass at 0.1 -> NotImplementedError - test_continuous_near_d_lower_accepted: uniform(0.1, 1.0) passes - test_untreated_at_zero_accepted: 15% at d=0 with boundary=0 passes - test_rank_deficient_design_raises_valueerror: rank-1 X -> ValueError - R parity on all four stages across 3 DGPs (12 new parametrized cases) 169 tests pass (up from 153). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Reviewer correctly flagged that the 1%-of-median rule is a Phase 2 design="auto" heuristic, not Phase 1b. Backed off that over-reach. P1 #1: Removed the min(d)/median(d) < 0.01 check. The mass-point guard now applies uniformly (whenever d.min() > 0 and modal fraction at d.min() > 2%) and does not gate on boundary. This still catches the original concern (silently routing mass-point data through the nonparametric branch) without rejecting valid Design 1' samples like Beta(2,2) where d.min() is strictly positive but small. P1 #2: Tightened boundary validation. The wrapper now accepts only boundary ~ 0 (Design 1') or boundary ~ d.min() (Design 1 continuous- near-d_lower) within float tolerance. Off-support values -- including the previously-allowed "boundary < d.min()" path -- are rejected with a targeted error message. P3: Added a public-wrapper duplicate-support regression that drives a rank-deficient X'X through the full selector stack (boundary = d.min(), unique minimum, only 4 distinct d values) and asserts a specific "qrXXinv" ValueError, not LinAlgError. Test updates: - Removed test_boundary_zero_with_positive_d_min_rejected: the case it modeled is now accepted (no mass point). - Added test_boundary_zero_thin_boundary_density_accepted: Beta(2,2) Design 1' with vanishing boundary density now passes. - Added test_off_support_boundary_rejected: boundary=0.5 on U(1,2). - Added test_negative_boundary_rejected: boundary<0 rejected. - Updated test_nonzero_boundary: uses boundary=float(d.min()), not boundary=1.0 (which is off the realized support of U(1,2)). 175 tests pass (up from 172). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
P1 #1: boundary=0 now enforces a Design 1' support plausibility heuristic: d.min() <= 5% * median(|d|). Samples with d.min() substantially positive (e.g. U(0.5, 1)) are rejected with ValueError directing the caller to boundary=float(d.min()). Threshold chosen at 5% (not REGISTRY's 1%) so the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) still pass. Reordered so the mass-point check (NotImplementedError, paper Section 3.2.4) fires before the support-check -- mass-point data should be redirected to 2SLS regardless of the boundary the caller picked. P1 #2: Empty-input front-door guard. d.size == 0 raises ValueError with a targeted "must be non-empty" message instead of leaking the NumPy reduction error from d.min(). P3 (docstring sync): _nprobust_port module docstring no longer says weighted data can be handled by the public wrapper -- the wrapper explicitly raises NotImplementedError. Docstring now matches the actual contract. P3 (deferred, same as last round): tri/uni/shifted-boundary golden parity extension. REGISTRY.md Phase 1b note expanded to document the full input contract (nonnegativity, boundary applicability, Design 1' support heuristic, mass-point redirection) so the public API surface is fully specified in the methodology registry. 178 tests pass (up from 177). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby, stratum FPC lookup) on every output cell. At BRFSS scale (50 states x 10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K pandas groupby constructions, each summing a mostly-zero psi vector and paying full pandas setup cost — the entire chain's runtime. This PR adds a frozen _PsuScaffolding dataclass plus private _precompute_psu_scaffolding(resolved) and _compute_if_variance_fast( psi, scf) helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments) plus a closed-form meat = sum_h adjustment_h * centered_ss_h. Scope is deliberately localized: _compute_stratified_psu_meat and compute_survey_if_variance are unchanged, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate- weight designs continue to route through compute_replicate_if_variance unchanged. Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS): - Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust) [18.4-19.0x] - Medium: 6.1s → 0.49s [12.5-13.2x] - Small: 1.6s → 0.17s [7.6-10x] No regression in any other scenario (all within run-to-run noise). Numerical equivalence: new TestAggregateSurveyScaffolding asserts assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths across seven design cases — stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust) — plus structural tests on the scaffolding itself. On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16). Existing coverage unchanged: all 43 TestAggregateSurvey tests green on the fast path (new default); all 129 test_survey.py tests green. Documentation: - docs/performance-plan.md: finding #1 rewritten ("practitioner-fast at every scale"), BRFSS bullet updated, hotspots row #1 marked LANDED, memory finding updated, priority table item #1 marked LANDED, new "Optimization landed" subsection, bottom line updated ("no practitioner-perceptible bottleneck remains"). Auto-tables regenerated via gen_findings_tables.py. - CHANGELOG.md: new Performance entry under [Unreleased]. No user-facing API change. Methodology docs (REGISTRY.md, survey- theory.md) are deliberately not touched: this is a pure internal performance optimization with numerics preserved to sub-ULP tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby, stratum FPC lookup) on every output cell. At BRFSS scale (50 states x 10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K pandas groupby constructions, each summing a mostly-zero psi vector and paying full pandas setup cost — the entire chain's runtime. This PR adds a frozen _PsuScaffolding dataclass plus private _precompute_psu_scaffolding(resolved) and _compute_if_variance_fast( psi, scf) helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments) plus a closed-form meat = sum_h adjustment_h * centered_ss_h. Scope is deliberately localized: _compute_stratified_psu_meat and compute_survey_if_variance are unchanged, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate- weight designs continue to route through compute_replicate_if_variance unchanged. Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS): - Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust) [18.4-19.0x] - Medium: 6.1s → 0.49s [12.5-13.2x] - Small: 1.6s → 0.17s [7.6-10x] No regression in any other scenario (all within run-to-run noise). Numerical equivalence: new TestAggregateSurveyScaffolding asserts assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths across seven design cases — stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust) — plus structural tests on the scaffolding itself. On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16). Existing coverage unchanged: all 43 TestAggregateSurvey tests green on the fast path (new default); all 129 test_survey.py tests green. Documentation: - docs/performance-plan.md: finding #1 rewritten ("practitioner-fast at every scale"), BRFSS bullet updated, hotspots row #1 marked LANDED, memory finding updated, priority table item #1 marked LANDED, new "Optimization landed" subsection, bottom line updated ("no practitioner-perceptible bottleneck remains"). Auto-tables regenerated via gen_findings_tables.py. - CHANGELOG.md: new Performance entry under [Unreleased]. No user-facing API change. Methodology docs (REGISTRY.md, survey- theory.md) are deliberately not touched: this is a pure internal performance optimization with numerics preserved to sub-ULP tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 20, 2026
**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples** When a user explicitly forced design="continuous_near_d_lower" on a sample that actually satisfies the >2% modal-fraction mass-point criterion, the downstream regressor shift (D - d_lower) would move the support minimum to zero on the shifted scale. Phase 1c's mass-point rejection guard only fires when d.min() > 0 (_validate_had_inputs), so the silent coercion ran the nonparametric local-linear estimator on a sample the paper (Section 3.2.4) requires to use the 2SLS branch, producing the wrong estimand. Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction check on the ORIGINAL (unshifted) d_arr when the user explicitly selects design="continuous_near_d_lower". If the fraction at d.min() exceeds 2%, the fit raises ValueError pointing to design="mass_point" or design="auto". design="auto" is unaffected (_detect_design already correctly resolves such samples to mass_point). **P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic** The previous validator called `.astype(np.float64)` and `int(v)` on grouped first_treat values, which crashed on otherwise-supported string-labelled two-period panels (period in {"A","B"}, first_treat in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value set-membership against `{0, t_post}` with no numeric coercion. **P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type** When cluster was supplied, `_fit_mass_point_2sls` unconditionally switches to the CR1 cluster-robust sandwich, but the result object stored the REQUESTED family ("hc1" or "classical") as `vcov_type`. `summary()` rendered correctly via the cluster_name branch, but `to_dict()` and downstream programmatic consumers saw the stale requested label. Fixed: when cluster is supplied, `vcov_type` is stored as `"cr1"` regardless of the requested family. Renamed the local variable from `vcov_effective` to `vcov_requested` to separate the input from the effective family. Updated the `HeterogeneousAdoptionDiDResults.summary()` branch so the cluster rendering still works with the new stored value. **Tests added (+8 regression):** - TestValidateHadPanel.test_first_treat_col_with_string_periods - TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string - TestContinuousPathRejectsMassPoint (2 tests) - TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base family when unclustered, classical+cluster collapses to cr1, to_dict shows effective family) Targeted regression: 126 HAD tests + 505 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 22, 2026
P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 22, 2026
R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive": The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" BEFORE checking whether any conclusive test had rejected. The reviewer's example: G=2 with QUG rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates — the workflow emitted "inconclusive - both linearity NaN", hiding a real assumption failure. The paper's rule is one-way: TWFE is admissible only if NO test rejects. A conclusive rejection therefore dominates unresolved-step notes. Fix: reorder _compose_verdict: 1. Collect rejections from conclusive tests first. If any, that is the primary verdict, and unresolved-step notes are APPENDED via "; additional steps unresolved: ..." rather than replacing the rejection. 2. Only when NO conclusive rejection exists AND a required step is unresolved do we return a pure "inconclusive - ..." verdict. 3. Otherwise fall through to the partial-workflow fail-to-reject verdict (with "(Yatchew NaN - skipped)" suffix if applicable). Regressions: - TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection - TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection - TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses: qug_test and _validate_had_panel both front-door-reject d < 0 (paper Section 2 HAD support restriction), but the new linearity helpers only validated shape + NaN. Negative doses are outside the method's stated scope and could silently produce conclusive-looking output. Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test now raise ValueError on any d < 0 with a message directing users to pre-process or check the dose column. Docstrings updated to list the new contract in the Raises section. Regressions: - TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises - TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises R6 P2 — Docstrings / REGISTRY sync: HADPretestReport.verdict docstring rewritten to describe the new "rejection-first, unresolved-suffix" priority. REGISTRY Phase 3 workflow checkbox updated to document the conclusive-rejection-not- hidden semantics plus the non-negative-dose contract. Test count: 64 -> 69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…ctive-support guard P1 #1 (FPC validator in SurveyDesign.resolve fires on placebo with explicit psu): The R10 fix gated the in-fit implicit-PSU FPC validator on bootstrap/jackknife only, but ``SurveyDesign.resolve()`` itself enforces ``FPC >= n_PSU`` design-validity (survey.py:349-368) before ``synthetic_did.fit()`` even sees the resolved object. So a placebo fit with explicit ``psu`` and low ``fpc`` would still raise — same parameter-interaction problem one layer earlier in resolution. Fix: when ``variance_method == "placebo"`` and ``survey_design.fpc is not None``, construct an FPC-stripped copy of the SurveyDesign (``dataclasses.replace(survey_design, fpc=None)``) BEFORE calling ``_resolve_survey_for_fit``. Emit the FPC no-op ``UserWarning`` at the same time. The original ``survey_design`` object is preserved (caller's reference unchanged); the resolved unit-level survey design carries no FPC on placebo, so the in-fit validators (and the downstream FPC-related dispatch flags) all correctly skip FPC handling. The duplicate downstream FPC no-op warning (added in R8 keyed on ``resolved_survey_unit.fpc``) becomes unreachable on placebo and is removed. New regression ``test_placebo_low_fpc_with_explicit_psu_skips_resolve_validator``: asserts (a) placebo with explicit psu + ``fpc < n_PSU`` succeeds + emits no-op warning, (b) SE matches the no-FPC fit at ``rel=1e-12``, (c) bootstrap on the same low-FPC design still raises ``"FPC (2.0) is less than the number of PSUs"`` from ``SurveyDesign.resolve()`` — validator-skip is correctly variance- method-gated. P1 #2 (Case D missed effective single-support): The Case D guard for placebo degeneracy keyed on raw control counts (``n_c_h > n_t_h`` for at least one stratum). It missed the case where ``n_c_h_positive < 2`` for every treated stratum: rows allow multiple subsets, but every successful pseudo-treated mean reduces to the unique positive-weight control's outcome (zero-weight cohabitants contribute 0 to numerator and denominator, R11 P1). The placebo null collapses to a single point and SE = FP noise. Fix: extend the non-degeneracy invariant to require **both** ``n_c_h > n_t_h`` AND ``n_c_h_positive >= 2`` for at least one treated stratum. The classical Case D shape (raw exact-count ``n_c_h == n_t_h``) and the new "effective single-support" shape (positive-weight controls < 2 even with extra zero-weight rows) both trigger Case D. Updated the Case D error message to enumerate ``n_c_positive`` alongside ``n_c`` / ``n_t`` per stratum. New regression ``test_placebo_full_design_raises_on_effective_single_support``: constructs a fixture with 1 treated unit + 1 positive-weight control + 9 zero-weight controls in stratum 0; raw guards (B/C/E) pass but Case D fires with the new "single distinct positive-mass pseudo-treated mean" message. Updated existing ``test_placebo_full_design_raises_on_exact_count_stratum`` regex to match the new message (same Case D path, slightly different wording). REGISTRY §SyntheticDiD Case enumeration updated: Case D now documents both the classical (``n_c == n_t``) and effective single- support (``n_c_positive < 2``) shapes, with the combined non- degeneracy invariant. Verification: 98 passed (2 new regressions; existing Case B/C/E/D- classical guards still fire on their fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…corrected scope; cover new exports in import-surface test P3 #1 (ROADMAP wording drift): ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE / ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance", which contradicted the round-1 corrections to TreatmentDoseShape's docstring + autonomous guide §2 + §5.2. Reworded to match: the new fields add descriptive distributional context only; `outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD QMLE judgment, and the authoritative ContinuousDiD pre-fit gates remain `has_never_treated`, `treatment_varies_within_unit`, and `is_balanced`. "Time-invariance" wording removed (the field was dropped in round 1). P3 #2 (import-surface test coverage): `test_top_level_import_surface()` previously only verified `profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the two new public exports `OutcomeShape` and `TreatmentDoseShape`, asserting both their importability and their presence in `diff_diff.__all__`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard P1 #1 (Wooldridge Poisson estimand wording): The guide §4.11 and §5.3 worked example described `WooldridgeDiD(method="poisson")`'s `overall_att` as a "multiplicative effect" / "log-link effect" / "proportional change" to be reported. Verified against `wooldridge.py:1225` (`att = _avg(mu_1 - mu_0, cell_mask)`) and `_reporting_helpers.py:262-281` (registered estimand: "ASF-based average from Wooldridge ETWFE ... average-structural-function (ASF) contrast between treated and counterfactual untreated outcomes ... on the natural outcome scale"): the actual quantity is `E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a multiplicative ratio. An agent following the previous wording would misreport the headline scalar. Rewrote both surfaces to: - Describe the estimand as an ASF-based outcome-scale difference, citing `wooldridge.py:1225` and Wooldridge (2023) + REGISTRY.md §WooldridgeDiD nonlinear / ASF path. - Explicitly note the headline `overall_att` is a difference on the natural outcome scale, NOT a multiplicative ratio. - Mention that a proportional / percent-change interpretation can be derived post-hoc as `overall_att / E[Y_0]` but is not the estimator's reported scalar. Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand` in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts forbidden phrases ("multiplicative effect under qmle", "estimates the multiplicative effect", "multiplicative (log-link) effect", "report the multiplicative effect", "report the multiplicative") do NOT appear, and asserts §5.3 explicitly contains "ASF" and "outcome scale" so future edits cannot silently weaken the description. P1 #2 (`is_count_like` non-negativity guard): The `is_count_like` heuristic gated on integer-valued + has-zeros + right-skewed + > 2 distinct values, but did NOT exclude negative support. Verified against `wooldridge.py:1105-1109`: Poisson method hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0 guard, a right-skewed integer outcome with zeros and some negatives would set `is_count_like=True` and steer an agent toward an estimator that then refuses to fit. Added `value_min >= 0.0` to the heuristic and explained the non-negativity gate in the docstring + autonomous guide §2 field reference (now reads "is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0"). The guide also notes that the gate exists specifically to align the routing signal with WooldridgeDiD Poisson's hard non-negativity requirement. Added `test_outcome_shape_count_like_excludes_negative_support` in `tests/test_profile_panel.py` covering a Poisson-distributed outcome with a small share of negative integers spliced in: asserts `is_count_like=False` despite the other four conditions firing. P2 (test coverage for both P1s): Both regressions above guard the new contracts. The guide test guards the wording surface; the profile test guards the heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but R10 narrowed Stute-family support to pweight + PSU + FPC only (stratified rejected with NotImplementedError pending derivation). Updated to reflect the actual support surface and consolidated the stratified-Stute follow-up alongside replicate-weight pretests as the two known Phase 4.5 C follow-ups. R12 P3 #2 -- the new survey test matrix covered pweight-only and PSU-only smokes but no FPC-only case. The bootstrap helper applies sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned by direct regression. 2 new positive smokes: - test_stute_test_fpc_only_survey_smoke: direct helper with ResolvedSurveyDesign(fpc=...) populated. - test_workflow_overall_fpc_only_survey_smoke: workflow path with SurveyDesign(weights=, fpc=) column reference. 193 pretest tests pass (was 191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…erage P3 #1: ``to_dataframe`` method docstring at ``chaisemartin_dhaultfoeuille_results.py:1375-1379`` listed the pre-change ``level="by_path"`` schema (no ``cband_*`` columns) even though the implementation now returns them. Updated the bullet to include ``cband_lower / cband_upper``, document the negative-horizon placebo convention, and document the NaN-on-absent-band behavior. P3 #2: ``TestByPathSupTBands::test_path_sup_t_seed_reproducibility`` only exercised the default ``rademacher`` weight family. Parameterized over ``["rademacher", "mammen", "webb"]`` to pin that the per-path sup-t branch correctly threads ``self.bootstrap_weights`` through ``_generate_psu_or_group_weights`` for all three multiplier families the feature advertises. The existing OVERALL machinery handles all three uniformly, but the per-path surface lacked direct coverage. Each variant must produce a finite, reproducible crit on the standard 3-path fixture. 17 tests pass on TestByPathSupTBands (was 15: +2 new parameterized variants on the existing seed_reproducibility test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R2 P1: extended dispatch-matrix coverage on the new survey_design= front door. Added 3 test classes covering paths that PR #376 fronted but didn't directly test: - TestHADFitMassPointSurveyDesign: design='mass_point' + survey_design= smoke + legacy-alias att-parity (vcov_type='hc1' required by the Phase 4.5 B mass-point + survey deviation). - TestHADFitEventStudySurveyDesign: aggregate='event_study' + cband=True + survey_design= smoke + legacy survey= parity (full bit-equality on att, se under same seed + design). - TestDidHadPretestWorkflowEventStudySurveyDesign: workflow event-study smoke via survey_design=, plus legacy survey= and weights= parity. The weights= parity test also locks the R2 P3 nested-warning suppression (asserts exactly ONE DeprecationWarning fires from the workflow front door, not three from cascading joint wrappers). R2 P3 #1: workflow's event-study `weights=` path was emitting up to 3 DeprecationWarnings (one at workflow front door + one each from the joint wrappers' internal weights= path). Wrap the internal joint wrapper calls in `warnings.catch_warnings() + simplefilter("ignore", DeprecationWarning)` since the user-facing warning has already fired at the workflow front door. Joint wrappers can't accept ResolvedSurveyDesign (their `_resolve_pretest_unit_weights` requires a SurveyDesign with .resolve()), so converting weights= to survey_design= via make_pweight_design isn't an option here. Locked by the new test_legacy_alias_parity_weights assertion `n_dep_warnings == 1`. R2 P3 #2: qug_test mutex error pointed users to `survey_design=make_pweight_design(arr)` as a migration target via the shared HAD_DUAL_KNOB_MUTEX_MSG_ARRAY_IN constant, but qug_test permanently rejects ALL survey_design/survey/weights inputs (Phase 4.5 C0 deferral). Replaced with a qug-specific mutex message that says "no migration path; see NotImplementedError below" instead of suggesting make_pweight_design. 545 tests pass (was 538 + 7 new dispatch-matrix tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R9 P3 #1 (helper error message canonical-kwarg consistency): `_resolve_pretest_unit_weights`'s TypeError on non-`SurveyDesign`-like input still said `survey=` must be a SurveyDesign — but on the data-in wrappers (workflow / joint_pretrends_test / joint_homogeneity_test) the canonical kwarg is now `survey_design=`. Updated the message to name `survey_design=` (with `survey=` flagged as the deprecated alias) and to point pre-resolved-design users to the array-in pretest helpers, mirroring HAD.fit's data-in guard. R9 P3 #2 (legacy-vs-canonical parity coverage on data-in pretests): Added 3 parity tests (test_legacy_alias_parity_survey on joint_pretrends_test + joint_homogeneity_test, plus test_legacy_alias_parity_survey_overall on did_had_pretest_workflow overall path). Locks the rebinding contract on the data-in surfaces that previously only had smoke / warning / mutex coverage. 558 tests pass (was 555 + 3 new R9 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
R10 P3 #1 (qug_test deprecation warning text): qug_test was using the shared array-in deprecation messages that point users to migrate to `survey_design=` / `make_pweight_design(arr)`, but qug_test permanently rejects ALL survey-aware kwargs (Phase 4.5 C0 deferral). Replaced with qug-specific warning text that says the aliases are deprecated AND that survey-aware QUG remains unsupported, pointing users to `did_had_pretest_workflow(..., survey_design=...)` for the survey-aware linearity family instead. R10 P3 #2 (weights= parity tests on data-in wrappers): the previous round added survey= parity for joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(aggregate='overall') but left the weights= rebinding paths warning-only with no numerical parity lock. Added 3 new tests: test_legacy_alias_parity_weights (joint_pretrends_test + joint_homogeneity_test) and test_legacy_alias_parity_weights_overall (workflow). Each asserts `weights=np.ones(n)` ≡ `survey_design=SurveyDesign(weights="w")` (uniform 1.0 column) on identical-numerical-output, locking the rebinding contract. 561 tests pass (was 558 + 3 new R10 P3 parity tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…scope P3 #1 (Methodology): qualified the "exact R match" claim across docstring / REGISTRY / CHANGELOG / R-generator comment / parity test docstring with a cross-reference to the existing DID^X cell-weighting deviation (Python's first-stage uses equal cell weights, R weights by N_gt). The two coincide on one-observation-per-(g,t) panels (the common cell-aggregated regime that the parity scenario uses). The multi-observation-per-cell deviation is independent of the by_path lift and was already documented in REGISTRY's "Note (Phase 3 DID^X covariate adjustment)". P3 #2 (Maintainability): narrowed the Step 7b header comment in chaisemartin_dhaultfoeuille.py:1465-1473 to spell out that DID^X residualization applies to the per-group multi-horizon path (event_study_effects, overall_att, joiners/leavers, by_path, placebos, sup-t bands) but intentionally excludes per_period_effects which stays on raw outcomes per the existing "Note (Phase 3 DID^X covariate adjustment)" contract. Documentation-only fix; no runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R6 was ✅ Looks good — 2 P3 polish items. P3 #1 — version-aware repro installer: benchmarks/R/requirements.R installed whatever CRAN currently served via install.packages, while the generator and parity test hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust == 0.5.0. A fresh R environment regenerating the goldens would have the generator's stopifnot(packageVersion == "X.Y.Z") immediately abort. Fix: add `install_pinned_version()` helper using remotes::install_version with `upgrade = "never"`, run it after the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent when the correct version is already installed. Bump procedure documented in lockstep with the generator + parity-test pins. P3 #2 — exact-set parity event_times: _zip_r_python() previously asserted only that R-mapped horizons were a SUBSET of Python's event_times (missing-in-python check). Tighten to FULL SET EQUALITY: also reject horizons present in Python but absent from R's requested set ("extra_in_python"). This catches future event_study horizon-selection regressions in both directions — e.g. if our effects/placebo cap drifts and Python emits an extra row R didn't request. Stats: 540 tests pass, 0 regressions. Still no estimator changes — all P3 polish on the parity / repro infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HanomicsIMF
pushed a commit
to HanomicsIMF/diff-diff
that referenced
this pull request
Apr 27, 2026
Closes BR/DR foundation gap igerber#6 from project_br_dr_foundation.md: BusinessReport and DiagnosticReport now name what the headline scalar actually represents as an estimand, for each of the 16 result classes. Baker et al. (2025) Step 2 ("define the target parameter") was previously in BR's next_steps list but not done by BR itself — this PR closes that gap. New top-level ``target_parameter`` block (additive schema change; experimental per REPORTING.md stability policy): { "name": str, # stakeholder-facing name "definition": str, # plain-English description "aggregation": str, # machine-readable dispatch tag "headline_attribute": str, # which raw result attribute "reference": str, # REGISTRY.md citation pointer } Schema placement: top-level block (user preference, selected via AskUserQuestion in planning). Aggregation tags include "simple", "event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd", "staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x", "l_fd", "l_x_fd", "dose_overall", "pt_all_combined", "pt_post_single_baseline", "unknown". Per-estimator dispatch lives in the new ``diff_diff/_reporting_helpers.py::describe_target_parameter`` (own module rather than business_report / diagnostic_report to avoid circular-import risk — plan-review LOW igerber#7). All 17 result classes covered (16 from _APPLICABILITY + BaconDecompositionResults); exhaustiveness locked in by TestTargetParameterCoversEveryResultClass. Fit-time config reads: - ``EfficientDiDResults.pt_assumption`` branches the aggregation tag between pt_all_combined and pt_post_single_baseline. - ``StackedDiDResults.clean_control`` varies the definition clause (never_treated / strict / not_yet_treated). - ``ChaisemartinDHaultfoeuilleResults.L_max`` + ``covariate_residuals`` + ``linear_trends_effects`` branches the dCDH estimand between DID_M / DID_l / DID^X_l / DID^{fd}_l / DID^{X,fd}_l. Fixed-tag branches (per plan-review CRITICAL igerber#1 and igerber#2): - ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` / ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not change the ``overall_att`` scalar — it only populates additional horizon / group tables on the result object. Disambiguating those tables in prose is tracked under gap igerber#9. - ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level assumption, not a library setting. Emits a single "dose_overall" tag with disjunctive definition naming both regime readings (ATT^loc under PT, ATT^glob under SPT). Prose rendering: - BR ``_render_summary``: emits "Target parameter: <name>." after the headline sentence (short name only; full definition lives in the full_report and schema). - BR ``_render_full_report``: "## Target Parameter" section between "## Headline" and "## Identifying Assumption". - DR ``_render_overall_interpretation``: mirror sentence. - DR ``_render_dr_full_report``: "## Target Parameter" section with name, definition, aggregation tag, headline attribute, and reference. Cross-surface parity: both BR and DR consume the same helper (the single source of truth), so their ``target_parameter`` blocks are byte-identical (verified by TestTargetParameterCrossSurfaceParity). Tests: 37 new (TestTargetParameterPerEstimator + TestTargetParameterFitConfigReads + TestTargetParameterCoversEveryResultClass + TestTargetParameterCrossSurfaceParity + TestTargetParameterProseRendering). Existing BR/DR top-level-key contract tests updated to include ``target_parameter``. Total 319 tests pass (282 prior + 37 new). Docs: REPORTING.md gains a "Target parameter" section documenting the per-estimator dispatch and schema shape. business_report.rst and diagnostic_report.rst note the new field with a pointer to REPORTING.md. CHANGELOG entry under Unreleased. Out of scope: REGISTRY.md per-estimator "Target parameter" sub-sections (plan-review additional-note); the reporting-layer doc in REPORTING.md is the current source of truth. A follow-up docs PR can land those sub-sections if maintainers want the registry to own the canonical wording directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement difference-in-differences (DiD) library with: