Add robust parallel trends testing with Wasserstein distance#3
Merged
Conversation
- Add check_parallel_trends_robust() using Wasserstein (Earth Mover's) distance for distributional comparison of pre-treatment outcome changes - Include permutation-based p-value for statistical inference - Add Kolmogorov-Smirnov test as complementary distributional test - Add equivalence_test_trends() using TOST procedure - Compute normalized Wasserstein and variance ratio diagnostics - Add 9 new tests for robust parallel trends functionality - Update README with usage examples for all three approaches The Wasserstein distance is more robust than simple slope comparisons because it captures differences in the full distribution shape, not just means, making it better suited for heterogeneous effects.
igerber
pushed a commit
that referenced
this pull request
Jan 4, 2026
igerber
added a commit
that referenced
this pull request
Apr 17, 2026
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 22, 2026
P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 25, 2026
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
dCDH by_path + placebo: per-path backward-horizon placebos (Wave 2 #3)
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed: P3 #1 — exact-pin nprobust: The parity contract runs through nprobust numerical paths (DIDHAD's local-linear bandwidth + bias-correction calls), so a fresh regeneration could drift if CRAN serves a newer nprobust. Pin nprobust == 0.5.0 in both the R generator's stopifnot guard and the parity test's metadata assertion alongside DIDHAD and YatchewTest. P3 #2 — workflow docstring: did_had_pretest_workflow's top-level docstring still said "Eq 18 linear-trend detrending is a Phase 4 follow-up" which contradicts the shipped trends_lin behavior. Updated to describe the forwarding contract (trends_lin → joint_pretrends_test + joint_homogeneity_test, consumed-placebo skip path on minimal panels). Same fix on the StuteJointResult class docstring. P3 #3 — parity test horizon-shape assertions: Added an explicit "missing in Python" assertion in _zip_r_python: every R-mapped event time must be present in Python's event_times (catches future horizon-shape regressions where Python silently drops a horizon R requested). Added an effects+placebo row-count sanity check in test_yatchew_t_stat_parity (uses the previously- unused effects/placebo parametrize values to catch fixture drift). Stats: 540 tests pass, 0 regressions. No estimator/methodology changes — all P3 polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.