Skip to content

Add robust parallel trends testing with Wasserstein distance#3

Merged
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf
Jan 1, 2026
Merged

Add robust parallel trends testing with Wasserstein distance#3
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 1, 2026

No description provided.

- Add check_parallel_trends_robust() using Wasserstein (Earth Mover's) distance
  for distributional comparison of pre-treatment outcome changes
- Include permutation-based p-value for statistical inference
- Add Kolmogorov-Smirnov test as complementary distributional test
- Add equivalence_test_trends() using TOST procedure
- Compute normalized Wasserstein and variance ratio diagnostics
- Add 9 new tests for robust parallel trends functionality
- Update README with usage examples for all three approaches

The Wasserstein distance is more robust than simple slope comparisons
because it captures differences in the full distribution shape, not
just means, making it better suited for heterogeneous effects.
@igerber igerber merged commit d86b108 into main Jan 1, 2026
@igerber igerber deleted the claude/init-did-library-pvNmf branch January 3, 2026 12:52
igerber pushed a commit that referenced this pull request Jan 4, 2026
Revised review reflects:
- #1, #4 verified as non-issues (correct by design)
- #3, #5, #6, #8, #13 addressed in commit e40d6b4
- Updated recommendation to approve and merge
- Remaining items are low-priority style suggestions for future PRs
igerber added a commit that referenced this pull request Apr 17, 2026
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call
  it from fit() after the weight_type/replicate-weights checks. The
  dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as
  the effective sampling unit; when strata or PSU vary within group it
  silently spreads horizon-specific IF mass across observations in
  different PSUs, contaminating the stratified-PSU variance. Walk back
  the overstated claim at the old line 669 comment to match. Within-
  group-varying weights remain supported.
- P1 #3: _survey_se_from_group_if now filters zero-weight rows before
  np.unique/np.bincount so NaN / non-comparable group IDs on excluded
  subpopulation rows cannot crash SE factorization. psi stays full-
  length with zeros in excluded positions to preserve alignment with
  resolved.strata / resolved.psu inside compute_survey_if_variance.
- REGISTRY.md line 652 Note updated: explicitly states the
  within-group-constant strata/PSU requirement and the
  within-group-varying weights support.
- Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects
  varying PSU, rejects varying strata, accepts varying weights, and
  ignores zero-weight rows during the constancy check) plus
  TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id.

All 268 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 22, 2026
P1 #1 — Stute tie-safe CvM:
Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each
observation's dose, so tied observations share the post-tie cumulative sum.
My naive cumsum over sorted residuals produced partial within-tie sums that
were row-order-dependent. Fix: after cumsum, replace within-tie-block values
with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now
accepts `d_sorted` and collapses tie blocks before squaring. Regression
test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance
on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses`
validates the end-to-end stute_test contract.

P1 #2 — Exact-linear fit must fail-to-reject (not return NaN):
For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome
is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix:
dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally
produces p=1 when eps=0 exactly). Added a scale-relative short-circuit
(sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so
FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat
the short-circuit by producing non-zero but tiny OLS residuals. Yatchew
exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than
NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan,
TestYatchewHRTest.test_exact_linear_returns_p1_not_nan.

P1 #3 — HADPretestReport.all_pass contract:
Previously `all_pass = not (reject or reject or reject)` could be True
while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every
constituent p-value being finite AND no test rejecting. Updated docstring.
Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan.

P2 #1 — QUG negative-dose guard:
HAD doses must be non-negative (paper Section 2). The raw qug_test API
was silently folding d < 0 rows into the n_excluded_zero counter (filter
was `d > 0`). Fix: front-door ValueError on any d < 0. Regression:
TestQUGTest.test_negative_dose_raises.

P3 #1 — QUG np.partition:
REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched
qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤
partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight
closed-form parity at atol=1e-12 still holds.

P3 #2 — REGISTRY n_bootstrap default:
REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated
REGISTRY to match code and added a note about the n_bootstrap >= 99
front-door validation.

Test count: 47 -> 53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
dCDH by_path + placebo: per-path backward-horizon placebos (Wave 2 #3)
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants