Skip to content

PR-C: PreTrendsPower R pretrends parity goldens at commit 122731d082#471

Merged
igerber merged 5 commits into
mainfrom
feature/pretrends-r-parity-pr-c
May 19, 2026
Merged

PR-C: PreTrendsPower R pretrends parity goldens at commit 122731d082#471
igerber merged 5 commits into
mainfrom
feature/pretrends-r-parity-pr-c

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 19, 2026

Summary

  • PR-C closes PR-B's deferred R-parity row for PreTrendsPower (Roth 2022 methodology audit). Generates JSON goldens at benchmarks/data/r_pretrends_golden.json from the committed benchmarks/R/generate_pretrends_golden.R against jonathandroth/pretrends commit 122731d082 (package version 0.1.0, R 4.5.2), and activates TestPretrendsParityR in tests/test_methodology_pretrends.py with 4 concrete tests covering 4 fixtures × NIS power + γ_p MDV at atol=1e-4.
  • End-to-end public-API parity on the irregular grid [-5,-3,-1] fixture: a synthetic MultiPeriodDiDResults constructed from the fixture's β̂ + full Σ + interaction_indices is passed through pt.fit(), locking the full fit() → _extract_pre_period_params → _get_violation_weights (γ-unit linear path) → _compute_mdv_nis chain against R's slope_for_power(). This is the regression test for PR-B Step 4's γ-unit linear-weight fix at the public-API boundary.
  • K=1 three-way cross-check on the single_pre_period_closed_form fixture (Roth Proposition 2): Python _compute_power_nis ≡ analytical truncated-normal 1 - Φ(z - γ/σ) + Φ(-z - γ/σ) at atol=1e-7 (effectively exact — same scalar path); both within atol=1e-4 of R. Strongest parity claim in the suite.
  • Tracker promotion: METHODOLOGY_REVIEW.md PreTrendsPower row flipped from **Complete** (R parity pending)**Complete**. REGISTRY.md Requirements checklist row checked. Roth (2022) paper review's R pretrends package version pin (provisional) Gaps bullet struck. TODO.md PR-C row deleted. CHANGELOG.md [Unreleased] entry added.

Tolerance rationale (atol=1e-4 on both tiers). R hardcodes thresholdTstat.Pretest = 1.96 while Python uses scipy.stats.norm.ppf(0.975) = 1.959963984540054 (dz ≈ 3.6e-5); on top of that, mvtnorm::pmvnorm (R) and scipy.stats.multivariate_normal.cdf (Python) use Genz-Bretz randomized-lattice rules with different absolute-error defaults (abseps ≈ 1e-3 vs 1e-5). The empirical NIS-power gap is bounded by ~5e-5 on the K=4 anticipation fixture (smaller for K∈{1,3}). For the inverse path (γ_p), R slope_for_power uses uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4) versus Python brentq(xtol=2e-12); the inverse-solver tolerance gap dominates. atol=1e-4 is the realistic ceiling on both tiers without tightening either solver.

Methodology references (required if estimator / math changes)

  • Method name(s): PreTrendsPower (Roth 2022 NIS box probability; γ_p MDV).
  • Paper / source link(s):
  • Any intentional deviations from the source (and why): None new in PR-C. PR-B-shipped deviations (NIS as default pretest_form='nis', full-Σ_22 routing on CS/SA event-study adapters, γ-unit linear weights, power_at(custom) persistence) are unchanged. The R-vs-Python tolerance differences documented above are not deviations but solver/algorithm-floor consequences (R hardcoded thresholdTstat, R uniroot default tol, Genz-Bretz randomization).

Validation

  • Tests added/updated: tests/test_methodology_pretrends.pyTestPretrendsParityR activated with 4 concrete tests + _extract_python_params helper (np.atleast_1d defensive parsing).
    • test_nis_power_matches_r_pretrends (16 comparisons: 4 fixtures × 4 γ values).
    • test_mdv_gamma_p_matches_r_slope_for_power (8 comparisons: 4 fixtures × 2 target_power values).
    • test_irregular_grid_gamma_unit_end_to_end_matches_r (end-to-end fit() through MultiPeriodDiDResults with interaction_indices).
    • test_k1_matches_r_and_closed_form (3-way: Python ≡ closed form at atol=1e-7, both within 1e-4 of R).
  • Module + class docstrings updated to past tense referencing commit 122731d082.
  • Full pretrends suite: 131 tests pass (tests/test_pretrends.py + tests/test_pretrends_event_study.py + tests/test_methodology_pretrends.py).
  • Local AI review: ran /ai-review-local --backend codex 2 rounds. R1 surfaced 3 P2/P3 items (paper-review stale prose, atol contract mismatch in script header, K=1 schema unboxing); all addressed in commit a75d4ed8. R2 verdict: ✅ no findings at any severity.
  • Backtest / simulation / notebook evidence: N/A (parity test, not a behavior change).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes.

Generated with Claude Code

igerber and others added 3 commits May 19, 2026 13:10
Closes the deferred R-parity row from PR-B (PreTrendsPower implementation
audit, Roth 2022). Generates JSON goldens at
`benchmarks/data/r_pretrends_golden.json` from the committed R script
against `jonathandroth/pretrends` commit `122731d082` (package version
0.1.0, R 4.5.2), and activates `TestPretrendsParityR` in
`tests/test_methodology_pretrends.py`.

Four fixtures (regular K=3, irregular K=3 `[-5,-3,-1]`, anticipation-shifted
K=4, K=1 closed form) × NIS power + γ_p MDV at `atol=1e-4`. K=1 also
asserts three-way cross-check against Roth Proposition 2's analytical
truncated-normal expression `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7`
(Python/closed-form effectively exact; both within `atol=1e-4` of R).

End-to-end irregular-grid `fit()` parity test exercises the full
`fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis`
chain through the public API, locking PR-B Step 4's γ-unit linear-weight
fix.

Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python
uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054`; R `slope_for_power`
uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` vs Python
`brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and
`mvtnorm::pmvnorm` vs `scipy.stats.multivariate_normal.cdf` Genz-Bretz
randomized-lattice differences bound the K=4 NIS power gap at ~5e-5.

Cross-surface tracker promotion:
- `METHODOLOGY_REVIEW.md` PreTrendsPower row: `**Complete** (R parity
  pending)` → `**Complete**`; Last Review `2026-05-18` → `2026-05-19`.
- REGISTRY.md Requirements checklist: `[x] R \`pretrends\` parity at
  commit \`122731d082\` (PR-C, 2026-05-19)`.
- `roth-2022-review.md` "R `pretrends` package version pin (provisional)"
  Gaps bullet struck.
- `TODO.md` PR-C row deleted.
- CHANGELOG.md `[Unreleased]` Added entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R1 codex local review verdict: no P0/P1. Three actionable items addressed:

- P2 (Methodology): `docs/methodology/papers/roth-2022-review.md` Gaps
  bullets at L282 (Joint Wald) and L288 (Heteroskedastic Sigma) contained
  stale provisional claims — "specific commit not pinned" and "current
  `diff_diff/pretrends.py` does Wald-test power/MDV only". Both are
  superseded by PR-B (NIS as default) and PR-C (pinned commit
  `122731d082`). Updated to cite the pinned commit and the current
  NIS+Wald state.

- P2 (Documentation/Tests): the R script header and JSON `meta.description`
  advertised NIS power parity at `atol=1e-5`, but the active tests use
  `atol=1e-4` (matching empirical Genz-Bretz randomization gap of ~5e-5
  on K=4 anticipation). Aligned the script comments + JSON description
  to `atol=1e-4` and added the Genz-Bretz rationale.

- P3 (Maintainability): the K=1 fixture serialized `pre_periods` and
  `post_periods` as scalars due to jsonlite `auto_unbox=TRUE` (e.g.,
  `pre_periods: -1`). Wrapped singleton vector fields in `I()` so they
  round-trip as length-1 arrays uniformly across all 4 fixtures. The
  `np.atleast_1d` defensive compensation in
  `_extract_python_params` is harmless and retained as defense-in-depth.

Re-ran `Rscript generate_pretrends_golden.R` to regenerate the JSON
(numerics unchanged — only structural changes from `I()` wrapping).
All 4 `TestPretrendsParityR` tests + 131 pretrends suite tests pass;
black + ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rebase onto main after release 3.4.0 textually applied the PR-C
CHANGELOG hunk under the [3.4.0] section because the surrounding context
(`### Added` heading + agent-discoverability bullet) had been copied from
the pre-release [Unreleased] into [3.4.0] by the release prep. PR-C is a
post-3.4.0 change, so its CHANGELOG entry belongs in [Unreleased].

Moves the single bullet; release [3.4.0] content unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Methods affected: PreTrendsPower only, and this PR does not change diff_diff/pretrends.py; it adds pinned R-parity artifacts, tests, and tracker/doc updates.
  • I did not find an undocumented methodology deviation, missing assumption check, or variance/SE regression relative to the registry and Roth (2022).
  • The new parity suite correctly targets the documented NIS-side R surface (pretrends(), slope_for_power()) and the atol=1e-4 rationale is coherent with the existing implementation.
  • P3: the new anticipation_shifted fixture does not actually exercise the anticipation-window extraction/filter path it claims to cover.
  • P3: the R generator hardcodes commit metadata into the JSON without enforcing that the loaded pretrends install is that exact pinned revision.

Methodology

Code Quality

  • No findings.

Performance

  • No findings. The PR is benchmark-data/test/documentation heavy and does not change runtime estimator code.

Maintainability

  • Severity P3. Impact: benchmarks/R/generate_pretrends_golden.R (L61-L63, L219-L221) writes pretrends_commit = "122731d082" into the JSON, but only checks packageVersion("pretrends") >= "0.1.0". A future rerun could silently regenerate goldens from a different installed revision while still stamping them as commit 122731d082. Concrete fix: fail closed on provenance before writing the JSON, e.g. exact packageVersion("pretrends") == "0.1.0" plus packageDescription("pretrends")$RemoteSha == PRETRENDS_COMMIT when available.

Tech Debt

  • No findings. Removing the resolved PreTrendsPower TODO row is consistent with TODO.md (L97-L100).

Security

  • No findings. I did not see secrets, credentials, or PII in the committed JSON/script/docs.

Documentation/Tests

  • Severity P3. Impact: the new anticipation_shifted fixture is described as validating anticipation-window filtering, but the fixture itself omits the t=-1 anticipation window and the parity assertions consume prefiltered Sigma_22/weights directly, so this PR does not add an R-parity lock on the actual _extract_pre_period_params(..., anticipation=1) path. See benchmarks/R/generate_pretrends_golden.R (L189-L211) and tests/test_methodology_pretrends.py (L1160-L1283). Concrete fix: either rename/comment the fixture as a shifted-grid parity case only, or add a fit()-level parity test with an explicit -1 anticipation-window coefficient and an anticipation=1 results object.
  • Diff-based review only; I did not re-run pytest or the R generator in this sandbox.

…ation

CI R1 codex review verdict: ✅ no P0/P1. Two P3 findings addressed:

- P3 (Maintainability): the R generator hardcoded
  `pretrends_commit = "122731d082"` into the JSON but only verified
  `packageVersion("pretrends") >= "0.1.0"`. A future rerun could
  silently regenerate goldens from a drifted revision while still
  stamping the artifact with the original commit. Fix: replace the
  loose version gate with an exact `packageVersion == "0.1.0"` check
  plus a `startsWith(packageDescription("pretrends")$RemoteSha,
  PRETRENDS_COMMIT)` provenance assertion that fails closed with a
  reinstall instruction if the installed revision drifts. Verified
  via positive (RemoteSha = `122731d082a5990e274f57fd9af0968e44977e7a`)
  and negative (synthetic `deadbeef` prefix) checks.

- P3 (Documentation/Tests): the `anticipation_shifted` fixture's
  comment described it as validating anticipation-window filtering,
  but the fixture omits the `t=-1` anticipation window and the parity
  assertions consume prefiltered `Sigma_22` / weights directly — the
  CS/SA-level `_extract_pre_period_params` anticipation filter
  (`if t < _pre_cutoff` in `pretrends.py`) is NOT R-parity-locked by
  this fixture. Fix: rename the comment / R `cat()` print / JSON
  meta.description to "K=4 shifted-grid case", and document the
  non-coverage explicitly in the file-header comment with a forward
  reference to the existing PR-B MC-based and full-VCV coverage in
  `TestPretrendsPropositions` / `TestPretrendsCovarianceSource`,
  plus a deferred follow-up for a CS/SA-level
  `anticipation=1 + R-parity` test (would need a synthetic
  `CallawaySantAnnaResults` with a t=-1 entry that gets filtered
  before reaching `_compute_power_nis`). Test class docstring
  tolerance-rationale prose flipped "K=4 anticipation fixture" →
  "K=4 shifted-grid fixture" to match.

The fixture's JSON key (`anticipation_shifted`) is unchanged to
preserve the test-side reference; only the prose contract is
clarified.

All 4 parity tests still pass; black + ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5b90559d19429e729d491186a6921371ff4aeafb


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • Severity P3. Impact: the generator now explicitly says that a true CS/SA anticipation=1 R-parity test is still deferred benchmarks/R/generate_pretrends_golden.R:L45-L58, but the PR also removes the only PreTrendsPower TODO row and leaves no replacement tracker entry TODO.md:L97. That makes the remaining coverage gap easy to lose even though the script itself calls it out as follow-up work. Concrete fix: add a low-priority TODO.md row for a synthetic CS/SA anticipation=1 parity fixture, or remove the “Deferred to a follow-up” language if the team does not intend to track it.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the generated metadata string says Python uses qnorm(0.975), but qnorm is the R function name; elsewhere this PR correctly refers to scipy.stats.norm.ppf(0.975) benchmarks/R/generate_pretrends_golden.R:L253-L260, benchmarks/data/r_pretrends_golden.json:L2-L7. This does not affect the golden values or assertions, but it muddies the audit trail in the committed parity artifact. Concrete fix: change the description string to scipy.stats.norm.ppf(0.975) or neutral wording such as “the exact 0.975 normal quantile.”

CI R2 codex review verdict: ✅ no P0/P1 (both prior P3s addressed).
Two new P3 findings (both consequences of R1 fixes):

- P3 (Documentation/Tests): the JSON `meta.description` string says
  Python uses `qnorm(0.975)`, but `qnorm` is the R function name; the
  rest of the PR correctly refers to `scipy.stats.norm.ppf(0.975)`.
  Fix: change the R script's `description = paste(...)` block from
  "qnorm(0.975)" to "scipy.stats.norm.ppf(0.975)" so the committed
  parity artifact's audit trail matches the language used in REGISTRY,
  the file-header comment, and the test class docstring.

- P3 (Tech Debt): the R generator's file-header comment now explicitly
  defers a CS/SA `anticipation=1` R-parity test to a follow-up (R1
  P3 #2 fix), but the PR removed the only PreTrendsPower TODO row and
  did not add a replacement tracker. Fix: add a low-priority TODO.md
  row describing the deferred work: build a synthetic
  `CallawaySantAnnaResults` (or `SunAbrahamResults`) with
  `anticipation=1` and a t=-1 entry that gets filtered by
  `_extract_pre_period_params` before reaching `_compute_power_nis`,
  then assert the resulting γ_p matches R's `slope_for_power()` on
  the K=4 shifted-grid fixture.

All 4 parity tests still pass; the regenerated JSON only differs in
the description string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 738ae8fcce029f851788eada26b76c97b5e88d37


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Method in scope is PreTrendsPower only. This PR adds pinned R-parity artifacts, tests, and methodology tracker/doc updates; it does not modify estimator math in diff_diff/pretrends.py.
  • The new parity surface is aligned with the current implementation and registry contract for γ-unit linear weights, pre-period covariance extraction, NIS power, and NIS MDV in diff_diff/pretrends.py:L876-L955, diff_diff/pretrends.py:L996-L1170, and diff_diff/pretrends.py:L1300-L1510.
  • The prior re-review P3s are addressed: the generator/artifact now use scipy.stats.norm.ppf(0.975) wording in benchmarks/R/generate_pretrends_golden.R:L253-L260 and benchmarks/data/r_pretrends_golden.json:L2-L7, and the remaining CS/SA anticipation=1 parity gap is now explicitly tracked in TODO.md:L97-L98.
  • The new parity tests cover four useful surfaces at the documented tolerance: direct NIS power, inverse gamma_p MDV, one public-API irregular-grid fit() path, and a K=1 closed-form cross-check in tests/test_methodology_pretrends.py:L1160-L1332.
  • I did not rerun pytest or the R generator in this sandbox; project runtime deps are incomplete here (scipy is missing) and no R package environment is available, so this is a static re-review.

Methodology

  • No findings. The PR’s parity claims and the promotion to “Complete” are consistent with the pinned-reference prose in docs/methodology/REGISTRY.md:L2842-L2854, docs/methodology/papers/roth-2022-review.md:L280-L291, METHODOLOGY_REVIEW.md:L1047-L1071, the generator’s provenance-checked extraction of pretrends() / slope_for_power() outputs in benchmarks/R/generate_pretrends_golden.R:L72-L186, and the current implementation paths in diff_diff/pretrends.py:L876-L955, diff_diff/pretrends.py:L996-L1170, and diff_diff/pretrends.py:L1300-L1510.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The fail-closed provenance check and explicit array-preservation for K=1 fixtures improve auditability and schema stability in benchmarks/R/generate_pretrends_golden.R:L74-L90 and benchmarks/R/generate_pretrends_golden.R:L167-L186.

Tech Debt

  • Severity P3 (informational, tracked). Impact: the new goldens still do not directly lock the CS/SA anticipation=1 pre-period filter against R because the R package has no anticipation parameter, but this limitation is now explicitly documented and tracked, so it is not a blocker for this PR. Concrete fix: no action required in this PR; follow the tracked item by adding the synthetic CallawaySantAnnaResults/SunAbrahamResults parity fixture described in TODO.md:L97-L98. References: benchmarks/R/generate_pretrends_golden.R:L45-L58, TODO.md:L97-L98.

Security

  • No findings.

Documentation/Tests

  • No findings. The prior metadata wording issue is fixed in benchmarks/R/generate_pretrends_golden.R:L253-L260 and benchmarks/data/r_pretrends_golden.json:L2-L7, and the parity suite now asserts the documented four-surface contract in tests/test_methodology_pretrends.py:L1160-L1332.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026
@igerber igerber merged commit 703ad76 into main May 19, 2026
31 of 32 checks passed
@igerber igerber deleted the feature/pretrends-r-parity-pr-c branch May 19, 2026 19:45
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request May 22, 2026
Release notes consolidate 8 PRs since 3.4.0 (2026-05-19):

Public-surface variance lifts:
- SpilloverDiD survey_design on HC1/CR1 via Binder TSL (Wave E.1, igerber#468)
- SpilloverDiD vcov_type=conley + survey_design via stratified-Conley
  on PSU totals (Wave E.2, igerber#474) + lag_cutoff>0 follow-up (igerber#477)
- SunAbraham vcov_type ∈ {classical, hc1, hc2, hc2_bm} (Phase 1b 1/8, igerber#472)
- WLS-CR2 Bell-McCaffrey gates lifted via clubSandwich port (igerber#475)

Methodology-review-tracker promotions (mostly docs/tests):
- PreTrendsPower R pretrends parity goldens (PR-C, igerber#471)
- HAD methodology-review-tracker promotion (igerber#473)
- ContinuousDiD methodology-review-tracker promotion (igerber#476)

All changes additive; bit-equal defaults preserved across the affected
estimators. No new estimators (patch-level per semver convention).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant