PR-C: PreTrendsPower R `pretrends` parity goldens at commit `122731d082` by igerber · Pull Request #471 · igerber/diff-diff

igerber · 2026-05-19T17:12:46Z

Summary

PR-C closes PR-B's deferred R-parity row for PreTrendsPower (Roth 2022 methodology audit). Generates JSON goldens at benchmarks/data/r_pretrends_golden.json from the committed benchmarks/R/generate_pretrends_golden.R against jonathandroth/pretrends commit 122731d082 (package version 0.1.0, R 4.5.2), and activates TestPretrendsParityR in tests/test_methodology_pretrends.py with 4 concrete tests covering 4 fixtures × NIS power + γ_p MDV at atol=1e-4.
End-to-end public-API parity on the irregular grid [-5,-3,-1] fixture: a synthetic MultiPeriodDiDResults constructed from the fixture's β̂ + full Σ + interaction_indices is passed through pt.fit(), locking the full fit() → _extract_pre_period_params → _get_violation_weights (γ-unit linear path) → _compute_mdv_nis chain against R's slope_for_power(). This is the regression test for PR-B Step 4's γ-unit linear-weight fix at the public-API boundary.
K=1 three-way cross-check on the single_pre_period_closed_form fixture (Roth Proposition 2): Python _compute_power_nis ≡ analytical truncated-normal 1 - Φ(z - γ/σ) + Φ(-z - γ/σ) at atol=1e-7 (effectively exact — same scalar path); both within atol=1e-4 of R. Strongest parity claim in the suite.
Tracker promotion: METHODOLOGY_REVIEW.md PreTrendsPower row flipped from **Complete** (R parity pending) → **Complete**. REGISTRY.md Requirements checklist row checked. Roth (2022) paper review's R pretrends package version pin (provisional) Gaps bullet struck. TODO.md PR-C row deleted. CHANGELOG.md [Unreleased] entry added.

Tolerance rationale (atol=1e-4 on both tiers). R hardcodes thresholdTstat.Pretest = 1.96 while Python uses scipy.stats.norm.ppf(0.975) = 1.959963984540054 (dz ≈ 3.6e-5); on top of that, mvtnorm::pmvnorm (R) and scipy.stats.multivariate_normal.cdf (Python) use Genz-Bretz randomized-lattice rules with different absolute-error defaults (abseps ≈ 1e-3 vs 1e-5). The empirical NIS-power gap is bounded by ~5e-5 on the K=4 anticipation fixture (smaller for K∈{1,3}). For the inverse path (γ_p), R slope_for_power uses uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4) versus Python brentq(xtol=2e-12); the inverse-solver tolerance gap dominates. atol=1e-4 is the realistic ceiling on both tiers without tightening either solver.

Methodology references (required if estimator / math changes)

Method name(s): PreTrendsPower (Roth 2022 NIS box probability; γ_p MDV).
Paper / source link(s):
- Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. American Economic Review: Insights, 4(3), 305-322. https://doi.org/10.1257/aeri.20210236
- R pretrends package: https://github.com/jonathandroth/pretrends at commit 122731d082 (package version 0.1.0).
- Paper review on file: docs/methodology/papers/roth-2022-review.md.
- REGISTRY entry: docs/methodology/REGISTRY.md ## PreTrendsPower.
Any intentional deviations from the source (and why): None new in PR-C. PR-B-shipped deviations (NIS as default pretest_form='nis', full-Σ_22 routing on CS/SA event-study adapters, γ-unit linear weights, power_at(custom) persistence) are unchanged. The R-vs-Python tolerance differences documented above are not deviations but solver/algorithm-floor consequences (R hardcoded thresholdTstat, R uniroot default tol, Genz-Bretz randomization).

Validation

Tests added/updated: tests/test_methodology_pretrends.py — TestPretrendsParityR activated with 4 concrete tests + _extract_python_params helper (np.atleast_1d defensive parsing).
- test_nis_power_matches_r_pretrends (16 comparisons: 4 fixtures × 4 γ values).
- test_mdv_gamma_p_matches_r_slope_for_power (8 comparisons: 4 fixtures × 2 target_power values).
- test_irregular_grid_gamma_unit_end_to_end_matches_r (end-to-end fit() through MultiPeriodDiDResults with interaction_indices).
- test_k1_matches_r_and_closed_form (3-way: Python ≡ closed form at atol=1e-7, both within 1e-4 of R).
Module + class docstrings updated to past tense referencing commit 122731d082.
Full pretrends suite: 131 tests pass (tests/test_pretrends.py + tests/test_pretrends_event_study.py + tests/test_methodology_pretrends.py).
Local AI review: ran /ai-review-local --backend codex 2 rounds. R1 surfaced 3 P2/P3 items (paper-review stale prose, atol contract mismatch in script header, K=1 schema unboxing); all addressed in commit a75d4ed8. R2 verdict: ✅ no findings at any severity.
Backtest / simulation / notebook evidence: N/A (parity test, not a behavior change).

Security / privacy

Confirm no secrets/PII in this PR: Yes.

Generated with Claude Code

Closes the deferred R-parity row from PR-B (PreTrendsPower implementation audit, Roth 2022). Generates JSON goldens at `benchmarks/data/r_pretrends_golden.json` from the committed R script against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0, R 4.5.2), and activates `TestPretrendsParityR` in `tests/test_methodology_pretrends.py`. Four fixtures (regular K=3, irregular K=3 `[-5,-3,-1]`, anticipation-shifted K=4, K=1 closed form) × NIS power + γ_p MDV at `atol=1e-4`. K=1 also asserts three-way cross-check against Roth Proposition 2's analytical truncated-normal expression `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7` (Python/closed-form effectively exact; both within `atol=1e-4` of R). End-to-end irregular-grid `fit()` parity test exercises the full `fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis` chain through the public API, locking PR-B Step 4's γ-unit linear-weight fix. Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054`; R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` vs Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and `mvtnorm::pmvnorm` vs `scipy.stats.multivariate_normal.cdf` Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5. Cross-surface tracker promotion: - `METHODOLOGY_REVIEW.md` PreTrendsPower row: `**Complete** (R parity pending)` → `**Complete**`; Last Review `2026-05-18` → `2026-05-19`. - REGISTRY.md Requirements checklist: `[x] R \`pretrends\` parity at commit \`122731d082\` (PR-C, 2026-05-19)`. - `roth-2022-review.md` "R `pretrends` package version pin (provisional)" Gaps bullet struck. - `TODO.md` PR-C row deleted. - CHANGELOG.md `[Unreleased]` Added entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R1 codex local review verdict: no P0/P1. Three actionable items addressed: - P2 (Methodology): `docs/methodology/papers/roth-2022-review.md` Gaps bullets at L282 (Joint Wald) and L288 (Heteroskedastic Sigma) contained stale provisional claims — "specific commit not pinned" and "current `diff_diff/pretrends.py` does Wald-test power/MDV only". Both are superseded by PR-B (NIS as default) and PR-C (pinned commit `122731d082`). Updated to cite the pinned commit and the current NIS+Wald state. - P2 (Documentation/Tests): the R script header and JSON `meta.description` advertised NIS power parity at `atol=1e-5`, but the active tests use `atol=1e-4` (matching empirical Genz-Bretz randomization gap of ~5e-5 on K=4 anticipation). Aligned the script comments + JSON description to `atol=1e-4` and added the Genz-Bretz rationale. - P3 (Maintainability): the K=1 fixture serialized `pre_periods` and `post_periods` as scalars due to jsonlite `auto_unbox=TRUE` (e.g., `pre_periods: -1`). Wrapped singleton vector fields in `I()` so they round-trip as length-1 arrays uniformly across all 4 fixtures. The `np.atleast_1d` defensive compensation in `_extract_python_params` is harmless and retained as defense-in-depth. Re-ran `Rscript generate_pretrends_golden.R` to regenerate the JSON (numerics unchanged — only structural changes from `I()` wrapping). All 4 `TestPretrendsParityR` tests + 131 pretrends suite tests pass; black + ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The rebase onto main after release 3.4.0 textually applied the PR-C CHANGELOG hunk under the [3.4.0] section because the surrounding context (`### Added` heading + agent-discoverability bullet) had been copied from the pre-release [Unreleased] into [3.4.0] by the release prep. PR-C is a post-3.4.0 change, so its CHANGELOG entry belongs in [Unreleased]. Moves the single bullet; release [3.4.0] content unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T17:18:16Z

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methods affected: PreTrendsPower only, and this PR does not change diff_diff/pretrends.py; it adds pinned R-parity artifacts, tests, and tracker/doc updates.
I did not find an undocumented methodology deviation, missing assumption check, or variance/SE regression relative to the registry and Roth (2022).
The new parity suite correctly targets the documented NIS-side R surface (pretrends(), slope_for_power()) and the atol=1e-4 rationale is coherent with the existing implementation.
P3: the new anticipation_shifted fixture does not actually exercise the anticipation-window extraction/filter path it claims to cover.
P3: the R generator hardcodes commit metadata into the JSON without enforcing that the loaded pretrends install is that exact pinned revision.

Methodology

No findings. The updated methodology docs remain consistent with the documented PreTrendsPower contract in docs/methodology/REGISTRY.md (L2842-L2854), docs/methodology/papers/roth-2022-review.md (L282-L288), and METHODOLOGY_REVIEW.md (L1047-L1071).

Code Quality

No findings.

Performance

No findings. The PR is benchmark-data/test/documentation heavy and does not change runtime estimator code.

Maintainability

Severity P3. Impact: benchmarks/R/generate_pretrends_golden.R (L61-L63, L219-L221) writes pretrends_commit = "122731d082" into the JSON, but only checks packageVersion("pretrends") >= "0.1.0". A future rerun could silently regenerate goldens from a different installed revision while still stamping them as commit 122731d082. Concrete fix: fail closed on provenance before writing the JSON, e.g. exact packageVersion("pretrends") == "0.1.0" plus packageDescription("pretrends")$RemoteSha == PRETRENDS_COMMIT when available.

Tech Debt

No findings. Removing the resolved PreTrendsPower TODO row is consistent with TODO.md (L97-L100).

Security

No findings. I did not see secrets, credentials, or PII in the committed JSON/script/docs.

Documentation/Tests

Severity P3. Impact: the new anticipation_shifted fixture is described as validating anticipation-window filtering, but the fixture itself omits the t=-1 anticipation window and the parity assertions consume prefiltered Sigma_22/weights directly, so this PR does not add an R-parity lock on the actual _extract_pre_period_params(..., anticipation=1) path. See benchmarks/R/generate_pretrends_golden.R (L189-L211) and tests/test_methodology_pretrends.py (L1160-L1283). Concrete fix: either rename/comment the fixture as a shifted-grid parity case only, or add a fit()-level parity test with an explicit -1 anticipation-window coefficient and an anticipation=1 results object.
Diff-based review only; I did not re-run pytest or the R generator in this sandbox.

…ation CI R1 codex review verdict: ✅ no P0/P1. Two P3 findings addressed: - P3 (Maintainability): the R generator hardcoded `pretrends_commit = "122731d082"` into the JSON but only verified `packageVersion("pretrends") >= "0.1.0"`. A future rerun could silently regenerate goldens from a drifted revision while still stamping the artifact with the original commit. Fix: replace the loose version gate with an exact `packageVersion == "0.1.0"` check plus a `startsWith(packageDescription("pretrends")$RemoteSha, PRETRENDS_COMMIT)` provenance assertion that fails closed with a reinstall instruction if the installed revision drifts. Verified via positive (RemoteSha = `122731d082a5990e274f57fd9af0968e44977e7a`) and negative (synthetic `deadbeef` prefix) checks. - P3 (Documentation/Tests): the `anticipation_shifted` fixture's comment described it as validating anticipation-window filtering, but the fixture omits the `t=-1` anticipation window and the parity assertions consume prefiltered `Sigma_22` / weights directly — the CS/SA-level `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py`) is NOT R-parity-locked by this fixture. Fix: rename the comment / R `cat()` print / JSON meta.description to "K=4 shifted-grid case", and document the non-coverage explicitly in the file-header comment with a forward reference to the existing PR-B MC-based and full-VCV coverage in `TestPretrendsPropositions` / `TestPretrendsCovarianceSource`, plus a deferred follow-up for a CS/SA-level `anticipation=1 + R-parity` test (would need a synthetic `CallawaySantAnnaResults` with a t=-1 entry that gets filtered before reaching `_compute_power_nis`). Test class docstring tolerance-rationale prose flipped "K=4 anticipation fixture" → "K=4 shifted-grid fixture" to match. The fixture's JSON key (`anticipation_shifted`) is unchanged to preserve the test-side reference; only the prose contract is clarified. All 4 parity tests still pass; black + ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T17:38:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5b90559d19429e729d491186a6921371ff4aeafb

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Method in scope: PreTrendsPower only. This PR does not change diff_diff/pretrends.py; it adds pinned R-parity artifacts, tests, and methodology tracker/doc updates.
The new claims are consistent with the Methodology Registry and current implementation of NIS power, γ-unit linear weights, and pre-period covariance extraction in docs/methodology/REGISTRY.md:L2842-L2854, diff_diff/pretrends.py:L876-L955, diff_diff/pretrends.py:L996-L1216, and diff_diff/pretrends.py:L1300-L1515.
The two prior AI-review P3s are addressed: the R generator now fails closed on provenance benchmarks/R/generate_pretrends_golden.R:L72-L90, and anticipation_shifted is no longer described as an actual anticipation-filter parity test benchmarks/R/generate_pretrends_golden.R:L45-L58.
The parity suite now locks four useful surfaces at the documented tolerance: direct NIS power, inverse γ_p MDV, one public-API irregular-grid fit() path, and the K=1 closed form tests/test_methodology_pretrends.py:L1097-L1328.
P3 only: one deferred follow-up is documented in the generator comments but no longer tracked in TODO.md.
I did not re-run pytest or the R generator in this sandbox; the bare environment here lacks the project’s Python deps and R package setup.

Methodology

No findings. The PR’s registry/review updates remain aligned with the documented NIS-based R reference surface and do not introduce an undocumented change to estimator math, weighting, variance/SE, assumptions, or defaults docs/methodology/REGISTRY.md:L2842-L2854, docs/methodology/papers/roth-2022-review.md:L280-L291, METHODOLOGY_REVIEW.md:L1047-L1071.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings beyond the tech-debt item below. The provenance guard added in the R generator is the main maintainability improvement in this re-review benchmarks/R/generate_pretrends_golden.R:L72-L90.

Tech Debt

Severity P3. Impact: the generator now explicitly says that a true CS/SA anticipation=1 R-parity test is still deferred benchmarks/R/generate_pretrends_golden.R:L45-L58, but the PR also removes the only PreTrendsPower TODO row and leaves no replacement tracker entry TODO.md:L97. That makes the remaining coverage gap easy to lose even though the script itself calls it out as follow-up work. Concrete fix: add a low-priority TODO.md row for a synthetic CS/SA anticipation=1 parity fixture, or remove the “Deferred to a follow-up” language if the team does not intend to track it.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the generated metadata string says Python uses qnorm(0.975), but qnorm is the R function name; elsewhere this PR correctly refers to scipy.stats.norm.ppf(0.975) benchmarks/R/generate_pretrends_golden.R:L253-L260, benchmarks/data/r_pretrends_golden.json:L2-L7. This does not affect the golden values or assertions, but it muddies the audit trail in the committed parity artifact. Concrete fix: change the description string to scipy.stats.norm.ppf(0.975) or neutral wording such as “the exact 0.975 normal quantile.”

CI R2 codex review verdict: ✅ no P0/P1 (both prior P3s addressed). Two new P3 findings (both consequences of R1 fixes): - P3 (Documentation/Tests): the JSON `meta.description` string says Python uses `qnorm(0.975)`, but `qnorm` is the R function name; the rest of the PR correctly refers to `scipy.stats.norm.ppf(0.975)`. Fix: change the R script's `description = paste(...)` block from "qnorm(0.975)" to "scipy.stats.norm.ppf(0.975)" so the committed parity artifact's audit trail matches the language used in REGISTRY, the file-header comment, and the test class docstring. - P3 (Tech Debt): the R generator's file-header comment now explicitly defers a CS/SA `anticipation=1` R-parity test to a follow-up (R1 P3 #2 fix), but the PR removed the only PreTrendsPower TODO row and did not add a replacement tracker. Fix: add a low-priority TODO.md row describing the deferred work: build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 entry that gets filtered by `_extract_pre_period_params` before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. All 4 parity tests still pass; the regenerated JSON only differs in the description string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T17:57:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 738ae8fcce029f851788eada26b76c97b5e88d37

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Method in scope is PreTrendsPower only. This PR adds pinned R-parity artifacts, tests, and methodology tracker/doc updates; it does not modify estimator math in diff_diff/pretrends.py.
The new parity surface is aligned with the current implementation and registry contract for γ-unit linear weights, pre-period covariance extraction, NIS power, and NIS MDV in diff_diff/pretrends.py:L876-L955, diff_diff/pretrends.py:L996-L1170, and diff_diff/pretrends.py:L1300-L1510.
The prior re-review P3s are addressed: the generator/artifact now use scipy.stats.norm.ppf(0.975) wording in benchmarks/R/generate_pretrends_golden.R:L253-L260 and benchmarks/data/r_pretrends_golden.json:L2-L7, and the remaining CS/SA anticipation=1 parity gap is now explicitly tracked in TODO.md:L97-L98.
The new parity tests cover four useful surfaces at the documented tolerance: direct NIS power, inverse gamma_p MDV, one public-API irregular-grid fit() path, and a K=1 closed-form cross-check in tests/test_methodology_pretrends.py:L1160-L1332.
I did not rerun pytest or the R generator in this sandbox; project runtime deps are incomplete here (scipy is missing) and no R package environment is available, so this is a static re-review.

Methodology

No findings. The PR’s parity claims and the promotion to “Complete” are consistent with the pinned-reference prose in docs/methodology/REGISTRY.md:L2842-L2854, docs/methodology/papers/roth-2022-review.md:L280-L291, METHODOLOGY_REVIEW.md:L1047-L1071, the generator’s provenance-checked extraction of pretrends() / slope_for_power() outputs in benchmarks/R/generate_pretrends_golden.R:L72-L186, and the current implementation paths in diff_diff/pretrends.py:L876-L955, diff_diff/pretrends.py:L996-L1170, and diff_diff/pretrends.py:L1300-L1510.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The fail-closed provenance check and explicit array-preservation for K=1 fixtures improve auditability and schema stability in benchmarks/R/generate_pretrends_golden.R:L74-L90 and benchmarks/R/generate_pretrends_golden.R:L167-L186.

Tech Debt

Severity P3 (informational, tracked). Impact: the new goldens still do not directly lock the CS/SA anticipation=1 pre-period filter against R because the R package has no anticipation parameter, but this limitation is now explicitly documented and tracked, so it is not a blocker for this PR. Concrete fix: no action required in this PR; follow the tracked item by adding the synthetic CallawaySantAnnaResults/SunAbrahamResults parity fixture described in TODO.md:L97-L98. References: benchmarks/R/generate_pretrends_golden.R:L45-L58, TODO.md:L97-L98.

Security

No findings.

Documentation/Tests

No findings. The prior metadata wording issue is fixed in benchmarks/R/generate_pretrends_golden.R:L253-L260 and benchmarks/data/r_pretrends_golden.json:L2-L7, and the parity suite now asserts the documented four-surface contract in tests/test_methodology_pretrends.py:L1160-L1332.

Release notes consolidate 8 PRs since 3.4.0 (2026-05-19): Public-surface variance lifts: - SpilloverDiD survey_design on HC1/CR1 via Binder TSL (Wave E.1, igerber#468) - SpilloverDiD vcov_type=conley + survey_design via stratified-Conley on PSU totals (Wave E.2, igerber#474) + lag_cutoff>0 follow-up (igerber#477) - SunAbraham vcov_type ∈ {classical, hc1, hc2, hc2_bm} (Phase 1b 1/8, igerber#472) - WLS-CR2 Bell-McCaffrey gates lifted via clubSandwich port (igerber#475) Methodology-review-tracker promotions (mostly docs/tests): - PreTrendsPower R pretrends parity goldens (PR-C, igerber#471) - HAD methodology-review-tracker promotion (igerber#473) - ContinuousDiD methodology-review-tracker promotion (igerber#476) All changes additive; bit-equal defaults preserved across the affected estimators. No new estimators (patch-level per semver convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits May 19, 2026 13:10

igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026

igerber merged commit 703ad76 into main May 19, 2026
31 of 32 checks passed

igerber deleted the feature/pretrends-r-parity-pr-c branch May 19, 2026 19:45

igerber mentioned this pull request May 21, 2026

Release 3.4.1: SpilloverDiD survey + Conley lifts, SunAbraham vcov_type, WLS-CR2 BM, methodology-tracker promotions #480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-C: PreTrendsPower R `pretrends` parity goldens at commit `122731d082`#471

PR-C: PreTrendsPower R `pretrends` parity goldens at commit `122731d082`#471
igerber merged 5 commits into
mainfrom
feature/pretrends-r-parity-pr-c

igerber commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant