From 13177ec477d085df298cce880a4ece5c5491923d Mon Sep 17 00:00:00 2001 From: igerber Date: Tue, 19 May 2026 11:53:03 -0400 Subject: [PATCH 1/5] PR-C: PreTrendsPower R `pretrends` parity goldens MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the deferred R-parity row from PR-B (PreTrendsPower implementation audit, Roth 2022). Generates JSON goldens at `benchmarks/data/r_pretrends_golden.json` from the committed R script against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0, R 4.5.2), and activates `TestPretrendsParityR` in `tests/test_methodology_pretrends.py`. Four fixtures (regular K=3, irregular K=3 `[-5,-3,-1]`, anticipation-shifted K=4, K=1 closed form) × NIS power + γ_p MDV at `atol=1e-4`. K=1 also asserts three-way cross-check against Roth Proposition 2's analytical truncated-normal expression `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7` (Python/closed-form effectively exact; both within `atol=1e-4` of R). End-to-end irregular-grid `fit()` parity test exercises the full `fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis` chain through the public API, locking PR-B Step 4's γ-unit linear-weight fix. Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054`; R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` vs Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and `mvtnorm::pmvnorm` vs `scipy.stats.multivariate_normal.cdf` Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5. Cross-surface tracker promotion: - `METHODOLOGY_REVIEW.md` PreTrendsPower row: `**Complete** (R parity pending)` → `**Complete**`; Last Review `2026-05-18` → `2026-05-19`. - REGISTRY.md Requirements checklist: `[x] R \`pretrends\` parity at commit \`122731d082\` (PR-C, 2026-05-19)`. - `roth-2022-review.md` "R `pretrends` package version pin (provisional)" Gaps bullet struck. - `TODO.md` PR-C row deleted. - CHANGELOG.md `[Unreleased]` Added entry. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 1 + METHODOLOGY_REVIEW.md | 13 +- TODO.md | 4 +- benchmarks/R/generate_pretrends_golden.R | 168 +++++++------- benchmarks/data/r_pretrends_golden.json | 106 +++++++++ docs/methodology/REGISTRY.md | 4 +- docs/methodology/papers/roth-2022-review.md | 7 - tests/test_methodology_pretrends.py | 236 +++++++++++++++++--- 8 files changed, 410 insertions(+), 129 deletions(-) create mode 100644 benchmarks/data/r_pretrends_golden.json diff --git a/CHANGELOG.md b/CHANGELOG.md index 7679075a..dae1a561 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [3.4.0] - 2026-05-19 ### Added +- **PreTrendsPower R `pretrends` parity goldens (PR-C closes PR-B's deferred R-parity row).** JSON goldens at `benchmarks/data/r_pretrends_golden.json` generated from the committed `benchmarks/R/generate_pretrends_golden.R` script against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0, R 4.5.2). 4 fixtures cover regular K=3 grid (`uniform_3_pre_periods_no_anticipation`), irregular K=3 grid `[-5,-3,-1]` (`irregular_pre_periods` — locks the PR-B Step 4 γ-unit linear-weight fix), anticipation-shifted K=4 grid (`anticipation_shifted`), and K=1 closed form (`single_pre_period_closed_form` — Roth Proposition 2 univariate truncated-normal). `TestPretrendsParityR` in `tests/test_methodology_pretrends.py` now active (4 tests): NIS power vs R `pretrends::pretrends()` at `atol=1e-4` across all 4 fixtures × 4 γ values; γ_p MDV vs R `slope_for_power()` at `atol=1e-4` across all 4 fixtures × 2 target_power values; end-to-end `fit()` on irregular grid vs R γ_p at `atol=1e-4` (locks the full `fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis` chain through the public API); K=1 three-way cross-check (Python ≡ analytical truncated-normal closed form `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7`; both within `atol=1e-4` of R). Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054` (`dz ≈ 3.6e-5`); R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` versus Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and `mvtnorm::pmvnorm` (R) vs `scipy.stats.multivariate_normal.cdf` (Python) Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5. `METHODOLOGY_REVIEW.md` PreTrendsPower row promoted `**Complete** (R parity pending)` → `**Complete**`. Roth (2022) paper review's `R \`pretrends\` package version pin (provisional)` Gaps bullet struck. Closes the PR-C TODO row. - **Agent-discoverability contract test (`tests/test_agent_discoverability.py`).** New static-snapshot test pinning the agent-facing surface introduced by PR #464: `__all__` membership of `agent_workflow` / `profile_panel` / `get_llm_guide` / `practitioner_next_steps` / `BusinessReport`; `dir(diff_diff)` head-first ordering against `_AGENT_FACING_ORDER` (catches drift in the `_OrderedName` `__lt__` ordering trick); `_OrderedName` `isinstance(_, str)` + str-method compatibility; `dir()` full-namespace + `inspect.getmembers` parity; top-level `__doc__` first-paragraph mention of `agent_workflow` + named references to the 5-step workflow primitives; `agent_workflow()` script content references each downstream helper by name; canonical estimator class names (CallawaySantAnna, ContinuousDiD, HeterogeneousAdoptionDiD, etc.) remain importable. No live API calls; runs in the default pytest suite. Closes [issue #461](https://github.com/igerber/diff-diff/issues/461) (snapshot variant — live-agent regression test deferred to a separate follow-up that depends on causal-llm-eval packaging its harness). Also closes the `__dir__()` contract-test row from `TODO.md` that PR #464 deferred here. - **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460). - **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard. diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index e61ca04e..133f24ef 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -80,7 +80,7 @@ The catalog grew incrementally over several quarters, so formats vary across the |------|--------|-------------|--------|-------------| | BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** | 2026-05-16 | | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 | -| PreTrendsPower | `pretrends.py` | `pretrends` package | **Complete** (R parity pending) | 2026-05-18 | +| PreTrendsPower | `pretrends.py` | `pretrends` package | **Complete** | 2026-05-19 | | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** | — | | PlaceboTests | `diagnostics.py` | (no canonical reference) | **In Progress** | — | @@ -1047,14 +1047,15 @@ and covariate-adjusted specifications.) | Module | `pretrends.py` | | Primary Reference | Roth (2022), *Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends*, AER:I 4(3), 305-322 | | R Reference | `pretrends` package | -| Status | **Complete** (R parity pending) | -| Last Review | 2026-05-18 | +| Status | **Complete** | +| Last Review | 2026-05-19 | **Documentation in place:** - REGISTRY.md section: `## PreTrendsPower` — NIS-framed audit per Roth (2022) Section II.A-B with full equation blocks for both NIS and Wald forms; paper-supported alternative + γ-unit MDV + full-Σ_22 routing all locked. - Paper review on file: `docs/methodology/papers/roth-2022-review.md` (added 2026-05-17 via PR #463). - Implementation: `tests/test_pretrends.py` (67 tests — point-estimator, MDV, power curve, sensitivity, plus the PR-A R18 silent-failure regression and the PR-B custom-weight persistence regression) + event-study coverage in `tests/test_pretrends_event_study.py` (27 tests). -- Dedicated `tests/test_methodology_pretrends.py` (added 2026-05-18 in PR-B Step 7) — Roth (2022) Section II.A-B paper-equation-numbered Verified Components walk-through (8 classes, 30-40 tests covering NIS box probability, Wald-vs-NIS, Propositions 1-4 simulation parity, linear-units γ-scale, custom-weight persistence, CS/SA full-VCV, helper API). +- Dedicated `tests/test_methodology_pretrends.py` (added 2026-05-18 in PR-B Step 7; PR-C 2026-05-19 activated `TestPretrendsParityR` with 4 concrete tests) — Roth (2022) Section II.A-B paper-equation-numbered Verified Components walk-through (8 classes covering NIS box probability, Wald-vs-NIS, Propositions 1-4 simulation parity, linear-units γ-scale, custom-weight persistence, CS/SA full-VCV, helper API, R parity at commit `122731d082`). +- R parity goldens: `benchmarks/data/r_pretrends_golden.json` generated by `benchmarks/R/generate_pretrends_golden.R` against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0); 4 fixtures (regular K=3, irregular K=3 `[-5,-3,-1]`, anticipation-shifted K=4, K=1 closed form) × NIS power + γ_p MDV at `atol=1e-4`. **Verified Components:** - [x] NIS box probability implemented via `scipy.stats.multivariate_normal.cdf` (Roth Section II.A-B primary form) @@ -1067,9 +1068,7 @@ and covariate-adjusted specifications.) - [x] `PreTrendsPowerResults` persists fitted `violation_weights` + `pretest_form` + `nis_box_probability`; `power_at(M)` works for all four violation types on fresh fits - [x] Helper API (`compute_pretrends_power`, `compute_mdv`) accepts `violation_weights` and `pretest_form`; closes the PR-A R18 helper/class API gap - [x] Summary, `to_dict`, `to_dataframe` dispatch on `pretest_form` (NIS prints box probability; Wald prints noncentrality) - -**Outstanding for promotion to fully Complete:** -- R parity fixture against the `pretrends` R package at a **pinned revision** (deferred to PR-C). The generator script `benchmarks/R/generate_pretrends_golden.R` is committed in PR-B with a placeholder commit reference; PR-C will install the package, generate the JSON goldens at `benchmarks/data/r_pretrends_golden.json`, activate `TestPretrendsParityR` (currently skips when goldens missing), and record the audited R-package revision. Until that lands, the R-package surface claims in `docs/methodology/papers/roth-2022-review.md` Gaps section remain provisional. +- [x] R `pretrends` parity at commit `122731d082` (PR-C, 2026-05-19) — 4 fixtures × NIS power + γ_p MDV at `atol=1e-4`; `tests/test_methodology_pretrends.py::TestPretrendsParityR` active --- diff --git a/TODO.md b/TODO.md index 83f35525..1a32b34d 100644 --- a/TODO.md +++ b/TODO.md @@ -94,8 +94,8 @@ Deferred items from PR reviews that were not addressed before merge. | WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional `weights="cohort_share"` parameter to `aggregate()`. | `wooldridge_results.py` | #216 | Medium | | WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low | | WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium | -| PreTrendsPower R parity goldens (PR-C): pin the R `pretrends` package commit/release, run `benchmarks/R/generate_pretrends_golden.R` (committed in PR-B), commit the JSON goldens at `benchmarks/data/r_pretrends_golden.json`, activate the `TestPretrendsParityR` class in `tests/test_methodology_pretrends.py` (currently skips when goldens missing), and flip the METHODOLOGY_REVIEW.md `PreTrendsPower` row from `**Complete** (R parity pending)` → `**Complete**`. Until that lands, the R-package surface claims in `docs/methodology/papers/roth-2022-review.md` remain provisional. | `benchmarks/R/generate_pretrends_golden.R`, `benchmarks/data/r_pretrends_golden.json` (new), `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `METHODOLOGY_REVIEW.md` (PreTrendsPower row) | PR-C (PreTrendsPower R parity) | Low | - + + | Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium | | Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium | diff --git a/benchmarks/R/generate_pretrends_golden.R b/benchmarks/R/generate_pretrends_golden.R index 78af170e..6721ab94 100644 --- a/benchmarks/R/generate_pretrends_golden.R +++ b/benchmarks/R/generate_pretrends_golden.R @@ -1,47 +1,48 @@ #!/usr/bin/env Rscript # Generate R `pretrends` parity goldens for diff-diff PreTrendsPower (PR-C). # -# This script is committed in PR-B (PreTrendsPower implementation audit, -# Roth 2022); the JSON goldens at ../data/r_pretrends_golden.json are -# DEFERRED to PR-C. Running this script writes the JSON to that path; PR-C -# pins the R `pretrends` package commit / release, runs this script, and -# commits the resulting JSON to land the parity tests. +# Script committed in PR-B; JSON goldens generated and committed in PR-C +# against `jonathandroth/pretrends` commit `122731d082` (package version +# 0.1.0). Running this script writes ../data/r_pretrends_golden.json. # # Requires: # - R 4.4+ (tested on 4.5.2) # - install.packages("remotes") -# - remotes::install_github("jonathandroth/pretrends", ref = "") -# - install.packages("jsonlite") -# -# **R `pretrends` commit pin (TODO — PR-C):** the audited revision MUST be -# recorded here before parity assertions are committed. As of 2026-05-18 -# (PR-B implementation date) the script targets the default `main` branch -# at https://github.com/jonathandroth/pretrends with no pin. PR-C will -# replace `` with the exact commit hash AND verify the surface -# claims documented in REGISTRY.md `## PreTrendsPower` and the paper -# review's "R `pretrends` package version pin (provisional)" Gaps bullet. +# - remotes::install_github("jonathandroth/pretrends", ref = "122731d082") +# - install.packages(c("jsonlite", "MASS")) # # Output: ../data/r_pretrends_golden.json # # diff-diff PreTrendsPower with `pretest_form='nis'` (the new default per -# PR-B Step 2) is expected to match the values in this JSON at atol=1e-6 -# along a three-tier contract: -# (1) NIS box probability `P(β̂_pre ∈ B_NIS(Σ))` at fixed M values on -# all 3 fixtures; -# (2) MDV / gamma_p (slope at target power 0.5 and 0.8) on regular and -# irregular pre-period grids; -# (3) γ-unit MDV invariance: PR-B's "skip L2 norm for linear with -# relative_times" path produces MDV in Roth's γ units exactly, -# matching R's `slope_for_power()` which also reports γ. +# PR-B Step 2) matches the values in this JSON along a three-tier contract: +# (1) NIS box probability `P(beta_hat_pre in B_NIS(Sigma))` at fixed gamma +# values on all 4 fixtures, at atol=1e-5. R hardcodes thresholdTstat +# = 1.96 while Python uses scipy.stats.norm.ppf(0.975) = +# 1.959963984540054; the ~4e-5 dz gap propagates to a ~1e-6 ceiling on +# the box probability — 1e-5 is the realistic atol. +# (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, +# anticipation, and K=1 grids, at atol=1e-4. R uniroot defaults to +# tol = .Machine$double.eps^0.25 ~= 1.22e-4 vs Python brentq xtol=2e-12; +# the inverse-solver tolerance gap dominates, so 1e-4 is the realistic +# atol without tightening either solver. +# (3) gamma-unit MDV invariance: PR-B's "skip L2 norm for linear with +# relative_times" path produces MDV in Roth's gamma units exactly, +# matching R's `slope_for_power()` which also reports gamma. Fixture 2 +# (irregular grid {-5, -3, -1}) and the end-to-end fit() test in +# tests/test_methodology_pretrends.py lock this. # -# Three fixtures (matched to test_methodology_pretrends.py expectations): -# 1. uniform_3_pre_periods_no_anticipation — K=3 regular grid (t ∈ {-3, -2, -1}), +# Four fixtures (matched to test_methodology_pretrends.py expectations): +# 1. uniform_3_pre_periods_no_anticipation — K=3 regular grid (t in {-3, -2, -1}), # never-treated control. Default-case parity baseline. # 2. irregular_pre_periods — K=3 with relative_times = [-5, -3, -1]. -# Exercises the PR-B γ-unit linear-pattern fix. +# Exercises the PR-B gamma-unit linear-pattern fix end-to-end. # 3. anticipation_shifted — K=4 with anticipation=1 (pre-cutoff at t<-1, # so pre-periods are {-5, -4, -3, -2}). Verifies the pre-period filter # logic in `_extract_pre_period_params`. +# 4. single_pre_period_closed_form — K=1 with diagonal Sigma = 0.25*I +# (Roth Proposition 2 univariate truncated-normal closed form). Locks +# the scalar fast-path against R AND against the analytical expression +# `1 - Phi(z - gamma/sigma) + Phi(-z - gamma/sigma)`. # # Run: # cd benchmarks/R && Rscript generate_pretrends_golden.R @@ -53,9 +54,11 @@ suppressPackageStartupMessages({ stopifnot(packageVersion("pretrends") >= "0.1.0") +PRETRENDS_COMMIT <- "122731d082" + # --------------------------------------------------------------------------- # DGP helper: build a synthetic event-study coefficient vector + VCV under a -# stylized null DGP (β = 0, Σ_22 ~ correlated). Mirrors the simulation +# stylized null DGP (beta = 0, Sigma_22 ~ correlated). Mirrors the simulation # fixtures in test_methodology_pretrends.py. # --------------------------------------------------------------------------- @@ -66,8 +69,9 @@ build_event_study_fixture <- function( rho = 0.3, seed = 42L ) { - # Generate a correlated equicorrelation Σ across all (pre + post) periods. - # Realized β̂ drawn from N(0, Σ) — null DGP, no real treatment effect. + # Generate a correlated equicorrelation Sigma across all (pre + post) periods. + # Realized beta_hat drawn from N(0, Sigma) — null DGP, no real treatment + # effect. set.seed(seed) all_periods <- c(pre_periods, post_periods) K_total <- length(all_periods) @@ -95,32 +99,36 @@ extract_pretrends <- function(fixture_data, fixture_name) { all_periods <- fixture_data$all_periods # R `pretrends` expects: betahat (coefficient vector), sigma (VCV matrix), - # tVec (relative-time labels including the reference period 0, omitted - # from betahat / sigma per convention), referencePeriod = 0, alpha = 0.05. - - # The `slopes_for_power` helper returns gamma values at target power. - # For the three-tier parity contract, we capture both NIS power at a fixed - # slope and the inverse (γ_p MDV) at target power 0.5 and 0.8. - - # NIS power at fixed gamma values (for tier-1 parity): + # tVec (relative-time labels excluding the reference period 0), + # referencePeriod = 0, and deltatrue (length-K_total hypothesized delta + # vector — only the pre-period entries are used for the rejection + # probability). + + # Tier 1: NIS power at fixed gamma values. + # Build delta_pre = gamma * pre_periods per Roth's slope convention + # delta_t = gamma * t (t < 0 for pre-periods, so delta_pre is negative; + # the NIS box is symmetric, so the sign does not affect the rejection + # probability). gamma_test_values <- c(0.0, 0.2, 0.5, 1.0) power_values <- sapply(gamma_test_values, function(g) { - # Build δ = γ * |t| for pre-periods (Roth's δ_t = γ·t convention, - # using |t| since pre-period t < 0). - delta_pre <- g * abs(pre_periods) - # `pretrends` package: pretrends() with explicit delta vector. - # The exact R API: pretrends(betahat, sigma, tVec, referencePeriod, - # deltahypothesis, ...). - # PR-C: replace this stub with the actual R pretrends() call and - # extract the rejection probability. - NA_real_ # PR-C will populate + deltatrue_full <- rep(0, length(all_periods)) + deltatrue_full[seq_along(pre_periods)] <- g * pre_periods + res <- pretrends(betahat = beta_hat, + sigma = Sigma, + deltatrue = deltatrue_full, + tVec = all_periods, + referencePeriod = 0) + as.numeric(res$df_power$Power) }) - # γ_p MDV: solve for γ such that NIS rejection probability = target power. - # R `slope_for_power(betahat, sigma, tVec, referencePeriod, power)`. + # Tier 2: gamma_p MDV at target powers 0.5 and 0.8. + # R `slope_for_power()` solves uniroot on the slope -> rejection-probability + # map; returns gamma in the same units as Roth's slope convention. gamma_p_values <- sapply(c(0.5, 0.8), function(p) { - # PR-C: replace with actual R slope_for_power() call. - NA_real_ + as.numeric(slope_for_power(sigma = Sigma, + targetPower = p, + tVec = all_periods, + referencePeriod = 0)) }) list( @@ -156,10 +164,11 @@ f1 <- build_event_study_fixture( fixture_1 <- extract_pretrends(f1, "uniform_3_pre_periods_no_anticipation") cat("Building fixture 2: irregular_pre_periods...\n") -# K=3 with t ∈ {-5, -3, -1}. Tests PR-B's γ-unit linear-pattern fix: -# pre-PR-B Python with normalized count-based weights would silently report -# MDV in [0.45, 0.30, 0.15] / sqrt(0.3) units, not γ. R `slope_for_power()` -# always reports γ; Python's PR-B Step 4 makes the two match at atol=1e-6. +# K=3 with t in {-5, -3, -1}. Tests PR-B's gamma-unit linear-pattern fix: +# pre-PR-B Python with normalized count-based weights silently reported +# MDV in [0.45, 0.30, 0.15] / sqrt(0.3) units, not gamma. R +# `slope_for_power()` always reports gamma; Python's PR-B Step 4 makes the +# two match at atol=1e-4. f2 <- build_event_study_fixture( pre_periods = c(-5L, -3L, -1L), post_periods = c(1L, 2L, 3L), @@ -172,12 +181,25 @@ cat("Building fixture 3: anticipation_shifted...\n") # so the {-5, -4, -3, -2} cells are the genuine pre-periods; t=-1 is the # anticipation window. Tests the pre-period filtering logic. f3 <- build_event_study_fixture( - pre_periods = c(-5L, -4L, -3L, -2L), # genuine pre-periods (cutoff = -1) + pre_periods = c(-5L, -4L, -3L, -2L), post_periods = c(1L, 2L, 3L), seed = 303L ) fixture_3 <- extract_pretrends(f3, "anticipation_shifted") +cat("Building fixture 4: single_pre_period_closed_form...\n") +# K=1 with diagonal Sigma = 0.25*I. Locks Roth Proposition 2 univariate +# truncated-normal closed form against R AND the analytical scalar +# expression `1 - Phi(z - gamma/sigma) + Phi(-z - gamma/sigma)`. +f4 <- build_event_study_fixture( + pre_periods = c(-1L), + post_periods = c(1L), + sigma2 = 0.25, + rho = 0.0, + seed = 404L +) +fixture_4 <- extract_pretrends(f4, "single_pre_period_closed_form") + # --------------------------------------------------------------------------- # Write JSON # --------------------------------------------------------------------------- @@ -186,38 +208,30 @@ out <- list( meta = list( generated_at = format(Sys.Date()), pretrends_version = as.character(packageVersion("pretrends")), - pretrends_commit = "", # TODO PR-C: replace with actual git SHA + pretrends_commit = PRETRENDS_COMMIT, r_version = R.version.string, description = paste( "Roth (2022) PreTrendsPower parity goldens for diff-diff", - "compute_pretrends_power / PreTrendsPower (PR-C parity target).", - "Parity at atol=1e-6 along a three-tier contract:", - "(1) NIS box probability at fixed γ values on all 3 fixtures;", - "(2) γ_p MDV (slope at target power 0.5 and 0.8) on regular and", - "irregular grids;", - "(3) γ-unit MDV invariance: PR-B's skip-L2-norm path produces MDV", - "in Roth's γ units exactly, matching R's slope_for_power().", + "compute_pretrends_power / PreTrendsPower (PR-C).", + "Three-tier parity contract:", + "(1) NIS box probability at fixed gamma values on all 4 fixtures", + "(atol=1e-5; R hardcodes thresholdTstat=1.96, Python uses qnorm(0.975)", + "= 1.959963984540054);", + "(2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular,", + "irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol", + "vs Python brentq xtol gap);", + "(3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV", + "in Roth's gamma units exactly, matching R's slope_for_power().", "See diff-diff/docs/methodology/papers/roth-2022-review.md for", "the full derivation." ) ), uniform_3_pre_periods_no_anticipation = fixture_1, irregular_pre_periods = fixture_2, - anticipation_shifted = fixture_3 + anticipation_shifted = fixture_3, + single_pre_period_closed_form = fixture_4 ) out_path <- "../data/r_pretrends_golden.json" write_json(out, out_path, pretty = TRUE, digits = NA, auto_unbox = TRUE) cat(sprintf("Wrote %s\n", out_path)) -cat("\n") -cat("PR-C TODO checklist:\n") -cat(" [ ] Replace commit-hash placeholder above with actual\n") -cat(" git SHA from https://github.com/jonathandroth/pretrends.\n") -cat(" [ ] Replace the NA_real_ stubs in extract_pretrends() with the\n") -cat(" actual pretrends::pretrends() / slope_for_power() calls.\n") -cat(" [ ] Verify the surface claims in REGISTRY.md PreTrendsPower\n") -cat(" Reference implementations section against the pinned revision.\n") -cat(" [ ] Activate tests/test_methodology_pretrends.py::TestPretrendsParityR\n") -cat(" (currently skips via @pytest.mark.skipif when the JSON is missing).\n") -cat(" [ ] Flip METHODOLOGY_REVIEW.md PreTrendsPower row from\n") -cat(" **Complete** (R parity pending) → **Complete**.\n") diff --git a/benchmarks/data/r_pretrends_golden.json b/benchmarks/data/r_pretrends_golden.json new file mode 100644 index 00000000..935f705a --- /dev/null +++ b/benchmarks/data/r_pretrends_golden.json @@ -0,0 +1,106 @@ +{ + "meta": { + "generated_at": "2026-05-19", + "pretrends_version": "0.1.0", + "pretrends_commit": "122731d082", + "r_version": "R version 4.5.2 (2025-10-31)", + "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-5; R hardcodes thresholdTstat=1.96, Python uses qnorm(0.975) = 1.959963984540054); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." + }, + "uniform_3_pre_periods_no_anticipation": { + "panel": { + "pre_periods": [-3, -2, -1], + "post_periods": [1, 2, 3], + "all_periods": [-3, -2, -1, 1, 2, 3], + "beta_hat": [0.0748350781672762, 0.068292848337178, -0.0771191463064529, -0.0935800609975106, 0.0682614805759886, 0.211856579827921], + "Sigma": [ + [0.04, 0.012, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.04, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.04, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.04, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.04, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.012, 0.04] + ] + }, + "r_power_at_gamma": { + "gamma_test_values": [0, 0.2, 0.5, 1], + "power_values": [0.136261192117481, 0.905586670498965, 0.999999998817154, 1] + }, + "r_gamma_p": { + "target_power": [0.5, 0.8], + "gamma_p_values": [0.11085490751343, 0.168538720656777] + }, + "fixture_name": "uniform_3_pre_periods_no_anticipation" + }, + "irregular_pre_periods": { + "panel": { + "pre_periods": [-5, -3, -1], + "post_periods": [1, 2, 3], + "all_periods": [-5, -3, -1, 1, 2, 3], + "beta_hat": [0.0167238420028552, 0.109902112987245, 0.328388861134961, 0.174851793110386, 0.252739941152015, -0.00593022820809866], + "Sigma": [ + [0.04, 0.012, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.04, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.04, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.04, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.04, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.012, 0.04] + ] + }, + "r_power_at_gamma": { + "gamma_test_values": [0, 0.2, 0.5, 1], + "power_values": [0.1362253542292, 0.999453378314271, 1, 1] + }, + "r_gamma_p": { + "target_power": [0.5, 0.8], + "gamma_p_values": [0.0685248863600757, 0.103615302925996] + }, + "fixture_name": "irregular_pre_periods" + }, + "anticipation_shifted": { + "panel": { + "pre_periods": [-5, -4, -3, -2], + "post_periods": [1, 2, 3], + "all_periods": [-5, -4, -3, -2, 1, 2, 3], + "beta_hat": [-0.238811348142889, -0.127368309020327, -0.120497459632285, -0.162119040596379, 0.118496154007033, -0.000345166145009745, 0.0906365319231415], + "Sigma": [ + [0.04, 0.012, 0.012, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.04, 0.012, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.04, 0.012, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.04, 0.012, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.04, 0.012, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.012, 0.04, 0.012], + [0.012, 0.012, 0.012, 0.012, 0.012, 0.012, 0.04] + ] + }, + "r_power_at_gamma": { + "gamma_test_values": [0, 0.2, 0.5, 1], + "power_values": [0.173946844004659, 0.99990350608629, 1, 1] + }, + "r_gamma_p": { + "target_power": [0.5, 0.8], + "gamma_p_values": [0.0580280273451627, 0.0917980203438368] + }, + "fixture_name": "anticipation_shifted" + }, + "single_pre_period_closed_form": { + "panel": { + "pre_periods": -1, + "post_periods": 1, + "all_periods": [-1, 1], + "beta_hat": [-0.029844176495569, 0.386819661339587], + "Sigma": [ + [0.25, 0], + [0, 0.25] + ] + }, + "r_power_at_gamma": { + "gamma_test_values": [0, 0.2, 0.5, 1], + "power_values": [0.0499957902964407, 0.0685174081253657, 0.170065802678576, 0.515990911734522] + }, + "r_gamma_p": { + "target_power": [0.5, 0.8], + "gamma_p_values": [0.979923226532149, 1.40081048235691] + }, + "fixture_name": "single_pre_period_closed_form" + } +} diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 3efdcdbc..7bed9947 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2840,7 +2840,7 @@ Violation types: - **Backwards-compat addendum (`power_at()` for `violation_type='custom'`):** `PreTrendsPowerResults` now persists `violation_weights` on fresh fits (PR-B Step 5), so `power_at(M)` works for all four violation types including custom. Old serialized results from before PR-B's field addition have `violation_weights=None`; for those legacy results, `power_at(M)` falls back to weight reconstruction from `violation_type + n_pre_periods`, but for `violation_type='custom'` the custom weights cannot be reconstructed and `power_at(M)` raises `NotImplementedError` with a "refit with current library version" message. Fresh fits do not hit this guard. **Reference implementation(s):** -- R: [`pretrends`](https://github.com/jonathandroth/pretrends) (Roth's official package). NIS-based (`pretrends()`, `slope_for_power()`, `*_NIS` helpers). R-parity goldens deferred to PR-C; the generator script `benchmarks/R/generate_pretrends_golden.R` ships in PR-B with a placeholder commit reference pending an R-package revision pin. +- R: [`pretrends`](https://github.com/jonathandroth/pretrends) (Roth's official package). NIS-based (`pretrends()`, `slope_for_power()`, `*_NIS` helpers). R-parity goldens at `benchmarks/data/r_pretrends_golden.json` generated by `benchmarks/R/generate_pretrends_golden.R` against commit `122731d082` (package version 0.1.0; PR-C, 2026-05-19). Parity at `atol=1e-4` on both NIS power and γ_p MDV — R hardcodes `thresholdTstat=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054`, and R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` versus Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates the γ_p residual. - R dependency: [`tmvtnorm`](https://cran.r-project.org/package=tmvtnorm) (Manjunath & Wilhelm 2012) — used by R `pretrends` for truncated multivariate normal moments. The Python library uses `scipy.stats.multivariate_normal.cdf` directly for the box probability (does not require a `tmvtnorm` port). **Requirements checklist:** @@ -2851,7 +2851,7 @@ Violation types: - [x] Custom-violation weights persisted on `PreTrendsPowerResults`; `power_at(custom)` works on fresh fits (PR-B Step 5) - [x] Helper API (`compute_pretrends_power` / `compute_mdv`) supports `violation_weights` + `pretest_form` (PR-B Step 6) - [x] Methodology test file with paper-equation-numbered Verified Components walk-through (PR-B Step 7 — `tests/test_methodology_pretrends.py`) -- [ ] R `pretrends` parity at pinned commit (deferred to PR-C; generator script committed in PR-B) +- [x] R `pretrends` parity at commit `122731d082` (PR-C, 2026-05-19; 4 fixtures × NIS power + γ_p MDV at `atol=1e-4`) - [x] Power curve plotting over violation magnitudes (preserved from pre-PR-B) - [x] Integrates with HonestDiD for combined sensitivity analysis (preserved from pre-PR-B) diff --git a/docs/methodology/papers/roth-2022-review.md b/docs/methodology/papers/roth-2022-review.md index caf3bac9..0b215db1 100644 --- a/docs/methodology/papers/roth-2022-review.md +++ b/docs/methodology/papers/roth-2022-review.md @@ -288,11 +288,4 @@ Quoting Roth's key empirical results (for cross-validation): - **Heteroskedastic Sigma**: Proposition 2 requires Assumption 1. Once Proposition 1 / Proposition 3 / Proposition 4 computations are added (current `diff_diff/pretrends.py` does Wald-test power/MDV only, not the full conditional-moment path), the library will be able to operate under arbitrary Sigma — but at that point the sign of the bias-amplification effect is NOT guaranteed without Assumption 1. The library should NOT print "pretest amplifies bias under monotone trends" unless Assumption 1 is approximately satisfied (or just always issue the conditional warning). - **Equation 4 publication-rules analysis**: not standardly implemented in PreTrendsPower-style tools. Roth notes it as part of the discussion (Section II.D) but does not provide a numerical workflow for users. Library should NOT attempt to implement Equation 4 unless requested. - **Connection to `compute_pretrends_power` library helper**: the paper review confirms that "minimum slope detectable at 80% power" is exactly Roth's gamma_{0.8}, and the library helper should compute and surface this. Need to verify the existing helper's calling convention against the paper's framework when auditing `diff_diff/pretrends.py`. -- **R `pretrends` package version pin (provisional)**: paper cites the package at https://github.com/jonathandroth/pretrends; no specific version cited in the paper, and this review has NOT pinned a commit either. **All R-package surface claims in this file are therefore provisional, pending a pinned commit/release.** Specifically, the following statements must be re-verified against a recorded revision before the follow-up audit can rely on them: - - "the published R `pretrends` package surface ... is NIS-based" (the Joint Wald bullet above) - - "exposes `pretrends()` / `slope_for_power()` / `*_NIS` helpers only, not a joint-Wald interface" - - "Section III ... endorses power analyses against hypothesized nonlinear trends via the `pretrends` package" - - any reference to specific R-package function names or behavior - - R-parity work should record the exact commit/release pinned and re-verify each of these claims at that revision; either pin a commit now or treat every package-API statement above as "to be verified against a pinned revision". - **Compatibility with multi-cohort estimators**: Remark 1 lists Callaway-Sant'Anna, Sun-Abraham, etc. as compatible. The paper does not detail how to construct (beta_hat, Sigma_hat) from those estimators when the event-study output is multi-cohort (e.g., cohort × event-time matrix). Library should document the aggregation convention (per Sun-Abraham overall ATT or per Callaway-Sant'Anna `aggregate=event`). diff --git a/tests/test_methodology_pretrends.py b/tests/test_methodology_pretrends.py index 53b594d3..5995abe8 100644 --- a/tests/test_methodology_pretrends.py +++ b/tests/test_methodology_pretrends.py @@ -30,9 +30,11 @@ violation_weights + pretest_form end-to-end (PR-B Step 6 regression). - ``TestPretrendsNISvsWald`` — NIS and Wald forms produce form-consistent output; backwards-compat regression on the Wald path. -- ``TestPretrendsParityR`` — R `pretrends` package parity (skips when - goldens at ``benchmarks/data/r_pretrends_golden.json`` are missing; - populated in PR-C). +- ``TestPretrendsParityR`` — R `pretrends` package parity at + `jonathandroth/pretrends` commit ``122731d082`` (package version 0.1.0). + Goldens at ``benchmarks/data/r_pretrends_golden.json`` populated in PR-C + (2026-05-19). The ``@pytest.mark.skipif`` guards partial checkouts where + the JSON is absent. """ import json @@ -1101,17 +1103,29 @@ def test_nis_and_wald_differ_in_general(self): "r_pretrends_golden.json", ) ), - reason="R `pretrends` parity goldens not yet committed — see PR-C", + reason="R `pretrends` parity goldens not present (partial checkout)", ) class TestPretrendsParityR: - """R `pretrends` package parity at `atol=1e-6`. - - All tests skip when `benchmarks/data/r_pretrends_golden.json` is absent - (the canonical PR-B-vs-PR-C handoff: the generator script ships in PR-B - with a placeholder commit reference; PR-C pins the audited revision, - runs the script, commits the JSON, and activates these tests). See - REGISTRY.md `## PreTrendsPower` requirements checklist for the R-parity - deferred-to-PR-C status. + """R `pretrends` package parity at commit ``122731d082`` (PR-C). + + Goldens at ``benchmarks/data/r_pretrends_golden.json`` populated in + PR-C 2026-05-19 against `jonathandroth/pretrends` commit ``122731d082`` + (package version 0.1.0). The ``@pytest.mark.skipif`` decorator guards + partial checkouts where the JSON is absent (the full repo always has + it post-PR-C). + + Tolerance rationale: R hardcodes ``thresholdTstat.Pretest = 1.96`` + while Python uses ``scipy.stats.norm.ppf(0.975) = 1.959963984540054`` + (``dz ≈ 3.6e-5``); on top of that, ``mvtnorm::pmvnorm`` (R) and + ``scipy.stats.multivariate_normal.cdf`` (Python) use Genz-Bretz + randomized-lattice rules with different absolute-error defaults + (abseps ≈ 1e-3 vs 1e-5). Combined, the empirical NIS power gap is + bounded by ~5e-5 in the K=4 anticipation fixture (smaller for K∈{1,3}). + For the inverse path (γ_p), R's ``slope_for_power`` uses + ``uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)`` versus Python + ``brentq(xtol=2e-12)``; the inverse-solver tolerance gap dominates the + γ_p residual. ``atol=1e-4`` is therefore the realistic ceiling on both + tiers without tightening either solver. """ @staticmethod @@ -1125,40 +1139,194 @@ def _load_r_golden(): with open(path) as f: return json.load(f) - def test_nis_power_matches_r_pretrends_at_atol_1e_6(self): - """Python NIS power matches R `pretrends::pretrends()` at atol=1e-6. - - Stub — PR-C populates with concrete fixture iteration. + @staticmethod + def _extract_python_params(fixture): + """Parse a fixture's panel into ``(pre_periods, Sigma_22, weights)``. + + ``weights = |pre_periods|`` is the Roth γ-unit linear violation + weight. ``Sigma_22`` is the pre-period sub-block of the full + ``Sigma``. K=1 fixtures arrive with jsonlite ``auto_unbox`` scalar + ``pre_periods`` — ``np.atleast_1d`` normalizes them to a length-1 + array. + """ + panel = fixture["panel"] + pre_periods = np.atleast_1d(np.array(panel["pre_periods"], dtype=float)) + K_pre = len(pre_periods) + Sigma_full = np.atleast_2d(np.array(panel["Sigma"], dtype=float)) + Sigma_22 = Sigma_full[:K_pre, :K_pre] + weights = np.abs(pre_periods) + return pre_periods, Sigma_22, weights + + def test_nis_power_matches_r_pretrends(self): + """Python ``_compute_power_nis`` matches R ``pretrends()`` at atol=1e-4. + + Iterates all 4 fixtures × 4 γ values (16 comparisons). See class + docstring for the tolerance derivation. """ goldens = self._load_r_golden() + pt = PreTrendsPower(alpha=0.05, pretest_form="nis") for fixture_name, fixture in goldens.items(): if fixture_name == "meta": continue - # PR-C will iterate fixture['panel'] + fixture['r_power_at_gamma'] etc. - assert isinstance(fixture, dict) - - def test_mdv_gamma_p_matches_r_slope_for_power_at_atol_1e_6(self): - """Python MDV (γ_p) matches R `slope_for_power()` at atol=1e-6. - - Stub — PR-C populates with concrete fixture iteration. + _, Sigma_22, weights = self._extract_python_params(fixture) + for g_val, r_power in zip( + fixture["r_power_at_gamma"]["gamma_test_values"], + fixture["r_power_at_gamma"]["power_values"], + ): + py_power, _, _, _ = pt._compute_power_nis(g_val, weights, Sigma_22) + assert np.isclose(py_power, r_power, atol=1e-4), ( + f"{fixture_name} γ={g_val}: Python={py_power:.7f}, " + f"R={r_power:.7f}, diff={py_power - r_power:+.2e}" + ) + + def test_mdv_gamma_p_matches_r_slope_for_power(self): + """Python ``_compute_mdv_nis`` matches R ``slope_for_power()`` at atol=1e-4. + + Iterates all 4 fixtures × 2 target_power values (8 comparisons). + See class docstring for the tolerance derivation — R uniroot vs + Python brentq solver gap dominates. """ goldens = self._load_r_golden() for fixture_name, fixture in goldens.items(): if fixture_name == "meta": continue - assert isinstance(fixture, dict) + _, Sigma_22, weights = self._extract_python_params(fixture) + for target_p, r_gamma_p in zip( + fixture["r_gamma_p"]["target_power"], + fixture["r_gamma_p"]["gamma_p_values"], + ): + pt = PreTrendsPower(alpha=0.05, power=target_p, pretest_form="nis") + py_mdv = pt._compute_mdv_nis(weights, Sigma_22) + assert np.isclose(py_mdv, r_gamma_p, atol=1e-4), ( + f"{fixture_name} target_power={target_p}: " + f"Python={py_mdv:.7f}, R={r_gamma_p:.7f}, " + f"diff={py_mdv - r_gamma_p:+.2e}" + ) + + def test_irregular_grid_gamma_unit_end_to_end_matches_r(self): + """End-to-end ``fit()`` on irregular grid matches R γ_p at atol=1e-4. + + Constructs a synthetic ``MultiPeriodDiDResults`` from the + ``irregular_pre_periods`` fixture (pre {-5, -3, -1}; post {1, 2, 3}; + reference_period 0) with the fixture's β̂ + full Σ + populated + ``interaction_indices``, then calls ``pt.fit()`` and asserts the + produced ``mdv`` matches R's ``slope_for_power()``. This exercises + the full + ``fit() → _extract_pre_period_params → _get_violation_weights + (γ-unit linear path) → _compute_mdv_nis`` chain, which the direct + ``_compute_mdv_nis`` test in + ``test_mdv_gamma_p_matches_r_slope_for_power`` does not. Locks the + PR-B Step 4 γ-unit fix at the public API boundary. + """ + from diff_diff.results import MultiPeriodDiDResults, PeriodEffect - def test_irregular_grid_gamma_unit_matches_r(self): - """γ-unit MDV on irregular pre-period grids matches R at atol=1e-6. + goldens = self._load_r_golden() + fixture = goldens["irregular_pre_periods"] + pre_periods, _, _ = self._extract_python_params(fixture) + panel = fixture["panel"] + Sigma_full = np.atleast_2d(np.array(panel["Sigma"], dtype=float)) + beta_hat = np.array(panel["beta_hat"], dtype=float) + all_periods = list(np.atleast_1d(np.array(panel["all_periods"], dtype=int))) + pre_period_ids = [int(p) for p in pre_periods] + post_period_ids = [ + int(p) for p in np.atleast_1d(np.array(panel["post_periods"], dtype=int)) + ] - Specifically tests the PR-B linear-units fix: irregular grid - {-5, -3, -1} should produce a γ value that R's pretrends package - also reports as the slope, not a normalized direction. + # Build period_effects + interaction_indices over all_periods. + ses_full = np.sqrt(np.diag(Sigma_full)) + period_effects = { + int(p): PeriodEffect( + period=int(p), + effect=float(beta_hat[i]), + se=float(ses_full[i]), + t_stat=0.0, + p_value=0.5, + conf_int=(0.0, 0.0), + ) + for i, p in enumerate(all_periods) + } + interaction_indices = {int(p): i for i, p in enumerate(all_periods)} - Stub — PR-C populates with concrete fixture iteration. + mpd_results = MultiPeriodDiDResults( + period_effects=period_effects, + avg_att=0.0, + avg_se=0.2, + avg_t_stat=0.0, + avg_p_value=0.5, + avg_conf_int=(0.0, 0.0), + n_obs=1000, + n_treated=500, + n_control=500, + pre_periods=pre_period_ids, + post_periods=post_period_ids, + reference_period=0, + vcov=Sigma_full, + interaction_indices=interaction_indices, + ) + + # Target power 0.8 (second entry in r_gamma_p). + target_p = fixture["r_gamma_p"]["target_power"][1] + r_gamma_p = fixture["r_gamma_p"]["gamma_p_values"][1] + + pt = PreTrendsPower(alpha=0.05, power=target_p, pretest_form="nis") + result = pt.fit(mpd_results) + + # Provenance assertion: end-to-end path took the full-Σ_22 branch, + # not the diag fallback (PR-B Step 3 routing). + assert ( + result.covariance_source == "full_pre_period_vcov" + ), f"Expected full_pre_period_vcov, got {result.covariance_source}" + + assert np.isclose(result.mdv, r_gamma_p, atol=1e-4), ( + f"End-to-end fit() mdv={result.mdv:.7f} ≠ R slope_for_power()={r_gamma_p:.7f}, " + f"diff={result.mdv - r_gamma_p:+.2e}" + ) + + def test_k1_matches_r_and_closed_form(self): + """K=1 fixture: Python ≡ closed form, both within atol=1e-4 of R. + + Three-way cross-check on the ``single_pre_period_closed_form`` + fixture (K=1, diagonal Σ = 0.25·I). Roth Proposition 2 univariate + truncated-normal scalar: + + power(γ) = 1 - Φ(z - γ/σ) + Φ(-z - γ/σ) + + where ``z = Φ⁻¹(1 - α/2) = 1.959963984540054`` and + ``σ = sqrt(Σ_22[0,0]) = 0.5``. Python ``_compute_power_nis`` for + K=1 reduces to the same scalar normal-tail computation (no MVN + randomization), so the Python-vs-closed-form match is essentially + exact (atol=1e-7). R differs by the ``thresholdTstat=1.96`` vs + ``qnorm(0.975)`` z-hardcoding (~1e-5 to ~2e-5 on power, ~3e-6 to + ~2e-5 on γ_p). All three within 1e-4 of each other. Strongest + parity claim in the suite. """ goldens = self._load_r_golden() - for fixture_name, fixture in goldens.items(): - if fixture_name == "meta": - continue - assert isinstance(fixture, dict) + fixture = goldens["single_pre_period_closed_form"] + _, Sigma_22, weights = self._extract_python_params(fixture) + sigma = float(np.sqrt(Sigma_22[0, 0])) + z = float(stats.norm.ppf(1 - 0.05 / 2)) + + pt = PreTrendsPower(alpha=0.05, pretest_form="nis") + for g_val, r_power in zip( + fixture["r_power_at_gamma"]["gamma_test_values"], + fixture["r_power_at_gamma"]["power_values"], + ): + py_power, _, _, _ = pt._compute_power_nis(g_val, weights, Sigma_22) + closed_form = ( + 1.0 - stats.norm.cdf(z - g_val / sigma) + stats.norm.cdf(-z - g_val / sigma) + ) + # Python ≡ closed form (effectively exact — same scalar path). + assert np.isclose(py_power, closed_form, atol=1e-7), ( + f"K=1 Python NIS power ≠ analytical closed form (γ={g_val}): " + f"Python={py_power:.10f}, closed_form={closed_form:.10f}" + ) + # Both within 1e-4 of R (z-hardcoding gap). + assert np.isclose(py_power, r_power, atol=1e-4), ( + f"K=1 Python ≠ R (γ={g_val}): " + f"Python={py_power:.7f}, R={r_power:.7f}, diff={py_power - r_power:+.2e}" + ) + assert np.isclose(closed_form, r_power, atol=1e-4), ( + f"K=1 closed form ≠ R (γ={g_val}): " + f"closed_form={closed_form:.7f}, R={r_power:.7f}, " + f"diff={closed_form - r_power:+.2e}" + ) From a75d4ed8a0b523dfcac82688d7242e41e628dda4 Mon Sep 17 00:00:00 2001 From: igerber Date: Tue, 19 May 2026 12:02:23 -0400 Subject: [PATCH 2/5] Address codex R1 P2+P3 on PR-C: atol/schema/stale-prose cleanup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit R1 codex local review verdict: no P0/P1. Three actionable items addressed: - P2 (Methodology): `docs/methodology/papers/roth-2022-review.md` Gaps bullets at L282 (Joint Wald) and L288 (Heteroskedastic Sigma) contained stale provisional claims — "specific commit not pinned" and "current `diff_diff/pretrends.py` does Wald-test power/MDV only". Both are superseded by PR-B (NIS as default) and PR-C (pinned commit `122731d082`). Updated to cite the pinned commit and the current NIS+Wald state. - P2 (Documentation/Tests): the R script header and JSON `meta.description` advertised NIS power parity at `atol=1e-5`, but the active tests use `atol=1e-4` (matching empirical Genz-Bretz randomization gap of ~5e-5 on K=4 anticipation). Aligned the script comments + JSON description to `atol=1e-4` and added the Genz-Bretz rationale. - P3 (Maintainability): the K=1 fixture serialized `pre_periods` and `post_periods` as scalars due to jsonlite `auto_unbox=TRUE` (e.g., `pre_periods: -1`). Wrapped singleton vector fields in `I()` so they round-trip as length-1 arrays uniformly across all 4 fixtures. The `np.atleast_1d` defensive compensation in `_extract_python_params` is harmless and retained as defense-in-depth. Re-ran `Rscript generate_pretrends_golden.R` to regenerate the JSON (numerics unchanged — only structural changes from `I()` wrapping). All 4 `TestPretrendsParityR` tests + 131 pretrends suite tests pass; black + ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmarks/R/generate_pretrends_golden.R | 44 +++++++++++++-------- benchmarks/data/r_pretrends_golden.json | 6 +-- docs/methodology/papers/roth-2022-review.md | 4 +- 3 files changed, 33 insertions(+), 21 deletions(-) diff --git a/benchmarks/R/generate_pretrends_golden.R b/benchmarks/R/generate_pretrends_golden.R index 6721ab94..249c4ad7 100644 --- a/benchmarks/R/generate_pretrends_golden.R +++ b/benchmarks/R/generate_pretrends_golden.R @@ -14,12 +14,18 @@ # Output: ../data/r_pretrends_golden.json # # diff-diff PreTrendsPower with `pretest_form='nis'` (the new default per -# PR-B Step 2) matches the values in this JSON along a three-tier contract: +# PR-B Step 2) matches the values in this JSON along a three-tier contract, +# both tiers at atol=1e-4: # (1) NIS box probability `P(beta_hat_pre in B_NIS(Sigma))` at fixed gamma -# values on all 4 fixtures, at atol=1e-5. R hardcodes thresholdTstat +# values on all 4 fixtures, at atol=1e-4. R hardcodes thresholdTstat # = 1.96 while Python uses scipy.stats.norm.ppf(0.975) = -# 1.959963984540054; the ~4e-5 dz gap propagates to a ~1e-6 ceiling on -# the box probability — 1e-5 is the realistic atol. +# 1.959963984540054 (~4e-5 dz gap); on top of that, mvtnorm::pmvnorm +# (R) and scipy.stats.multivariate_normal.cdf (Python) use Genz-Bretz +# randomized-lattice rules with different absolute-error defaults +# (abseps ~ 1e-3 vs 1e-5). The empirical NIS power gap is bounded by +# ~5e-5 on the K=4 anticipation fixture; ~3e-5 on K=3 fixtures; ~2e-5 +# on K=1. atol=1e-4 is the realistic atol without tightening +# thresholdTstat.Pretest in R or relaxing the Genz tolerances. # (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, # anticipation, and K=1 grids, at atol=1e-4. R uniroot defaults to # tol = .Machine$double.eps^0.25 ~= 1.22e-4 vs Python brentq xtol=2e-12; @@ -132,20 +138,24 @@ extract_pretrends <- function(fixture_data, fixture_name) { }) list( + # Wrap vector fields in I() to prevent jsonlite's auto_unbox=TRUE from + # collapsing length-1 vectors to scalars (matters for the K=1 fixture: + # `pre_periods` and `post_periods` are singletons but must serialize as + # length-1 arrays so downstream Python loaders can iterate uniformly). panel = list( - pre_periods = as.integer(pre_periods), - post_periods = as.integer(post_periods), - all_periods = as.integer(all_periods), - beta_hat = as.numeric(beta_hat), + pre_periods = I(as.integer(pre_periods)), + post_periods = I(as.integer(post_periods)), + all_periods = I(as.integer(all_periods)), + beta_hat = I(as.numeric(beta_hat)), Sigma = Sigma ), r_power_at_gamma = list( - gamma_test_values = as.numeric(gamma_test_values), - power_values = as.numeric(power_values) + gamma_test_values = I(as.numeric(gamma_test_values)), + power_values = I(as.numeric(power_values)) ), r_gamma_p = list( - target_power = c(0.5, 0.8), - gamma_p_values = as.numeric(gamma_p_values) + target_power = I(c(0.5, 0.8)), + gamma_p_values = I(as.numeric(gamma_p_values)) ), fixture_name = fixture_name ) @@ -213,13 +223,15 @@ out <- list( description = paste( "Roth (2022) PreTrendsPower parity goldens for diff-diff", "compute_pretrends_power / PreTrendsPower (PR-C).", - "Three-tier parity contract:", + "Three-tier parity contract, both numeric tiers at atol=1e-4:", "(1) NIS box probability at fixed gamma values on all 4 fixtures", - "(atol=1e-5; R hardcodes thresholdTstat=1.96, Python uses qnorm(0.975)", - "= 1.959963984540054);", + "(atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses", + "qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs", + "scipy MVN CDF Genz-Bretz randomized-lattice differences bound the", + "K=4 NIS power gap at ~5e-5);", "(2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular,", "irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol", - "vs Python brentq xtol gap);", + "vs Python brentq xtol gap dominates);", "(3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV", "in Roth's gamma units exactly, matching R's slope_for_power().", "See diff-diff/docs/methodology/papers/roth-2022-review.md for", diff --git a/benchmarks/data/r_pretrends_golden.json b/benchmarks/data/r_pretrends_golden.json index 935f705a..4ec10386 100644 --- a/benchmarks/data/r_pretrends_golden.json +++ b/benchmarks/data/r_pretrends_golden.json @@ -4,7 +4,7 @@ "pretrends_version": "0.1.0", "pretrends_commit": "122731d082", "r_version": "R version 4.5.2 (2025-10-31)", - "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-5; R hardcodes thresholdTstat=1.96, Python uses qnorm(0.975) = 1.959963984540054); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." + "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract, both numeric tiers at atol=1e-4: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs scipy MVN CDF Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap dominates); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." }, "uniform_3_pre_periods_no_anticipation": { "panel": { @@ -84,8 +84,8 @@ }, "single_pre_period_closed_form": { "panel": { - "pre_periods": -1, - "post_periods": 1, + "pre_periods": [-1], + "post_periods": [1], "all_periods": [-1, 1], "beta_hat": [-0.029844176495569, 0.386819661339587], "Sigma": [ diff --git a/docs/methodology/papers/roth-2022-review.md b/docs/methodology/papers/roth-2022-review.md index 0b215db1..c66f93df 100644 --- a/docs/methodology/papers/roth-2022-review.md +++ b/docs/methodology/papers/roth-2022-review.md @@ -279,13 +279,13 @@ Quoting Roth's key empirical results (for cross-validation): ## Gaps and Uncertainties -- **Joint Wald acceptance region**: paper mentions joint tests only briefly (Section I.B notes 1 of 12 papers uses one). Power, bias, and coverage formulas all apply by replacing B_NIS with the joint Wald acceptance region B_W (convex, so Propositions 1+3+4 all hold), but Roth does not work out a separate table. Joint Wald is theoretically admissible under the paper's propositions, but the published R `pretrends` package surface, as observed at the time of this review (`github.com/jonathandroth/pretrends`; specific commit not pinned — the follow-up audit should record the exact revision against which parity is asserted), is NIS-based (`pretrends()`, `slope_for_power()`, `*_NIS` helpers) and does NOT expose a joint-Wald parity target. Any library implementation of joint-Wald PreTrendsPower will need an independent fixture or derivation rather than direct R-package parity. +- **Joint Wald acceptance region**: paper mentions joint tests only briefly (Section I.B notes 1 of 12 papers uses one). Power, bias, and coverage formulas all apply by replacing B_NIS with the joint Wald acceptance region B_W (convex, so Propositions 1+3+4 all hold), but Roth does not work out a separate table. Joint Wald is theoretically admissible under the paper's propositions, but the published R `pretrends` package surface (`github.com/jonathandroth/pretrends` at commit `122731d082` / package version 0.1.0, verified in PR-C parity goldens) is NIS-based (`pretrends()`, `slope_for_power()`, `*_NIS` helpers) and does NOT expose a joint-Wald parity target. Any library implementation of joint-Wald PreTrendsPower will need an independent fixture or derivation rather than direct R-package parity. - **"Slope-of-best-fit-line t-test" acceptance region**: Table 1 column shows the t-stat for the slope of the linear pre-trend. Paper does not analyze pretests based on this t-stat as a separate acceptance region; library should NOT extrapolate without further reading the `pretrends` package source. - **Nonlinear violations**: Section I.C formally tabulates power only against linear violations; Section I.D extends the sign-of-bias result (Proposition 2) to monotone violations under homoskedasticity. Section III ("Practical Recommendations") explicitly endorses power analyses against hypothesized nonlinear trends via the `pretrends` package, so the general nonlinear capability is paper-supported even though the paper does not separately tabulate it. The specific named shapes the library exposes ("constant", "last_period") are R-package API conventions, not separately analyzed in the paper. - **Custom delta vector interface**: paper Section III endorses "power analyses for the types of violations of parallel trends deemed to be most relevant in their context," which is the paper-level framing for a user-supplied delta vector; the specific `violation_weights`-style INTERFACE used in the library and the R `pretrends` package is a package-API convention layered on top of that paper-level framework. - **Choice of contrast l**: paper highlights l = uniform 1/M (average post-treatment) and l = e_1 (first period after treatment). No guidance on other contrasts (e.g., long-run effect l = e_M, dynamic-weighted contrast) — library should document defaults and warn that bias and coverage depend on l. - **K = 0 (no pre-periods)**: trivially no pretest possible; library should error. -- **Heteroskedastic Sigma**: Proposition 2 requires Assumption 1. Once Proposition 1 / Proposition 3 / Proposition 4 computations are added (current `diff_diff/pretrends.py` does Wald-test power/MDV only, not the full conditional-moment path), the library will be able to operate under arbitrary Sigma — but at that point the sign of the bias-amplification effect is NOT guaranteed without Assumption 1. The library should NOT print "pretest amplifies bias under monotone trends" unless Assumption 1 is approximately satisfied (or just always issue the conditional warning). +- **Heteroskedastic Sigma**: Proposition 2 requires Assumption 1. The current `diff_diff/pretrends.py` implements NIS and Wald power/MDV (PR-B 2026-05-18; NIS is the default per `pretest_form='nis'`), but NOT the full conditional-moment path (Propositions 1 / 3 / 4 numerical evaluation). Once those are added, the library will be able to operate under arbitrary Sigma — but at that point the sign of the bias-amplification effect is NOT guaranteed without Assumption 1. The library should NOT print "pretest amplifies bias under monotone trends" unless Assumption 1 is approximately satisfied (or just always issue the conditional warning). - **Equation 4 publication-rules analysis**: not standardly implemented in PreTrendsPower-style tools. Roth notes it as part of the discussion (Section II.D) but does not provide a numerical workflow for users. Library should NOT attempt to implement Equation 4 unless requested. - **Connection to `compute_pretrends_power` library helper**: the paper review confirms that "minimum slope detectable at 80% power" is exactly Roth's gamma_{0.8}, and the library helper should compute and surface this. Need to verify the existing helper's calling convention against the paper's framework when auditing `diff_diff/pretrends.py`. - **Compatibility with multi-cohort estimators**: Remark 1 lists Callaway-Sant'Anna, Sun-Abraham, etc. as compatible. The paper does not detail how to construct (beta_hat, Sigma_hat) from those estimators when the event-study output is multi-cohort (e.g., cohort × event-time matrix). Library should document the aggregation convention (per Sun-Abraham overall ATT or per Callaway-Sant'Anna `aggregate=event`). From 50a088246ba6b247b535de1b4605281d309542d4 Mon Sep 17 00:00:00 2001 From: igerber Date: Tue, 19 May 2026 13:11:48 -0400 Subject: [PATCH 3/5] Fix CHANGELOG placement: move PR-C bullet from [3.4.0] to [Unreleased] The rebase onto main after release 3.4.0 textually applied the PR-C CHANGELOG hunk under the [3.4.0] section because the surrounding context (`### Added` heading + agent-discoverability bullet) had been copied from the pre-release [Unreleased] into [3.4.0] by the release prep. PR-C is a post-3.4.0 change, so its CHANGELOG entry belongs in [Unreleased]. Moves the single bullet; release [3.4.0] content unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index dae1a561..4ff3a980 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,10 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Added +- **PreTrendsPower R `pretrends` parity goldens (PR-C closes PR-B's deferred R-parity row).** JSON goldens at `benchmarks/data/r_pretrends_golden.json` generated from the committed `benchmarks/R/generate_pretrends_golden.R` script against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0, R 4.5.2). 4 fixtures cover regular K=3 grid (`uniform_3_pre_periods_no_anticipation`), irregular K=3 grid `[-5,-3,-1]` (`irregular_pre_periods` — locks the PR-B Step 4 γ-unit linear-weight fix), anticipation-shifted K=4 grid (`anticipation_shifted`), and K=1 closed form (`single_pre_period_closed_form` — Roth Proposition 2 univariate truncated-normal). `TestPretrendsParityR` in `tests/test_methodology_pretrends.py` now active (4 tests): NIS power vs R `pretrends::pretrends()` at `atol=1e-4` across all 4 fixtures × 4 γ values; γ_p MDV vs R `slope_for_power()` at `atol=1e-4` across all 4 fixtures × 2 target_power values; end-to-end `fit()` on irregular grid vs R γ_p at `atol=1e-4` (locks the full `fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis` chain through the public API); K=1 three-way cross-check (Python ≡ analytical truncated-normal closed form `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7`; both within `atol=1e-4` of R). Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054` (`dz ≈ 3.6e-5`); R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` versus Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and `mvtnorm::pmvnorm` (R) vs `scipy.stats.multivariate_normal.cdf` (Python) Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5. `METHODOLOGY_REVIEW.md` PreTrendsPower row promoted `**Complete** (R parity pending)` → `**Complete**`. Roth (2022) paper review's `R \`pretrends\` package version pin (provisional)` Gaps bullet struck. Closes the PR-C TODO row. + ## [3.4.0] - 2026-05-19 ### Added -- **PreTrendsPower R `pretrends` parity goldens (PR-C closes PR-B's deferred R-parity row).** JSON goldens at `benchmarks/data/r_pretrends_golden.json` generated from the committed `benchmarks/R/generate_pretrends_golden.R` script against `jonathandroth/pretrends` commit `122731d082` (package version 0.1.0, R 4.5.2). 4 fixtures cover regular K=3 grid (`uniform_3_pre_periods_no_anticipation`), irregular K=3 grid `[-5,-3,-1]` (`irregular_pre_periods` — locks the PR-B Step 4 γ-unit linear-weight fix), anticipation-shifted K=4 grid (`anticipation_shifted`), and K=1 closed form (`single_pre_period_closed_form` — Roth Proposition 2 univariate truncated-normal). `TestPretrendsParityR` in `tests/test_methodology_pretrends.py` now active (4 tests): NIS power vs R `pretrends::pretrends()` at `atol=1e-4` across all 4 fixtures × 4 γ values; γ_p MDV vs R `slope_for_power()` at `atol=1e-4` across all 4 fixtures × 2 target_power values; end-to-end `fit()` on irregular grid vs R γ_p at `atol=1e-4` (locks the full `fit() → _extract_pre_period_params → _get_violation_weights → _compute_mdv_nis` chain through the public API); K=1 three-way cross-check (Python ≡ analytical truncated-normal closed form `1 - Φ(z - γ/σ) + Φ(-z - γ/σ)` at `atol=1e-7`; both within `atol=1e-4` of R). Tolerance rationale: R hardcodes `thresholdTstat.Pretest=1.96` while Python uses `scipy.stats.norm.ppf(0.975) = 1.959963984540054` (`dz ≈ 3.6e-5`); R `slope_for_power` uses `uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)` versus Python `brentq(xtol=2e-12)`; the inverse-solver tolerance gap dominates γ_p, and `mvtnorm::pmvnorm` (R) vs `scipy.stats.multivariate_normal.cdf` (Python) Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5. `METHODOLOGY_REVIEW.md` PreTrendsPower row promoted `**Complete** (R parity pending)` → `**Complete**`. Roth (2022) paper review's `R \`pretrends\` package version pin (provisional)` Gaps bullet struck. Closes the PR-C TODO row. - **Agent-discoverability contract test (`tests/test_agent_discoverability.py`).** New static-snapshot test pinning the agent-facing surface introduced by PR #464: `__all__` membership of `agent_workflow` / `profile_panel` / `get_llm_guide` / `practitioner_next_steps` / `BusinessReport`; `dir(diff_diff)` head-first ordering against `_AGENT_FACING_ORDER` (catches drift in the `_OrderedName` `__lt__` ordering trick); `_OrderedName` `isinstance(_, str)` + str-method compatibility; `dir()` full-namespace + `inspect.getmembers` parity; top-level `__doc__` first-paragraph mention of `agent_workflow` + named references to the 5-step workflow primitives; `agent_workflow()` script content references each downstream helper by name; canonical estimator class names (CallawaySantAnna, ContinuousDiD, HeterogeneousAdoptionDiD, etc.) remain importable. No live API calls; runs in the default pytest suite. Closes [issue #461](https://github.com/igerber/diff-diff/issues/461) (snapshot variant — live-agent regression test deferred to a separate follow-up that depends on causal-llm-eval packaging its harness). Also closes the `__dir__()` contract-test row from `TODO.md` that PR #464 deferred here. - **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460). - **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard. From 5b90559d19429e729d491186a6921371ff4aeafb Mon Sep 17 00:00:00 2001 From: igerber Date: Tue, 19 May 2026 13:33:57 -0400 Subject: [PATCH 4/5] Address CI R1 P3 on PR-C: provenance fail-closed + fixture-3 clarification MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI R1 codex review verdict: ✅ no P0/P1. Two P3 findings addressed: - P3 (Maintainability): the R generator hardcoded `pretrends_commit = "122731d082"` into the JSON but only verified `packageVersion("pretrends") >= "0.1.0"`. A future rerun could silently regenerate goldens from a drifted revision while still stamping the artifact with the original commit. Fix: replace the loose version gate with an exact `packageVersion == "0.1.0"` check plus a `startsWith(packageDescription("pretrends")$RemoteSha, PRETRENDS_COMMIT)` provenance assertion that fails closed with a reinstall instruction if the installed revision drifts. Verified via positive (RemoteSha = `122731d082a5990e274f57fd9af0968e44977e7a`) and negative (synthetic `deadbeef` prefix) checks. - P3 (Documentation/Tests): the `anticipation_shifted` fixture's comment described it as validating anticipation-window filtering, but the fixture omits the `t=-1` anticipation window and the parity assertions consume prefiltered `Sigma_22` / weights directly — the CS/SA-level `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py`) is NOT R-parity-locked by this fixture. Fix: rename the comment / R `cat()` print / JSON meta.description to "K=4 shifted-grid case", and document the non-coverage explicitly in the file-header comment with a forward reference to the existing PR-B MC-based and full-VCV coverage in `TestPretrendsPropositions` / `TestPretrendsCovarianceSource`, plus a deferred follow-up for a CS/SA-level `anticipation=1 + R-parity` test (would need a synthetic `CallawaySantAnnaResults` with a t=-1 entry that gets filtered before reaching `_compute_power_nis`). Test class docstring tolerance-rationale prose flipped "K=4 anticipation fixture" → "K=4 shifted-grid fixture" to match. The fixture's JSON key (`anticipation_shifted`) is unchanged to preserve the test-side reference; only the prose contract is clarified. All 4 parity tests still pass; black + ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmarks/R/generate_pretrends_golden.R | 54 ++++++++++++++++++------ benchmarks/data/r_pretrends_golden.json | 2 +- tests/test_methodology_pretrends.py | 2 +- 3 files changed, 44 insertions(+), 14 deletions(-) diff --git a/benchmarks/R/generate_pretrends_golden.R b/benchmarks/R/generate_pretrends_golden.R index 249c4ad7..4bb3c2fe 100644 --- a/benchmarks/R/generate_pretrends_golden.R +++ b/benchmarks/R/generate_pretrends_golden.R @@ -23,11 +23,11 @@ # (R) and scipy.stats.multivariate_normal.cdf (Python) use Genz-Bretz # randomized-lattice rules with different absolute-error defaults # (abseps ~ 1e-3 vs 1e-5). The empirical NIS power gap is bounded by -# ~5e-5 on the K=4 anticipation fixture; ~3e-5 on K=3 fixtures; ~2e-5 +# ~5e-5 on the K=4 shifted-grid fixture; ~3e-5 on K=3 fixtures; ~2e-5 # on K=1. atol=1e-4 is the realistic atol without tightening # thresholdTstat.Pretest in R or relaxing the Genz tolerances. # (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, -# anticipation, and K=1 grids, at atol=1e-4. R uniroot defaults to +# shifted-grid, and K=1 grids, at atol=1e-4. R uniroot defaults to # tol = .Machine$double.eps^0.25 ~= 1.22e-4 vs Python brentq xtol=2e-12; # the inverse-solver tolerance gap dominates, so 1e-4 is the realistic # atol without tightening either solver. @@ -42,9 +42,20 @@ # never-treated control. Default-case parity baseline. # 2. irregular_pre_periods — K=3 with relative_times = [-5, -3, -1]. # Exercises the PR-B gamma-unit linear-pattern fix end-to-end. -# 3. anticipation_shifted — K=4 with anticipation=1 (pre-cutoff at t<-1, -# so pre-periods are {-5, -4, -3, -2}). Verifies the pre-period filter -# logic in `_extract_pre_period_params`. +# 3. anticipation_shifted — K=4 shifted-grid case (pre-periods at +# {-5, -4, -3, -2}, leaving a notional t=-1 anticipation gap to the +# reference period 0). Locks R parity at a larger K than the K=3 +# fixtures and on a non-adjacent-to-reference grid. NOTE: the R-side +# `pretrends` package has no anticipation parameter, so this fixture +# does NOT exercise the Python-side `_extract_pre_period_params` +# CS/SA anticipation filter (`if t < _pre_cutoff` in pretrends.py) +# against R goldens — that filter is exercised by the existing +# `TestPretrendsCovarianceSource` suite (PR-B Step 3) and by the +# MC-based `TestPretrendsPropositions` tests, but a CS/SA-level +# anticipation+R-parity test would need a synthetic +# CallawaySantAnnaResults with `anticipation=1` and a t=-1 entry +# that gets filtered before reaching `_compute_power_nis`. Deferred +# to a follow-up. # 4. single_pre_period_closed_form — K=1 with diagonal Sigma = 0.25*I # (Roth Proposition 2 univariate truncated-normal closed form). Locks # the scalar fast-path against R AND against the analytical expression @@ -58,10 +69,26 @@ suppressPackageStartupMessages({ library(jsonlite) }) -stopifnot(packageVersion("pretrends") >= "0.1.0") - PRETRENDS_COMMIT <- "122731d082" +# Provenance fail-closed: refuse to regenerate goldens unless the installed +# pretrends matches the pinned (version, commit) pair. The JSON stamps the +# pinned commit string into meta.pretrends_commit, so without this check a +# future rerun could silently regenerate goldens from a drifted revision +# while still labeling the artifact with the original commit. +stopifnot(packageVersion("pretrends") == "0.1.0") +.installed_sha <- packageDescription("pretrends")$RemoteSha +if (is.null(.installed_sha) || !startsWith(.installed_sha, PRETRENDS_COMMIT)) { + stop(sprintf( + "pretrends provenance mismatch: expected RemoteSha to start with '%s' but got '%s'. Reinstall with: remotes::install_github('jonathandroth/pretrends', ref = '%s')", + PRETRENDS_COMMIT, + if (is.null(.installed_sha)) "" else .installed_sha, + PRETRENDS_COMMIT + )) +} +cat("pretrends provenance verified: version 0.1.0, RemoteSha =", + .installed_sha, "\n") + # --------------------------------------------------------------------------- # DGP helper: build a synthetic event-study coefficient vector + VCV under a # stylized null DGP (beta = 0, Sigma_22 ~ correlated). Mirrors the simulation @@ -186,10 +213,13 @@ f2 <- build_event_study_fixture( ) fixture_2 <- extract_pretrends(f2, "irregular_pre_periods") -cat("Building fixture 3: anticipation_shifted...\n") -# K=4 pre-periods with anticipation=1. Real pre-treatment cutoff is t < -1, -# so the {-5, -4, -3, -2} cells are the genuine pre-periods; t=-1 is the -# anticipation window. Tests the pre-period filtering logic. +cat("Building fixture 3: anticipation_shifted (K=4 shifted-grid)...\n") +# K=4 pre-periods at {-5, -4, -3, -2} — a shifted-grid case (gap at t=-1 +# between the last pre-period and the reference period 0). Locks R parity +# at a larger K and on a non-adjacent-to-reference grid. Note: does NOT +# exercise the Python-side `_extract_pre_period_params` CS/SA +# anticipation-filter path against R goldens — see the file-header +# comment for the rationale and deferred follow-up. f3 <- build_event_study_fixture( pre_periods = c(-5L, -4L, -3L, -2L), post_periods = c(1L, 2L, 3L), @@ -230,7 +260,7 @@ out <- list( "scipy MVN CDF Genz-Bretz randomized-lattice differences bound the", "K=4 NIS power gap at ~5e-5);", "(2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular,", - "irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol", + "irregular, shifted-grid (K=4), and K=1 grids (atol=1e-4; R uniroot tol", "vs Python brentq xtol gap dominates);", "(3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV", "in Roth's gamma units exactly, matching R's slope_for_power().", diff --git a/benchmarks/data/r_pretrends_golden.json b/benchmarks/data/r_pretrends_golden.json index 4ec10386..1e1eebc4 100644 --- a/benchmarks/data/r_pretrends_golden.json +++ b/benchmarks/data/r_pretrends_golden.json @@ -4,7 +4,7 @@ "pretrends_version": "0.1.0", "pretrends_commit": "122731d082", "r_version": "R version 4.5.2 (2025-10-31)", - "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract, both numeric tiers at atol=1e-4: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs scipy MVN CDF Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, anticipation, and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap dominates); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." + "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract, both numeric tiers at atol=1e-4: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs scipy MVN CDF Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, shifted-grid (K=4), and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap dominates); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." }, "uniform_3_pre_periods_no_anticipation": { "panel": { diff --git a/tests/test_methodology_pretrends.py b/tests/test_methodology_pretrends.py index 5995abe8..ae4cb4dd 100644 --- a/tests/test_methodology_pretrends.py +++ b/tests/test_methodology_pretrends.py @@ -1120,7 +1120,7 @@ class TestPretrendsParityR: ``scipy.stats.multivariate_normal.cdf`` (Python) use Genz-Bretz randomized-lattice rules with different absolute-error defaults (abseps ≈ 1e-3 vs 1e-5). Combined, the empirical NIS power gap is - bounded by ~5e-5 in the K=4 anticipation fixture (smaller for K∈{1,3}). + bounded by ~5e-5 in the K=4 shifted-grid fixture (smaller for K∈{1,3}). For the inverse path (γ_p), R's ``slope_for_power`` uses ``uniroot(tol = .Machine$double.eps^0.25 ≈ 1.22e-4)`` versus Python ``brentq(xtol=2e-12)``; the inverse-solver tolerance gap dominates the From 738ae8fcce029f851788eada26b76c97b5e88d37 Mon Sep 17 00:00:00 2001 From: igerber Date: Tue, 19 May 2026 13:50:59 -0400 Subject: [PATCH 5/5] Address CI R2 P3 on PR-C: qnorm/norm.ppf typo + TODO follow-up row MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI R2 codex review verdict: ✅ no P0/P1 (both prior P3s addressed). Two new P3 findings (both consequences of R1 fixes): - P3 (Documentation/Tests): the JSON `meta.description` string says Python uses `qnorm(0.975)`, but `qnorm` is the R function name; the rest of the PR correctly refers to `scipy.stats.norm.ppf(0.975)`. Fix: change the R script's `description = paste(...)` block from "qnorm(0.975)" to "scipy.stats.norm.ppf(0.975)" so the committed parity artifact's audit trail matches the language used in REGISTRY, the file-header comment, and the test class docstring. - P3 (Tech Debt): the R generator's file-header comment now explicitly defers a CS/SA `anticipation=1` R-parity test to a follow-up (R1 P3 #2 fix), but the PR removed the only PreTrendsPower TODO row and did not add a replacement tracker. Fix: add a low-priority TODO.md row describing the deferred work: build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 entry that gets filtered by `_extract_pre_period_params` before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. All 4 parity tests still pass; the regenerated JSON only differs in the description string. Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 1 + benchmarks/R/generate_pretrends_golden.R | 2 +- benchmarks/data/r_pretrends_golden.json | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/TODO.md b/TODO.md index 1a32b34d..f9ac9824 100644 --- a/TODO.md +++ b/TODO.md @@ -95,6 +95,7 @@ Deferred items from PR reviews that were not addressed before merge. | WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low | | WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium | +| PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low | | Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium | diff --git a/benchmarks/R/generate_pretrends_golden.R b/benchmarks/R/generate_pretrends_golden.R index 4bb3c2fe..1881fa80 100644 --- a/benchmarks/R/generate_pretrends_golden.R +++ b/benchmarks/R/generate_pretrends_golden.R @@ -256,7 +256,7 @@ out <- list( "Three-tier parity contract, both numeric tiers at atol=1e-4:", "(1) NIS box probability at fixed gamma values on all 4 fixtures", "(atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses", - "qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs", + "scipy.stats.norm.ppf(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs", "scipy MVN CDF Genz-Bretz randomized-lattice differences bound the", "K=4 NIS power gap at ~5e-5);", "(2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular,", diff --git a/benchmarks/data/r_pretrends_golden.json b/benchmarks/data/r_pretrends_golden.json index 1e1eebc4..77e7d333 100644 --- a/benchmarks/data/r_pretrends_golden.json +++ b/benchmarks/data/r_pretrends_golden.json @@ -4,7 +4,7 @@ "pretrends_version": "0.1.0", "pretrends_commit": "122731d082", "r_version": "R version 4.5.2 (2025-10-31)", - "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract, both numeric tiers at atol=1e-4: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses qnorm(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs scipy MVN CDF Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, shifted-grid (K=4), and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap dominates); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." + "description": "Roth (2022) PreTrendsPower parity goldens for diff-diff compute_pretrends_power / PreTrendsPower (PR-C). Three-tier parity contract, both numeric tiers at atol=1e-4: (1) NIS box probability at fixed gamma values on all 4 fixtures (atol=1e-4; R hardcodes thresholdTstat=1.96 while Python uses scipy.stats.norm.ppf(0.975) = 1.959963984540054, and mvtnorm::pmvnorm vs scipy MVN CDF Genz-Bretz randomized-lattice differences bound the K=4 NIS power gap at ~5e-5); (2) gamma_p MDV (slope at target power 0.5 and 0.8) on regular, irregular, shifted-grid (K=4), and K=1 grids (atol=1e-4; R uniroot tol vs Python brentq xtol gap dominates); (3) gamma-unit MDV invariance: PR-B's skip-L2-norm path produces MDV in Roth's gamma units exactly, matching R's slope_for_power(). See diff-diff/docs/methodology/papers/roth-2022-review.md for the full derivation." }, "uniform_3_pre_periods_no_anticipation": { "panel": {