Skip to content

Add fixed effects and absorb parameters to DifferenceInDifferences#2

Merged
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf
Jan 1, 2026
Merged

Add fixed effects and absorb parameters to DifferenceInDifferences#2
igerber merged 1 commit intomainfrom
claude/init-did-library-pvNmf

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 1, 2026

  • Add fixed_effects parameter for low-dimensional categorical FE (dummy variables)
  • Add absorb parameter for high-dimensional FE (within-transformation)
  • Properly adjust degrees of freedom for absorbed fixed effects
  • Add comprehensive test suite for fixed effects functionality (8 new tests)
  • Update README with fixed effects usage examples and API documentation

- Add fixed_effects parameter for low-dimensional categorical FE (dummy variables)
- Add absorb parameter for high-dimensional FE (within-transformation)
- Properly adjust degrees of freedom for absorbed fixed effects
- Add comprehensive test suite for fixed effects functionality (8 new tests)
- Update README with fixed effects usage examples and API documentation
@igerber igerber merged commit 860f8c8 into main Jan 1, 2026
igerber added a commit that referenced this pull request Apr 16, 2026
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and
  runs survey-aware WLS + Binder TSL IF when survey_design is active.
  Point estimate via solve_ols(weights=W_elig, weight_type='pweight');
  group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded
  to obs-level via w_i/W_g ratio, then compute_survey_if_variance for
  stratified/PSU variance. safe_inference uses df_survey.
  Rank-deficiency short-circuits to NaN to avoid point-estimate/IF
  mismatch between solve_ols's R-style drop and pinv's minimum-norm.
- P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When
  provided, resolves weights via _resolve_survey_for_fit and passes
  them to _validate_and_aggregate_to_cells, restoring fit-vs-helper
  parity under survey-backed inputs. fweight/aweight rejected.
- P3: REGISTRY updates — TWFE parity sentence now includes survey;
  heterogeneity Note documents the TSL IF mechanics and library
  extension disclaimer; checklist line-651 lists survey-aware
  surfaces; new survey+bootstrap-fallback Note after line 652.
- P2: 5 new regression tests in test_survey_dcdh.py:
  TestSurveyHeterogeneity (uniform-weights match, non-uniform beta
  change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper
  match, non-pweight rejection).

All 254 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 16, 2026
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when
  available, else n_gt) for FE regressions, the normalization
  denominator, contribution weights, and the Corollary 1 observation
  shares. On survey-backed inputs the outputs now match the
  observation-level pweighted TWFE estimand; non-survey path is
  byte-identical.
- P1 #2: Zero-weight rows are dropped before the groupby in
  _validate_and_aggregate_to_cells when weights are provided, so that
  d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight
  subpopulation rows from tripping the fuzzy-DiD guard or inflating
  downstream n_gt counts.
- P2: 2 new regression tests in test_survey_dcdh.py —
  TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols
  verifies beta_fe matches an observation-level pweighted OLS under
  survey (would fail if n_gt was still used), and
  TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation
  verifies an injected zero-weight row with opposite treatment value
  doesn't trip the within-cell constancy check.

All 256 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 17, 2026
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and
  runs survey-aware WLS + Binder TSL IF when survey_design is active.
  Point estimate via solve_ols(weights=W_elig, weight_type='pweight');
  group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded
  to obs-level via w_i/W_g ratio, then compute_survey_if_variance for
  stratified/PSU variance. safe_inference uses df_survey.
  Rank-deficiency short-circuits to NaN to avoid point-estimate/IF
  mismatch between solve_ols's R-style drop and pinv's minimum-norm.
- P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When
  provided, resolves weights via _resolve_survey_for_fit and passes
  them to _validate_and_aggregate_to_cells, restoring fit-vs-helper
  parity under survey-backed inputs. fweight/aweight rejected.
- P3: REGISTRY updates — TWFE parity sentence now includes survey;
  heterogeneity Note documents the TSL IF mechanics and library
  extension disclaimer; checklist line-651 lists survey-aware
  surfaces; new survey+bootstrap-fallback Note after line 652.
- P2: 5 new regression tests in test_survey_dcdh.py:
  TestSurveyHeterogeneity (uniform-weights match, non-uniform beta
  change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper
  match, non-pweight rejection).

All 254 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 17, 2026
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when
  available, else n_gt) for FE regressions, the normalization
  denominator, contribution weights, and the Corollary 1 observation
  shares. On survey-backed inputs the outputs now match the
  observation-level pweighted TWFE estimand; non-survey path is
  byte-identical.
- P1 #2: Zero-weight rows are dropped before the groupby in
  _validate_and_aggregate_to_cells when weights are provided, so that
  d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight
  subpopulation rows from tripping the fuzzy-DiD guard or inflating
  downstream n_gt counts.
- P2: 2 new regression tests in test_survey_dcdh.py —
  TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols
  verifies beta_fe matches an observation-level pweighted OLS under
  survey (would fail if n_gt was still used), and
  TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation
  verifies an injected zero-weight row with opposite treatment value
  doesn't trip the within-cell constancy check.

All 256 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 17, 2026
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call
  it from fit() after the weight_type/replicate-weights checks. The
  dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as
  the effective sampling unit; when strata or PSU vary within group it
  silently spreads horizon-specific IF mass across observations in
  different PSUs, contaminating the stratified-PSU variance. Walk back
  the overstated claim at the old line 669 comment to match. Within-
  group-varying weights remain supported.
- P1 #3: _survey_se_from_group_if now filters zero-weight rows before
  np.unique/np.bincount so NaN / non-comparable group IDs on excluded
  subpopulation rows cannot crash SE factorization. psi stays full-
  length with zeros in excluded positions to preserve alignment with
  resolved.strata / resolved.psu inside compute_survey_if_variance.
- REGISTRY.md line 652 Note updated: explicitly states the
  within-group-constant strata/PSU requirement and the
  within-group-varying weights support.
- Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects
  varying PSU, rejects varying strata, accepts varying weights, and
  ignores zero-weight rows during the constancy check) plus
  TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id.

All 268 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 18, 2026
TwoStageDiD and ImputationDiD each run two iterative alternating-projection
solvers (_iterative_fe, _iterative_demean) whose convergence loop exited
silently on max_iter exhaustion, returning the current iterate as if
converged. This matches the silent-failure pattern audited under axis B of
the silent-failures initiative (findings #2-#5).

Adds a shared warn_if_not_converged helper in diff_diff.utils and calls it
from all four alternating-projection loops on non-convergence. Pattern
mirrors the existing logistic + Poisson IRLS convergence warnings in
linalg.py (lines 1329-1376). Warning-only: no new public parameter, no
behavior change on inputs that already converge.

Updates REGISTRY.md entries for ImputationDiD and TwoStageDiD with Note
labels describing the new signal.

Axis-B regression-lint baseline: 10 silent range(max_iter) loops -> 6
remaining (Frank-Wolfe and TROP addressed in follow-up PRs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 18, 2026
Addresses PR #311 AI review R6 (2 × P3 cleanups).

P3 #1: Warning gate was computed from raw positive-weight groups,
not the post-filter eligible-group set used to build the bootstrap
PSU map. Panels where upstream dCDH filtering drops groups that
share PSUs with kept groups could emit a misleading "PSU coarser
than group" warning even when the effective bootstrap is one group
per PSU.

Fix: count PSUs and groups from `_eligible_group_ids` (the same set
feeding `group_id_to_psu_code_bootstrap`), preserving the within-
group-constant-PSU invariant by taking each eligible group's first
positive-weight PSU label.

P3 #2: Two docstrings said the bootstrap is "clustered at the group
level" only — now incomplete after the PSU-level survey path:
- `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring:
  extended to note PSU-level Hall-Mammen wild clustering under
  `survey_design` with coarser PSU.
- `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module
  docstring: documents the identity-map fast path (auto-inject
  `psu=group`), the PSU-level broadcast when PSU is strictly
  coarser, and points to REGISTRY.md for the full contract.

Full regression: 318 passing.
igerber added a commit that referenced this pull request Apr 18, 2026
Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups).

R7 P3 #1: Several sites still said dCDH "always clusters at the
group level" — which was true when the PR was written but is now
incomplete given the PSU-level Hall-Mammen wild bootstrap path
under `survey_design`. Updated to distinguish user-specified
`cluster=` (still unsupported, raises NotImplementedError) from
automatic PSU-level clustering (takes over under `survey_design`
with strictly-coarser PSUs; identity under auto-inject `psu=group`):
- `docs/methodology/REGISTRY.md:592` Note (cluster contract) —
  rewrote to describe both paths; dropped "Phase 1" framing.
- `docs/methodology/REGISTRY.md:636` checklist — added the
  automatic PSU-level upgrade clause.
- `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor
  docstring — same contract split.
- `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503`
  `cluster=` error messages — removed "Phase 1" phrasing, added
  PSU-level-under-survey_design context.
- `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated
  to match the new error wording (no longer pins "Phase 1").

R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will
add multiplier-bootstrap support for placebo and bootstrap covers
`DID_M`, `DID_+`, `DID_-` only — both stale after this PR's
L_max >= 1 placebo and event-study bootstrap paths. Rewrote to
scope the NaN-SE contract to `L_max=None` only and describe the
full bootstrap coverage (overall, joiners, leavers, per-horizon
event-study, placebo horizons, shared weights for sup-t bands).

Full regression: 336 passing.
igerber added a commit that referenced this pull request Apr 19, 2026
Addresses two P0 correctness regressions in the PR-4 bootstrap PSU-map
plumbing flagged by CI review.

**P0 #1 - valid_map gate discarded the per-cell tensor too eagerly.**
When any variance-eligible group had no positive-weight cells (all-
sentinel row in psu_codes_per_cell), the old code set valid_map=False
and left BOTH group_id_to_psu_code_bootstrap AND
psu_codes_per_cell_bootstrap as None. The bootstrap then silently
dropped to unclustered group-level instead of excluding only that
group's empty row. Fix: always populate psu_codes_per_cell_bootstrap
once the tensor is built; the cell-level path already masks out -1
cells at unroll time. Always populate group_id_to_psu_code_bootstrap
with a per-group code (use placeholder 0 for all-sentinel rows since
those groups have no IF mass and the multiplier they receive is
irrelevant on either the legacy or the cell-level path).

**P0 #2 - dense PSU codes factorized over non-eligible subset.**
`np.unique(obs_psu_codes[pos_mask_boot])` previously included PSU
labels from groups that were filtered out of _eligible_group_ids
(e.g., singleton-baseline-excluded groups). The excluded groups'
PSUs contributed dense codes that formed gaps in the eligible
subset's map. Downstream `_generate_psu_or_group_weights` computes
`n_psu = max(code) + 1` and triggers the identity fast path when
`n_psu >= n_groups_target`. A gapped map like `[1, 1]` or `[0, 2, 2]`
silently activated independent-draws clustering for eligible groups
that should have shared a multiplier. Fix: restrict the np.unique
factorization to the eligible-subset positive-weight obs only
(`elig_obs_mask = pos_mask_boot & (g_idx_arr >= 0) & (t_idx_arr >=
0)`), so the dense code domain exactly matches the PSUs actually
used by variance-eligible groups.

Tests:
- `test_bootstrap_zero_weight_group_equivalent_to_removing_it`:
  fit with vs without an all-zero-weight eligible group must
  produce byte-identical bootstrap SE at the same seed (byte-
  identity would have failed before P0 #1 fix because valid_map
  flipped the PSU-aware path off for the with-zero-group fit).
- `test_bootstrap_dense_codes_under_singleton_baseline_excluded_group`:
  spies on the group_id_to_psu_code dict passed to
  `_compute_dcdh_bootstrap` under a fixture with an always-treated
  singleton-baseline group and strictly-coarser PSU among eligible
  groups. Asserts the dict's values form a contiguous `[0,
  n_unique-1]` range (no gaps from the excluded group's PSU), and
  that eligible groups sharing a PSU label receive the same dense
  code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
…= workaround text

**P3 #1 (warning predicate inconsistent with "strictly coarser PSU"
contract):** the new bootstrap warning block's comment said the
warning fires only on strictly-coarser PSU designs, but the
predicate `n_psu_eff_warn < n_groups_eff_warn` could also fire on
supported varying-PSU designs whose eligible groups happened to
share PSU labels across groups. Detect within-group-varying PSU
explicitly (`.groupby("g")["p"].nunique().gt(1).any()`) and
suppress the warning in that regime. Under auto-inject PSU=group
and under within-group-varying PSU the warning now stays silent,
matching the stated contract.

**P3 #2 (`_unroll_target_to_cells` suggested `psu=<group_col>` as a
bootstrap workaround):** the Registry / CHANGELOG already clarified
that `psu=<group_col>` is ONLY a Binder TSL workaround; the cell-
level wild PSU bootstrap has no allocator fallback. The helper's
docstring and `ValueError` message still advertised it as a
bootstrap-path workaround. Dropped that suggestion and explicitly
clarified: the varying-PSU bootstrap IS the cell-level path, so
there is no legacy-allocator alternative to fall back to —
pre-processing the panel is the only workaround on the bootstrap
side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
P1 #1 (methodology): mse_optimal_bandwidth now rejects boundary > d.min()
with a clear ValueError. The Phase 1b wrapper is scoped to the HAD
lower-boundary case (Design 1' with d_0 = 0 or Design 1 continuous-near-
d_lower with d_0 = min D_2). Interior or upper-boundary inputs would
silently run the boundary selector with a symmetric kernel and return
a bandwidth incompatible with the one-sided fitter. The port remains
available for interior / broader surface via
_nprobust_port.lpbwselect_mse_dpi.

P1 #2 (code quality): lprobust_bw validates in-window observation
counts at each of the three local-poly fits before calling qrXXinv:
  - variance: n_V >= o+1
  - B1: n_B1 >= o_B+1
  - B2: n_B2 >= o_B+2
Each guard raises a targeted ValueError naming the failing stage, the
bandwidth, and suggested remediation. Previously these failed with
opaque LinAlgError from Cholesky on under-determined designs.

P3 (doc): local_linear.py module docstring updated to say Phase 1b
"ships" instead of "will add"; tiny-sample test now asserts the new
ValueError contract instead of accepting any non-IndexError failure.

New behavioral tests:
- test_interior_boundary_rejected: boundary=0.5 on U(0,1) rejected
- test_upper_boundary_rejected: boundary=d.max() rejected
- test_boundary_equal_to_min_d_accepted: boundary=min(d) accepted
  (Design 1 continuous-near-d_lower path)
- test_boundary_below_min_d_accepted: boundary=0 with d.min()>0
  accepted (Design 1' path)
- test_bwcheck_none_on_tiny_sample_raises_valueerror: upgraded from
  "catch anything non-IndexError" to pytest.raises(ValueError,
  match="lprobust_bw").

153 tests pass (up from 149).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
P1 #1 (methodology): mse_optimal_bandwidth now rejects Design 1
mass-point designs. When boundary > 0 and the modal fraction at
d.min() exceeds the REGISTRY-specified 2% threshold, raise
NotImplementedError pointing to the 2SLS sample-average estimator
per de Chaisemartin et al. (2026) Section 3.2.4. Design 1' with
untreated units at d=0 (boundary=0) is still accepted per Garrett
et al. (2020) application precedent.

P1 #2 (code quality): qrXXinv now catches np.linalg.LinAlgError from
Cholesky and re-raises as ValueError with a targeted message naming
the failing dimension and suggesting remediation. Duplicate-support
windows or other rank-deficient designs now fail with a clear error
instead of leaking LinAlgError out of the port.

P3 (tests): Added TestStageDiagnosticsParity::test_R_parity covering
all four stages. Previously only V/B1/B2 were pinned; R (BWreg) was
only trivially checked for stage_d1 (scale=0 -> R=0). Now stage_b
and stage_h R values are explicitly parity-tested at 1% against R
nprobust.

New behavioral tests:
- test_mass_point_design_rejected: 10% mass at 0.1 -> NotImplementedError
- test_continuous_near_d_lower_accepted: uniform(0.1, 1.0) passes
- test_untreated_at_zero_accepted: 15% at d=0 with boundary=0 passes
- test_rank_deficient_design_raises_valueerror: rank-1 X -> ValueError
- R parity on all four stages across 3 DGPs (12 new parametrized cases)

169 tests pass (up from 153).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
Reviewer correctly flagged that the 1%-of-median rule is a Phase 2
design="auto" heuristic, not Phase 1b. Backed off that over-reach.

P1 #1: Removed the min(d)/median(d) < 0.01 check. The mass-point
guard now applies uniformly (whenever d.min() > 0 and modal fraction
at d.min() > 2%) and does not gate on boundary. This still catches
the original concern (silently routing mass-point data through the
nonparametric branch) without rejecting valid Design 1' samples like
Beta(2,2) where d.min() is strictly positive but small.

P1 #2: Tightened boundary validation. The wrapper now accepts only
boundary ~ 0 (Design 1') or boundary ~ d.min() (Design 1 continuous-
near-d_lower) within float tolerance. Off-support values -- including
the previously-allowed "boundary < d.min()" path -- are rejected with
a targeted error message.

P3: Added a public-wrapper duplicate-support regression that drives a
rank-deficient X'X through the full selector stack (boundary =
d.min(), unique minimum, only 4 distinct d values) and asserts a
specific "qrXXinv" ValueError, not LinAlgError.

Test updates:
- Removed test_boundary_zero_with_positive_d_min_rejected: the case
  it modeled is now accepted (no mass point).
- Added test_boundary_zero_thin_boundary_density_accepted: Beta(2,2)
  Design 1' with vanishing boundary density now passes.
- Added test_off_support_boundary_rejected: boundary=0.5 on U(1,2).
- Added test_negative_boundary_rejected: boundary<0 rejected.
- Updated test_nonzero_boundary: uses boundary=float(d.min()), not
  boundary=1.0 (which is off the realized support of U(1,2)).

175 tests pass (up from 172).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
P1 #1: boundary=0 now enforces a Design 1' support plausibility
heuristic: d.min() <= 5% * median(|d|). Samples with d.min()
substantially positive (e.g. U(0.5, 1)) are rejected with ValueError
directing the caller to boundary=float(d.min()). Threshold chosen
at 5% (not REGISTRY's 1%) so the paper's thin-boundary-density
DGPs (Beta(2,2), d.min/median ~ 3%) still pass. Reordered so the
mass-point check (NotImplementedError, paper Section 3.2.4) fires
before the support-check -- mass-point data should be redirected
to 2SLS regardless of the boundary the caller picked.

P1 #2: Empty-input front-door guard. d.size == 0 raises ValueError
with a targeted "must be non-empty" message instead of leaking
the NumPy reduction error from d.min().

P3 (docstring sync): _nprobust_port module docstring no longer says
weighted data can be handled by the public wrapper -- the wrapper
explicitly raises NotImplementedError. Docstring now matches the
actual contract.

P3 (deferred, same as last round): tri/uni/shifted-boundary golden
parity extension.

REGISTRY.md Phase 1b note expanded to document the full input
contract (nonnegativity, boundary applicability, Design 1' support
heuristic, mass-point redirection) so the public API surface is
fully specified in the methodology registry.

178 tests pass (up from 177).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
CI re-review P3 items, all documentation-only:

- Scenario 3 operation chain: said "analytical TSL via strata + PSU",
  but aggregate_survey()'s returned second-stage design is pweight
  with geographic PSU clustering and no stage-2 strata. Reworded to
  match the actual second-stage design surface being benchmarked.
- ImputationDiD "consistently dominant" claim in scaling finding #2
  and hotspot table row #2: at Rust medium SunAbraham clearly leads
  (0.353s vs 0.214s). Both claims narrowed to "Python all scales +
  Rust small/large" with the Rust-medium SunAbraham exception called
  out explicitly; the "together ~70-80% of the chain" framing
  preserves the optimization recommendation.
- SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo
  are the two largest at every scale/backend, but at Rust small
  bootstrap_variance slightly edges both (at sub-50ms totals, per-
  phase fixed overhead dominates ranking). Qualified to Python all
  scales + Rust medium/large.

Docs-only. No script or baseline changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
…race shifts

CI re-review P1: bench_dose_response.py inherited the CDiD generator's
default cohort [2], not the documented period 3. The fallback that
would have set first_treat=3 never ran (generator already populates
first_treat), so the committed baselines measured a different cohort
onset than the scenario doc. The binarized DiD phase also hardcoded
post >= 3, which further desynced it from the actual CDiD treatment
start under the default DGP.

Fix:
- Pin the generator to cohort_periods=[3] so the DGP matches the docs.
- Assert exactly one positive first_treat after generation; future
  DGP changes that break the single-cohort contract will fail loudly
  instead of drifting silently.
- Binarized phase now derives its post cutoff from the actual
  first_treat in the data, not a hardcoded period number. No
  opportunity to desync from the CDiD fits above.
- Regenerated dose-response baselines for both backends.

Structural narrative hardening:

Prior CI rounds have repeatedly re-flagged the same drift pattern:
the staggered campaign and reversible dCDH narratives make phase-
order claims at close-race cells (staggered Rust medium, dCDH at
this shape) that can flip on rerun because the two contenders are
within a few percentage points of each other. The underlying ranking
is not the right level of abstraction for narrative; the phase-share
table is. This commit rewrites both narratives to describe the
aggregate share pattern and defer per-cell ordering to the
generator-produced table. Scaling finding #2 and hotspot table row
#2 get the same treatment. Net effect: narrative claims are now
robust to rerun noise at close-race cells.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples**

When a user explicitly forced design="continuous_near_d_lower" on a
sample that actually satisfies the >2% modal-fraction mass-point
criterion, the downstream regressor shift (D - d_lower) would move the
support minimum to zero on the shifted scale. Phase 1c's mass-point
rejection guard only fires when d.min() > 0 (_validate_had_inputs), so
the silent coercion ran the nonparametric local-linear estimator on a
sample the paper (Section 3.2.4) requires to use the 2SLS branch,
producing the wrong estimand.

Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction
check on the ORIGINAL (unshifted) d_arr when the user explicitly
selects design="continuous_near_d_lower". If the fraction at d.min()
exceeds 2%, the fit raises ValueError pointing to design="mass_point"
or design="auto". design="auto" is unaffected (_detect_design already
correctly resolves such samples to mass_point).

**P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic**

The previous validator called `.astype(np.float64)` and `int(v)` on
grouped first_treat values, which crashed on otherwise-supported
string-labelled two-period panels (period in {"A","B"}, first_treat
in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value
set-membership against `{0, t_post}` with no numeric coercion.

**P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type**

When cluster was supplied, `_fit_mass_point_2sls` unconditionally
switches to the CR1 cluster-robust sandwich, but the result object
stored the REQUESTED family ("hc1" or "classical") as `vcov_type`.
`summary()` rendered correctly via the cluster_name branch, but
`to_dict()` and downstream programmatic consumers saw the stale
requested label. Fixed: when cluster is supplied, `vcov_type` is
stored as `"cr1"` regardless of the requested family. Renamed the
local variable from `vcov_effective` to `vcov_requested` to separate
the input from the effective family. Updated the
`HeterogeneousAdoptionDiDResults.summary()` branch so the cluster
rendering still works with the new stored value.

**Tests added (+8 regression):**
- TestValidateHadPanel.test_first_treat_col_with_string_periods
- TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string
- TestContinuousPathRejectsMassPoint (2 tests)
- TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base
  family when unclustered, classical+cluster collapses to cr1,
  to_dict shows effective family)

Targeted regression: 126 HAD tests + 505 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
…x dCDH headline_attribute

R1 surfaced three P1s, all legitimate:

1. StackedDiD wording mismatch. Claimed ``overall_att`` is a
   treated-share-weighted aggregate across sub-experiments; actual
   implementation (``stacked_did.py`` ~line 541) computes
   ``overall_att`` as the simple average of post-treatment event-
   study coefficients ``delta_h`` with delta-method SE. Per-horizon
   ``delta_h`` is the paper's ``theta_kappa^e`` cross-event
   aggregate, but the headline is an equally-weighted average over
   those per-horizon coefficients, not a separate cross-event
   weighting at the ATT level. Definition rewritten to describe the
   actual estimand.

2. Dead ``TwoWayFixedEffectsResults`` branch. ``TwoWayFixedEffects``
   is a subclass of ``DifferenceInDifferences`` and its ``fit()``
   returns ``DiDResults`` — there is no separate TWFE result class,
   so the ``type(results).__name__ == "TwoWayFixedEffectsResults"``
   dispatch branch was unreachable on any real fit. Removed the
   dead branch and rewrote the ``DiDResults`` branch to cover both
   2x2 DiD and TWFE interpretations explicitly (both estimators
   route here). Follow-up for future PR: persist estimator
   provenance on ``DiDResults`` (or return a dedicated TWFE result
   class) so the branch can split again; documented inline.

3. dCDH ``headline_attribute="att"``. Both dCDH branches (``DID_M``
   for ``L_max=None``, ``DID_l``/derivatives for ``L_max >= 1``)
   named ``"att"`` as the headline attribute, but
   ``ChaisemartinDHaultfoeuilleResults`` stores the headline in
   ``overall_att`` (``chaisemartin_dhaultfoeuille_results.py:357``).
   Fixed both branches to ``"overall_att"``; downstream consumers
   using the machine-readable contract now point at the correct
   attribute.

Tests: new ``TestTargetParameterRealFitIntegration`` covers the
gap R1 P2 flagged — prior coverage was stub-based and would not
have caught any of the three P1s. Four new real-fit tests:

- ``TwoWayFixedEffects().fit(...)`` returns ``DiDResults``; target-
  parameter block uses the shared DiD/TWFE branch.
- ``StackedDiD(...).fit(...)`` on a staggered panel; the
  ``headline_attribute`` matches the actual real attribute and the
  definition names the event-study-coefficient estimand.
- ``ChaisemartinDHaultfoeuille().fit(...)`` on a reversible-
  treatment panel (both ``DID_M`` and ``DID_l`` regimes);
  ``headline_attribute == "overall_att"`` and the named attribute
  actually exists on the real fit object.

Existing stub-based dispatch tests updated: the ``test_twfe_results``
test is now ``test_did_results_mentions_twfe`` (asserts the DiD
branch describes both estimators). The dCDH stub tests now also
assert ``headline_attribute == "overall_att"``.

All 323 BR/DR tests pass (319 prior + 4 new real-fit integration).

Out of scope (plan-review MEDIUM #2 — centralizing report metadata
in a single registry shared by estimator outputs and reporting
helpers): queued as a separate PR. Current approach (string dispatch
on ``type(results).__name__`` + REGISTRY.md references) is working
but brittle; a centralized registry is the principled fix for the
TWFE-dispatch-dead-code class of bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
R2 surfaced one P1 methodology finding: the dCDH dynamic branch
flattened every ``L_max >= 1`` into a generic ``DID_l`` estimand,
but the library's actual ``overall_att`` contract is:

- ``L_max = None`` -> ``DID_M`` (Phase 1 per-period aggregate).
- ``L_max = 1`` -> ``DID_1`` (single-horizon per-group estimand,
  Equation 3 of the dynamic companion paper — NOT the generic
  ``DID_l``).
- ``L_max >= 2`` (no ``trends_linear``) -> ``delta`` (cost-benefit
  cross-horizon aggregate, Lemma 4;
  ``chaisemartin_dhaultfoeuille.py:2602-2634``).
- ``trends_linear = True`` AND ``L_max >= 2`` -> ``overall_att`` is
  intentionally NaN by design
  (``chaisemartin_dhaultfoeuille.py:2828-2834``). No scalar
  aggregate; per-horizon level effects live on
  ``linear_trends_effects[l]``.

Fix: ``describe_target_parameter()`` now mirrors the result class's
own ``_estimand_label()`` at
``chaisemartin_dhaultfoeuille_results.py:454-490``. New aggregation
tags: ``DID_1`` / ``DID_1_x`` / ``DID_1_fd`` / ``DID_1_x_fd`` for
single-horizon, ``delta`` / ``delta_x`` for cost-benefit, and
``no_scalar_headline`` for the trends+L_max>=2 suppression case.
On the no-scalar case, ``headline_attribute`` is ``None`` so
downstream consumers do not point at a field whose value is NaN
by design.

Tests: added stub-based branches for every new case (``DID_1``,
``DID_1^X``, ``delta``, ``delta^X``, trends + L_max>=2 no-scalar,
trends + L_max=1 still-has-scalar) and split the real-fit
integration test into ``L_max=1`` and ``L_max=2`` real-panel
cases so the contract is enforced end-to-end per R2 P2. The
parameterized ``test_dcdh_config_branches_tag`` now covers 10 cases
and also asserts ``headline_attribute`` flips to ``None`` only on
the no-scalar case.

Docs: ``REPORTING.md`` dCDH section rewritten to match the
corrected dispatch, including the ``no_scalar_headline`` case and
the L_max=None/1/>=2 contract.

332 BR/DR tests pass.

Out of scope (still open from R1): centralizing report metadata
in a single registry shared by estimator outputs and reporting
helpers (plan-review MEDIUM #2 / R1 P2 maintainability). The
current string dispatch on ``type(results).__name__`` + explicit
REGISTRY.md citations is source-faithful but requires manual
mirroring of result-class contracts; a centralized registry is
the principled fix. Tracked for a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 21, 2026
…full.txt schema

Two P3 cleanups from R6.

P3 #1: the StackedDiD ``target_parameter.definition`` embedded an
internal implementation line reference (``stacked_did.py`` around
line 541). That pointer is not methodology source material and
will go stale under routine estimator edits even when the estimand
itself is unchanged. Removed the reference; definition now stands
on paper/registry terms alone.

P3 #2: ``diff_diff/guides/llms-full.txt`` listed the pre-PR BR/DR
schema top-level keys and omitted ``target_parameter``, so agent-
facing documentation disagreed with the runtime schema. Added
``target_parameter`` to both schema-key lists (BR around line 1779
and DR around line 1844). Documented the field shape
(``name`` / ``definition`` / ``aggregation`` /
``headline_attribute`` / ``reference``), the dispatch tag set, and
the ``headline_attribute=None`` / ``aggregation="no_scalar_headline"``
edge case for the dCDH ``trends_linear=True, L_max>=2`` fit. Also
noted the ``headline.status="no_scalar_by_design"`` value so
guide-driven agents can dispatch correctly. UTF-8 fingerprint
preserved per ``feedback_llms_guide_utf8_fingerprint.md``
(``tests/test_guides.py`` passes).

354 BR/DR + guide tests pass (337 BR/DR + 17 guide). Black clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 21, 2026
…y vs ES_avg note

Two P1 findings from R7, both addressed.

P1 #1 (schema version bump): the new ``headline.status`` /
``headline_metric.status`` value ``"no_scalar_by_design"`` added
in R4 for the dCDH ``trends_linear=True, L_max>=2`` configuration
is a breaking change per REPORTING.md stability policy (new
status-enum values are breaking — agents doing exhaustive match
will break on unknown enums). Bumped
``BUSINESS_REPORT_SCHEMA_VERSION`` and
``DIAGNOSTIC_REPORT_SCHEMA_VERSION`` from ``"1.0"`` to ``"2.0"``,
updated the in-tree schema-version tests (one explicit
``== "1.0"`` assertion and six ``"schema_version": "1.0"`` stub
dicts in BR / DR test files), added a REPORTING.md "Schema
version 2.0" note, and documented the bump in the CHANGELOG
Unreleased entry. The schemas remain marked experimental so the
formal deprecation policy does not yet apply.

P1 #2 (EfficientDiD library vs paper estimand): both
EfficientDiD branches now explicitly state that BR/DR's headline
``overall_att`` is the library's cohort-size-weighted average
over post-treatment ``(g, t)`` cells, NOT the paper's ``ES_avg``
uniform event-time average. The regime (PT-All / PT-Post)
describes identification; the aggregation choice is a separate
library-level policy that REGISTRY.md Sec. EfficientDiD
documents. Added ``cohort-size-weighted`` + ``ES_avg`` /
``post-treatment`` assertions to ``test_efficient_did_pt_all``
and ``test_efficient_did_pt_post`` so the wording is pinned.

354 BR/DR + guide + target-parameter tests pass. Black and ruff
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 21, 2026
…n tests

Both P3 cleanups from R8.

P3 #1 (TROP wording in rst): ``business_report.rst`` summary listed
TROP's target parameter as "factor-model residual" — which does
not match the helper / REGISTRY definition. Both say the TROP
target parameter is a factor-model-adjusted weighted average
over treated cells (not a residual). Fixed the rst wording to
"factor-model-adjusted ATT".

P3 #2 (Bacon branch untested): the exhaustiveness guard iterates
``_APPLICABILITY``, but ``BaconDecompositionResults`` is a
diagnostic read-out on the DR side and is NOT listed in
``_APPLICABILITY`` (BR rejects it with a TypeError). The helper
branch for Bacon therefore slipped through the 16-class guard.
Added two regressions:

- ``test_bacon_decomposition`` (unit-level, direct helper call):
  asserts aggregation / headline_attribute / definition wording
  / Goodman-Bacon reference.
- ``test_dr_with_bacon_result_emits_target_parameter``
  (integration): passes a real ``BaconDecompositionResults``
  from ``bacon_decompose`` on a staggered panel through DR,
  asserts the ``target_parameter`` block propagates into DR's
  schema, and confirms the named ``headline_attribute``
  (``twfe_estimate``) exists on the real fit object.

356 BR/DR + guide + target-parameter tests pass. Black and ruff
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 22, 2026
P1 #1 — Stute tie-safe CvM:
Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each
observation's dose, so tied observations share the post-tie cumulative sum.
My naive cumsum over sorted residuals produced partial within-tie sums that
were row-order-dependent. Fix: after cumsum, replace within-tie-block values
with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now
accepts `d_sorted` and collapses tie blocks before squaring. Regression
test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance
on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses`
validates the end-to-end stute_test contract.

P1 #2 — Exact-linear fit must fail-to-reject (not return NaN):
For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome
is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix:
dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally
produces p=1 when eps=0 exactly). Added a scale-relative short-circuit
(sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so
FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat
the short-circuit by producing non-zero but tiny OLS residuals. Yatchew
exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than
NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan,
TestYatchewHRTest.test_exact_linear_returns_p1_not_nan.

P1 #3 — HADPretestReport.all_pass contract:
Previously `all_pass = not (reject or reject or reject)` could be True
while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every
constituent p-value being finite AND no test rejecting. Updated docstring.
Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan.

P2 #1 — QUG negative-dose guard:
HAD doses must be non-negative (paper Section 2). The raw qug_test API
was silently folding d < 0 rows into the n_excluded_zero counter (filter
was `d > 0`). Fix: front-door ValueError on any d < 0. Regression:
TestQUGTest.test_negative_dose_raises.

P3 #1 — QUG np.partition:
REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched
qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤
partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight
closed-form parity at atol=1e-12 still holds.

P3 #2 — REGISTRY n_bootstrap default:
REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated
REGISTRY to match code and added a note about the n_bootstrap >= 99
front-door validation.

Test count: 47 -> 53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 22, 2026
R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive":
The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both
Stute and Yatchew linearity tests NaN" BEFORE checking whether any
conclusive test had rejected. The reviewer's example: G=2 with QUG
rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates —
the workflow emitted "inconclusive - both linearity NaN", hiding a
real assumption failure.

The paper's rule is one-way: TWFE is admissible only if NO test rejects.
A conclusive rejection therefore dominates unresolved-step notes.

Fix: reorder _compose_verdict:
  1. Collect rejections from conclusive tests first. If any, that is the
     primary verdict, and unresolved-step notes are APPENDED via
     "; additional steps unresolved: ..." rather than replacing the
     rejection.
  2. Only when NO conclusive rejection exists AND a required step is
     unresolved do we return a pure "inconclusive - ..." verdict.
  3. Otherwise fall through to the partial-workflow fail-to-reject
     verdict (with "(Yatchew NaN - skipped)" suffix if applicable).

Regressions:
- TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection
- TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection
- TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections

R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses:
qug_test and _validate_had_panel both front-door-reject d < 0 (paper
Section 2 HAD support restriction), but the new linearity helpers only
validated shape + NaN. Negative doses are outside the method's stated
scope and could silently produce conclusive-looking output.

Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test
now raise ValueError on any d < 0 with a message directing users to
pre-process or check the dose column. Docstrings updated to list the
new contract in the Raises section.

Regressions:
- TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises
- TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises

R6 P2 — Docstrings / REGISTRY sync:
HADPretestReport.verdict docstring rewritten to describe the new
"rejection-first, unresolved-suffix" priority. REGISTRY Phase 3
workflow checkbox updated to document the conclusive-rejection-not-
hidden semantics plus the non-negative-dose contract.

Test count: 64 -> 69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…ctive-support guard

P1 #1 (FPC validator in SurveyDesign.resolve fires on placebo with
explicit psu):
The R10 fix gated the in-fit implicit-PSU FPC validator on
bootstrap/jackknife only, but ``SurveyDesign.resolve()`` itself
enforces ``FPC >= n_PSU`` design-validity (survey.py:349-368) before
``synthetic_did.fit()`` even sees the resolved object. So a placebo
fit with explicit ``psu`` and low ``fpc`` would still raise — same
parameter-interaction problem one layer earlier in resolution.

Fix: when ``variance_method == "placebo"`` and
``survey_design.fpc is not None``, construct an FPC-stripped copy of
the SurveyDesign (``dataclasses.replace(survey_design, fpc=None)``)
BEFORE calling ``_resolve_survey_for_fit``. Emit the FPC no-op
``UserWarning`` at the same time. The original ``survey_design``
object is preserved (caller's reference unchanged); the resolved
unit-level survey design carries no FPC on placebo, so the in-fit
validators (and the downstream FPC-related dispatch flags) all
correctly skip FPC handling.

The duplicate downstream FPC no-op warning (added in R8 keyed on
``resolved_survey_unit.fpc``) becomes unreachable on placebo and is
removed.

New regression
``test_placebo_low_fpc_with_explicit_psu_skips_resolve_validator``:
asserts (a) placebo with explicit psu + ``fpc < n_PSU`` succeeds
+ emits no-op warning, (b) SE matches the no-FPC fit at ``rel=1e-12``,
(c) bootstrap on the same low-FPC design still raises
``"FPC (2.0) is less than the number of PSUs"`` from
``SurveyDesign.resolve()`` — validator-skip is correctly variance-
method-gated.

P1 #2 (Case D missed effective single-support):
The Case D guard for placebo degeneracy keyed on raw control counts
(``n_c_h > n_t_h`` for at least one stratum). It missed the case
where ``n_c_h_positive < 2`` for every treated stratum: rows allow
multiple subsets, but every successful pseudo-treated mean reduces
to the unique positive-weight control's outcome (zero-weight
cohabitants contribute 0 to numerator and denominator, R11 P1).
The placebo null collapses to a single point and SE = FP noise.

Fix: extend the non-degeneracy invariant to require **both**
``n_c_h > n_t_h`` AND ``n_c_h_positive >= 2`` for at least one
treated stratum. The classical Case D shape (raw exact-count
``n_c_h == n_t_h``) and the new "effective single-support" shape
(positive-weight controls < 2 even with extra zero-weight rows) both
trigger Case D. Updated the Case D error message to enumerate
``n_c_positive`` alongside ``n_c`` / ``n_t`` per stratum.

New regression
``test_placebo_full_design_raises_on_effective_single_support``:
constructs a fixture with 1 treated unit + 1 positive-weight
control + 9 zero-weight controls in stratum 0; raw guards (B/C/E)
pass but Case D fires with the new "single distinct positive-mass
pseudo-treated mean" message.

Updated existing
``test_placebo_full_design_raises_on_exact_count_stratum`` regex
to match the new message (same Case D path, slightly different
wording).

REGISTRY §SyntheticDiD Case enumeration updated: Case D now
documents both the classical (``n_c == n_t``) and effective single-
support (``n_c_positive < 2``) shapes, with the combined non-
degeneracy invariant.

Verification: 98 passed (2 new regressions; existing Case B/C/E/D-
classical guards still fire on their fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…corrected scope; cover new exports in import-surface test

P3 #1 (ROADMAP wording drift):
ROADMAP.md still said the new fields "gate WooldridgeDiD QMLE /
ContinuousDiD prerequisites pre-fit" and mentioned "time-invariance",
which contradicted the round-1 corrections to TreatmentDoseShape's
docstring + autonomous guide §2 + §5.2. Reworded to match: the new
fields add descriptive distributional context only;
`outcome_shape.is_count_like` informs (not gates) the WooldridgeDiD
QMLE judgment, and the authoritative ContinuousDiD pre-fit gates
remain `has_never_treated`, `treatment_varies_within_unit`, and
`is_balanced`. "Time-invariance" wording removed (the field was
dropped in round 1).

P3 #2 (import-surface test coverage):
`test_top_level_import_surface()` previously only verified
`profile_panel`, `PanelProfile`, `Alert`. Extended to also cover the
two new public exports `OutcomeShape` and `TreatmentDoseShape`,
asserting both their importability and their presence in
`diff_diff.__all__`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…n estimand wording + is_count_like non-negativity guard

P1 #1 (Wooldridge Poisson estimand wording):
The guide §4.11 and §5.3 worked example described
`WooldridgeDiD(method="poisson")`'s `overall_att` as a
"multiplicative effect" / "log-link effect" / "proportional change"
to be reported. Verified against `wooldridge.py:1225`
(`att = _avg(mu_1 - mu_0, cell_mask)`) and
`_reporting_helpers.py:262-281` (registered estimand: "ASF-based
average from Wooldridge ETWFE ... average-structural-function (ASF)
contrast between treated and counterfactual untreated outcomes ...
on the natural outcome scale"): the actual quantity is
`E[exp(η_1)] - E[exp(η_0)]`, an outcome-scale DIFFERENCE, not a
multiplicative ratio. An agent following the previous wording would
misreport the headline scalar.

Rewrote both surfaces to:
- Describe the estimand as an ASF-based outcome-scale difference,
  citing `wooldridge.py:1225` and Wooldridge (2023) +
  REGISTRY.md §WooldridgeDiD nonlinear / ASF path.
- Explicitly note the headline `overall_att` is a difference on the
  natural outcome scale, NOT a multiplicative ratio.
- Mention that a proportional / percent-change interpretation can
  be derived post-hoc as `overall_att / E[Y_0]` but is not the
  estimator's reported scalar.

Added `test_autonomous_count_outcome_uses_asf_outcome_scale_estimand`
in `tests/test_guides.py`: extracts §4.11 and §5.3 blocks, asserts
forbidden phrases ("multiplicative effect under qmle", "estimates
the multiplicative effect", "multiplicative (log-link) effect",
"report the multiplicative effect", "report the multiplicative")
do NOT appear, and asserts §5.3 explicitly contains "ASF" and
"outcome scale" so future edits cannot silently weaken the
description.

P1 #2 (`is_count_like` non-negativity guard):
The `is_count_like` heuristic gated on integer-valued + has-zeros +
right-skewed + > 2 distinct values, but did NOT exclude negative
support. Verified against `wooldridge.py:1105-1109`: Poisson method
hard-rejects `y < 0` with `ValueError`. Without a value_min >= 0
guard, a right-skewed integer outcome with zeros and some negatives
would set `is_count_like=True` and steer an agent toward an
estimator that then refuses to fit.

Added `value_min >= 0.0` to the heuristic and explained the
non-negativity gate in the docstring + autonomous guide §2 field
reference (now reads
"is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND
n_distinct_values > 2 AND value_min >= 0"). The guide also notes
that the gate exists specifically to align the routing signal with
WooldridgeDiD Poisson's hard non-negativity requirement.

Added `test_outcome_shape_count_like_excludes_negative_support` in
`tests/test_profile_panel.py` covering a Poisson-distributed outcome
with a small share of negative integers spliced in: asserts
`is_count_like=False` despite the other four conditions firing.

P2 (test coverage for both P1s):
Both regressions above guard the new contracts. The guide test
guards the wording surface; the profile test guards the heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R1 P0 — Stute survey path silently accepted zero-weight units, which
leak into the dose-variation check + CvM cusum + bootstrap refit while
contributing zero population mass. Extreme case: only zero-weight units
carry dose variation -> spurious finite test statistic with no warning.
Fix: strictly-positive guards on every survey-aware Stute / Yatchew /
workflow entry point (the weights= shortcut already had this; survey=
branch was the gap).

R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only
formulas silently (the variance components are derived assuming pweight
sandwich semantics). Fix: weight_type='pweight' guards added in
_resolve_pretest_unit_weights and on every direct-helper survey= branch
(stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit
guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914.

R1 P1 #2 — workflow's row-level weights= crashed on staggered event-
study panels because _validate_multi_period_panel filters to last
cohort but the joint wrappers re-aggregate with the original full-
panel weights array. Fix: subset joint_weights to data_filtered's
rows via data.index.get_indexer(data_filtered.index) BEFORE passing
to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional-
index pattern. Survey= path is unaffected (column references resolve
internally on data_filtered).

R1 P3 — REGISTRY C0 note still said "the same gate applies to
did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both
are stale post-C. Updated to clarify (a) workflow gate was temporary
and is now closed by C, (b) qug_test direct-helper gate remains
permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT
Rao-Wu rescaling).

7 new tests in TestPhase45CR1Regressions covering: zero-weight survey
on stute_test / stute_joint_pretest / workflow; aweight rejection on
stute_test / workflow; fweight rejection on yatchew_hr_test; staggered
event-study workflow with weights= (catches the length-mismatch crash).
165 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test
direct calls still crashed on staggered panels because the staggered-
weights subset fix from R1 was only applied at the workflow level. The
wrappers run their own _validate_had_panel_event_study() and may filter
to data_filtered, then passed the original full-panel weights array to
_resolve_pretest_unit_weights(data_filtered, ...) which expects the
filtered row count. Fix: subset row-level weights to data_filtered.index
positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights,
mirroring the workflow fix.

R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap
perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does
`dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper
Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows
through the OLS refit and the weighted CvM, not through the perturbation.
Adding `* w` would over-weight by w². Fix: update REGISTRY note to
remove the spurious `* w` and clarify the canonical form. Add a
regression that pins (a) bit-exact cvm_stat reduction at uniform weights,
(b) bootstrap p-value distributional agreement within Monte-Carlo noise.

R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract:
- qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled
  bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped).
  Updated to reflect the correct mechanism.
- HADPretestReport.all_pass docstring described the unweighted contract
  only; survey/weights path drops the QUG-conclusiveness gate
  (linearity-conditional admissibility per C0 deferral). Updated.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_pretrends_test_staggered_weights_subset
- test_joint_homogeneity_test_staggered_weights_subset
- test_stute_survey_perturbation_does_not_double_weight (locks the
  perturbation form via cvm_stat bit-exact reduction + p-value MC bound)

168 pretest tests pass (was 165 after R1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved
weights/survey on the FULL panel before _validate_multi_period_panel
applied the staggered last-cohort filter. Because
_resolve_pretest_unit_weights enforces strictly-positive per-unit
weights / pweight type / etc. on whatever data it sees, zero or
otherwise-invalid weights on the soon-to-be-dropped cohort would abort
an otherwise-valid event-study run.

Fix: defer resolution to per-aggregate branches.
- Top-level: only the survey/weights mutex check + use_survey_path
  presence detection (no resolution).
- Overall path: resolve weights/survey AFTER _validate_had_panel
  (no cohort filtering on this path; original data IS the panel).
- Event-study path: do NOT resolve at the workflow level. The joint
  wrappers (joint_pretrends_test / joint_homogeneity_test) own
  resolution and already see data_filtered (post staggered filter).
  Row-level weights= passed through with the existing positional
  subsetting (R1 P1 fix preserved).

R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage
gap. Existing tests covered overall-workflow + trivial/no-PSU smokes;
the PSU-aware multiplier-bootstrap path (the core new methodology)
was unpinned for joint_homogeneity_test and the event-study workflow.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial
  SurveyDesign(weights=, strata=, psu=) on the linearity wrapper).
- test_workflow_event_study_psu_strata_survey_smoke (full event-study
  dispatch under PSU/strata clustering: validate_multi_period_panel +
  resolve on data_filtered + pretrends_joint + homogeneity_joint).
- test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1
  fix regression: panel where the dropped early cohort has zero
  weights succeeds on the surviving last cohort; pre-fix this crashed
  with "weights must be strictly positive").

183 pretest tests pass (was 180 after R5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity-
family pretests now accept survey=/weights= keyword-only kwargs:

- stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test,
  joint_homogeneity_test, did_had_pretest_workflow.

Stute family: PSU-level Mammen multiplier bootstrap via
generate_survey_multiplier_weights_batch. Each replicate draws
(B, n_psu) Mammen multipliers, broadcast to per-obs perturbation
eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new
_cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix
across horizons within each replicate, preserving both vector-valued
empirical-process unit-level dependence (Delgado 1993; Escanciano 2006)
AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling --
multiplier bootstrap is a different mechanism.

Yatchew: closed-form weighted OLS + pweight-sandwich variance components
(no bootstrap):
  sigma2_lin  = sum(w * eps^2) / sum(w)
  sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w))   [Reviewer CRITICAL #2]
  sigma4_W    = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg)
  T_hr        = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W
where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3).
All three components reduce bit-exactly to existing unweighted formulas
at w=ones(G); locked at atol=1e-14 by direct helper test.

Workflow under survey/weights: skips the QUG step with UserWarning (per
C0 deferral), sets qug=None on the report, dispatches the linearity
family with survey-aware mechanism, appends "linearity-conditional
verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the
verdict. all_pass drops the QUG-conclusiveness gate (one less
precondition). HADPretestReport.qug retyped from QUGTestResults to
Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to
None-tolerant rendering.

Pweight shortcut routing: weights= passes through a synthetic trivial
ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the
same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern
on HAD sup-t.

Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise
NotImplementedError at every entry point (defense in depth, reciprocal-
guard discipline). The per-replicate weight-ratio rescaling for the
OLS-on-residuals refit step is not covered by the multiplier-bootstrap
composition; deferred to a parallel follow-up.

Per-row weights= / survey=col aggregated to per-unit via existing HAD
helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey;
constant-within-unit invariant enforced) through new
_resolve_pretest_unit_weights helper. Strictly-positive weights required
on Yatchew (the adjacent-difference variance is undefined under
contiguous-zero blocks).

Stability invariants preserved:
- Unweighted code paths bit-exact pre-PR (the new survey/weights branch
  is a separate if arm; existing 138 pretest tests pass unchanged).
- Yatchew weighted variance components reduce to unweighted at w=1 at
  atol=1e-14 (locked by TestYatchewHRTestSurvey).
- HADPretestReport schema bit-exact on the unweighted path; qug=None
  triggers the new None-tolerant rendering only on the survey path.

20 new tests across TestHADPretestWorkflowSurveyGuards (revised from
C0 rejection-only to C functional + 2 mutex/replicate-weight retained),
TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey
(5). Full pretest suite: 158 tests pass.

Patch-level addition (additive on stable surfaces). See
docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for
the full methodology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R1 P0 — Stute survey path silently accepted zero-weight units, which
leak into the dose-variation check + CvM cusum + bootstrap refit while
contributing zero population mass. Extreme case: only zero-weight units
carry dose variation -> spurious finite test statistic with no warning.
Fix: strictly-positive guards on every survey-aware Stute / Yatchew /
workflow entry point (the weights= shortcut already had this; survey=
branch was the gap).

R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only
formulas silently (the variance components are derived assuming pweight
sandwich semantics). Fix: weight_type='pweight' guards added in
_resolve_pretest_unit_weights and on every direct-helper survey= branch
(stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit
guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914.

R1 P1 #2 — workflow's row-level weights= crashed on staggered event-
study panels because _validate_multi_period_panel filters to last
cohort but the joint wrappers re-aggregate with the original full-
panel weights array. Fix: subset joint_weights to data_filtered's
rows via data.index.get_indexer(data_filtered.index) BEFORE passing
to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional-
index pattern. Survey= path is unaffected (column references resolve
internally on data_filtered).

R1 P3 — REGISTRY C0 note still said "the same gate applies to
did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both
are stale post-C. Updated to clarify (a) workflow gate was temporary
and is now closed by C, (b) qug_test direct-helper gate remains
permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT
Rao-Wu rescaling).

7 new tests in TestPhase45CR1Regressions covering: zero-weight survey
on stute_test / stute_joint_pretest / workflow; aweight rejection on
stute_test / workflow; fweight rejection on yatchew_hr_test; staggered
event-study workflow with weights= (catches the length-mismatch crash).
165 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test
direct calls still crashed on staggered panels because the staggered-
weights subset fix from R1 was only applied at the workflow level. The
wrappers run their own _validate_had_panel_event_study() and may filter
to data_filtered, then passed the original full-panel weights array to
_resolve_pretest_unit_weights(data_filtered, ...) which expects the
filtered row count. Fix: subset row-level weights to data_filtered.index
positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights,
mirroring the workflow fix.

R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap
perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does
`dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper
Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows
through the OLS refit and the weighted CvM, not through the perturbation.
Adding `* w` would over-weight by w². Fix: update REGISTRY note to
remove the spurious `* w` and clarify the canonical form. Add a
regression that pins (a) bit-exact cvm_stat reduction at uniform weights,
(b) bootstrap p-value distributional agreement within Monte-Carlo noise.

R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract:
- qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled
  bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped).
  Updated to reflect the correct mechanism.
- HADPretestReport.all_pass docstring described the unweighted contract
  only; survey/weights path drops the QUG-conclusiveness gate
  (linearity-conditional admissibility per C0 deferral). Updated.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_pretrends_test_staggered_weights_subset
- test_joint_homogeneity_test_staggered_weights_subset
- test_stute_survey_perturbation_does_not_double_weight (locks the
  perturbation form via cvm_stat bit-exact reduction + p-value MC bound)

168 pretest tests pass (was 165 after R1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved
weights/survey on the FULL panel before _validate_multi_period_panel
applied the staggered last-cohort filter. Because
_resolve_pretest_unit_weights enforces strictly-positive per-unit
weights / pweight type / etc. on whatever data it sees, zero or
otherwise-invalid weights on the soon-to-be-dropped cohort would abort
an otherwise-valid event-study run.

Fix: defer resolution to per-aggregate branches.
- Top-level: only the survey/weights mutex check + use_survey_path
  presence detection (no resolution).
- Overall path: resolve weights/survey AFTER _validate_had_panel
  (no cohort filtering on this path; original data IS the panel).
- Event-study path: do NOT resolve at the workflow level. The joint
  wrappers (joint_pretrends_test / joint_homogeneity_test) own
  resolution and already see data_filtered (post staggered filter).
  Row-level weights= passed through with the existing positional
  subsetting (R1 P1 fix preserved).

R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage
gap. Existing tests covered overall-workflow + trivial/no-PSU smokes;
the PSU-aware multiplier-bootstrap path (the core new methodology)
was unpinned for joint_homogeneity_test and the event-study workflow.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial
  SurveyDesign(weights=, strata=, psu=) on the linearity wrapper).
- test_workflow_event_study_psu_strata_survey_smoke (full event-study
  dispatch under PSU/strata clustering: validate_multi_period_panel +
  resolve on data_filtered + pretrends_joint + homogeneity_joint).
- test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1
  fix regression: panel where the dropped early cohort has zero
  weights succeeds on the surviving last cohort; pre-fix this crashed
  with "weights must be strictly positive").

183 pretest tests pass (was 180 after R5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but
R10 narrowed Stute-family support to pweight + PSU + FPC only
(stratified rejected with NotImplementedError pending derivation).
Updated to reflect the actual support surface and consolidated the
stratified-Stute follow-up alongside replicate-weight pretests as the
two known Phase 4.5 C follow-ups.

R12 P3 #2 -- the new survey test matrix covered pweight-only and
PSU-only smokes but no FPC-only case. The bootstrap helper applies
sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned
by direct regression. 2 new positive smokes:
- test_stute_test_fpc_only_survey_smoke: direct helper with
  ResolvedSurveyDesign(fpc=...) populated.
- test_workflow_overall_fpc_only_survey_smoke: workflow path with
  SurveyDesign(weights=, fpc=) column reference.

193 pretest tests pass (was 191).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…erage

P3 #1: ``to_dataframe`` method docstring at
``chaisemartin_dhaultfoeuille_results.py:1375-1379`` listed the
pre-change ``level="by_path"`` schema (no ``cband_*`` columns) even
though the implementation now returns them. Updated the bullet to
include ``cband_lower / cband_upper``, document the negative-horizon
placebo convention, and document the NaN-on-absent-band behavior.

P3 #2: ``TestByPathSupTBands::test_path_sup_t_seed_reproducibility``
only exercised the default ``rademacher`` weight family. Parameterized
over ``["rademacher", "mammen", "webb"]`` to pin that the per-path
sup-t branch correctly threads ``self.bootstrap_weights`` through
``_generate_psu_or_group_weights`` for all three multiplier families
the feature advertises. The existing OVERALL machinery handles all
three uniformly, but the per-path surface lacked direct coverage.
Each variant must produce a finite, reproducible crit on the standard
3-path fixture.

17 tests pass on TestByPathSupTBands (was 15: +2 new parameterized
variants on the existing seed_reproducibility test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R2 P1: extended dispatch-matrix coverage on the new survey_design= front
door. Added 3 test classes covering paths that PR #376 fronted but didn't
directly test:

- TestHADFitMassPointSurveyDesign: design='mass_point' + survey_design=
  smoke + legacy-alias att-parity (vcov_type='hc1' required by the Phase
  4.5 B mass-point + survey deviation).
- TestHADFitEventStudySurveyDesign: aggregate='event_study' + cband=True +
  survey_design= smoke + legacy survey= parity (full bit-equality on att,
  se under same seed + design).
- TestDidHadPretestWorkflowEventStudySurveyDesign: workflow event-study
  smoke via survey_design=, plus legacy survey= and weights= parity. The
  weights= parity test also locks the R2 P3 nested-warning suppression
  (asserts exactly ONE DeprecationWarning fires from the workflow front
  door, not three from cascading joint wrappers).

R2 P3 #1: workflow's event-study `weights=` path was emitting up to 3
DeprecationWarnings (one at workflow front door + one each from the
joint wrappers' internal weights= path). Wrap the internal joint wrapper
calls in `warnings.catch_warnings() + simplefilter("ignore",
DeprecationWarning)` since the user-facing warning has already fired at
the workflow front door. Joint wrappers can't accept ResolvedSurveyDesign
(their `_resolve_pretest_unit_weights` requires a SurveyDesign with
.resolve()), so converting weights= to survey_design= via
make_pweight_design isn't an option here. Locked by the new
test_legacy_alias_parity_weights assertion `n_dep_warnings == 1`.

R2 P3 #2: qug_test mutex error pointed users to
`survey_design=make_pweight_design(arr)` as a migration target via the
shared HAD_DUAL_KNOB_MUTEX_MSG_ARRAY_IN constant, but qug_test
permanently rejects ALL survey_design/survey/weights inputs (Phase 4.5 C0
deferral). Replaced with a qug-specific mutex message that says "no
migration path; see NotImplementedError below" instead of suggesting
make_pweight_design.

545 tests pass (was 538 + 7 new dispatch-matrix tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R9 P3 #1 (helper error message canonical-kwarg consistency):
`_resolve_pretest_unit_weights`'s TypeError on non-`SurveyDesign`-like
input still said `survey=` must be a SurveyDesign — but on the data-in
wrappers (workflow / joint_pretrends_test / joint_homogeneity_test) the
canonical kwarg is now `survey_design=`. Updated the message to name
`survey_design=` (with `survey=` flagged as the deprecated alias) and
to point pre-resolved-design users to the array-in pretest helpers,
mirroring HAD.fit's data-in guard.

R9 P3 #2 (legacy-vs-canonical parity coverage on data-in pretests):
Added 3 parity tests (test_legacy_alias_parity_survey on
joint_pretrends_test + joint_homogeneity_test, plus
test_legacy_alias_parity_survey_overall on did_had_pretest_workflow
overall path). Locks the rebinding contract on the data-in surfaces
that previously only had smoke / warning / mutex coverage.

558 tests pass (was 555 + 3 new R9 P3 parity tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
R10 P3 #1 (qug_test deprecation warning text): qug_test was using the
shared array-in deprecation messages that point users to migrate to
`survey_design=` / `make_pweight_design(arr)`, but qug_test permanently
rejects ALL survey-aware kwargs (Phase 4.5 C0 deferral). Replaced with
qug-specific warning text that says the aliases are deprecated AND
that survey-aware QUG remains unsupported, pointing users to
`did_had_pretest_workflow(..., survey_design=...)` for the survey-aware
linearity family instead.

R10 P3 #2 (weights= parity tests on data-in wrappers): the previous
round added survey= parity for joint_pretrends_test,
joint_homogeneity_test, and did_had_pretest_workflow(aggregate='overall')
but left the weights= rebinding paths warning-only with no numerical
parity lock. Added 3 new tests:
test_legacy_alias_parity_weights (joint_pretrends_test +
joint_homogeneity_test) and test_legacy_alias_parity_weights_overall
(workflow). Each asserts `weights=np.ones(n)` ≡
`survey_design=SurveyDesign(weights="w")` (uniform 1.0 column) on
identical-numerical-output, locking the rebinding contract.

561 tests pass (was 558 + 3 new R10 P3 parity tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…scope

P3 #1 (Methodology): qualified the "exact R match" claim across
docstring / REGISTRY / CHANGELOG / R-generator comment / parity test
docstring with a cross-reference to the existing DID^X cell-weighting
deviation (Python's first-stage uses equal cell weights, R weights
by N_gt). The two coincide on one-observation-per-(g,t) panels (the
common cell-aggregated regime that the parity scenario uses). The
multi-observation-per-cell deviation is independent of the by_path
lift and was already documented in REGISTRY's "Note (Phase 3 DID^X
covariate adjustment)".

P3 #2 (Maintainability): narrowed the Step 7b header comment in
chaisemartin_dhaultfoeuille.py:1465-1473 to spell out that DID^X
residualization applies to the per-group multi-horizon path
(event_study_effects, overall_att, joiners/leavers, by_path,
placebos, sup-t bands) but intentionally excludes per_period_effects
which stays on raw outcomes per the existing "Note (Phase 3 DID^X
covariate adjustment)" contract. Documentation-only fix; no runtime
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R6 was ✅ Looks good — 2 P3 polish items.

P3 #1 — version-aware repro installer:
benchmarks/R/requirements.R installed whatever CRAN currently
served via install.packages, while the generator and parity test
hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust ==
0.5.0. A fresh R environment regenerating the goldens would have
the generator's stopifnot(packageVersion == "X.Y.Z") immediately
abort.

Fix: add `install_pinned_version()` helper using
remotes::install_version with `upgrade = "never"`, run it after
the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent
when the correct version is already installed. Bump procedure
documented in lockstep with the generator + parity-test pins.

P3 #2 — exact-set parity event_times:
_zip_r_python() previously asserted only that R-mapped horizons
were a SUBSET of Python's event_times (missing-in-python check).
Tighten to FULL SET EQUALITY: also reject horizons present in
Python but absent from R's requested set ("extra_in_python"). This
catches future event_study horizon-selection regressions in both
directions — e.g. if our effects/placebo cap drifts and Python
emits an extra row R didn't request.

Stats: 540 tests pass, 0 regressions. Still no estimator changes
— all P3 polish on the parity / repro infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
R6 was ✅ Looks good — 2 P3 polish items.

P3 #1 — version-aware repro installer:
benchmarks/R/requirements.R installed whatever CRAN currently
served via install.packages, while the generator and parity test
hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust ==
0.5.0. A fresh R environment regenerating the goldens would have
the generator's stopifnot(packageVersion == "X.Y.Z") immediately
abort.

Fix: add `install_pinned_version()` helper using
remotes::install_version with `upgrade = "never"`, run it after
the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent
when the correct version is already installed. Bump procedure
documented in lockstep with the generator + parity-test pins.

P3 #2 — exact-set parity event_times:
_zip_r_python() previously asserted only that R-mapped horizons
were a SUBSET of Python's event_times (missing-in-python check).
Tighten to FULL SET EQUALITY: also reject horizons present in
Python but absent from R's requested set ("extra_in_python"). This
catches future event_study horizon-selection regressions in both
directions — e.g. if our effects/placebo cap drifts and Python
emits an extra row R didn't request.

Stats: 540 tests pass, 0 regressions. Still no estimator changes
— all P3 polish on the parity / repro infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request Apr 27, 2026
Closes BR/DR foundation gap igerber#6 from project_br_dr_foundation.md:
BusinessReport and DiagnosticReport now name what the headline
scalar actually represents as an estimand, for each of the 16
result classes. Baker et al. (2025) Step 2 ("define the target
parameter") was previously in BR's next_steps list but not done
by BR itself — this PR closes that gap.

New top-level ``target_parameter`` block (additive schema
change; experimental per REPORTING.md stability policy):

  {
    "name": str,               # stakeholder-facing name
    "definition": str,         # plain-English description
    "aggregation": str,        # machine-readable dispatch tag
    "headline_attribute": str, # which raw result attribute
    "reference": str,          # REGISTRY.md citation pointer
  }

Schema placement: top-level block (user preference, selected via
AskUserQuestion in planning). Aggregation tags include "simple",
"event_study", "group", "2x2", "twfe", "iw", "stacked", "ddd",
"staggered_ddd", "synthetic", "factor_model", "M", "l", "l_x",
"l_fd", "l_x_fd", "dose_overall", "pt_all_combined",
"pt_post_single_baseline", "unknown".

Per-estimator dispatch lives in the new
``diff_diff/_reporting_helpers.py::describe_target_parameter``
(own module rather than business_report / diagnostic_report to
avoid circular-import risk — plan-review LOW igerber#7). All 17 result
classes covered (16 from _APPLICABILITY + BaconDecompositionResults);
exhaustiveness locked in by
TestTargetParameterCoversEveryResultClass.

Fit-time config reads:

- ``EfficientDiDResults.pt_assumption`` branches the aggregation
  tag between pt_all_combined and pt_post_single_baseline.
- ``StackedDiDResults.clean_control`` varies the definition clause
  (never_treated / strict / not_yet_treated).
- ``ChaisemartinDHaultfoeuilleResults.L_max`` +
  ``covariate_residuals`` + ``linear_trends_effects`` branches
  the dCDH estimand between DID_M / DID_l / DID^X_l /
  DID^{fd}_l / DID^{X,fd}_l.

Fixed-tag branches (per plan-review CRITICAL igerber#1 and igerber#2):

- ``CallawaySantAnna`` / ``ImputationDiD`` / ``TwoStageDiD`` /
  ``WooldridgeDiD``: the fit-time ``aggregate`` kwarg does not
  change the ``overall_att`` scalar — it only populates
  additional horizon / group tables on the result object.
  Disambiguating those tables in prose is tracked under gap igerber#9.
- ``ContinuousDiDResults``: the PT-vs-SPT regime is a user-level
  assumption, not a library setting. Emits a single
  "dose_overall" tag with disjunctive definition naming both
  regime readings (ATT^loc under PT, ATT^glob under SPT).

Prose rendering:

- BR ``_render_summary``: emits "Target parameter: <name>."
  after the headline sentence (short name only; full definition
  lives in the full_report and schema).
- BR ``_render_full_report``: "## Target Parameter" section
  between "## Headline" and "## Identifying Assumption".
- DR ``_render_overall_interpretation``: mirror sentence.
- DR ``_render_dr_full_report``: "## Target Parameter" section
  with name, definition, aggregation tag, headline attribute,
  and reference.

Cross-surface parity: both BR and DR consume the same helper
(the single source of truth), so their ``target_parameter``
blocks are byte-identical (verified by
TestTargetParameterCrossSurfaceParity).

Tests: 37 new (TestTargetParameterPerEstimator +
TestTargetParameterFitConfigReads +
TestTargetParameterCoversEveryResultClass +
TestTargetParameterCrossSurfaceParity +
TestTargetParameterProseRendering). Existing BR/DR top-level-key
contract tests updated to include ``target_parameter``. Total
319 tests pass (282 prior + 37 new).

Docs: REPORTING.md gains a "Target parameter" section
documenting the per-estimator dispatch and schema shape.
business_report.rst and diagnostic_report.rst note the new
field with a pointer to REPORTING.md. CHANGELOG entry under
Unreleased.

Out of scope: REGISTRY.md per-estimator "Target parameter"
sub-sections (plan-review additional-note); the reporting-layer
doc in REPORTING.md is the current source of truth. A follow-up
docs PR can land those sub-sections if maintainers want the
registry to own the canonical wording directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants