Skip to content

Add aggregate_survey() for survey microdata-to-panel aggregation#282

Merged
igerber merged 16 commits intomainfrom
survey-aggregate
Apr 7, 2026
Merged

Add aggregate_survey() for survey microdata-to-panel aggregation#282
igerber merged 16 commits intomainfrom
survey-aggregate

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 7, 2026

Summary

  • Add aggregate_survey() to diff_diff/prep.py — aggregates individual-level survey microdata to geographic-period cells with design-based precision estimates
  • Returns (DataFrame, SurveyDesign) tuple: panel with cell means/SEs/precision weights, plus pre-configured aweight + geographic clustering design for second-stage DiD
  • Computes Horvitz-Thompson influence functions per cell, routing to TSL or replicate variance via existing survey infrastructure
  • SRS fallback (flagged via srs_fallback column) when within-cell design is too thin

Methodology references (required if estimator / math changes)

  • Method name(s): Design-weighted cell mean estimation with Taylor Series Linearization variance, following R survey::svyby() pattern
  • Paper / source link(s): Lumley (2004) "Analysis of Complex Survey Samples", JSS 9(8); Binder (1983) "On the Variances of Asymptotically Normal Estimators from Complex Surveys", ISR 51(3)
  • Any intentional deviations from the source (and why): SRS fallback when design-based variance is unidentifiable within a cell (documented in REGISTRY.md)

Validation

  • Tests added/updated: tests/test_prep.py::TestAggregateSurvey — 17 tests covering core functionality, edge cases (SRS fallback, zero variance, missing data, singleton PSU), correctness (SRS equivalence, design effect, equal weights), and full integration pipeline (microdata → aggregate → DiD estimation)
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Adds a new prep utility that aggregates individual-level survey data
(BRFSS, ACS, CPS) to geographic-period cells with design-based
precision estimates. Returns a (DataFrame, SurveyDesign) tuple ready
for second-stage DiD estimation with inverse-variance weights and
geographic clustering.

Computes Horvitz-Thompson influence functions per cell, routing to
TSL or replicate variance via existing survey infrastructure. Falls
back to SRS variance when within-cell design is too thin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2994d4961c9b0e19b756c30c7ce20c9c4cbd1906


Overall Assessment

⛔ Blocker

Highest unmitigated finding is a P0 methodology bug in aggregate_survey(): cell-level variance and precision are computed on a cell-subset design rather than the library’s documented full-design subpopulation/domain estimator.

Executive Summary

  • aggregate_survey() currently computes cell variance on a physically subsetted design, which conflicts with the existing subpopulation/domain methodology and can silently distort *_se and *_precision outputs; because *_precision is returned as second-stage aweight, this can also shift downstream DiD point estimates.
  • The new min_n parameter is documented and validated but never applied, so callers can request small-cell fallback behavior and get unchanged results.
  • Empty inputs are not handled; a filtered-away dataset can fall through to an internal sort error instead of a defined public contract.
  • Docs/tests need follow-up: the new example does not run as written, and the new tests do not cover the blocking domain-estimation case, min_n semantics, empty input, or replicate-weight aggregation.

Methodology

Affected method: design-weighted cell mean estimation with TSL/replicate variance and second-stage inverse-variance weighting in aggregate_survey().

  • Severity P0 | Impact: aggregate_survey() physically subsets the resolved design per cell at diff_diff/prep.py:L1497 and rebuilds cell-local PSU/strata counts at diff_diff/prep.py:L1511 via diff_diff/prep.py:L1514. That conflicts with the library’s own domain-estimation contract, which says subpopulations must preserve the full design with zeroed weights outside the domain at diff_diff/survey.py:L427 and docs/methodology/REGISTRY.md:L2490. In cells supported by only some PSUs within a stratum, this changes lonely-PSU handling, FPC adjustment, and PSU centering, so *_se and *_precision are wrong. Because the returned second-stage design uses *_precision as aweight at diff_diff/prep.py:L1608, the bug can also alter downstream DiD point estimates with no warning. Concrete fix: compute each cell as a full-design domain estimator by zero-padding psi outside the cell or by using SurveyDesign.subpopulation(...), preserve full-sample PSU/strata/FPC counts, and add a regression where a domain appears in only one of two sampled PSUs in a stratum.

  • Severity P1 | Impact: min_n is documented to force SRS fallback below a respondent threshold at diff_diff/prep.py:L1421 and validated only for <1 at diff_diff/prep.py:L1477, but it is never consulted in _cell_mean_variance() or the aggregation loop at diff_diff/prep.py:L1343 and diff_diff/prep.py:L1543. A caller can set min_n=10 and still get the same design-based variance path as the default, which silently violates the advertised API/default-behavior contract. Concrete fix: once n_valid is known, route n_valid < min_n to the documented fallback formula at diff_diff/prep.py:L1375 and return NaN when n_valid < 2; add a regression test, because current coverage only checks bad-input validation at tests/test_prep.py:L2396.

Code Quality

  • Severity P1 | Impact: the new public path has no empty-input contract. After resolving/grouping at diff_diff/prep.py:L1485 and diff_diff/prep.py:L1492, an empty dataset falls through to panel_df = pd.DataFrame(rows) and unconditional sort_values(by_cols) at diff_diff/prep.py:L1600 and diff_diff/prep.py:L1604. That is exactly the empty-result-set edge case the review checklist calls out as P1 for new code paths. Concrete fix: add an explicit if data.empty: guard before resolve()/groupby and define one public contract (ValueError("data must be non-empty") or schemaful empty return), then test aggregate_survey(data.iloc[0:0], ...).

Performance

No material findings in the changed code.

Maintainability

  • Severity P3 | Impact: the new registry entry describes variance behavior “after cell subsetting” and FPC on “cell subsets” at docs/methodology/REGISTRY.md:L2323 and docs/methodology/REGISTRY.md:L2327, while the existing Subpopulation Analysis section says naive subsetting is not the methodologically correct approach for domain estimation at docs/methodology/REGISTRY.md:L2490 and docs/methodology/REGISTRY.md:L2493. That internal contradiction will make future survey-method reviews harder. Concrete fix: after fixing the code, rewrite the new registry entry to use the same subpopulation/domain language, or add an explicit Deviation note if a different method is truly intended.

Tech Debt

No separate tech-debt-only finding. The P0/P1 issues above are not tracked in TODO.md, so they remain unmitigated and cannot be deferred under the stated review policy.

Security

No findings. I did not see secrets, unsafe command construction, or new data-exposure paths in the changed files.

Documentation/Tests

  • Severity P3 | Impact: the new example does not run as written. It aggregates by ["state", "year"] and then calls DifferenceInDifferences().fit(..., time="post") without creating a post column at docs/api/prep.rst:L274, docs/api/prep.rst:L287, diff_diff/prep.py:L1443, and diff_diff/prep.py:L1447. Concrete fix: either aggregate on a binary post column or add an explicit line creating panel["post"].

  • Severity P3 | Impact: the new tests exercise happy-path aggregation and input validation, but they do not cover the blocking cases above: no full-design domain/subpopulation regression, no non-default min_n behavior, no empty input, and no replicate-weight aggregation coverage. See tests/test_prep.py:L2087, tests/test_prep.py:L2319, and tests/test_prep.py:L2396. Concrete fix: add targeted regressions for each of those four cases.

Path to Approval

  1. Rework aggregate_survey() to compute each cell as a full-design subpopulation/domain estimator, not a cell-subset design. Add a regression where a stratum has two sampled PSUs but only one contributes to the target cell, and assert the cell SE/precision against the full-design domain formula.
  2. Make min_n operative. Add a test where min_n=3 and min_n=1 produce different fallback behavior on the same data, and define the n_valid < 2 contract explicitly.
  3. Add an explicit empty-data guard with a tested public contract so filtered-away inputs do not crash inside sort_values().

Rework aggregate_survey() to use full-design domain estimation:
zero-pad influence functions outside each cell, preserving full
strata/PSU structure for variance computation per Lumley (2004)
Section 3.4 and the library's subpopulation() methodology.

Also fix: min_n parameter now operative (forces SRS fallback),
empty-input guard added, docstring examples corrected, REGISTRY.md
rewritten with domain estimation language, 4 new tests added
(domain regression, min_n behavior, empty input, replicate weights).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: d330d36187bbbca1dbc13834d4df3f6389f5b979


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a [Newly identified] P1 edge-case bug in aggregate_survey(): missing values in grouping columns are neither validated nor handled, so pandas can silently drop those microdata rows from aggregation and, in the all-dropped case, fall through to an internal output-assembly error.

Executive Summary

Methodology

Code Quality

  • [Newly identified] Severity P1 | Impact: aggregate_survey() checks that grouping columns exist, but never checks that they are non-missing before data.groupby(by_cols, sort=True) at diff_diff/prep.py:1523. Pandas drops NaN group keys by default, so rows with missing geography/time are silently excluded from the computed means/SEs; if every row is dropped, the function falls through to DataFrame(rows) plus sort_values(by_cols) at diff_diff/prep.py:1621 and diff_diff/prep.py:1624, yielding an internal pandas error instead of a defined public contract. Concrete fix: validate data[by_cols] for missing values before grouping and raise an explicit ValueError, or opt into dropna=False and document that NaN forms its own cell; in either case add a post-grouping guard for the “no valid cells” path.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate tech-debt-only findings. The remaining P1 is not tracked in TODO.md, so it is unmitigated and cannot be deferred under the stated review policy.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: The new example still fails as written in both the API docs and the function docstring. The aggregated panel only contains smoking_rate_mean, age_mean, and income_mean, but the example calls DifferenceInDifferences().fit(... treatment="treated", time="year"); treated is absent, and year is not a binary post indicator under the estimator’s validation rules. See docs/api/prep.rst:274, diff_diff/prep.py:1465, diff_diff/estimators.py:223, and diff_diff/utils.py:47. Concrete fix: aggregate or derive a treatment indicator on the panel, create a binary post column, and use time="post" in both examples.
  • Severity P3 | Impact: The new test block is much stronger, but it still does not cover missing by values, which is why the remaining P1 is easy to miss. See tests/test_prep.py:1970. Concrete fix: add one regression with a partially missing grouping key and one where grouping would otherwise yield zero valid cells, asserting the chosen public contract.

Path to Approval

  1. Add an explicit public contract for missing grouping keys in aggregate_survey() before groupby runs: either reject missing by values with a clear ValueError, or support them deliberately with dropna=False.
  2. Guard the “no valid cells after grouping” path so the function never fails inside sort_values()/output assembly.
  3. Add targeted regressions for the chosen missing-key contract: one partially missing-key input and one all-missing-key input.

Address remaining P1/P3 from AI review:
- Reject NaN in by columns before groupby (P1)
- Make docstring/RST examples illustrative pseudocode (P3)
- Add tests for partial and all-NaN grouping keys (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: fe77d84a1d54bda99fd9e817bd05ea74c68d4118


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a [Newly identified] P1 edge-case bug in aggregate_survey(): it reconstructs cell membership from row labels via diff_diff/prep.py:L1540-L1546, so duplicate DataFrame indices are not handled safely and can break or mis-map aggregation on otherwise valid pandas inputs.

Executive Summary

Methodology

Code Quality

  • Severity P1 [Newly identified] | Impact: aggregate_survey() converts each grouped cell back to a full-sample mask by looking up cell_df.index inside data.index.get_indexer(...) diff_diff/prep.py:L1540-L1546. That reconstruction is label-based rather than position-based, so duplicate DataFrame indices are not safely supported by construction. On valid pandas inputs with repeated row labels, this can either raise or mis-assign rows to a cell. | Concrete fix: build cell membership from stable row positions instead of index labels. The safest pattern is to create a temporary integer row-position vector before grouping (or reset to an internal RangeIndex), then use those positions directly when populating cell_mask.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The P1 above is not tracked under TODO.md:L51-L94, so it remains unmitigated and is not deferrable under the stated review policy.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: the new test block closes the prior missing-group-key gap, but it still does not cover duplicate-index inputs, so the remaining edge-case bug is unguarded tests/test_prep.py:L1970-L2569. | Concrete fix: add one regression with duplicated row labels inside the same cell and one with duplicated labels spanning different cells, and assert that aggregation matches the same data after reset_index(drop=True).

Path to Approval

  1. Replace label-based cell-membership reconstruction in aggregate_survey() with position-based tracking so duplicate DataFrame indices cannot break or distort aggregation diff_diff/prep.py:L1540-L1546.
  2. Add duplicate-index regression tests covering both same-cell and cross-cell duplicate labels tests/test_prep.py:L1970-L2569.

Replace label-based index lookup with stable positional row tracking
via _row_pos column, so duplicate DataFrame indices cannot break or
mis-map cell aggregation. Add regression test verifying identical
results with duplicated vs clean indices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 1bbdfa882b74e05891df20a065eef582dbe6f20e


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a [Newly identified] P1 correctness bug in aggregate_survey(): the duplicate-index fix now injects a hard-coded _row_pos column before grouping, so a valid input that already has a real _row_pos grouping column will be silently regrouped by internal row numbers instead of the user’s data, producing incorrect cell means/SEs/precision with no warning. See diff_diff/prep.py:L1535-L1543.

Executive Summary

  • The prior duplicate-index blocker is addressed for ordinary repeated DataFrame indices: cell membership is now reconstructed from stable row positions rather than index labels, and the new regression compares duplicate-index vs reset-index outputs. See diff_diff/prep.py:L1535-L1547 and tests/test_prep.py:L2455-L2485.
  • The prior missing-group-key blocker is also addressed: aggregate_survey() now rejects NaNs in grouping columns before grouping, with regressions for both partial and all-missing grouping keys. See diff_diff/prep.py:L1514-L1521 and tests/test_prep.py:L2424-L2453.
  • Methodologically, the new estimator path is aligned with the Methodology Registry: _cell_mean_variance() uses zero-padded full-design domain IFs and routes them through the existing TSL/replicate variance helpers, matching the documented Lumley-style domain-estimation approach. See diff_diff/prep.py:L1381-L1402 and docs/methodology/REGISTRY.md:L2303-L2326.
  • [Newly identified] The replacement duplicate-index fix introduces a new edge-case bug: data.assign(_row_pos=...) can overwrite a legitimate user _row_pos column and silently change the grouping partition. See diff_diff/prep.py:L1535-L1543.
  • The earlier docs/example issue appears fixed in both the docstring and API docs: the example now clearly says treatment/post indicators must be added at the panel level before fitting DiD. See diff_diff/prep.py:L1462-L1475 and docs/api/prep.rst:L266-L293.
  • I could not execute the tests locally because this sandbox is missing pytest and pandas.

Methodology

  • Severity P3 | Impact: No unmitigated methodology defect found. The affected method is aggregate_survey()’s design-weighted cell mean / SE computation, and its implementation matches the new registry entry: full-design domain estimation via zero-padded IFs, documented SRS fallback, and documented NaN precision for zero-variance cells. | Concrete fix: None required. Keep the registry entry and helper docstring aligned as behavior evolves. See diff_diff/prep.py:L1414-L1424, diff_diff/prep.py:L1381-L1402, and docs/methodology/REGISTRY.md:L2303-L2326.

Code Quality

  • Severity P1 [Newly identified] | Impact: aggregate_survey() now depends on data.assign(_row_pos=row_positions).groupby(by_cols, ...) for safe duplicate-index handling. If a caller already has a real _row_pos column and includes it in by, that user column is overwritten in the grouping frame, so the function silently groups by internal row numbers instead of the caller’s requested key. That corrupts the cell partition and therefore the estimated means, SEs, counts, and precision weights on otherwise valid inputs. | Concrete fix: stop injecting a hard-coded temporary column into the data being grouped. Either generate a collision-proof sentinel name not present in data.columns, or better, keep row positions external to the DataFrame entirely and use groupby(...).indices / positional bookkeeping to build cell_mask. See diff_diff/prep.py:L1535-L1543.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The P1 above is not tracked under TODO.md:L51-L93, so it remains unmitigated and is not deferrable under the stated review rules.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: The new regression coverage proves the ordinary duplicate-index case, but it does not cover the _row_pos collision path introduced by the fix, so the new blocker is currently unguarded by tests. | Concrete fix: add a regression where the input already contains a real _row_pos column with nontrivial values and by includes that column; assert aggregation matches the same data after renaming the user column or after switching to collision-free positional tracking. See tests/test_prep.py:L2455-L2485 and diff_diff/prep.py:L1535-L1543.

Path to Approval

  1. Replace the hard-coded _row_pos temporary-column strategy in aggregate_survey() with collision-free positional tracking so user-supplied columns can never be overwritten. See diff_diff/prep.py:L1535-L1543.
  2. Add a regression for an input DataFrame that already has _row_pos in by, and verify results are unchanged relative to the same data after renaming that column. The existing duplicate-index regression is the right place to extend. See tests/test_prep.py:L2455-L2485.

Replace _row_pos column injection with groupby().indices which returns
positional indices directly. No temporary columns are added to the
user's data, eliminating any column name collision risk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 4788166cc653a4155aae2f9b5ac9ac3da841f3c3


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a P1 SE/counting bug in aggregate_survey(): cell validity is defined as ~np.isnan(y_cell) rather than positive-weight, non-missing observations, so zero-weight rows can inflate {outcome}_n, bypass the n_valid < 2 / min_n guards, and emit se=0.0 where the effective positive-weight sample is too thin for a design-based SE. See diff_diff/prep.py:1354, diff_diff/prep.py:1360, diff_diff/prep.py:1376.

Executive Summary

  • The prior duplicate-index blocker is resolved. aggregate_survey() now uses position-based groupby().indices membership and no longer injects _row_pos into user data. See diff_diff/prep.py:1535 and tests/test_prep.py:2455.
  • The prior missing-group-key blocker is also resolved: grouping columns are validated for NaN before grouping, with regressions for partial and all-missing cases. See diff_diff/prep.py:1514 and tests/test_prep.py:2424.
  • Methodologically, the core design is sound: the new path implements the registry’s full-design domain-estimation approach by zero-padding cell IFs and routing them through the existing TSL/replicate variance helpers. The documented SRS fallback and zero-variance precision behavior are therefore not defects. See diff_diff/prep.py:1381 and docs/methodology/REGISTRY.md:2303.
  • P1: zero-weight rows are still treated as valid cell observations. That conflicts with the library’s own positive-weight handling elsewhere and can silently produce incorrect {outcome}_n and degenerate se=0.0 instead of NaN on valid survey inputs with zero weights. See diff_diff/prep.py:1354, diff_diff/linalg.py:1051, and diff_diff/survey.py:838.
  • The new test suite is strong on duplicate indices, missing grouping keys, domain estimation, SRS fallback, and replicate weights, but it does not cover zero full-sample weights in aggregate_survey(). See tests/test_prep.py:2087, tests/test_prep.py:2487, and tests/test_prep.py:2572.
  • I could not execute the tests locally because this sandbox lacks pytest, numpy, and pandas.

Methodology

  • Severity P1 | Impact: _cell_mean_variance() counts every non-missing row as valid, even when its survey weight is exactly zero. Because n_valid then drives {outcome}_n, the n_valid < 2 undefined-SE branch, and the min_n SRS-fallback gate, a cell with only one positive-weight observation plus zero-weight padding can be reported as having n_valid >= 2 and y_se = 0.0 instead of an unidentified SE. That is an incorrect variance/SE on a valid survey-weight input. See diff_diff/prep.py:1354, diff_diff/prep.py:1360, diff_diff/prep.py:1376, diff_diff/linalg.py:1051, and diff_diff/survey.py:838.
    Concrete fix: define cell validity for SE/count purposes as positive-weight and non-missing, use that positive-weight count for {outcome}_n, the < 2 undefined-SE guard, and the min_n fallback decision, and keep zero-weight rows from manufacturing effective information.

  • Severity P3 | Impact: No unmitigated mismatch in the core domain-estimation design. The zero-padded IF construction, TSL/replicate routing, documented SRS fallback, and precision-NaN handling for zero-variance cells are aligned with the new registry entry. See diff_diff/prep.py:1381, diff_diff/prep.py:1396, diff_diff/prep.py:1583, and docs/methodology/REGISTRY.md:2321.
    Concrete fix: None.

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate finding. The P1 above is not tracked under TODO.md:51, and it is not deferrable under the stated review rules because it affects correctness of reported SEs/counts.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: The new TestAggregateSurvey coverage does not exercise zero full-sample weights, so the P1 path above is currently unguarded by regression tests. See tests/test_prep.py:2087, tests/test_prep.py:2455, tests/test_prep.py:2487, and tests/test_prep.py:2572.
    Concrete fix: add a regression where a cell has one positive-weight observation plus one or more zero-weight rows; assert {outcome}_n counts only positive-weight observations and se/precision are NaN when the positive-weight sample is effectively size one.

  • No unmitigated documentation blocker. The earlier example issue appears fixed in both the docstring and API docs: treatment/time indicators are now added at the panel level before fitting DiD. See diff_diff/prep.py:1469 and docs/api/prep.rst:286.

Verification note: I could not run the tests in this environment because pytest, numpy, and pandas are unavailable.

Path to Approval

  1. Change _cell_mean_variance() so “valid” means positive-weight and non-missing, and use that same positive-weight count for {outcome}_n, the n_valid < 2 branch, and the min_n SRS-fallback threshold. See diff_diff/prep.py:1354.
  2. Add a regression for a weights-only design where a cell contains one positive-weight observation and one or more zero-weight rows; assert y_n == 1, y_se is NaN, and y_precision is NaN.
  3. Add the same regression for a clustered/PSU design so zero-weight rows cannot manufacture a finite design-based SE through the domain-variance path.

Define validity as non-NaN AND positive weight so zero-weight padding
rows don't inflate {outcome}_n or bypass n_valid < 2 / min_n guards.
Add regression test for cell with 1 real + 9 zero-weight observations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 703a7fe74b5c528bdcb01607bf76e4bb774c6a00


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a P1 edge-case/API-contract bug: aggregate_survey() can return a (panel, stage2) pair that is not actually fit-ready when any cell has non-finite first-stage outputs, because the returned SurveyDesign points at a precision column that the same code path intentionally fills with NaN for zero/undefined-variance cells. See diff_diff/prep.py:L1358, diff_diff/prep.py:L1584, diff_diff/prep.py:L1641, diff_diff/survey.py:L166, and diff_diff/estimators.py:L696.

Executive Summary

Methodology

  • Severity P3 | Impact: Informational only. The zero-padded full-design IF construction in aggregate_survey() is aligned with Lumley’s domain/subpopulation guidance and the survey package’s subset() / svyby() behavior that preserves the original design information for domain SEs; Binder is the general linearization reference for asymptotically normal estimators under complex designs. The PR’s SRS fallback and zero-variance-precision behavior are already documented in docs/methodology/REGISTRY.md:L2303, so they are not blockers. Concrete fix: None. (jstatsoft.org)

Code Quality

  • Severity P1 | Impact: aggregate_survey() promises a panel plus pre-configured second-stage design, but it retains first-stage-nonestimable cells and still points stage2 at a weight column that can contain NaN. That means valid inputs with all-missing cells, one-effective-observation cells, or zero-variance cells fail at second-stage fit time instead of being filtered or zero-weighted. Concrete fix: before returning the pair, either drop rows whose first outcome has non-finite *_mean / *_precision, or create a separate sanitized stage-two weight column (for example, map non-finite precision to 0.0) and point stage2 at that column while preserving the diagnostic *_precision output. See diff_diff/prep.py:L1358, diff_diff/prep.py:L1584, diff_diff/prep.py:L1641, diff_diff/survey.py:L166, and diff_diff/estimators.py:L696.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining P1 is not tracked under TODO.md:L51 and is not deferrable under the stated review rubric.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: The new tests independently cover zero-variance aggregation and the happy-path stage-two pipeline, but they do not cover the handoff case where aggregate_survey() returns non-finite first-stage outputs and the caller immediately uses the returned stage2 design. That leaves the P1 above unguarded. Concrete fix: add regressions that call DifferenceInDifferences.fit(..., survey_design=stage2) on panels containing (1) a zero-variance cell and (2) a one-effective-observation or all-missing cell, and assert the chosen filtering/zero-weighting behavior. See tests/test_prep.py:L2160, tests/test_prep.py:L2355, and docs/api/prep.rst:L273.

Path to Approval

  1. Make the returned (panel, stage2) pair internally consistent for cells with undefined first-stage outputs. Recommended: keep *_precision as the diagnostic column, add a finite stage-two weight column, and point the returned SurveyDesign at that sanitized column.
  2. Ensure cells with non-finite first-outcome means are either removed from the returned panel or otherwise excluded before second-stage fitting, with an explicit warning.
  3. Add end-to-end regressions covering zero-variance, n_valid == 1, and all-missing cells using the returned stage2 design.
  4. Update the aggregate_survey() docstring / API example to state exactly how non-estimable cells are handled in the returned panel/design pair.

The returned SurveyDesign now points at a {outcome}_weight column
where NaN/Inf precision values are mapped to 0.0, so downstream
fit() never rejects missing weights. Diagnostic *_precision column
is preserved as-is. Add stage2-handoff test with single-observation
cell (NaN precision → zero weight → fit succeeds).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: f04fe0d18cf4fae93cc33bd3facc9a55605b80c8


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a P1 stage-two handoff bug: aggregate_survey() now sanitizes non-finite precision into *_weight, but cells with n_valid == 0 still keep *_mean = NaN, and DifferenceInDifferences.fit() rejects missing outcomes before zero weights can help. See diff_diff/prep.py:L1358, diff_diff/prep.py:L1641, and diff_diff/estimators.py:L696.

Executive Summary

  • The prior re-review blockers on zero-weight padding, missing grouping keys, and duplicate indices look addressed.
  • Methodologically, the new first-stage cell estimator is aligned with the cited survey/domain-estimation pattern, and the SRS fallback plus zero-variance precision handling are explicitly documented in the registry, so those are not defects. (jstatsoft.org)
  • P1: the returned (panel, stage2) pair is still not fit-ready when a cell has no positive-weight, non-missing observations.
  • P1 [Newly identified]: if an entire geographic unit ends up with zero total second-stage weight, the returned PSU-based survey path still counts that unit in variance/df calculations.
  • Tests and docs improved materially, but they still miss the n_valid == 0 and all-zero-weight-PSU handoff cases. I could not run the suite locally because this sandbox lacks NumPy.

Methodology

  • Severity P3 | Impact: Informational only. The core first-stage construction matches the survey-methodology references: preserve the full design, zero-pad out-of-domain contributions, and estimate variance by linearization or replicate weights. The survey docs also state that subset() preserves original strata/clusters and yields the same domain estimates/SEs as svyby/svyglm, and svyby notes that exact zero-weight observations are counted as missing in subset-style counts. Binder remains the canonical linearization reference for asymptotically normal estimators under complex designs. The PR’s SRS fallback and zero-variance precision behavior are explicitly documented in docs/methodology/REGISTRY.md:L2303, so they are mitigated and not defects. Concrete fix: None. (jstatsoft.org)

Code Quality

  • Severity P1 | Impact: aggregate_survey() still returns NaN means when a cell has no positive-weight, non-missing observations, via the n_valid == 0 branch at diff_diff/prep.py:L1358. The new stage-two handoff only sanitizes the weight column at diff_diff/prep.py:L1641, while DifferenceInDifferences.fit() rejects missing outcomes up front at diff_diff/estimators.py:L696. So an all-missing or all-zero-weight cell still breaks the advertised “panel + pre-configured stage2” contract. Concrete fix: before returning, drop or otherwise exclude rows whose first outcome mean is non-finite, and emit a warning listing the affected cells.

  • Severity P1 [Newly identified] | Impact: the new *_weight = 0.0 path at diff_diff/prep.py:L1646 can produce a geographic unit whose total second-stage weight is zero. The returned design always uses unstratified PSU clustering at diff_diff/prep.py:L1652. In that path, survey df are still n_psu - 1 at diff_diff/survey.py:L619, and PSU variance aggregation still counts every PSU label present in the panel at diff_diff/survey.py:L1219. That is inconsistent with the package’s generic weighted cluster path, which explicitly excludes zero-total-weight clusters at diff_diff/linalg.py:L1087. If an entire state/region is non-estimable at stage 1, second-stage SEs and t-based inference can be skewed silently. Concrete fix: either prune all-zero-weight PSUs before creating stage2, or update the aweight survey path to exclude zero-total-weight PSUs in both variance computation and df_survey.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. Neither P1 above is tracked under TODO.md:L51, so both remain unmitigated.

Security

No findings.

Documentation/Tests

  • Severity P3 | Impact: the new tests cover zero-weight padding, duplicate indices, zero variance, and the n_valid == 1 handoff case at tests/test_prep.py:L2455, but they still do not cover n_valid == 0 or an all-zero-weight PSU. The API example also still lists panel columns without mentioning the returned *_weight column or how non-estimable cells are handled at docs/api/prep.rst:L282. Concrete fix: add end-to-end regressions for those two cases and update the docs/example to describe *_weight plus the exact row/cluster exclusion behavior.

Path to Approval

  1. Make aggregate_survey() return a genuinely fit-ready pair for n_valid == 0 cells by excluding rows whose first outcome mean is non-finite before returning (panel, stage2).
  2. Ensure the returned stage-two design excludes zero-total-weight PSUs from PSU counts, variance adjustments, and df_survey, or equivalently remove all-zero-weight PSUs before constructing stage2.
  3. Add regressions for an all-missing/all-zero-weight cell and for a state/region whose every returned row has *_weight == 0.
  4. Update the aggregate_survey() docs to mention the returned *_weight column and the precise handling of non-estimable cells and PSUs.

Cells with non-finite outcome mean (n_valid==0, all-missing, all-zero-
weight) are dropped from the panel with a warning before constructing
the stage2 SurveyDesign. This ensures fit() never encounters NaN
outcomes and eliminates all-zero-weight PSUs from second-stage
variance/df calculations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f84169d87f06600c23b923b705d6ad15e16020f


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding is a P1: the new aggregate_survey() handoff still allows an all-zero-weight second-stage PSU to survive into survey variance/df calculation, which can skew second-stage SEs and t-based inference. The prior n_valid == 0 blocker does look fixed via the non-estimable-cell drop in diff_diff/prep.py:L1641-L1657 and the new regression at tests/test_prep.py:L2455-L2510.

Executive Summary

  • The earlier n_valid == 0 / NaN-outcome handoff issue is addressed: non-estimable cells are now dropped before returning (panel, stage2).
  • Methodologically, the first-stage cell estimator matches the new registry entry and the cited domain-estimation references; the documented SRS fallback and zero-variance precision behavior are not defects.
  • P1: a geography whose every retained cell has *_weight == 0 can still be counted as a PSU in second-stage survey df/meat calculations.
  • P1 [Newly identified]: if every cell is dropped as non-estimable, aggregate_survey() returns an empty (panel, stage2) pair instead of failing fast with a clear error.
  • Tests/docs improved materially, but they still miss regressions for those two remaining handoff cases, and the API example still omits the returned *_weight column plus the row-dropping behavior.
  • I could not run pytest here because pytest is not installed in this sandbox.

Methodology

  • Severity P3 | Impact: Informational only. Affected methods: _cell_mean_variance() and aggregate_survey(). The first-stage construction is aligned with the registry and the cited survey-methodology sources: full-design domain estimation via zero-padded influence functions, with Taylor linearization or replicate-weight variance, is the right pattern here. The PR’s SRS fallback and zero-variance precision handling are explicitly documented in docs/methodology/REGISTRY.md:L2303-L2326, so those are mitigated deviations rather than defects. Concrete fix: None. citeturn0search0turn0search5turn0search7

  • Severity P1 | Impact: The prior all-zero-weight-PSU inference bug is still unresolved. aggregate_survey() now drops only non-finite-mean cells diff_diff/prep.py:L1641-L1657, but a whole geography can still end up with zero total second-stage weight when every cell has a finite mean yet precision becomes NaN/0 because of singleton support or zero variance diff_diff/prep.py:L1361-L1363, diff_diff/prep.py:L1586-L1593, diff_diff/prep.py:L1662-L1667. The returned stage-two design still counts every PSU label in resolve() diff_diff/survey.py:L286-L323, uses that count in df_survey diff_diff/survey.py:L619-L622, and includes zero-score PSUs in the survey meat calculation diff_diff/survey.py:L1219-L1241, diff_diff/survey.py:L1257-L1280. That is especially problematic because the generic weighted cluster path already treats zero-weight clusters as outside the effective sample count diff_diff/linalg.py:L1051-L1093. Concrete fix: before constructing stage2, drop any by_cols[0] groups with zero total weight_col, or update the aweight survey path to exclude zero-total-weight PSUs from PSU counts, centering, and df_survey; add a regression where one state/region has only zero-precision cells across all periods.

Code Quality

  • Severity P1 [Newly identified] | Impact: Empty post-drop output is still unhandled. After dropping non-estimable cells diff_diff/prep.py:L1641-L1657, the function immediately constructs stage2 diff_diff/prep.py:L1659-L1675. If every cell has n_valid == 0, callers get an empty “panel-ready” pair instead of a clear failure at the aggregation boundary. The new tests cover empty input tests/test_prep.py:L2413-L2422 and a single dropped cell tests/test_prep.py:L2455-L2510, but not the “all cells dropped” case. Concrete fix: after the drop step, raise ValueError when panel_df.empty, with a message like “No estimable cells remain after aggregation”; add a regression with all outcomes missing within every cell.

Performance

No findings.

Maintainability

No separate findings.

Tech Debt

No separate findings. I checked TODO.md:L51-L120; neither remaining P1 is tracked there, so both remain unmitigated.

Security

No findings.

Documentation/Tests

  • Severity P3 | Impact: Coverage is much better, but still incomplete around the remaining handoff edges. The suite now covers SRS fallback tests/test_prep.py:L2090-L2122, zero-variance cells tests/test_prep.py:L2142-L2170, and partial non-estimable-cell dropping tests/test_prep.py:L2455-L2510, but it still lacks regressions for “all cells dropped” and “entire geography has zero total *_weight”. The API example also still lists panel columns without the returned *_weight column and does not mention that non-estimable cells are removed before return docs/api/prep.rst:L273-L293. Concrete fix: add those two regressions and update the example/comments to mention *_weight plus dropped non-estimable cells.

Path to Approval

  1. After dropping non-estimable cells, raise a clear ValueError when no estimable cells remain instead of returning an empty (panel, stage2) pair.
  2. Exclude zero-total-weight geographies from second-stage aweight survey variance/df calculations, either by pruning them before building stage2 or by teaching the survey aweight path to ignore zero-total-weight PSUs.
  3. Add regressions for:
    1. all cells dropped because every cell has n_valid == 0;
    2. one geography whose every retained row has *_weight == 0 from zero variance or singleton support.
  4. Update the API example to show the returned *_weight column and the row-dropping behavior for non-estimable cells.

Geographic units where every cell has zero weight are pruned before
constructing stage2, preventing inflated survey df/variance. If all
cells are dropped, raise ValueError with clear message. Add regressions
for zero-weight PSU pruning and all-cells-dropped cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: fa6c73874b9552404edbdba5db7a000786b07a62


Overall Assessment

⚠️ Needs changes

The two prior blocker-class handoff issues do look fixed: non-estimable cells are now dropped before return at diff_diff/prep.py:1641, zero-total-weight geographies are pruned before the stage-two design is built at diff_diff/prep.py:1669, and the empty-post-drop case now raises at diff_diff/prep.py:1687, with matching regressions at tests/test_prep.py:2481, tests/test_prep.py:2512, and tests/test_prep.py:2550. The remaining unmitigated issue is a new P1 in the multi-outcome contract: the returned panel is globally filtered on the first outcome only, so valid cells for secondary outcomes can be silently removed.

Executive Summary

  • The previous P1s from the last review appear resolved in the current diff: non-estimable cells are dropped, zero-total-weight geographies are removed, and the all-cells-dropped case now fails fast.
  • Methodology adherence looks good for the core estimator: _cell_mean_variance() uses full-design domain estimation with zero-padded IFs and routes to TSL or replicate-weight variance, matching the registry entry and cited survey-methodology sources.
  • Severity P1: aggregate_survey() advertises multi-outcome output, but after computing all outcomes it drops rows using only outcomes[0]; that can silently discard valid secondary-outcome and covariate aggregates.
  • The new test block substantially improves edge-case coverage, but it does not cover asymmetric estimability across multiple outcomes.
  • Documentation is still incomplete: the API example omits the returned *_weight column and still does not explain row-dropping behavior.
  • I could not execute pytest in this sandbox because both pytest and numpy are unavailable.

Methodology

  • Severity P3 | Impact: Informational only. The affected method is survey cell aggregation in _cell_mean_variance() / aggregate_survey(). The implementation at diff_diff/prep.py:1315 follows the right design-based pattern: keep the full sample, zero-pad the domain influence function outside the cell, and estimate variance through Taylor linearization or replicate weights. That is consistent with Lumley’s subpopulation guidance and the Binder/Lumley linearization framework, and the SRS fallback / zero-variance behavior is explicitly documented in docs/methodology/REGISTRY.md:2303. Concrete fix: None. (researchgate.net)

Code Quality

  • Severity P1 | Impact: aggregate_survey() silently applies first-outcome support to every returned column. The function computes all requested outcomes in the per-cell loop at diff_diff/prep.py:1573, but the post-processing filter is built only from first_outcome at diff_diff/prep.py:1645. So if outcomes=["y1","y2"] and a cell has no valid y1 but valid y2, that row is removed from the shared panel anyway, taking y2_mean, y2_se, and any covariate means with it. The public contract only documents that the returned SurveyDesign uses the first outcome’s weights at diff_diff/prep.py:1460; it does not document that all rows are also filtered on the first outcome. The only multi-outcome regression, tests/test_prep.py:2046, checks column presence only and would not catch this. Concrete fix: enforce a clear multi-outcome contract by either raising when requested outcomes have different estimable support, or redesigning the return shape so per-outcome support/designs are preserved instead of globally filtering on outcomes[0].

Performance

  • No findings.

Maintainability

  • No findings beyond the P1 contract ambiguity above.

Tech Debt

  • No separate findings. There is no mitigating entry in TODO.md:51 covering the multi-outcome filtering contract, so the P1 above remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity P3 | Impact: The new API docs still under-describe the returned object. The example comment in docs/api/prep.rst:282 omits the returned smoking_rate_weight column even though it is part of the public return contract at diff_diff/prep.py:1453, and it still does not mention that non-estimable cells may be dropped before return. Concrete fix: update the example/comments to include *_weight and the row-dropping semantics, especially if multi-outcome support remains exposed.
  • I could not run the tests in this environment because pytest and numpy are not installed.

Path to Approval

  1. Make multi-outcome aggregation non-silent: either raise a clear ValueError when requested outcomes have different estimable support, or change the return contract so secondary outcomes are not dropped based on outcomes[0].
  2. Add a regression where one cell has all y1 missing but valid y2, using outcomes=["y1","y2"], and assert the chosen contract.
  3. Update the aggregate_survey() docstring/API docs to spell out the row-filtering contract and include the returned *_weight column.

Multi-outcome filtering is based on the first outcome (consistent with
the returned SurveyDesign using the first outcome's weights). Docstring
now explicitly states this contract. Warning emitted when dropped cells
had valid data for secondary outcomes. Test added for the contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: d6abd0651a08d79e47839e6cadc5c8778b4390d0


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P1 on silent multi-outcome row loss appears addressed: first-outcome filtering is now part of the public contract, surfaced with a warning, and regression-tested at diff_diff/prep.py:L1438-L1442, diff_diff/prep.py:L1655-L1671, and tests/test_prep.py:L2062-L2086.
  • Methodology alignment looks good for the new estimator path: cell means are handled as full-design domain estimates with TSL/replicate dispatch, and the SRS fallback is documented in the Methodology Registry.
  • Severity P2: the new public API still lacks explicit validation for empty by / outcomes lists and will fall through to internal errors instead of a clear ValueError.
  • Severity P3: the new replicate-weight regression uses JK1-generated columns but declares replicate_method="BRR", so it does not really validate method-specific replicate scaling.
  • Severity P3: the API example comment still omits the returned *_weight column and the row-dropping behavior.

Methodology

Code Quality

  • Severity P2 | Impact: aggregate_survey() still does not explicitly reject empty by / outcomes lists. outcomes=[] flows through normalization and only fails later when the code indexes outcome_cols[0], while by=[] falls through to groupby(by_cols) instead of surfacing a clear API contract error at validation time (diff_diff/prep.py:L1488-L1492, diff_diff/prep.py:L1544-L1545, diff_diff/prep.py:L1649-L1650). Concrete fix: add explicit if not by_cols / if not outcome_cols guards alongside the existing validation and add regressions for both cases.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity P3 | Impact: The new replicate-weight regression is method-mismatched. generate_survey_did_data(include_replicate_weights=True) documents and generates JK1 delete-one-PSU replicates, but the new test instantiates SurveyDesign(..., replicate_method="BRR") (diff_diff/prep_dgp.py:L1246-L1248, diff_diff/prep_dgp.py:L1651-L1667, tests/test_prep.py:L2740-L2756). Because the assertion only checks for finite positive SEs, it would not catch a method-specific scaling bug in the replicate path. Concrete fix: change the test to replicate_method="JK1" for this fixture, or generate true BRR-compatible replicate columns before testing BRR.
  • Severity P3 | Impact: The example block in the API docs still under-describes the returned panel. It lists smoking_rate_mean, smoking_rate_se, smoking_rate_n, and smoking_rate_precision but omits the fit-ready smoking_rate_weight, and it still does not mention that first-outcome filtering can drop rows before return (docs/api/prep.rst:L282-L287, diff_diff/prep.py:L1438-L1442, diff_diff/prep.py:L1457-L1467). Concrete fix: update the example comment to include smoking_rate_weight and a brief note that multi-outcome support is keyed off the first outcome.
  • I could not run pytest -q tests/test_prep.py -k aggregate_survey in this sandbox because pytest, numpy, and pandas are not installed.

- Validate empty by/outcomes lists with clear ValueError (P2)
- Fix replicate test to use JK1 matching DGP output (P3)
- Update RST example with *_weight column and filtering note (P3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 15b3d4a4b458967a16545fda2b3cad64bef2371f


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review findings are addressed: aggregate_survey() now explicitly rejects empty by / outcomes inputs at diff_diff/prep.py:1495, with regressions at tests/test_prep.py:2441 and tests/test_prep.py:2446.
  • The replicate-weight regression now matches the generator’s JK1 design: the fixture generates JK1 delete-one-PSU replicates at diff_diff/prep_dgp.py:1651, and the test now uses replicate_method="JK1" at tests/test_prep.py:2767.
  • The docs gap from the last review is fixed: the API docs now mention the fit-ready *_weight column, automatic dropping of non-estimable / zero-weight geos, and the first-outcome multi-outcome contract at docs/api/prep.rst:282.
  • Methodology alignment looks good: the new path treats each cell as a full-design domain, zero-pads influence values outside the cell, and hands variance off to the existing Taylor-linearization / replicate helpers, which is the standard complex-survey domain/subpopulation pattern described by Lumley and Binder. See diff_diff/prep.py:1315, diff_diff/survey.py:422, diff_diff/survey.py:1463, and diff_diff/survey.py:1681. (r-survey.r-forge.r-project.org)
  • Severity P2: aggregate_survey() still relies on a raw astype(np.float64) cast for outcomes/covariates at diff_diff/prep.py:1544, so non-numeric inputs will fail with a low-level conversion error instead of a clear API-level ValueError.
  • I could not execute the tests in this sandbox because numpy, pandas, and pytest are not installed.

Methodology

  • Severity P3 | Impact: Informational only. Affected methods: survey domain/cell-mean estimation, Taylor-linearized / replicate variance for those cell means, and second-stage inverse-variance weighting. The implementation in _cell_mean_variance() and aggregate_survey() matches the registry entry at docs/methodology/REGISTRY.md:2303: zero-padded full-design domain IFs, variance via the existing survey helpers, and 1 / V(ȳ_g) handed off as aweight for stage two. The only non-primary branch is the SRS fallback at diff_diff/prep.py:1377 and diff_diff/prep.py:1398, and that behavior is explicitly documented as a **Note:** in the registry at docs/methodology/REGISTRY.md:2321, so it is mitigated rather than blocking. Concrete fix: None. (r-survey.r-forge.r-project.org)

Code Quality

  • Severity P2 | Impact: Non-numeric outcomes or covariates currently surface a generic NumPy conversion failure from data[var].values.astype(np.float64) at diff_diff/prep.py:1544 rather than a clear contract error from aggregate_survey(). On a new public API, that makes bad inputs harder to diagnose and is inconsistent with the function’s otherwise explicit validation. Concrete fix: add explicit numeric-dtype checks for outcome_cols and cov_cols before building y_arrays, and add targeted regression tests next to the existing validation cases around tests/test_prep.py:2408.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

Non-numeric columns now raise a clear ValueError instead of failing
inside astype(np.float64). Add regression test for string column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8ac222222b7a00e1bd1a010b8dd8037b88a54c2b


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Methodology looks aligned: aggregate_survey() treats each cell as a survey domain, preserves full-design variance structure, and routes variance through the existing linearization/replicate helpers; the SRS fallback and zero-variance handling are explicitly documented in the registry rather than being undocumented deviations. (r-survey.r-forge.r-project.org)
  • The current diff fixes the earlier obvious validation/docs gaps visible here: empty by / outcomes guards were added at diff_diff/prep.py:L1495-L1498, and the API/methodology contract is now documented at docs/api/prep.rst:L253-L297 and docs/methodology/REGISTRY.md:L2303-L2326.
  • Severity P1: the new numeric validator at diff_diff/prep.py:L1544-L1548 still mishandles pandas nullable numeric dtypes (Int64, Float64) because it uses np.issubdtype(data[v].dtype, np.number). That leaves a common survey-data edge case failing with a raw type error on the new public API.
  • I could not execute the tests in this sandbox because pytest is not installed.

Methodology

  • Severity P3 | Impact: Informational only. The affected methods are cell-level design-weighted mean estimation and cell-level variance estimation in diff_diff/prep.py:L1315-L1403, plus the stage-two inverse-variance handoff in diff_diff/prep.py:L1685-L1724. This is consistent with the survey-methodology sources: subpopulation/domain estimates should preserve the original strata/cluster structure rather than naively subsetting, domain SEs are not determined only by data inside the domain, and replicate-weight variance is based on evaluating the statistic under the sampling and replicate weights. Binder is also the cited basis for variance estimation of asymptotically normal estimators under complex sampling. The SRS fallback and zero-variance-to-NaN precision branches are explicitly documented in docs/methodology/REGISTRY.md:L2303-L2326, so they are mitigated rather than defects. Concrete fix: None. (r-survey.r-forge.r-project.org)

Code Quality

  • Severity P1 | Impact: diff_diff/prep.py:L1544-L1548 uses np.issubdtype(data[v].dtype, np.number) for the new pre-validation step. That works for plain NumPy dtypes but not for pandas nullable extension dtypes, so a DataFrame with numeric Int64 / Float64 outcome or covariate columns can still fail before aggregation with a low-level exception. This is a production-facing edge-case break in the new API. Concrete fix: replace that check with pd.api.types.is_numeric_dtype(data[v]), which is already the file-local pattern at diff_diff/prep.py:L466-L469 and diff_diff/prep.py:L697-L700, and keep the explicit ValueError only for truly non-numeric inputs.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not find a TODO.md entry that mitigates the P1 above, and under the review policy this kind of edge-case production error is not deferrable anyway.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Replace the np.issubdtype(...) validator in diff_diff/prep.py:L1544-L1548 with pandas-aware numeric detection (pd.api.types.is_numeric_dtype or equivalent) so nullable numeric dtypes are accepted.
  2. Add regression tests in tests/test_prep.py:L2454-L2464 for nullable Int64 and Float64 outcome/covariate columns, asserting aggregate_survey() accepts them and still raises a clean ValueError only for genuinely non-numeric inputs.

Replace np.issubdtype with pd.api.types.is_numeric_dtype so pandas
nullable extension dtypes (Int64, Float64) are accepted as numeric.
Add regression test with Float64 outcome column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 50ab3bdf372e8093e7c09fe97025d666788d80ea


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the prior nullable-dtype P1 is resolved. The validator now uses pandas-aware numeric detection in diff_diff/prep.py:1544, and there is a regression for nullable numeric input in tests/test_prep.py:2466.
  • Methodology is largely aligned: aggregate_survey() implements full-design domain estimation with zero-padded influence functions, which matches the registry entry in docs/methodology/REGISTRY.md:2303 and survey-package guidance that domain SEs depend on the full design rather than only the in-domain rows. The documented SRS fallback and zero-variance precision behavior are therefore informational, not defects. (jstatsoft.org)
  • Severity P1: the new min_n fallback path is only tested on the plain non-replicate path, while replicate-weight designs are only exercised on the default non-fallback path. That leaves a new parameter × inference-backend interaction unverified in code that directly determines returned SEs and precision weights.
  • I could not run the test suite in this sandbox because pytest and the project’s Python dependencies are unavailable.

Methodology

  • Severity: P3 | Impact: Informational only. The affected methods are cell-level design-weighted mean estimation, cell-level variance estimation through compute_survey_if_variance() / compute_replicate_if_variance(), and second-stage inverse-variance weighting in diff_diff/prep.py:1315 and diff_diff/prep.py:1720. This matches the registry contract in docs/methodology/REGISTRY.md:2303: Lumley describes complex-survey analysis via linearisation or replicate weights, the survey package’s domain-estimation docs say subpopulation SEs do not depend only on subpopulation rows, and svyCprod explicitly adds zero rows when fewer PSUs remain in a domain. The SRS fallback and zero-variance-to-NaN precision rules are documented in the registry, so they are mitigated rather than defects. Concrete fix: None. (jstatsoft.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. TODO.md does not contain a mitigating entry for the P1 below, and under the stated rubric this issue would not be downgraded by TODO tracking anyway.

Security

  • No findings.

Documentation/Tests

  • Severity: P1 | Impact: _cell_mean_variance()’s fallback branch uses full_resolved.weights directly when n_valid < min_n or when design-based variance is unidentifiable (diff_diff/prep.py:1377, diff_diff/prep.py:1398). The new tests cover min_n only on a plain weights-only design (tests/test_prep.py:2738), while replicate-weight designs are tested only on the default non-fallback path (tests/test_prep.py:2778). That matters because SurveyDesign.resolve() intentionally skips mean-1 normalization for replicate designs and preserves raw full-sample weight scale (diff_diff/survey.py:189, diff_diff/survey.py:238). Under the review rubric, a new parameter that is not tested across an existing inference path is P1. Concrete fix: add replicate-weight regressions that force both fallback triggers and verify that the fallback SE/precision output is scale-invariant under constant rescaling of full and replicate weights, or normalize/document that branch explicitly.

Path to Approval

  1. Add a replicate-weight regression where a cell has n_valid < min_n, and assert that aggregate_survey() sets srs_fallback=True with stable *_se and *_precision.
  2. Add a second replicate-weight regression that forces the “design-based variance not estimable” fallback branch, then rerun it after multiplying both full-sample and replicate weights by a constant. The fallback result should not change.
  3. If the scale-invariance check fails, normalize the weights used in the fallback formula or document the intended replicate-path fallback behavior in docs/methodology/REGISTRY.md:2303.

Exercises the SRS fallback path under a JK1 replicate-weight design,
verifying that fallback SEs are finite/positive and differ from the
replicate-based SEs. Covers the min_n × replicate interaction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: bb5d6609ca2dc61f1cc6ed7582c919b7a985a502


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the prior nullable-dtype issue is resolved by the pandas-aware numeric check in diff_diff/prep.py:L1544-L1549 plus the new regression in tests/test_prep.py:L2466-L2479.
  • The main domain-estimation path in aggregate_survey() is methodology-consistent: zero-padded influence functions preserve the full design, which matches the new registry entry and official survey package guidance for subpopulation/domain estimation rather than naive subsetting. (faculty.washington.edu)
  • Severity P1: the SRS fallback branch still produces scale-dependent SEs and precision weights for replicate-weight designs. In diff_diff/prep.py:L1377-L1380 and diff_diff/prep.py:L1398-L1400, the fallback uses raw full_resolved.weights, while diff_diff/survey.py:L189-L240 explicitly skips normalization whenever replicate weights are present. Under w -> c w, that fallback variance becomes V/c, so *_se and downstream *_precision change even though the design has not. Official survey docs treat weight rescaling in replicate workflows as a numerical option, not a methodological change; by implication, a correct fallback should be invariant to constant rescaling. (r-survey.r-forge.r-project.org)
  • The new replicate fallback test in tests/test_prep.py:L2809-L2857 closes the old branch-coverage complaint, but it does not catch the remaining bug because it never rescales the full and replicate weights. That is especially notable because the library already encodes this invariant for replicate IF variance in tests/test_survey_phase6.py:L767-L811.
  • I could not run the test suite in this sandbox because numpy and pytest are unavailable.

Methodology

  • Severity: P1 | Impact: aggregate_survey()’s affected methods are cell-level design-weighted mean estimation, cell-level variance estimation, and second-stage inverse-variance weighting. The full-design IF path is fine, but both fallback branches in diff_diff/prep.py:L1377-L1380 and diff_diff/prep.py:L1398-L1400 silently make replicate-design SEs depend on arbitrary weight scale because they use raw full_resolved.weights and sum_w. Since replicate designs intentionally preserve raw full and replicate weight scale in diff_diff/survey.py:L189-L240, two equivalent encodings of the same replicate design can return different *_se, *_precision, and therefore different second-stage aweights. Concrete fix: in both fallback branches, normalize the positive full-sample weights locally before applying the documented SRS formula, and add rescaling regressions for both min_n-forced fallback and “design variance unidentifiable” fallback. (r-survey.r-forge.r-project.org)
  • Severity: P3 | Impact: Informational only. The non-fallback path in diff_diff/prep.py:L1382-L1395 matches the registry contract in docs/methodology/REGISTRY.md:L2303-L2326: domain estimates preserve the full design, and the documented SRS fallback plus zero-variance-to-NaN precision behavior are explicitly noted rather than undocumented deviations. Concrete fix: None. (faculty.washington.edu)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. TODO.md does not track the P1 above, and under the review rubric this is not deferrable tech debt.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 | Impact: Informational only. The docs/API exposure are in good shape via diff_diff/init.py:L80-L84, diff_diff/init.py:L326-L334, and docs/api/prep.rst:L253-L297. The prior test-gap finding is partially addressed by tests/test_prep.py:L2809-L2857, but that regression does not mitigate the P1 because it only checks that fallback executes, not that fallback output is invariant to constant rescaling of full and replicate weights. Concrete fix: extend the new test with a duplicated dataset/design whose full and replicate weights are multiplied by a constant, and assert identical *_se and *_precision in both fallback modes.

Path to Approval

  1. Normalize weights inside both SRS fallback branches in diff_diff/prep.py:L1377-L1380 and diff_diff/prep.py:L1398-L1400 so replicate-design fallback SEs are invariant to arbitrary constant rescaling of full and replicate weights.
  2. Extend tests/test_prep.py:L2809-L2857 to rerun the same replicate-weight min_n fallback after multiplying both full and replicate weights by a constant, and assert unchanged outcome_se and outcome_precision.
  3. Add a second replicate-weight regression that forces the “design-based variance not estimable” fallback branch, applies the same weight rescaling, and asserts unchanged *_se/*_precision.

Both SRS fallback branches now normalize positive weights to mean=1
before computing variance, ensuring SEs are invariant to constant
weight rescaling (important for replicate designs that preserve raw
weight scale). Add scale-invariance regression test with 5x rescaling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 7, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: e7eefa18f83a72ba783b27ef1a115e4ea32eb6bf


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review. The new aggregate_survey() path is methodology-consistent with full-design domain/subpopulation estimation and with variance estimation via linearization or replicate weights; the only remaining issue I found is a non-blocking regression-test gap around replicate-weight rescaling in the fallback branch. (jstatsoft.org)

Executive Summary

Methodology

  • Severity: P3 | Impact: Informational only. aggregate_survey() changes cell-level survey mean estimation, cell-level variance estimation, and second-stage precision weighting. The implementation at diff_diff/prep.py:1389 and diff_diff/prep.py:1399 matches the registry at docs/methodology/REGISTRY.md:2303 by treating each cell as a domain of the full design, zero-padding the IF outside the cell, and using Taylor linearization or replicate weighting rather than naive physical subsetting. The SRS fallback is explicitly documented in the registry, so under the review rubric it is not a defect. Concrete fix: None. (jstatsoft.org)
  • Severity: P3 | Impact: Informational only. Excluding zero-weight rows from n_valid via diff_diff/prep.py:1355 is consistent with the survey package’s documented svyby(..., unwtd.count) behavior that zero-weight observations are treated as missing for subset counts. Concrete fix: None. (faculty.washington.edu)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 | Impact: The new test class is broad, but the exact regression needed to lock down the old replicate-fallback bug is still indirect. tests/test_prep.py:2809 exercises replicate fallback without rescaling, while tests/test_prep.py:2859 proves rescaling invariance only for non-replicate weights. Because survey workflows treat weight rescaling as a numerical-stability option rather than a methodological change, this is a regression-gap issue, not a current correctness defect. Concrete fix: add replicate-weight tests that multiply both full and replicate weights by a constant and assert unchanged fallback *_se/*_precision in both the min_n branch and the “design-based variance not estimable” branch. (r-survey.r-forge.r-project.org)

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 7, 2026
The RST example references undefined 'microdata' variable, same as
the existing wide_to_long skip pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit a26e8d5 into main Apr 7, 2026
16 checks passed
@igerber igerber deleted the survey-aggregate branch April 7, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant