Skip to content

F18: fix AHCI CI-level completion drain#307

Open
ryanbreen wants to merge 3 commits intodiagnostic-fix/f17-local-wakefrom
probe/f18-ahci-ci-loop
Open

F18: fix AHCI CI-level completion drain#307
ryanbreen wants to merge 3 commits intodiagnostic-fix/f17-local-wakefrom
probe/f18-ahci-ci-loop

Conversation

@ryanbreen
Copy link
Copy Markdown
Owner

Summary

  • Replace AHCI single-shot PORT_IS completion handling with a bounded CI-level drain loop.
  • Defer slot-0 wake publication until the sampled PORT_IS is acknowledged and the port is stable, preventing the next command from being issued while the prior AHCI interrupt remains asserted.
  • Document the F18 Linux audit, final 5/5 Parallels sweep, and cleanup recommendation.

Validation

  • Clean AArch64 build: no warning/error lines in logs/breenix-parallels-cpu0/f18-ahci-ci-loop/build-final.log.
  • 5x ./run.sh --parallels --test 60 serial criteria:
    • run1-run5 reached [init] bsshd started (PID 2)
    • ahci_timeouts=0
    • corruption_markers=0

Note: run.sh exits 1 because the Parallels screenshot helper cannot find the generated VM window; serial logs are the validation source, consistent with previous F-series sweeps.

ryanbreen and others added 3 commits April 16, 2026 11:29
Audit: the F17 Breenix handler was edge-sensitive. It read PORT_IS, acknowledged PORT_IS/HBA_IS through ack_port_interrupt(), then read PORT_CI once and completed at most the slot implied by that single interrupt-status sample. A completion that cleared PORT_CI around that one-shot status sample could leave PORT_CI clear while no waiter was woken.

Linux v6.8 uses the level-sensitive model in drivers/ata/libahci.c: ahci_port_intr() acknowledges PORT_IRQ_STAT, ahci_handle_port_interrupt() delegates command completion to ahci_qc_complete(), and ahci_qc_complete() reads PORT_CMD_ISSUE/PORT_SCR_ACT into qc_active before calling ata_qc_complete_multiple(). That derives completion from hardware-active state rather than relying on a single interrupt edge; SERR/error handling remains separate before normal command completion.

Fix: loop each active AHCI port up to eight times, compute completed slots as PORT_ACTIVE_MASK & !PORT_CI, clear active bits atomically, acknowledge sampled PORT_IS, then re-read PORT_IS and PORT_CI and continue if the port reasserted or another active slot has cleared. Slot-0 wake publication is deferred until after the port is stable, preventing the woken waiter from issuing the next command while the prior AHCI interrupt line remains asserted. The existing single-active-slot interrupt fallback is preserved, and CI loop iterations are emitted as AHCI_RING site=CI_LOOP with token=<iteration>.

Co-authored-by: Ryan Breen <ryan@ryanbreen.com>

Co-authored-by: Claude Code <noreply@anthropic.com>
Five final ./run.sh --parallels --test 60 runs reached bsshd with zero AHCI timeouts and zero corruption markers by serial-log criteria. The run.sh process still exits 1 because the Parallels screenshot helper cannot find the generated VM window, matching prior F-series sweeps where serial output is the validation source.

Co-authored-by: Ryan Breen <ryan@ryanbreen.com>

Co-authored-by: Claude Code <noreply@anthropic.com>
Record the F18 audit, Linux AHCI reference, CI-level completion fix, final 5/5 Parallels sweep, and cleanup recommendation. The exit report is preserved with the run artifacts for validator handoff.

Co-authored-by: Ryan Breen <ryan@ryanbreen.com>

Co-authored-by: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant