feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead by julienrbrt · Pull Request #3262 · evstack/ev-node

julienrbrt · 2026-04-16T17:38:55Z

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

too far ahead

coderabbitai · 2026-04-16T17:39:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31ea96d4-173d-4e8d-b3f9-747b6fd620c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-16T17:39:09Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 16, 2026, 5:40 PM

claude · 2026-04-16T17:39:12Z

Claude finished @julienrbrt's task in 3m 35s —— View job

Code Review: feat: add automatic DA retriever walkback when P2P stalls

Summary

This PR adds a self-correction mechanism: when P2P sync stalls and the DA follower has advanced to blocks far ahead of the node's committed height, the DA follower walks backwards through DA heights until it finds blocks contiguous with the node's state. The design is sound at a high level, but there are a few bugs to address before merging.

Bugs

Critical: Walkback tests fail because `p2pStalledFn` is never set

In da_follower_test.go, three of the five walkback tests construct a daFollower struct directly without setting p2pStalledFn. This matters because HandleCatchup evaluates:

p2pStalled := f.p2pStalledFn != nil && f.p2pStalledFn()
if p2pStalled && f.nodeHeightFn != nil && daHeight > f.startDAHeight {
    // walkback logic — NEVER ENTERED when p2pStalledFn is nil
} else if !p2pStalled {
    f.walkbackActive.Store(false)  // ← always runs when p2pStalledFn is nil
}

When p2pStalledFn == nil, p2pStalled = false, so the walkback block is never entered and walkbackActive is always cleared.

Affected tests and expected failures:

Test	Assertion	Actual result
`rewinds_when_gap_detected`	`walkbackActive == true`, `LocalDAHeight() == 99`	`false`, `100`
`keeps_walking_back_on_empty_height`	`walkbackActive == true`, `LocalDAHeight() == 98`	`false`, `100`
`stops_walkback_when_contiguous`	`walkbackActive == false`	`false` ✓ (but for wrong reason)

The fix is to add p2pStalledFn: func() bool { return true } to those three test cases. The stops_walkback_when_contiguous test also needs it to actually verify the contiguous-block stopping condition rather than just the !p2pStalled fallthrough.

Fix this →

Logic Issues

Events piped before walkback decision

In HandleCatchup, step 2 pipes events at daHeight unconditionally, and step 3 then decides to rewind:

// step 2 — already piped
events, err := f.fetchAndPipeHeight(ctx, daHeight)

// step 3 — now rewind
f.subscriber.RewindTo(daHeight - 1)

After the rewind, runCatchup will re-process daHeight and pipe those same events again. The duplicate suppression in PipeEvent (cache.IsHeaderSeen) handles this gracefully, but it is worth noting. Consider adding a comment clarifying that double-piping is intentional and handled upstream by deduplication.

Potential oscillation

Consider this scenario:

nodeHeight = 40, DA height 100 has block 50 → walkback starts, rewinds to 99
Heights 99–96 are empty → keeps walking back to 95
DA height 95 has block 41 → gap closed, walkbackActive.Store(false), pipe block 41
Catchup continues: 96–99 empty (fine), 100 re-fetched → block 50 again → gap detected (nodeHeight still needs 42–49)
Walkback restarts

If blocks 42–49 exist somewhere in DA between heights 95 and 100, the walkback will find them. If they only existed in P2P and were never submitted to DA, this becomes an infinite re-walk. It might be worth adding a maximum walkback depth (startDAHeight already acts as a floor, but a configurable depth or a log warning after repeated oscillation would help operability).

Code Quality

Misleading no-op in test

// da_follower_test.go
sub := newTestSubscriber(100)
sub.LocalDAHeight() // ensure initialized

NewSubscriber always initializes localDAHeight via s.localDAHeight.Store(cfg.StartHeight), so this call is a no-op. The comment is misleading; remove it.

`atomic.Bool` for `walkbackActive` is over-engineered

HandleCatchup is called sequentially by catchupLoop — there is no concurrent writer to walkbackActive. A plain bool field would be sufficient. That said, atomic.Bool is harmless and gives a visibility guarantee across goroutine restarts, so this is a minor nit.

Context usage in `NodeHeight` closure

// syncer.go:228
NodeHeight: func() uint64 {
    h, _ := s.store.Height(s.ctx)
    return h
},

When s.ctx is cancelled (syncer shutting down), s.store.Height returns 0, which makes nodeHeight = 0, potentially triggering a spurious walkback. In practice the DA follower is also stopping at that point, so this is benign — but worth a comment.

Minor

The RewindTo docstring could note that it is safe to call concurrently and is a no-op when daHeight >= localDAHeight.
Priority heights fetched in step 1 of HandleCatchup don't participate in the walkback gap check (their events are discarded via if _, err := ...). This is intentional, but a one-line comment in the code would make it clearer.

Positive Notes

The CAS loop in RewindTo is correct and race-free.
Gating walkback behind both p2pStalledFn and nodeHeightFn != nil and daHeight > startDAHeight prevents spurious rewinds during normal catchup — good defensive design.
The p2pStalled atomic in Syncer is cleanly owned by p2pWorkerLoop (write) and read via a closure by the DA follower, with no shared mutable state — good separation.
Test coverage for the boundary cases (no_walkback_without_nodeHeightFn, no_walkback_at_startDAHeight) is appreciated.

feat: add automatic DA retriever walkback when P2P stalls and DA blocks

1929851

too far ahead

github-actions bot assigned julienrbrt Apr 16, 2026

julienrbrt changed the title ~~feat: add automatic DA retriever walkback when P2P stalls and DA blocks~~ feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026

julienrbrt added 2 commits April 16, 2026 19:40

add cl

9b7ecce

Merge branch 'main' into julien/edge-case-p2p-stall

e9591c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 3 commits intomainfrom
julien/edge-case-p2p-stall

julienrbrt commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

julienrbrt commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: feat: add automatic DA retriever walkback when P2P stalls

Summary

Bugs

Critical: Walkback tests fail because p2pStalledFn is never set

Logic Issues

Events piped before walkback decision

Potential oscillation

Code Quality

Misleading no-op in test

atomic.Bool for walkbackActive is over-engineered

Context usage in NodeHeight closure

Minor

Positive Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

julienrbrt commented Apr 16, 2026 •

edited

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

claude bot commented Apr 16, 2026 •

edited

Loading

Critical: Walkback tests fail because `p2pStalledFn` is never set

`atomic.Bool` for `walkbackActive` is over-engineered

Context usage in `NodeHeight` closure