Skip to content

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

Draft
julienrbrt wants to merge 3 commits intomainfrom
julien/edge-case-p2p-stall
Draft

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 3 commits intomainfrom
julien/edge-case-p2p-stall

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 16, 2026

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31ea96d4-173d-4e8d-b3f9-747b6fd620c3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 16, 2026, 5:40 PM

@julienrbrt julienrbrt changed the title feat: add automatic DA retriever walkback when P2P stalls and DA blocks feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @julienrbrt's task in 3m 35s —— View job


Code Review: feat: add automatic DA retriever walkback when P2P stalls

  • Gather context and understand the PR
  • Read changed files
  • Review DA follower logic and walkback mechanism
  • Review subscriber changes
  • Review syncer changes
  • Review tests
  • Post comprehensive review

Summary

This PR adds a self-correction mechanism: when P2P sync stalls and the DA follower has advanced to blocks far ahead of the node's committed height, the DA follower walks backwards through DA heights until it finds blocks contiguous with the node's state. The design is sound at a high level, but there are a few bugs to address before merging.


Bugs

Critical: Walkback tests fail because p2pStalledFn is never set

In da_follower_test.go, three of the five walkback tests construct a daFollower struct directly without setting p2pStalledFn. This matters because HandleCatchup evaluates:

p2pStalled := f.p2pStalledFn != nil && f.p2pStalledFn()
if p2pStalled && f.nodeHeightFn != nil && daHeight > f.startDAHeight {
    // walkback logic — NEVER ENTERED when p2pStalledFn is nil
} else if !p2pStalled {
    f.walkbackActive.Store(false)  // ← always runs when p2pStalledFn is nil
}

When p2pStalledFn == nil, p2pStalled = false, so the walkback block is never entered and walkbackActive is always cleared.

Affected tests and expected failures:

Test Assertion Actual result
rewinds_when_gap_detected walkbackActive == true, LocalDAHeight() == 99 false, 100
keeps_walking_back_on_empty_height walkbackActive == true, LocalDAHeight() == 98 false, 100
stops_walkback_when_contiguous walkbackActive == false false ✓ (but for wrong reason)

The fix is to add p2pStalledFn: func() bool { return true } to those three test cases. The stops_walkback_when_contiguous test also needs it to actually verify the contiguous-block stopping condition rather than just the !p2pStalled fallthrough.

Fix this →


Logic Issues

Events piped before walkback decision

In HandleCatchup, step 2 pipes events at daHeight unconditionally, and step 3 then decides to rewind:

// step 2 — already piped
events, err := f.fetchAndPipeHeight(ctx, daHeight)

// step 3 — now rewind
f.subscriber.RewindTo(daHeight - 1)

After the rewind, runCatchup will re-process daHeight and pipe those same events again. The duplicate suppression in PipeEvent (cache.IsHeaderSeen) handles this gracefully, but it is worth noting. Consider adding a comment clarifying that double-piping is intentional and handled upstream by deduplication.

Potential oscillation

Consider this scenario:

  1. nodeHeight = 40, DA height 100 has block 50 → walkback starts, rewinds to 99
  2. Heights 99–96 are empty → keeps walking back to 95
  3. DA height 95 has block 41 → gap closed, walkbackActive.Store(false), pipe block 41
  4. Catchup continues: 96–99 empty (fine), 100 re-fetched → block 50 again → gap detected (nodeHeight still needs 42–49)
  5. Walkback restarts

If blocks 42–49 exist somewhere in DA between heights 95 and 100, the walkback will find them. If they only existed in P2P and were never submitted to DA, this becomes an infinite re-walk. It might be worth adding a maximum walkback depth (startDAHeight already acts as a floor, but a configurable depth or a log warning after repeated oscillation would help operability).


Code Quality

Misleading no-op in test

// da_follower_test.go
sub := newTestSubscriber(100)
sub.LocalDAHeight() // ensure initialized

NewSubscriber always initializes localDAHeight via s.localDAHeight.Store(cfg.StartHeight), so this call is a no-op. The comment is misleading; remove it.

atomic.Bool for walkbackActive is over-engineered

HandleCatchup is called sequentially by catchupLoop — there is no concurrent writer to walkbackActive. A plain bool field would be sufficient. That said, atomic.Bool is harmless and gives a visibility guarantee across goroutine restarts, so this is a minor nit.

Context usage in NodeHeight closure

// syncer.go:228
NodeHeight: func() uint64 {
    h, _ := s.store.Height(s.ctx)
    return h
},

When s.ctx is cancelled (syncer shutting down), s.store.Height returns 0, which makes nodeHeight = 0, potentially triggering a spurious walkback. In practice the DA follower is also stopping at that point, so this is benign — but worth a comment.


Minor

  • The RewindTo docstring could note that it is safe to call concurrently and is a no-op when daHeight >= localDAHeight.
  • Priority heights fetched in step 1 of HandleCatchup don't participate in the walkback gap check (their events are discarded via if _, err := ...). This is intentional, but a one-line comment in the code would make it clearer.

Positive Notes

  • The CAS loop in RewindTo is correct and race-free.
  • Gating walkback behind both p2pStalledFn and nodeHeightFn != nil and daHeight > startDAHeight prevents spurious rewinds during normal catchup — good defensive design.
  • The p2pStalled atomic in Syncer is cleanly owned by p2pWorkerLoop (write) and read via a closure by the DA follower, with no shared mutable state — good separation.
  • Test coverage for the boundary cases (no_walkback_without_nodeHeightFn, no_walkback_at_startDAHeight) is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant