feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262julienrbrt wants to merge 3 commits intomainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 3m 35s —— View job Code Review: feat: add automatic DA retriever walkback when P2P stalls
SummaryThis PR adds a self-correction mechanism: when P2P sync stalls and the DA follower has advanced to blocks far ahead of the node's committed height, the DA follower walks backwards through DA heights until it finds blocks contiguous with the node's state. The design is sound at a high level, but there are a few bugs to address before merging. BugsCritical: Walkback tests fail because
|
| Test | Assertion | Actual result |
|---|---|---|
rewinds_when_gap_detected |
walkbackActive == true, LocalDAHeight() == 99 |
false, 100 |
keeps_walking_back_on_empty_height |
walkbackActive == true, LocalDAHeight() == 98 |
false, 100 |
stops_walkback_when_contiguous |
walkbackActive == false |
false ✓ (but for wrong reason) |
The fix is to add p2pStalledFn: func() bool { return true } to those three test cases. The stops_walkback_when_contiguous test also needs it to actually verify the contiguous-block stopping condition rather than just the !p2pStalled fallthrough.
Logic Issues
Events piped before walkback decision
In HandleCatchup, step 2 pipes events at daHeight unconditionally, and step 3 then decides to rewind:
// step 2 — already piped
events, err := f.fetchAndPipeHeight(ctx, daHeight)
// step 3 — now rewind
f.subscriber.RewindTo(daHeight - 1)After the rewind, runCatchup will re-process daHeight and pipe those same events again. The duplicate suppression in PipeEvent (cache.IsHeaderSeen) handles this gracefully, but it is worth noting. Consider adding a comment clarifying that double-piping is intentional and handled upstream by deduplication.
Potential oscillation
Consider this scenario:
nodeHeight = 40, DA height 100 has block 50 → walkback starts, rewinds to 99- Heights 99–96 are empty → keeps walking back to 95
- DA height 95 has block 41 → gap closed,
walkbackActive.Store(false), pipe block 41 - Catchup continues: 96–99 empty (fine), 100 re-fetched → block 50 again → gap detected (nodeHeight still needs 42–49)
- Walkback restarts
If blocks 42–49 exist somewhere in DA between heights 95 and 100, the walkback will find them. If they only existed in P2P and were never submitted to DA, this becomes an infinite re-walk. It might be worth adding a maximum walkback depth (startDAHeight already acts as a floor, but a configurable depth or a log warning after repeated oscillation would help operability).
Code Quality
Misleading no-op in test
// da_follower_test.go
sub := newTestSubscriber(100)
sub.LocalDAHeight() // ensure initializedNewSubscriber always initializes localDAHeight via s.localDAHeight.Store(cfg.StartHeight), so this call is a no-op. The comment is misleading; remove it.
atomic.Bool for walkbackActive is over-engineered
HandleCatchup is called sequentially by catchupLoop — there is no concurrent writer to walkbackActive. A plain bool field would be sufficient. That said, atomic.Bool is harmless and gives a visibility guarantee across goroutine restarts, so this is a minor nit.
Context usage in NodeHeight closure
// syncer.go:228
NodeHeight: func() uint64 {
h, _ := s.store.Height(s.ctx)
return h
},When s.ctx is cancelled (syncer shutting down), s.store.Height returns 0, which makes nodeHeight = 0, potentially triggering a spurious walkback. In practice the DA follower is also stopping at that point, so this is benign — but worth a comment.
Minor
- The
RewindTodocstring could note that it is safe to call concurrently and is a no-op whendaHeight >= localDAHeight. - Priority heights fetched in step 1 of
HandleCatchupdon't participate in the walkback gap check (their events are discarded viaif _, err := ...). This is intentional, but a one-line comment in the code would make it clearer.
Positive Notes
- The CAS loop in
RewindTois correct and race-free. - Gating walkback behind both
p2pStalledFnandnodeHeightFn != nilanddaHeight > startDAHeightprevents spurious rewinds during normal catchup — good defensive design. - The
p2pStalledatomic inSynceris cleanly owned byp2pWorkerLoop(write) and read via a closure by the DA follower, with no shared mutable state — good separation. - Test coverage for the boundary cases (
no_walkback_without_nodeHeightFn,no_walkback_at_startDAHeight) is appreciated.
Overview
Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead
Early WIP, i do not like what the LLM has produced.