fix(puller): non-blocking peer disconnect and sync error backoff#5423
fix(puller): non-blocking peer disconnect and sync error backoff#5423misaakidis wants to merge 2 commits intomasterfrom
Conversation
When onChange held syncPeersMtx and called disconnectPeer, the inner peer.stop() cancelled per-bin goroutine contexts and then called peer.wg.Wait(). Live goroutines blocked in Sync() → ReadMsgWithContext and only unblocked after pageTimeout (1s) per stream. For N peers disconnecting during a radius decrease, the outer lock was held for up to N×1s, stalling all queued topology-change notifications for the same duration. When syncer.Sync returned a non-fatal error (connection reset, protocol error, stream timeout), the goroutine fell through to limiter.WaitN with count=0 and looped immediately. Any persistent non-fatal error caused a tight CPU spin until the peer disconnected or context was cancelled. Split syncPeer.stop() into cancelBins() (cancel all per-bin contexts, clear the map, no wait) and stop() (cancel + wait, used only in Close()). disconnectPeer now calls cancelBins(): the peer is removed from the sync map immediately and its goroutines drain in the background. Close() already calls p.wg.Wait(), so shutdown correctness is unchanged. Add syncRetryBackoff (1s) with a ctx.Done()-escape after any non-fatal sync error before the next retry. This bounds the retry rate to ≤1/s per goroutine under persistent errors.
|
I think bug 2 is probably going to be easier to reason about. it would be beneficial to split changes into small PRs rather than cramming them together - it makes reviews more difficult, especially for core protocol. the same goes for the other PR. |
|
I've noted the preference for atomic PRs for the future. For this specific case, would it be okay to leave this PR as is? These liveness bugs are somewhat coupled and validating them together ensures the fix is cohesive. Btw, I have closed the other PR #5424 as the pullsync interval advancement behavior is correct in the setting of pulling from all peers (as per v2.7.1.) |
Yeah all good leave it as it is, however it needs a rebase as there are merge conflicts. |
| // TestSyncErrorBackoff verifies that a non-fatal sync error is followed by a | ||
| // backoff before the next retry, bounding the retry rate to roughly 1/s. | ||
| func TestSyncErrorBackoff(t *testing.T) { | ||
| t.Parallel() |
There was a problem hiding this comment.
i wonder if we could use synctest here instead of the sleeps, spinlock etc.
| github.com/caddyserver/certmagic v0.21.6 | ||
| github.com/coreos/go-semver v0.3.0 | ||
| github.com/ethereum/go-ethereum v1.15.11 | ||
| github.com/ethersphere/batch-archive v0.0.6 |
There was a problem hiding this comment.
how is this related to this PR?
There was a problem hiding this comment.
i see yes. not sure why this commit is on this branch. it would be a good habit to work straight off master so that you can then rebase easily.
8dd0d97 to
8f23985
Compare
|
Created a separate validation branch for the CI experiment: #ci/radius-decrease-validation . This PR now stays focused on the puller fix and its direct test coverage only. |
Checklist
Description
Two liveness bugs in the puller are fixed.
Bug 1 — topology change holds
syncPeersMtxfor secondsonChangeholdssyncPeersMtxwhile callingdisconnectPeer, which calledpeer.stop():Sync()callsReadMsgWithContext, which only unblocks after the stream times out (pageTimeout = 1sper bin). For a node with B active bins syncing to a peer that is being disconnected during a radius decrease,disconnectPeerheld the lock for up to B seconds. Any subsequent topology-change notification queued behind the samemanage()loop was blocked for that entire duration, causing the sync map to diverge from the live topology.Fix: split
peer.stop()into two methods:cancelBins()disconnectPeerstop()cancelBins()+peer.wg.Wait()Close()(shutdown)disconnectPeernow callscancelBins(): the peer is removed from the sync map immediately and its goroutines drain in the background.Close()still callsp.wg.Wait()across all peers, so shutdown correctness is unchanged.Bug 2 — tight CPU spin on non-fatal sync errors
When
Sync()returns a non-fatal error (stream reset, protocol error, timeout), the goroutine logged the error and fell through tolimiter.WaitN(ctx, count)withcount=0.WaitN(ctx, 0)returns immediately, so the goroutine looped back and retried with no delay. Any persistent non-fatal error caused a continuous CPU spin until the peer disconnected or the context was cancelled.Fix: add a
syncRetryBackoff = 1ssleep with actx.Done()escape after any non-fatal sync error:This bounds the retry rate to ≤ 1 call/s per goroutine under persistent errors.
Related Issue
Both bugs were surfaced during pull-sync optimisation work. The topology-freeze bug is most visible at depth transitions where many peers disconnect in rapid succession. See also:
fix/pullsync-interval-advancement.