Skip to content

fix(puller): non-blocking peer disconnect and sync error backoff#5423

Open
misaakidis wants to merge 2 commits intomasterfrom
fix/puller-disconnect-backoff
Open

fix(puller): non-blocking peer disconnect and sync error backoff#5423
misaakidis wants to merge 2 commits intomasterfrom
fix/puller-disconnect-backoff

Conversation

@misaakidis
Copy link
Copy Markdown
Member

@misaakidis misaakidis commented Apr 5, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Two liveness bugs in the puller are fixed.

Bug 1 — topology change holds syncPeersMtx for seconds

onChange holds syncPeersMtx while calling disconnectPeer, which called peer.stop():

onChange()
  lock(syncPeersMtx)
  → disconnectPeer(peer)
    → peer.stop()
      → cancel per-bin contexts
      → peer.wg.Wait()          ← blocks until all Sync() calls return

Sync() calls ReadMsgWithContext, which only unblocks after the stream times out (pageTimeout = 1s per bin). For a node with B active bins syncing to a peer that is being disconnected during a radius decrease, disconnectPeer held the lock for up to B seconds. Any subsequent topology-change notification queued behind the same manage() loop was blocked for that entire duration, causing the sync map to diverge from the live topology.

Fix: split peer.stop() into two methods:

Method Behaviour Used by
cancelBins() cancel all per-bin contexts, clear the map, no wait disconnectPeer
stop() cancelBins() + peer.wg.Wait() Close() (shutdown)

disconnectPeer now calls cancelBins(): the peer is removed from the sync map immediately and its goroutines drain in the background. Close() still calls p.wg.Wait() across all peers, so shutdown correctness is unchanged.

Bug 2 — tight CPU spin on non-fatal sync errors

When Sync() returns a non-fatal error (stream reset, protocol error, timeout), the goroutine logged the error and fell through to limiter.WaitN(ctx, count) with count=0. WaitN(ctx, 0) returns immediately, so the goroutine looped back and retried with no delay. Any persistent non-fatal error caused a continuous CPU spin until the peer disconnected or the context was cancelled.

Fix: add a syncRetryBackoff = 1s sleep with a ctx.Done() escape after any non-fatal sync error:

select {
case <-time.After(syncRetryBackoff):
case <-ctx.Done():
    return
}

This bounds the retry rate to ≤ 1 call/s per goroutine under persistent errors.

Related Issue

Both bugs were surfaced during pull-sync optimisation work. The topology-freeze bug is most visible at depth transitions where many peers disconnect in rapid succession. See also: fix/pullsync-interval-advancement.

gacevicljubisa and others added 2 commits March 18, 2026 08:35
When onChange held syncPeersMtx and called disconnectPeer, the inner
peer.stop() cancelled per-bin goroutine contexts and then called
peer.wg.Wait(). Live goroutines blocked in Sync() → ReadMsgWithContext
and only unblocked after pageTimeout (1s) per stream. For N peers
disconnecting during a radius decrease, the outer lock was held for up
to N×1s, stalling all queued topology-change notifications for the same
duration.

When syncer.Sync returned a non-fatal error (connection reset, protocol
error, stream timeout), the goroutine fell through to limiter.WaitN with
count=0 and looped immediately. Any persistent non-fatal error caused a
tight CPU spin until the peer disconnected or context was cancelled.

Split syncPeer.stop() into cancelBins() (cancel all per-bin contexts,
clear the map, no wait) and stop() (cancel + wait, used only in
Close()). disconnectPeer now calls cancelBins(): the peer is removed
from the sync map immediately and its goroutines drain in the
background. Close() already calls p.wg.Wait(), so shutdown correctness
is unchanged.

Add syncRetryBackoff (1s) with a ctx.Done()-escape after any non-fatal
sync error before the next retry. This bounds the retry rate to ≤1/s
per goroutine under persistent errors.
@acud
Copy link
Copy Markdown
Contributor

acud commented Apr 6, 2026

I think bug 2 is probably going to be easier to reason about. it would be beneficial to split changes into small PRs rather than cramming them together - it makes reviews more difficult, especially for core protocol. the same goes for the other PR.

@misaakidis
Copy link
Copy Markdown
Member Author

misaakidis commented Apr 8, 2026

I've noted the preference for atomic PRs for the future. For this specific case, would it be okay to leave this PR as is? These liveness bugs are somewhat coupled and validating them together ensures the fix is cohesive.

Btw, I have closed the other PR #5424 as the pullsync interval advancement behavior is correct in the setting of pulling from all peers (as per v2.7.1.)

@acud
Copy link
Copy Markdown
Contributor

acud commented Apr 8, 2026

I've noted the preference for atomic PRs for the future. For this specific case, would it be okay to leave this PR as is? These liveness bugs are somewhat coupled and validating them together ensures the fix is cohesive.

Btw, I have closed the other PR #5424 as the pullsync interval advancement behavior is correct in the setting of pulling from all peers (as per v2.7.1.)

Yeah all good leave it as it is, however it needs a rebase as there are merge conflicts.
Thanks for closing the other PR. I'll try to get to this one tomorrow.

Comment thread pkg/puller/puller_test.go
// TestSyncErrorBackoff verifies that a non-fatal sync error is followed by a
// backoff before the next retry, bounding the retry rate to roughly 1/s.
func TestSyncErrorBackoff(t *testing.T) {
t.Parallel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we could use synctest here instead of the sleeps, spinlock etc.

Comment thread go.mod
github.com/caddyserver/certmagic v0.21.6
github.com/coreos/go-semver v0.3.0
github.com/ethereum/go-ethereum v1.15.11
github.com/ethersphere/batch-archive v0.0.6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this related to this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dependency bump is not related to the puller/backoff changes in this PR.
It came from another commit: 61fab37 (chore: update postage snapshot to v0.0.6 (#5401)), which is the commit tagged v2.7.1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see yes. not sure why this commit is on this branch. it would be a good habit to work straight off master so that you can then rebase easily.

Comment thread .github/workflows/beekeeper.yml Fixed
@misaakidis misaakidis force-pushed the fix/puller-disconnect-backoff branch 2 times, most recently from 8dd0d97 to 8f23985 Compare April 15, 2026 13:17
@misaakidis
Copy link
Copy Markdown
Member Author

misaakidis commented Apr 15, 2026

Created a separate validation branch for the CI experiment: #ci/radius-decrease-validation . This PR now stays focused on the puller fix and its direct test coverage only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants