fix(puller): non-blocking peer disconnect and sync error backoff by misaakidis · Pull Request #5423 · ethersphere/bee

misaakidis · 2026-04-05T19:06:26Z

Checklist

I have read the coding guide.
My change requires a documentation update, and I have done it.
I have added tests to cover my changes.
I have filled out the description and linked the related issues.

Description

Two liveness bugs in the puller are fixed.

Bug 1 — topology change holds `syncPeersMtx` for seconds

onChange holds syncPeersMtx while calling disconnectPeer, which called peer.stop():

onChange()
  lock(syncPeersMtx)
  → disconnectPeer(peer)
    → peer.stop()
      → cancel per-bin contexts
      → peer.wg.Wait()          ← blocks until all Sync() calls return

Sync() calls ReadMsgWithContext, which only unblocks after the stream times out (pageTimeout = 1s per bin). For a node with B active bins syncing to a peer that is being disconnected during a radius decrease, disconnectPeer held the lock for up to B seconds. Any subsequent topology-change notification queued behind the same manage() loop was blocked for that entire duration, causing the sync map to diverge from the live topology.

Fix: split peer.stop() into two methods:

Method	Behaviour	Used by
`cancelBins()`	cancel all per-bin contexts, clear the map, no wait	`disconnectPeer`
`stop()`	`cancelBins()` + `peer.wg.Wait()`	`Close()` (shutdown)

disconnectPeer now calls cancelBins(): the peer is removed from the sync map immediately and its goroutines drain in the background. Close() still calls p.wg.Wait() across all peers, so shutdown correctness is unchanged.

Bug 2 — tight CPU spin on non-fatal sync errors

When Sync() returns a non-fatal error (stream reset, protocol error, timeout), the goroutine logged the error and fell through to limiter.WaitN(ctx, count) with count=0. WaitN(ctx, 0) returns immediately, so the goroutine looped back and retried with no delay. Any persistent non-fatal error caused a continuous CPU spin until the peer disconnected or the context was cancelled.

Fix: add a syncRetryBackoff = 1s sleep with a ctx.Done() escape after any non-fatal sync error:

select {
case <-time.After(syncRetryBackoff):
case <-ctx.Done():
    return
}

This bounds the retry rate to ≤ 1 call/s per goroutine under persistent errors.

Related Issue

Both bugs were surfaced during pull-sync optimisation work. The topology-freeze bug is most visible at depth transitions where many peers disconnect in rapid succession. See also: fix/pullsync-interval-advancement.

When onChange held syncPeersMtx and called disconnectPeer, the inner peer.stop() cancelled per-bin goroutine contexts and then called peer.wg.Wait(). Live goroutines blocked in Sync() → ReadMsgWithContext and only unblocked after pageTimeout (1s) per stream. For N peers disconnecting during a radius decrease, the outer lock was held for up to N×1s, stalling all queued topology-change notifications for the same duration. When syncer.Sync returned a non-fatal error (connection reset, protocol error, stream timeout), the goroutine fell through to limiter.WaitN with count=0 and looped immediately. Any persistent non-fatal error caused a tight CPU spin until the peer disconnected or context was cancelled. Split syncPeer.stop() into cancelBins() (cancel all per-bin contexts, clear the map, no wait) and stop() (cancel + wait, used only in Close()). disconnectPeer now calls cancelBins(): the peer is removed from the sync map immediately and its goroutines drain in the background. Close() already calls p.wg.Wait(), so shutdown correctness is unchanged. Add syncRetryBackoff (1s) with a ctx.Done()-escape after any non-fatal sync error before the next retry. This bounds the retry rate to ≤1/s per goroutine under persistent errors.

acud · 2026-04-06T11:42:00Z

I think bug 2 is probably going to be easier to reason about. it would be beneficial to split changes into small PRs rather than cramming them together - it makes reviews more difficult, especially for core protocol. the same goes for the other PR.

misaakidis · 2026-04-08T09:14:09Z

I've noted the preference for atomic PRs for the future. For this specific case, would it be okay to leave this PR as is? These liveness bugs are somewhat coupled and validating them together ensures the fix is cohesive.

Btw, I have closed the other PR #5424 as the pullsync interval advancement behavior is correct in the setting of pulling from all peers (as per v2.7.1.)

acud · 2026-04-08T12:41:45Z

I've noted the preference for atomic PRs for the future. For this specific case, would it be okay to leave this PR as is? These liveness bugs are somewhat coupled and validating them together ensures the fix is cohesive.

Btw, I have closed the other PR #5424 as the pullsync interval advancement behavior is correct in the setting of pulling from all peers (as per v2.7.1.)

Yeah all good leave it as it is, however it needs a rebase as there are merge conflicts.
Thanks for closing the other PR. I'll try to get to this one tomorrow.

acud · 2026-04-14T05:10:07Z

+// TestSyncErrorBackoff verifies that a non-fatal sync error is followed by a
+// backoff before the next retry, bounding the retry rate to roughly 1/s.
+func TestSyncErrorBackoff(t *testing.T) {
+	t.Parallel()


i wonder if we could use synctest here instead of the sleeps, spinlock etc.

acud · 2026-04-14T05:10:41Z

 	github.com/caddyserver/certmagic v0.21.6
 	github.com/coreos/go-semver v0.3.0
 	github.com/ethereum/go-ethereum v1.15.11
 	github.com/ethersphere/batch-archive v0.0.6


how is this related to this PR?

this dependency bump is not related to the puller/backoff changes in this PR.
It came from another commit: 61fab37 (chore: update postage snapshot to v0.0.6 (#5401)), which is the commit tagged v2.7.1.

i see yes. not sure why this commit is on this branch. it would be a good habit to work straight off master so that you can then rebase easily.

misaakidis · 2026-04-15T13:19:40Z

Created a separate validation branch for the CI experiment: #ci/radius-decrease-validation . This PR now stays focused on the puller fix and its direct test coverage only.

gacevicljubisa and others added 2 commits March 18, 2026 08:35

chore: update postage snapshot to v0.0.6 (#5401)

61fab37

misaakidis mentioned this pull request Apr 5, 2026

fix(pullsync): prevent sync interval over-advancement #5424

Closed

4 tasks

acud reviewed Apr 14, 2026

View reviewed changes

github-advanced-security AI found potential problems Apr 15, 2026

View reviewed changes

Comment thread .github/workflows/beekeeper.yml Fixed

misaakidis force-pushed the fix/puller-disconnect-backoff branch 2 times, most recently from 8dd0d97 to 8f23985 Compare April 15, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(puller): non-blocking peer disconnect and sync error backoff#5423

fix(puller): non-blocking peer disconnect and sync error backoff#5423
misaakidis wants to merge 2 commits intomasterfrom
fix/puller-disconnect-backoff

misaakidis commented Apr 5, 2026 •

edited

Loading

Uh oh!

acud commented Apr 6, 2026

Uh oh!

misaakidis commented Apr 8, 2026 •

edited

Loading

Uh oh!

acud commented Apr 8, 2026

Uh oh!

acud Apr 14, 2026

Uh oh!

acud Apr 14, 2026

Uh oh!

misaakidis Apr 15, 2026

Uh oh!

acud Apr 16, 2026

Uh oh!

Uh oh!

misaakidis commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

misaakidis commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Bug 1 — topology change holds syncPeersMtx for seconds

Bug 2 — tight CPU spin on non-fatal sync errors

Related Issue

Uh oh!

acud commented Apr 6, 2026

Uh oh!

misaakidis commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acud commented Apr 8, 2026

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

misaakidis Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

acud Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

misaakidis commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

misaakidis commented Apr 5, 2026 •

edited

Loading

Bug 1 — topology change holds `syncPeersMtx` for seconds

misaakidis commented Apr 8, 2026 •

edited

Loading

misaakidis commented Apr 15, 2026 •

edited

Loading