chore: stabilize flaky tests and remove flaky runner by sbackend123 · Pull Request #5429 · ethersphere/bee

sbackend123 · 2026-04-08T09:32:27Z

Fix CI

Checklist

I have read the coding guide.
My change requires a documentation update, and I have done it.
I have added tests to cover my changes.
I have filled out the description and linked the related issues.

Description

Goal: stabilize flaky tests and remove ci-runner for "flaky" tests.
Cross-cutting change: tests use rand with an explicit source so failures are easier to reproduce and debug.

Per test

TestBzzUploadDownloadWithRedundancy — It was consistently failing on the master branch. The reason: it wasn’t reading api.SwarmRedundancyLevelHeader, so it was using the default (PARANOID) level.
TestFinder — Stable on Linux on master; only renamed.
TestDBNuke — Fails on Windows. Cause: LevelDB sits under a cache layer, which sits under the store, etc. On Close(), the cache layer should close the underlying DB but did not, so the file stayed open and the test could not remove its temp directory. Fix: close the DB from the cache’s Close(), and adjust the cache test so LevelDB is not closed twice.
TestGetterRACE — Stable on Linux on master; renamed and given rand + source for reproducibility.
TestPushChunkToNextClosest — With origin=true, pushToClosest may deliver to several nearest peers in parallel; order and outcomes are nondeterministic. Old assertions assumed a fixed successful peer and strict pivot vs peer bookkeeping, and hard-coded which peer “fails” and which “succeeds”. Fix: any of the nearest candidates may succeed; assert push stream activity toward both nearest candidates and that exactly one ends up with a positive balance vs the pivot.
TestMakeInclusionProofs — Stable on Linux on master; only renamed.
TestAddressBookQuickPrune — With storageRadius = 2, the “good” and “bad” peers can share bin 1. connectNeighbours skips that bin (PO < depth), while connectBalanced can treat the slot as satisfied by the already-connected “good” peer and never dial the “bad” one — flaky behavior depending on random overlay geometry. Fix: avoid that collision (e.g. don’t put a connected peer in the same balanced bin as the bad peer) and align assertions with the implicit dial from AddPeers plus explicit Triggers (e.g. wait for at least MaxConnAttempts failed connects, then assert address book prune).
TestAnnounceBgBroadcast — Assertions relied on timing instead of the background goroutine actually running. cancel() could run right after Announce before BroadcastPeers blocked on <-ctx.Done(). After Close(), a select with a fixed 100ms timeout could fail on slow CI. Fix: bgStarted closed on first real entry into the background BroadcastPeers, and a more generous wait for shutdown on slow CI.
TestSnapshot — Expected an almost immediate snapshot while Kademlia updates asynchronously. Fix: poll / wait up to a timeout for the snapshot.
TestStart (non-empty addressbook subtest) — Besides the three address-book peers, bootstrap dialing adds more Connect calls, so the total is > 3, not exactly 3.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

#5418

Screenshots (if appropriate):

AI Disclosure

This PR contains code that has been generated by an LLM.
I have reviewed the AI generated code thoroughly.
I possess the technical expertise to responsibly review the code generated in this PR.

Fix CI

Keep flaky-test ci job for one remaining flaky test

acud · 2026-04-14T05:16:20Z

-	err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute()
+	// Retrying avoids a short OS-level race after db.Close(), where file handles
+	// may still be getting released and early removal can fail on some platforms.
+	backoff := 50 * time.Millisecond


maybe just sleep once then (1s) and try once?

acud · 2026-04-14T05:17:22Z

+	// may still be getting released and early removal can fail on some platforms.
+	backoff := 50 * time.Millisecond
+	for range 3 {
+		err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute()


here, if the first command returns an error (application level), this loop will just swallow it and try again which might just get rid of the application level error, obfuscating the error.

acud · 2026-04-14T05:26:42Z

+func waitChanClosed(t *testing.T, ch <-chan struct{}) {
+	t.Helper()
+
+	err := spinlock.Wait(spinLockWaitTime, func() bool {


i would really be happy to get rid of this spinlock functionality at some point. it adds a functionality that is just shorter to express using normal golang idioms:

func waitChanClosed(t *testing.T, ch <-chan struct{}) { t.Helper() select { case <-ch: return case <-time.After(timeWait): t.Fatal('timed out') }

done.

acud · 2026-04-14T05:30:49Z

@sbackend123, thanks for this. what about removing the flaky test runner from the github workflows & makefile?

…ests

sbackend123 · 2026-04-14T11:48:15Z

@sbackend123, thanks for this. what about removing the flaky test runner from the github workflows & makefile?

Already removed, we also need remove it from required check (I do not have rights for that)

akrem-chabchoub

The Test (flaky) workflow is in a freeze mode.
Although it is removed from go.yml, not sure why its left here in CI

akrem-chabchoub · 2026-04-14T12:26:16Z


+	// Waiting avoids a short OS-level race after db.Close(), where file handles
+	// may still be getting released and early removal can fail on some platforms.
+	time.Sleep(2 * time.Second)


Would it be better to replace this with a condition based wait/retry to ensure more deterministic behavior across slower ci environments rather than setting hardcoded time?

@akrem-chabchoub there was something else here before that was a retry... see my previous comment here. if you have any suggestions - welcome (but code example would also be good)

sbackend123 · 2026-04-14T13:25:01Z

The Test (flaky) workflow is in a freeze mode. Although it is removed from go.yml, not sure why its left here in CI

It is highly likely related to #5429 (comment)

acud · 2026-04-15T08:31:06Z

@sbackend123 the job requirement has been removed

martinconic · 2026-04-15T11:02:27Z


 	ctx := r.Context()
-	ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), redundancy.DefaultLevel)
+	ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), rLevel)


martinconic · 2026-04-15T11:04:56Z

I think all CI/CD tests should run 10-20 times at least to make sure flaky tests do not occur

sbackend123 · 2026-04-15T15:00:59Z

I think all CI/CD tests should run 10-20 times at least to make sure flaky tests do not occur

I tried on Linux about 50 times, but would be cool if somebody with mac OS would try the same.

sbackend123 added 10 commits April 8, 2026 09:53

chore: remove randomness in flaky tests

03f6171

Fix CI

fix: read redundancy level from header instead of using default

caa13c9

fix: skip test

6e8d2c6

fix: closing db on cache level

de589f3

fix: remove generating deterministic seed

0927a96

fix: make pruning test more strict

99c4663

fix: fixed remaining flaky tests

3ec43d8

Keep flaky-test ci job for one remaining flaky test

Merge branch 'master' into fix/flaky-tests

1f44324

fix: clean up

c882d59

fix: re-write quick prune test

aea26b0

acud reviewed Apr 14, 2026

View reviewed changes

sbackend123 added 5 commits April 14, 2026 10:38

fix: clean up

9366481

Merge branch 'fix/read-redundancy-level-from-header' into fix/flaky-t…

449a061

…ests

chore: remove ci-runner for flaky tests

e634fd5

fix: remove comment

85bfbe5

fix: review issues

079ed30

sbackend123 mentioned this pull request Apr 14, 2026

fix: read redundancy level from header instead of using default #5432

Closed

7 tasks

fix: use go channels instead of spinlock

952a49b

sbackend123 requested a review from acud April 14, 2026 11:48

sbackend123 changed the title ~~chore: remove randomness in flaky tests~~ chore: stabilize flaky tests and remove flaky runner Apr 14, 2026

sbackend123 marked this pull request as ready for review April 14, 2026 11:48

sbackend123 requested review from akrem-chabchoub and nugaon April 14, 2026 11:49

akrem-chabchoub reviewed Apr 14, 2026

View reviewed changes

martinconic reviewed Apr 15, 2026

View reviewed changes

sbackend123 requested a review from akrem-chabchoub April 15, 2026 15:01

akrem-chabchoub mentioned this pull request Apr 16, 2026

docs: add CLAUDE.md and Agents.md for AI coding assistant guidance #5437

Open

7 tasks

acud approved these changes Apr 16, 2026

View reviewed changes

akrem-chabchoub approved these changes Apr 17, 2026

View reviewed changes

Conversation

sbackend123 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

acud commented Apr 14, 2026

Uh oh!

sbackend123 commented Apr 14, 2026

Uh oh!

akrem-chabchoub left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akrem-chabchoub Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acud Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

sbackend123 commented Apr 14, 2026

Uh oh!

acud commented Apr 15, 2026

Uh oh!

martinconic Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martinconic commented Apr 15, 2026

Uh oh!

sbackend123 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sbackend123 commented Apr 8, 2026 •

edited

Loading

akrem-chabchoub left a comment •

edited

Loading

akrem-chabchoub Apr 14, 2026 •

edited

Loading