Skip to content

chore: stabilize flaky tests and remove flaky runner#5429

Open
sbackend123 wants to merge 16 commits intomasterfrom
fix/flaky-tests
Open

chore: stabilize flaky tests and remove flaky runner#5429
sbackend123 wants to merge 16 commits intomasterfrom
fix/flaky-tests

Conversation

@sbackend123
Copy link
Copy Markdown
Contributor

@sbackend123 sbackend123 commented Apr 8, 2026

Fix CI

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Goal: stabilize flaky tests and remove ci-runner for "flaky" tests.
Cross-cutting change: tests use rand with an explicit source so failures are easier to reproduce and debug.

Per test

  • TestBzzUploadDownloadWithRedundancy — It was consistently failing on the master branch. The reason: it wasn’t reading api.SwarmRedundancyLevelHeader, so it was using the default (PARANOID) level.

  • TestFinder — Stable on Linux on master; only renamed.

  • TestDBNuke — Fails on Windows. Cause: LevelDB sits under a cache layer, which sits under the store, etc. On Close(), the cache layer should close the underlying DB but did not, so the file stayed open and the test could not remove its temp directory. Fix: close the DB from the cache’s Close(), and adjust the cache test so LevelDB is not closed twice.

  • TestGetterRACE — Stable on Linux on master; renamed and given rand + source for reproducibility.

  • TestPushChunkToNextClosest — With origin=true, pushToClosest may deliver to several nearest peers in parallel; order and outcomes are nondeterministic. Old assertions assumed a fixed successful peer and strict pivot vs peer bookkeeping, and hard-coded which peer “fails” and which “succeeds”. Fix: any of the nearest candidates may succeed; assert push stream activity toward both nearest candidates and that exactly one ends up with a positive balance vs the pivot.

  • TestMakeInclusionProofs — Stable on Linux on master; only renamed.

  • TestAddressBookQuickPrune — With storageRadius = 2, the “good” and “bad” peers can share bin 1. connectNeighbours skips that bin (PO < depth), while connectBalanced can treat the slot as satisfied by the already-connected “good” peer and never dial the “bad” one — flaky behavior depending on random overlay geometry. Fix: avoid that collision (e.g. don’t put a connected peer in the same balanced bin as the bad peer) and align assertions with the implicit dial from AddPeers plus explicit Triggers (e.g. wait for at least MaxConnAttempts failed connects, then assert address book prune).

  • TestAnnounceBgBroadcast — Assertions relied on timing instead of the background goroutine actually running. cancel() could run right after Announce before BroadcastPeers blocked on <-ctx.Done(). After Close(), a select with a fixed 100ms timeout could fail on slow CI. Fix: bgStarted closed on first real entry into the background BroadcastPeers, and a more generous wait for shutdown on slow CI.

  • TestSnapshot — Expected an almost immediate snapshot while Kademlia updates asynchronously. Fix: poll / wait up to a timeout for the snapshot.

  • TestStart (non-empty addressbook subtest) — Besides the three address-book peers, bootstrap dialing adds more Connect calls, so the total is > 3, not exactly 3.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

#5418

Screenshots (if appropriate):

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

Comment thread cmd/bee/cmd/db_test.go Outdated
err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute()
// Retrying avoids a short OS-level race after db.Close(), where file handles
// may still be getting released and early removal can fail on some platforms.
backoff := 50 * time.Millisecond
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just sleep once then (1s) and try once?

Comment thread cmd/bee/cmd/db_test.go Outdated
// may still be getting released and early removal can fail on some platforms.
backoff := 50 * time.Millisecond
for range 3 {
err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, if the first command returns an error (application level), this loop will just swallow it and try again which might just get rid of the application level error, obfuscating the error.

Comment thread pkg/storage/cache/cache_test.go
Comment thread pkg/topology/kademlia/kademlia_test.go Outdated
func waitChanClosed(t *testing.T, ch <-chan struct{}) {
t.Helper()

err := spinlock.Wait(spinLockWaitTime, func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would really be happy to get rid of this spinlock functionality at some point. it adds a functionality that is just shorter to express using normal golang idioms:

func waitChanClosed(t *testing.T, ch <-chan struct{}) {
	t.Helper()
    select {
      case <-ch:
        return
      case <-time.After(timeWait):
        t.Fatal('timed out')
    }

done.

@acud
Copy link
Copy Markdown
Contributor

acud commented Apr 14, 2026

@sbackend123, thanks for this. what about removing the flaky test runner from the github workflows & makefile?

@sbackend123
Copy link
Copy Markdown
Contributor Author

@sbackend123, thanks for this. what about removing the flaky test runner from the github workflows & makefile?

Already removed, we also need remove it from required check (I do not have rights for that)

@sbackend123 sbackend123 requested a review from acud April 14, 2026 11:48
@sbackend123 sbackend123 changed the title chore: remove randomness in flaky tests chore: stabilize flaky tests and remove flaky runner Apr 14, 2026
@sbackend123 sbackend123 marked this pull request as ready for review April 14, 2026 11:48
Copy link
Copy Markdown
Contributor

@akrem-chabchoub akrem-chabchoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Test (flaky) workflow is in a freeze mode.
Although it is removed from go.yml, not sure why its left here in CI

Image

Comment thread cmd/bee/cmd/db_test.go

// Waiting avoids a short OS-level race after db.Close(), where file handles
// may still be getting released and early removal can fail on some platforms.
time.Sleep(2 * time.Second)
Copy link
Copy Markdown
Contributor

@akrem-chabchoub akrem-chabchoub Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to replace this with a condition based wait/retry to ensure more deterministic behavior across slower ci environments rather than setting hardcoded time?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akrem-chabchoub there was something else here before that was a retry... see my previous comment here. if you have any suggestions - welcome (but code example would also be good)

@sbackend123
Copy link
Copy Markdown
Contributor Author

The Test (flaky) workflow is in a freeze mode. Although it is removed from go.yml, not sure why its left here in CI
Image

It is highly likely related to #5429 (comment)

@acud
Copy link
Copy Markdown
Contributor

acud commented Apr 15, 2026

@sbackend123 the job requirement has been removed

Comment thread pkg/api/bzz.go

ctx := r.Context()
ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), redundancy.DefaultLevel)
ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), rLevel)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@martinconic
Copy link
Copy Markdown
Contributor

I think all CI/CD tests should run 10-20 times at least to make sure flaky tests do not occur

@sbackend123
Copy link
Copy Markdown
Contributor Author

I think all CI/CD tests should run 10-20 times at least to make sure flaky tests do not occur

I tried on Linux about 50 times, but would be cool if somebody with mac OS would try the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants