chore: stabilize flaky tests and remove flaky runner#5429
chore: stabilize flaky tests and remove flaky runner#5429sbackend123 wants to merge 16 commits intomasterfrom
Conversation
Keep flaky-test ci job for one remaining flaky test
| err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute() | ||
| // Retrying avoids a short OS-level race after db.Close(), where file handles | ||
| // may still be getting released and early removal can fail on some platforms. | ||
| backoff := 50 * time.Millisecond |
There was a problem hiding this comment.
maybe just sleep once then (1s) and try once?
| // may still be getting released and early removal can fail on some platforms. | ||
| backoff := 50 * time.Millisecond | ||
| for range 3 { | ||
| err = newCommand(t, cmd.WithArgs("db", "nuke", "--data-dir", dataDir)).Execute() |
There was a problem hiding this comment.
here, if the first command returns an error (application level), this loop will just swallow it and try again which might just get rid of the application level error, obfuscating the error.
| func waitChanClosed(t *testing.T, ch <-chan struct{}) { | ||
| t.Helper() | ||
|
|
||
| err := spinlock.Wait(spinLockWaitTime, func() bool { |
There was a problem hiding this comment.
i would really be happy to get rid of this spinlock functionality at some point. it adds a functionality that is just shorter to express using normal golang idioms:
func waitChanClosed(t *testing.T, ch <-chan struct{}) {
t.Helper()
select {
case <-ch:
return
case <-time.After(timeWait):
t.Fatal('timed out')
}done.
|
@sbackend123, thanks for this. what about removing the flaky test runner from the github workflows & makefile? |
Already removed, we also need remove it from required check (I do not have rights for that) |
|
|
||
| // Waiting avoids a short OS-level race after db.Close(), where file handles | ||
| // may still be getting released and early removal can fail on some platforms. | ||
| time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Would it be better to replace this with a condition based wait/retry to ensure more deterministic behavior across slower ci environments rather than setting hardcoded time?
There was a problem hiding this comment.
@akrem-chabchoub there was something else here before that was a retry... see my previous comment here. if you have any suggestions - welcome (but code example would also be good)
It is highly likely related to #5429 (comment) |
|
@sbackend123 the job requirement has been removed |
|
|
||
| ctx := r.Context() | ||
| ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), redundancy.DefaultLevel) | ||
| ls := loadsave.NewReadonly(s.storer.Download(cache), s.storer.Cache(), rLevel) |
|
I think all CI/CD tests should run 10-20 times at least to make sure flaky tests do not occur |
I tried on Linux about 50 times, but would be cool if somebody with mac OS would try the same. |


Fix CI
Checklist
Description
Goal: stabilize flaky tests and remove ci-runner for "flaky" tests.
Cross-cutting change: tests use
randwith an explicit source so failures are easier to reproduce and debug.Per test
TestBzzUploadDownloadWithRedundancy — It was consistently failing on the master branch. The reason: it wasn’t reading
api.SwarmRedundancyLevelHeader, so it was using the default (PARANOID) level.TestFinder — Stable on Linux on
master; only renamed.TestDBNuke — Fails on Windows. Cause: LevelDB sits under a cache layer, which sits under the store, etc. On
Close(), the cache layer should close the underlying DB but did not, so the file stayed open and the test could not remove its temp directory. Fix: close the DB from the cache’sClose(), and adjust the cache test so LevelDB is not closed twice.TestGetterRACE — Stable on Linux on
master; renamed and givenrand+ source for reproducibility.TestPushChunkToNextClosest — With
origin=true,pushToClosestmay deliver to several nearest peers in parallel; order and outcomes are nondeterministic. Old assertions assumed a fixed successful peer and strict pivot vs peer bookkeeping, and hard-coded which peer “fails” and which “succeeds”. Fix: any of the nearest candidates may succeed; assert push stream activity toward both nearest candidates and that exactly one ends up with a positive balance vs the pivot.TestMakeInclusionProofs — Stable on Linux on
master; only renamed.TestAddressBookQuickPrune — With
storageRadius = 2, the “good” and “bad” peers can share bin 1.connectNeighboursskips that bin (PO < depth), whileconnectBalancedcan treat the slot as satisfied by the already-connected “good” peer and never dial the “bad” one — flaky behavior depending on random overlay geometry. Fix: avoid that collision (e.g. don’t put a connected peer in the same balanced bin as the bad peer) and align assertions with the implicit dial fromAddPeersplus explicitTriggers (e.g. wait for at leastMaxConnAttemptsfailed connects, then assert address book prune).TestAnnounceBgBroadcast — Assertions relied on timing instead of the background goroutine actually running.
cancel()could run right afterAnnouncebeforeBroadcastPeersblocked on<-ctx.Done(). AfterClose(), aselectwith a fixed100mstimeout could fail on slow CI. Fix:bgStartedclosed on first real entry into the backgroundBroadcastPeers, and a more generous wait for shutdown on slow CI.TestSnapshot — Expected an almost immediate snapshot while Kademlia updates asynchronously. Fix: poll / wait up to a timeout for the snapshot.
TestStart (non-empty addressbook subtest) — Besides the three address-book peers, bootstrap dialing adds more
Connectcalls, so the total is > 3, not exactly 3.Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
#5418
Screenshots (if appropriate):
AI Disclosure