Skip to content

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230

Draft
auricom wants to merge 26 commits intomainfrom
fix/3229-raft-re-election
Draft

fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log#3230
auricom wants to merge 26 commits intomainfrom
fix/3229-raft-re-election

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 7, 2026

Summary

Hardens Raft HA behavior across several failure modes observed in SIGTERM cycling tests.

Leader fencing on SIGTERM

  • Added ResignLeader(ctx context.Context) error to raft.Node and FullNode (via LeaderResigner interface)
  • SIGTERM handler in run_node.go calls resigner.ResignLeader(resignCtx) directly with a 3s deadline — no wrapper goroutine needed; context.DeadlineExceeded vs other errors are logged separately
  • ResignLeader runs leadershipTransfer in a goroutine and selects on the result vs ctx.Done(), so the caller is unblocked at the deadline (the inner transfer exits on its own, bounded by ElectionTimeout)
  • ShutdownTimeout field (default 5s) added to raft.Config so Stop() drains committed logs with a meaningful timeout

Raft log compaction

  • Added SnapshotThreshold (default 500) and TrailingLogs (default 200) to RaftConfig and wired them into the hashicorp/raft config — previously using library defaults (8192 / 10240, i.e. snapshots every ~2.25h at 1 blk/s)
  • Fixed SnapCount default 0 → 3 (0 caused no snapshot files to be retained on disk)

Election timeout config

  • Added ElectionTimeout to RaftConfig (default 1s)
  • Validate() rejects negative values explicitly (< 0 is invalid; 0 remains valid as "use library default"); also validates >= HeartbeatTimeout when non-zero to prevent hashicorp/raft panicking at startup

Follower crash on restart (EVM ahead of stale snapshot)

  • RecoverFromRaft now verifies the local block hash at raftState.Height; a hash match skips unnecessary recovery safely, a mismatch returns an explicit error indicating a fork
  • waitForMsgsLanded replaced its repeating deadline ticker with a one-shot time.NewTimer and adds a final check in the deadline branch — prevents a race where both poll and deadline tickers fired simultaneously and timeout won even when AppliedIndex >= CommitIndex

FSM data race

  • FSM.Apply and SetApplyCallback are now guarded by applyMu sync.RWMutex to prevent the raft library calling Apply concurrently while SetApplyCallback(nil) clears the channel pointer during shutdown

BoltDB log noise

  • A one-time stdlib log filter (sync.Once in NewNode) silently drops "tx closed" lines from raft-boltdb — these are emitted on every successful commit due to defer tx.Rollback() and have no actionable meaning
  • Filter wraps log.Writer() instead of hard-coding os.Stderr so any existing stdlib logger redirection is preserved

Block provenance audit log

  • RaftApplyMsg carries a Term field; FSM.Apply() logs raft_term and raft_index alongside block height and hash for replication debugging

Test plan

  • go test ./pkg/raft/... ./pkg/config/... ./node/... -count=1 — all pass
  • TestNodeResignLeader_NilNoop / TestNodeResignLeader_NotLeaderNoop — nil/non-leader no-op paths
  • TestFullNode_ResignLeader_Noop — table-driven nil and non-leader cases
  • TestNewNode_SnapshotConfigApplied — snapshot config wiring (table-driven)
  • TestSyncerStop_CallsRaftRetrieverStopSyncer.Stop() calls raftRetriever.Stop()
  • TestRecoverFromRaft_GetHeaderFailure — hash mismatch error path
  • "negative election timeout rejected"Validate() rejects ElectionTimeout < 0
  • HA cycling test: verify leader fencing reduces unconfirmed-block window on SIGTERM
  • HA cycling test: verify rebooted nodes resync within cycle window after snapshot config change

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added leadership resignation capability for graceful leader transition during node shutdown
    • New Raft configuration options for fine-tuning cluster behavior (election timeout, snapshot threshold, trailing logs)
  • Bug Fixes

    • Improved shutdown sequence with proper cleanup of internal raft state
    • Enhanced recovery logic to gracefully handle stale cluster snapshots
    • Nodes now abdicate leadership when significantly behind the cluster state
  • Configuration

    • New Raft configuration parameters with sensible defaults

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d1f347c-97d4-42c5-94c7-d4e1c127007b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds Raft leadership resignation capability, extends configuration with election and snapshot tuning parameters, improves FSM apply concurrency safety with mutex guards, handles ahead-of-raft recovery scenarios gracefully, and implements leader abdication when the local store falls significantly behind the Raft state.

Changes

Cohort / File(s) Summary
Raft Core Configuration & Node
pkg/raft/node.go, pkg/raft/types.go
Extends Config with ShutdownTimeout, ElectionTimeout, SnapshotThreshold, and TrailingLogs; adds ResignLeader method for context-aware leadership transfer; makes FSM apply callback wiring concurrency-safe with RWMutex; updates waitForMsgsLanded loop with one-shot deadline timer; adds Term field to RaftApplyMsg.
Raft Leadership & Election
pkg/raft/election.go
Adds leader abdication check: if store lags raft state by more than 1 block, node logs warning, attempts immediate leadership transfer, and skips leader start.
Block Syncing
block/internal/syncing/syncer.go, block/internal/syncing/raft_retriever.go
Extends shutdown to stop raftRetriever; improves RecoverFromRaft to load and compare local block hash when local height exceeds raft snapshot height, enabling graceful skip of recovery when hashes match.
Full Node Implementation
node/full.go, node/node.go
Adds LeaderResigner interface declaring ResignLeader(ctx context.Context) error; implements method on FullNode forwarding to raftNode when available; configures raft with new timing/retention fields.
Configuration Management
pkg/config/config.go, pkg/config/defaults.go
Adds ElectionTimeout, SnapshotThreshold, and TrailingLogs fields to RaftConfig; extends validation to check ElectionTimeout >= HeartbeatTimeout; sets sensible defaults (ElectionTimeout: 1000ms, SnapshotThreshold: 500, TrailingLogs: 200, SnapCount: 3).
Startup/Shutdown Coordination
pkg/cmd/run_node.go
Integrates best-effort Raft leadership resignation on SIGINT/SIGTERM shutdown via type assertion to LeaderResigner interface with 3-second timeout; logs success or distinguishes timeout vs. other errors.
Test Coverage
block/internal/syncing/syncer_test.go, node/full_node_test.go, pkg/config/config_test.go, pkg/raft/node_test.go, pkg/raft/election_test.go
Adds tests for callback clearing during shutdown, ResignLeader noop behavior, config validation for election timeouts, buildRaftConfig with new parameters, and leader abdication when store lags raft state.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (SIGINT/SIGTERM)
    participant Runner as Run Control
    participant Node as FullNode
    participant RaftNode as RaftNode
    participant FSM as FSM/Apply
    
    CLI->>Runner: Signal received
    Runner->>Node: Type-assert LeaderResigner
    alt LeaderResigner supported
        Runner->>Node: ResignLeader(ctx, 3s timeout)
        Node->>RaftNode: ResignLeader(ctx)
        RaftNode->>RaftNode: Transfer leadership
        RaftNode-->>Node: Success/Error
        Node-->>Runner: Return result
        Runner->>Runner: Log outcome
    else Not supported
        Runner->>Runner: Skip resignation
    end
    Runner->>Runner: Call cancel()
    Runner->>Runner: Wait for shutdown (10s)
Loading
sequenceDiagram
    participant Election as Election Loop
    participant Store as Store (Local)
    participant Runnable as Runnable
    participant Leader as Leader Sync
    participant Transfer as Leadership Transfer
    
    Election->>Runnable: IsSynced(raftState)
    Runnable-->>Election: diff = -5 (lag > 1 block)
    alt Store significantly behind
        Election->>Election: Log warning with lag size
        Election->>Transfer: leadershipTransfer()
        Transfer-->>Election: Initiated
        Election->>Election: Skip leader start (continue)
    else Store within tolerance
        Election->>Leader: Proceed with leader startup
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • alpe
  • julienrbrt
  • tac0turtle

Poem

🐰 A Raft of changes, carefully planned,
Leadership resignation, delegation at hand,
Timeouts and thresholds in harmony play,
Safe shutdowns, abdications save the day! 🎩✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: leader fencing, log compaction, election timeout config, and audit logging are all substantive parts of the changeset.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all major changes, the rationale, test coverage, and explicitly listing pending integration tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 7, 2026

Claude finished @auricom's task in 3m 51s —— View job


Code Review — Raft HA Production Hardening

This is a well-motivated PR with solid safety rationale throughout. The changes address real production failure modes observed in SIGTERM cycling tests. Below are my findings organized by severity.


Tasks

  • Read AGENTS.md for repo guidelines
  • Review pkg/raft/node.go — FSM mutex, ShutdownTimeout, ResignLeader
  • Review pkg/raft/election.go — abdication + waitForBlockStoreSync
  • Review pkg/raft/types.go, config, defaults
  • Review pkg/cmd/run_node.go — SIGTERM handler
  • Review node/full.go, node/node.go — LeaderResigner interface
  • Review block/internal/syncing/syncer.go, raft_retriever.go
  • Review test coverage
  • Synthesize and post review

What's Good

  • FSM data race fix (node.go:394-404): The applyMu RWMutex correctly guards applyCh access — write lock in SetApplyCallback, read lock in Apply. The pattern of reading the channel under the lock, then sending outside it, is correct and avoids holding the lock during a blocking send.
  • waitForMsgsLanded one-shot timer: Replacing the repeating ticker with a time.NewTimer deadline eliminates the race where both poll and deadline could fire simultaneously. Clean fix.
  • waitForBlockStoreSync addresses the hot-potato abdication problem correctly: if all nodes restart simultaneously with lagging stores, they wait for sync before abdicating rather than immediately passing leadership around. The leadership-lost abort (leaderID != NodeID) inside the poll loop is the right escape hatch.
  • RecoverFromRaft hash verification: The "local ahead of stale snapshot" branch now verifies hash identity before skipping recovery. The explicit fork error gives operators a clear signal.
  • Config defaults: SnapCount 0 → 3, SnapshotThreshold 500, TrailingLogs 200 are well-reasoned and commented in defaults.go.
  • Context-aware ResignLeader: Correctly unblocks the caller at deadline while the inner transfer drains to completion (bounded by ElectionTimeout). The goroutine leak is intentional and bounded — acceptable.

Issues

Medium — Worth Addressing Before Merge

1. SnapCount not validated in RaftConfig.Validate()
(pkg/config/config.go)

raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr) requires retain > 0. A zero value silently creates a store that retains no snapshots — nodes would lose compaction on restart. The fix was flagged by CodeRabbit but not applied. The default is now 3, but an operator could override to 0.

// Add to RaftConfig.Validate():
if c.SnapCount == 0 {
    multiErr = errors.Join(multiErr, fmt.Errorf("snap_count must be > 0"))
}

Fix this →

2. waitForBlockStoreSync repurposes ShutdownTimeout semantically
(pkg/raft/election.go:296-299)

ShutdownTimeout is used here as the "how long to wait for block-store sync before abdicating" budget. With the default of 5 * SendTimeout = 1s, nodes have 1 second to resync their block store after being elected leader. This may be too short under real conditions (P2P sync latency, slow storage). The field name is also misleading in this context.

Consider adding a dedicated BlockStoreSyncTimeout config field, or at least documenting that ShutdownTimeout doubles as the sync-wait budget. Fix this →

3. Duplicate log message in election.go
(pkg/raft/election.go:123 and :179)

// line ~123
d.logger.Info().Msg("became leader, stopping follower operations")
// ... 50+ lines of sync/recovery logic ...
// line ~179
d.logger.Info().Msg("became leader, stopping follower operations")
workerCancel()
wg.Wait()

Both messages are identical. The first fires before any sync checks have run (nothing has stopped yet), which is misleading in logs. The second is accurate. The first should be removed or changed to something like "became leader, checking sync state before stopping follower". Fix this →

4. Stop() may attempt leadership transfer twice
(pkg/raft/node.go:222-226)

When shutdown is signalled, the call sequence is:

  1. ResignLeader(3s) (via SIGTERM handler in run_node.go)
  2. cancel() → node context cancelled
  3. Eventually: raftNode.Stop() which calls leadershipTransfer() again if IsLeader() returns true

If ResignLeader returned successfully, IsLeader() should return false and the second transfer is skipped. But there is a narrow window where the raft state hasn't propagated yet. This is low-probability but worth a comment noting the double-transfer is intentionally guarded by IsLeader(). Fix this →


Minor

5. RecoverFromRaft silent return nil for multi-block gap
(block/internal/syncing/syncer.go:1267)

When raftState.Height > currentState.LastBlockHeight + 1 (local more than 1 block behind raft but not "ahead"), the function falls through all branches to return nil with no action and no log. This is intentional (the caller handles the gap via waitForBlockStoreSync + abdication), but a comment would prevent future confusion:

// Local block store is more than 1 block behind raft state. Recovery can only
// apply one block; the gap will be closed by p2p sync after the caller allows
// normal operations.
return nil

Fix this →

6. IsSyncedWithRaft uses s.ctx instead of caller's context
(block/internal/syncing/syncer.go:1163)

s.store.GetState(s.ctx) uses the syncer's internal context. If Stop() has been called and s.ctx is cancelled, any IsSyncedWithRaft call from the election loop would return an error. This would cause waitForBlockStoreSync to never report sync (since err != nilfalse). Consider passing the context from the caller. This is pre-existing but the new waitForBlockStoreSync loop now exercises this path more frequently.

7. os.Stderr hardcoded in transport and snapshot store
(pkg/raft/node.go:117,127)

Both raft.NewFileSnapshotStore and raft.NewTCPTransport use os.Stderr directly. This was pre-existing and the boltdb noise filter was removed per reviewer feedback, but these two still bypass any zerolog routing. Low priority for this PR.

8. election_test.go "abdicate" scenario uses time.After(200ms)
(pkg/raft/election_test.go:269-274)

select {
case <-leaderStarted:
    t.Error("leader should not start...")
case <-time.After(200 * time.Millisecond):
    // leadership transferred without starting leader — expected
}

This works because the test config uses a 2ms timeout so the wait expires well before 200ms. But the 200ms wall-clock dependency makes the test slow in CI and could flake on heavily-loaded runners. A channel-based signal from the leadership transfer mock would be more deterministic. Not a blocker but something to tighten in a follow-up.


Summary

The PR is production-ready. The FSM mutex, one-shot timer, hash verification in recovery, and hot-potato abdication fix are all correct and well-tested. The main gap is the missing SnapCount > 0 validation (item 1 above) — an operator who sets snap_count: 0 would get silent broken behavior at runtime instead of a startup error. The rest are documentation/clarity issues.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 17, 2026, 2:58 PM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)
  • block/internal/syncing/raft_retriever.go
  • block/internal/syncing/raft_retriever_test.go
  • pkg/raft/node.go

Comment thread block/internal/syncing/raft_retriever.go
@auricom auricom marked this pull request as draft April 7, 2026 15:24
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 60.43165% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.50%. Comparing base (2865d6d) to head (5897ef3).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/raft/node.go 39.62% 32 Missing ⚠️
pkg/cmd/run_node.go 0.00% 10 Missing ⚠️
pkg/raft/election.go 78.57% 6 Missing and 3 partials ⚠️
node/full.go 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   62.33%   62.50%   +0.17%     
==========================================
  Files         122      122              
  Lines       12873    13002     +129     
==========================================
+ Hits         8024     8127     +103     
- Misses       3968     3990      +22     
- Partials      881      885       +4     
Flag Coverage Δ
combined 62.50% <60.43%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

auricom and others added 2 commits April 8, 2026 17:03
Bug A: RecoverFromRaft crashed with "invalid block height" when the node
restarted after SIGTERM and the EVM state (persisted before kill) was ahead
of the raft FSM snapshot (which hadn't finished log replay yet). The function
now verifies the hash of the local block at raftState.Height — if it matches
the snapshot hash the EVM history is correct and recovery is safely skipped;
a mismatch returns an error indicating a genuine fork.

Bug B: waitForMsgsLanded used two repeating tickers with the same effective
period (SendTimeout/2 poll, SendTimeout timeout), so both could fire
simultaneously in select and the timeout would win even when AppliedIndex >=
CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer,
added a final check in the deadline branch, and reduced poll interval to
min(50ms, timeout/4) for more responsive detection.

Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test
cycle 7 (poc-ha-2 never rejoining within the 300s kill interval).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetApplyCallback(nil) called from raftRetriever.Stop() raced with
FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine
exited, but the raft library can still invoke Apply concurrently.

Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and
read lock in Apply before reading the channel pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom changed the title Fix raft leader re-election delays after SIGTERM fix: raft leader re-election delays after SIGTERM Apr 8, 2026
auricom and others added 6 commits April 9, 2026 15:14
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…GTERM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to RaftConfig; fix SnapCount default 0→3

Add three new Raft config parameters:
  - ElectionTimeout: timeout for candidate to wait for votes (defaults to 1s)
  - SnapshotThreshold: outstanding log entries that trigger snapshot (defaults to 500)
  - TrailingLogs: log entries to retain after snapshot (defaults to 200)

Fix critical default: SnapCount was 0 (broken, retains no snapshots) → 3

This enables control over Raft's snapshot frequency and recovery behavior to prevent
resync debt from accumulating unbounded during normal operation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nto hashicorp/raft config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r block provenance audit

Add Term field to RaftApplyMsg struct to track the raft term in which each
block was committed. Update FSM.Apply() debug logging to include both
raft_term and raft_index fields alongside block height and hash. This
enables better audit trails and debugging of replication issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom changed the title fix: raft leader re-election delays after SIGTERM fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log Apr 9, 2026
auricom and others added 6 commits April 9, 2026 16:01
…gering tests

The gci formatter requires single space before inline comments (not aligned
double-space). Also removed TestNodeResignLeader_NotLeaderNoop and
TestNewNode_SnapshotConfigApplied which create real boltdb-backed raft nodes:
boltdb@v1.3.1 has an unsafe pointer alignment issue that panics under
Go 1.25's -checkptr. The nil-receiver test (TestNodeResignLeader_NilNoop)
is retained as it exercises the same guard without touching boltdb.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hashicorp/raft-boltdb uses defer tx.Rollback() as a safety net on every
transaction. When Commit() succeeds, the deferred Rollback() returns
bolt.ErrTxClosed and raft-boltdb logs it as an error — even though it is
the expected outcome of every successful read or write. The message has no
actionable meaning and floods logs at high block rates.

Add a one-time stdlib log filter (sync.Once in NewNode) that silently drops
lines containing 'tx closed' and forwards everything else to stderr.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ocs, tests

- Call raftRetriever.Stop() in Syncer.Stop() so SetApplyCallback(nil) is
  actually reached and the goroutine is awaited before wg.Wait()
- Log leadershipTransfer error at warn level in Node.Stop() instead of
  discarding it silently
- Fix SnapCount comments in config.go: it retains snapshot files on disk
  (NewFileSnapshotStore retain param), not log-entry frequency
- Extract buildRaftConfig helper from NewNode to enable config wiring tests
- Add TestNodeResignLeader_NotLeaderNoop (non-nil node, nil raft → noop)
- Add TestNewNode_SnapshotConfigApplied (table-driven, verifies
  SnapshotThreshold and TrailingLogs wiring with custom and zero values)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, election validation

- Add ShutdownTimeout field (default 5s) to raft Config so Stop() drains
  committed logs with a meaningful timeout instead of the 200ms SendTimeout
- Wrap ResignLeader() in a 3s goroutine+select fence in the SIGTERM handler
  so a hung leadership transfer cannot block graceful shutdown indefinitely
- Validate ElectionTimeout >= HeartbeatTimeout in RaftConfig.Validate() to
  prevent hashicorp/raft panicking at startup with an invalid config
- Fix stale "NewNode creates a new raft node" comment that had migrated onto
  buildRaftConfig after the function was extracted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gofmt/gci requires minimal alignment; excessive spaces in the
TestNewNode_SnapshotConfigApplied struct literal caused a lint failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add unit tests for lines flagged by Codecov:
- boltTxClosedFilter.Write: filter drops "tx closed", forwards others
- buildRaftConfig: ElectionTimeout > 0 applied, zero uses default
- FullNode.ResignLeader: nil raftNode no-op; non-leader raftNode no-op
- Syncer.Stop: raftRetriever.Stop is called when raftRetriever is set
- Syncer.RecoverFromRaft: GetHeader failure when local state is ahead of
  stale raft snapshot returns "cannot verify hash" error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom marked this pull request as ready for review April 10, 2026 16:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
node/full_node_test.go (1)

87-97: Prefer a table-driven test for the ResignLeader no-op cases.

These two tests are the same shape and are easier to extend/maintain as one table-driven test.

♻️ Proposed refactor
-func TestFullNode_ResignLeader_NilRaftNode(t *testing.T) {
-	n := &FullNode{} // raftNode is nil
-	assert.NoError(t, n.ResignLeader())
-}
-
-func TestFullNode_ResignLeader_NonLeaderRaftNode(t *testing.T) {
-	// Empty *raftpkg.Node has nil raft field so IsLeader() returns false;
-	// ResignLeader() is a no-op and returns nil.
-	n := &FullNode{raftNode: &raftpkg.Node{}}
-	assert.NoError(t, n.ResignLeader())
-}
+func TestFullNode_ResignLeader_NoOpCases(t *testing.T) {
+	tests := []struct {
+		name string
+		node *FullNode
+	}{
+		{name: "nil raft node", node: &FullNode{}},
+		{name: "non-leader raft node", node: &FullNode{raftNode: &raftpkg.Node{}}},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			assert.NoError(t, tt.node.ResignLeader())
+		})
+	}
+}

As per coding guidelines, **/*_test.go: "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@node/full_node_test.go` around lines 87 - 97, Combine the two ResignLeader
no-op tests into a single table-driven test: create a slice of cases with
descriptive names for the nil-raftNode case and the non-leader raftNode case,
each providing a FullNode value (e.g., &FullNode{} and &FullNode{raftNode:
&raftpkg.Node{}}) and the expected outcome (no error); iterate over cases using
t.Run(case.name, func(t *testing.T) { assert.NoError(t,
case.node.ResignLeader()) }), keeping references to FullNode.ResignLeader and
raftpkg.Node to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cmd/run_node.go`:
- Around line 231-245: The current shutdown uses a 3s resignCtx but launches
resigner.ResignLeader() without cancellation so the resign can keep running;
change the API to accept a context (e.g., LeaderResigner.ResignLeader(ctx
context.Context) error), call resigner.ResignLeader(resignCtx) directly (or in a
goroutine and select on its returned error and resignCtx.Done()) and
propagate/observe the context to abort the resign operation; update the call
site in run_node.go to pass resignCtx and handle the returned error or context
timeout to avoid leaking a background goroutine or leaving an unbounded resign
in progress.

In `@pkg/config/config.go`:
- Around line 454-456: The validation must reject negative election timeouts:
add an explicit check in the same validation block to treat c.ElectionTimeout <
0 as invalid (join an error like "election timeout (%v) must be >= 0" into
multiErr), keeping 0 as the "use default" sentinel; ensure this complements the
existing comparison with c.HeartbeatTimeout and aligns with buildRaftConfig's
behavior (which only applies values > 0) so negative values fail fast instead of
being silently ignored.

In `@pkg/raft/node.go`:
- Around line 95-97: The code currently calls
log.SetOutput(&boltTxClosedFilter{w: os.Stderr}) inside suppressBoltNoise.Do(),
which clobbers the process-wide stdlib logger; instead capture the current
stdlib logger writer and wrap it so existing configuration is preserved: obtain
prev := log.Writer() and call log.SetOutput(&boltTxClosedFilter{w: prev}) (still
guarded by suppressBoltNoise.Do()). Update the usage in the same block where
suppressBoltNoise.Do and boltTxClosedFilter are referenced (e.g., in NewNode
initialization) so you no longer hard-code os.Stderr.

---

Nitpick comments:
In `@node/full_node_test.go`:
- Around line 87-97: Combine the two ResignLeader no-op tests into a single
table-driven test: create a slice of cases with descriptive names for the
nil-raftNode case and the non-leader raftNode case, each providing a FullNode
value (e.g., &FullNode{} and &FullNode{raftNode: &raftpkg.Node{}}) and the
expected outcome (no error); iterate over cases using t.Run(case.name, func(t
*testing.T) { assert.NoError(t, case.node.ResignLeader()) }), keeping references
to FullNode.ResignLeader and raftpkg.Node to locate the code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bff68b93-3da9-4c95-9c3b-ae84e849204f

📥 Commits

Reviewing files that changed from the base of the PR and between 2d28b20 and a2a2599.

📒 Files selected for processing (12)
  • block/internal/syncing/syncer.go
  • block/internal/syncing/syncer_test.go
  • node/full.go
  • node/full_node_test.go
  • node/node.go
  • pkg/cmd/run_node.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/config/defaults.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
  • pkg/raft/types.go

Comment thread pkg/cmd/run_node.go
Comment thread pkg/config/config.go Outdated
Comment thread pkg/raft/node.go Outdated
@julienrbrt julienrbrt self-requested a review April 10, 2026 17:20
auricom and others added 2 commits April 10, 2026 19:22
A negative ElectionTimeout was silently ignored (buildRaftConfig only
applies values > 0), allowing a misconfigured node to start with the
library default instead of failing fast.  Add an explicit < 0 check
that returns an error; 0 remains valid as the "use library default"
sentinel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x through ResignLeader

- suppressBoltNoise.Do now wraps log.Writer() instead of os.Stderr so any
  existing stdlib logger redirection is preserved rather than clobbered
- ResignLeader now accepts a context.Context: leadershipTransfer runs in a
  goroutine and a select abandons the caller at ctx.Done(), returning
  ctx.Err(); the goroutine itself exits once the inner raft transfer
  completes (bounded by ElectionTimeout)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rface

- LeaderResigner.ResignLeader() → ResignLeader(ctx context.Context) error
- FullNode.ResignLeader passes ctx down to raft.Node.ResignLeader
- run_node.go calls resigner.ResignLeader(resignCtx) directly — no wrapper
  goroutine/select needed; context.DeadlineExceeded vs other errors are
  logged distinctly
- Merge TestFullNode_ResignLeader_NilRaftNode and
  TestFullNode_ResignLeader_NonLeaderRaftNode into single table-driven test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/raft/node_test.go (1)

195-228: Test logic is good; consider renaming for precision.

The assertions are correct, but the function name suggests NewNode behavior while it directly tests buildRaftConfig. Renaming would improve discoverability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/node_test.go` around lines 195 - 228, Rename the test
TestNewNode_SnapshotConfigApplied to reflect what it actually verifies: the
buildRaftConfig function applies snapshot-related fields from Config; for
example rename to TestBuildRaftConfig_SnapshotConfigApplied (or
TestBuildRaftConfig_AppliesSnapshotConfig). Update the test name only (keep the
same table-driven cases and assertions using buildRaftConfig, rc, Config) so
test discovery and intent match the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/raft/node_test.go`:
- Around line 195-228: Rename the test TestNewNode_SnapshotConfigApplied to
reflect what it actually verifies: the buildRaftConfig function applies
snapshot-related fields from Config; for example rename to
TestBuildRaftConfig_SnapshotConfigApplied (or
TestBuildRaftConfig_AppliesSnapshotConfig). Update the test name only (keep the
same table-driven cases and assertions using buildRaftConfig, rc, Config) so
test discovery and intent match the implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36de18f5-92b8-4b48-8098-050aafc90d75

📥 Commits

Reviewing files that changed from the base of the PR and between a2a2599 and 2d3dc8e.

📒 Files selected for processing (8)
  • node/full.go
  • node/full_node_test.go
  • node/node.go
  • pkg/cmd/run_node.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/cmd/run_node.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • node/node.go
  • node/full.go
  • pkg/config/config.go
  • pkg/raft/node.go

Copy link
Copy Markdown
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test does not look necessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — deleted in bf6bb50. The test was validating an implementation detail (that Stop() calls SetApplyCallback(nil)) rather than observable behaviour. The stubRaftNode helper it defined has been moved to syncer_test.go where it's still needed by TestSyncer_Stop_CallsRaftRetrieverStop.

Comment thread pkg/raft/node.go Outdated
}

func NewNode(cfg *Config, logger zerolog.Logger) (*Node, error) {
suppressBoltNoise.Do(func() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's delete this, this is does not use the correct logger anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — deleted in bf6bb50. The global stdlib log.SetOutput redirect was the wrong scope; nothing else in the codebase should be silenced that way.

Comment thread pkg/raft/node.go Outdated
"google.golang.org/protobuf/proto"
)

// suppressBoltNoise redirects the stdlib log output once to drop the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, if we don't need log from this, and if it uses the stdlib log implementation, we should make sure hashicorp/raft-boltdb is instantiated with zerolog

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately hashicorp/raft-boltdb does not support logger injection — its Options struct has no Logger field and all rollback errors are emitted via hardcoded log.Printf calls. There is no way to route those through zerolog without patching the library upstream. The "Rollback failed: tx closed" noise is benign (it fires every time defer tx.Rollback() runs after a successful tx.Commit()), so for now we accept it.

Comment thread pkg/raft/node.go
// Wait for FSM to apply all committed logs before shutdown to prevent state loss.
// This ensures pending raft messages are processed before the node stops.
if err := n.waitForMsgsLanded(n.config.SendTimeout); err != nil {
if err := n.waitForMsgsLanded(n.config.ShutdownTimeout); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a different ShutdownTimeout vs SendTimeout? That's a log of timeouts in the config, which is confusing for operators (like cometbft has thousands of confusing timeouts)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed from public config in bf6bb50. ShutdownTimeout is still computed internally as 5 × SendTimeout at the initRaftNode call site (node/full.go), so the drain window scales with the operator's chosen send timeout without exposing another knob. The ShutdownTimeout field stays in pkg/raft.Config as an internal detail.

auricom and others added 4 commits April 15, 2026 12:41
…t state

When a node wins election but its local store is more than 1 block behind
the raft FSM state, RecoverFromRaft cannot replay the gap (it only handles
the single latest block in the raft snapshot). Previously the node would
log "recovery successful" and start leader operations anyway, then stall
block production while the executor repeatedly failed to sync the missing
blocks from a store that did not have them.

Fix: in DynamicLeaderElection.Run, detect diff < -1 at the
follower→leader transition and immediately transfer leadership to a
better-synced peer. diff == -1 is preserved: RecoverFromRaft can apply
exactly one block from the raft snapshot, so that path is unchanged.

Closes #3255

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nTimeout

- Remove stdlib log filter (boltTxClosedFilter / suppressBoltNoise): it
  redirected the global stdlib logger which is the wrong scope. raft-boltdb
  uses log.Printf directly with no Logger option, so the "tx closed" noise
  is now accepted as-is from stderr.

- Wire hashicorp/raft's internal hclog output through zerolog: set
  raft.Config.Logger to an hclog.Logger backed by the zerolog writer so
  all raft-internal messages appear in the structured log stream under
  component=raft-hashicorp.

- Remove ShutdownTimeout from public config: it was a second "how long to
  wait" knob that confused operators. ShutdownTimeout is now computed
  internally as 5 × SendTimeout at the initRaftNode call site.

- Delete TestRaftRetrieverStopClearsApplyCallback: tested an implementation
  detail (Stop clears the apply callback pointer) rather than observable
  behaviour. The stubRaftNode helper it defined is moved to syncer_test.go
  where it is still needed.

- Rename TestNewNode_SnapshotConfigApplied → TestBuildRaftConfig_SnapshotConfigApplied
  to reflect that it tests buildRaftConfig, not NewNode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test

go mod tidy promotes github.com/hashicorp/go-hclog from indirect to
direct now that pkg/raft/node.go imports it explicitly.

gci auto-formatted stubRaftNode method stubs in syncer_test.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/config/config.go (1)

419-452: ⚠️ Potential issue | 🟠 Major

Validate raft.snap_count here too.

SnapCount is still allowed to be 0, but pkg/raft/node.go passes it straight into raft.NewFileSnapshotStore(...), which will fail during node startup instead of during config validation. Since this is now a user-facing retention knob, reject zero in RaftConfig.Validate() so bad operator input fails fast.

Suggested fix
 func (c RaftConfig) Validate() error {
 	if !c.Enable {
 		return nil
 	}
 	var multiErr error
@@
+	if c.SnapCount == 0 {
+		multiErr = errors.Join(multiErr, fmt.Errorf("snap count must be > 0"))
+	}
+
 	if c.SendTimeout <= 0 {
 		multiErr = errors.Join(multiErr, fmt.Errorf("send timeout must be positive"))
 	}

As per coding guidelines, "Validate all inputs from external sources in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/config.go` around lines 419 - 452, RaftConfig.Validate currently
misses validating SnapCount; add a check in the RaftConfig.Validate() method to
reject SnapCount == 0 (or <= 0) and join an appropriate error (e.g.,
fmt.Errorf("snap count must be positive")) into multiErr so invalid config fails
fast; reference the RaftConfig type and its Validate() method (and note that
raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count)
when locating where to add the validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/raft/election_test.go`:
- Around line 245-261: The test is nondeterministic due to time.Sleep and
t.Fatal in a goroutine; replace those with channel-based synchronization: in the
testRunnable used as leader (testRunnable.runFn) send a message on a new
leaderStartAttemptCh instead of calling t.Fatal, and have the main test
goroutine wait on that channel to observe the leader start attempt; remove
time.Sleep and instead use the existing leaderCh/fStarted or a new
leaderStartedCh to coordinate the leader transfer (send/receive to trigger state
transitions) and then perform assertions on the captured signals from
leaderStartAttemptCh and leaderStartedCh; update where DynamicLeaderElection is
constructed (leaderFactory/followerFactory) to return the testRunnable that uses
these channels for deterministic coordination.

In `@pkg/raft/election.go`:
- Around line 147-150: The current code swallows failures from
node.leadershipTransfer() by logging and continuing, which can leave the node
stuck; change the behavior so that transfer failures are surfaced instead of
silently continuing: inside the loop where node.leadershipTransfer() is called
(the call shown as d.node.leadershipTransfer() with the logger d.logger),
replace the bare continue on error with logic that returns or propagates tErr to
the caller (or otherwise transitions the state to abort leader startup) so the
failure is visible and handled upstream; ensure the log still records the error
before returning/propagating.

In `@pkg/raft/node.go`:
- Around line 84-90: NewNode currently can leave cfg.ShutdownTimeout as zero
which causes a panic in Stop() -> waitForMsgsLanded when creating a ticker with
interval zero; fix this by initializing/clamping the internal shutdown timeout
in NewNode before the config is stored: if cfg.ShutdownTimeout <= 0 set it to
the internal default (e.g. 5 * cfg.SendTimeout) or a sane minimum, so callers of
Stop() / waitForMsgsLanded always see a positive timeout value.

---

Outside diff comments:
In `@pkg/config/config.go`:
- Around line 419-452: RaftConfig.Validate currently misses validating
SnapCount; add a check in the RaftConfig.Validate() method to reject SnapCount
== 0 (or <= 0) and join an appropriate error (e.g., fmt.Errorf("snap count must
be positive")) into multiErr so invalid config fails fast; reference the
RaftConfig type and its Validate() method (and note that
raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count)
when locating where to add the validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4087f79-ce2a-422e-a24c-41233d9807d3

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3dc8e and bf6bb50.

📒 Files selected for processing (9)
  • block/internal/syncing/syncer_test.go
  • node/full.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/config/defaults.go
  • pkg/raft/election.go
  • pkg/raft/election_test.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/config/config_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/config/defaults.go
  • node/full.go
  • block/internal/syncing/syncer_test.go

Comment thread pkg/raft/election_test.go
Comment thread pkg/raft/election.go Outdated
Comment thread pkg/raft/node.go
…nsfer error propagation, deterministic test

ShutdownTimeout zero-value panic (critical):
NewNode now clamps ShutdownTimeout to 5*SendTimeout when the caller passes
zero, preventing a panic in time.NewTicker inside waitForMsgsLanded. The
normal path through initRaftNode already sets it explicitly; this guard
protects direct callers (e.g. tests) that omit the field.

Leadership transfer error propagation (major):
When store-lag abdication calls leadershipTransfer() and it fails, the
error is now returned instead of being logged and silently continuing.
Continuing after a failed transfer left the node as leader-without-worker,
stalling the cluster.

Deterministic abdication test (major):
Replace time.Sleep(10ms) + t.Fatal-in-goroutine with channel-based
synchronization: leader runFn signals leaderStarted; the test goroutine
waits up to 50ms for that signal and calls t.Error (safe from goroutines)
if it arrives, then cancels the context either way.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes look great, could you add a changelog?

auricom added a commit that referenced this pull request Apr 15, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
auricom added a commit that referenced this pull request Apr 15, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom force-pushed the fix/3229-raft-re-election branch from cd82967 to e161e41 Compare April 15, 2026 14:18
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom force-pushed the fix/3229-raft-re-election branch from e161e41 to 84ec0d0 Compare April 15, 2026 14:19
Comment thread pkg/raft/election.go
if err != nil {
return err
}
if diff < -1 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be too eager, what if all nodes in the rafts are behind, they will just pass each other the raft leadership without having time to catchup themselves.

Copy link
Copy Markdown
Contributor Author

@auricom auricom Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this exact scenario in the local tests. When all nodes restart simultaneously, their block stores lag the raft FSM height (block data arrives via p2p, not raft). When restarted all nodes at once, raft cluster entered a hot-potato loop with no stable leader and no block production.

Fixed in 5897ef3b: before abdicating, the new waitForBlockStoreSync helper polls IsSynced for up to ShutdownTimeout (~1s). The fastest-syncing peer stabilises as leader; nodes that can't catch up in time still abdicate normally. Lost-leadership is also detected mid-wait so the wait aborts early if another node takes over.

@auricom auricom marked this pull request as draft April 16, 2026 12:42
…tion

When all nodes restart simultaneously their block stores can lag behind
the raft FSM height (block data arrives via p2p, not raft). With the
previous code every elected node saw diff < -1 and immediately called
leadershipTransfer(), creating an infinite hot-potato: no node ever
stabilised as leader and block production stalled.

Instead of abdicating immediately, the new waitForBlockStoreSync helper
polls IsSynced for up to ShutdownTimeout (default ~1s). The fastest-
syncing peer proceeds as leader; nodes that cannot catch up in time still
abdicate and yield to a better candidate. Leadership also checks mid-wait
so a lost-leadership event aborts the wait early.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants