Skip to content

Speed up historical MVCC reads #3266

Open
Kbhat1 wants to merge 16 commits intomainfrom
kartik/perf-mvcc
Open

Speed up historical MVCC reads #3266
Kbhat1 wants to merge 16 commits intomainfrom
kartik/perf-mvcc

Conversation

@Kbhat1
Copy link
Copy Markdown
Contributor

@Kbhat1 Kbhat1 commented Apr 16, 2026

Describe your changes and provide context

  • Faster historical reads: flip PebbleDB version encoding to descending. Newest version sorts first
  • IMPORTANT: Open-time compatibility gate: pebbledb stamps an s/_mvcc_descending sentinel on fresh DBs and refuses to open legacy ascending-version
  • Adds a mvcc ascending flag
  • Read session for traces: one Pebble snapshot + reusable iterators per request

Roughly another ~2x speedup on debug_trace

Testing performed to validate your change

  • Verified end to end on node

MVCC PebbleDB changes:
- Invert version ordering so newest versions sort first per key, making
  historical lookups a direct SeekGE instead of an unbounded reverse seek.
- Introduce a per-request historical read session that holds one Pebble
  snapshot plus reusable iterators per store prefix, eliminating the
  per-read iterator create/destroy overhead.
- Add a request-scoped (module, key, version) read cache so repeated
  hot-key reads inside a single trace skip Pebble entirely.
- Add a sentinel-versioned latest-value fast path (math.MaxInt64 per
  key) usable as a bloom-filter accelerated db.Get when tracing at or
  past the latest committed height.
- Harden batch writes: sort MVCC batches before commit, guard against
  empty store keys/keys to keep state-sync imports safe, and commit the
  latest-version metadata key separately so it never violates Pebble's
  strictly-increasing batch invariant.
- Extend iterator navigation to drive MVCC seeks by logical key under
  the new descending-version encoding, with forward/reverse helpers and
  defensive skips for the sentinel pointer.

Tracing hooks:
- Add ReadTraceCollector / ReadTraceEvent / TraceableStateStore types in
  db_engine/types so callers can optionally observe low-level reads. The
  MVCC Database implements TraceableStateStore as a no-op when no
  collector is attached, so this change is zero-cost in the default
  configuration; the actual consumers live in the follow-up profiling
  PR.

evmrpc:
- Add a block-level replay cache keyed by block hash + tx index so a
  debug trace request that replays txs [0..N) against the historical
  state reuses the post-tx context from earlier calls in the same block
  instead of replaying from scratch.

No migration is required: the new on-disk version encoding is applied
on fresh state-synced data, and the latest-value sentinel is additive.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 17, 2026, 8:08 PM

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 38.68195% with 428 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.23%. Comparing base (b57aeb8) to head (39ae166).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...i-db/db_engine/pebbledb/mvcc/iterator_ascending.go 0.00% 180 Missing ⚠️
sei-db/db_engine/pebbledb/mvcc/db_ascending.go 18.23% 122 Missing and 8 partials ⚠️
sei-db/db_engine/pebbledb/mvcc/db.go 68.15% 38 Missing and 19 partials ⚠️
sei-db/db_engine/pebbledb/mvcc/batch.go 53.12% 26 Missing and 4 partials ⚠️
sei-db/db_engine/pebbledb/mvcc/iterator.go 69.76% 21 Missing and 5 partials ⚠️
sei-db/db_engine/pebbledb/mvcc/comparator.go 83.33% 3 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3266      +/-   ##
==========================================
- Coverage   59.30%   59.23%   -0.08%     
==========================================
  Files        2071     2073       +2     
  Lines      169811   170250     +439     
==========================================
+ Hits       100713   100843     +130     
- Misses      60327    60623     +296     
- Partials     8771     8784      +13     
Flag Coverage Δ
sei-chain-pr 48.64% <38.68%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/db_engine/pebbledb/mvcc/comparator.go 68.42% <83.33%> (+4.61%) ⬆️
sei-db/db_engine/pebbledb/mvcc/iterator.go 66.36% <69.76%> (+10.34%) ⬆️
sei-db/db_engine/pebbledb/mvcc/batch.go 52.63% <53.12%> (+10.77%) ⬆️
sei-db/db_engine/pebbledb/mvcc/db.go 66.08% <68.15%> (-0.13%) ⬇️
sei-db/db_engine/pebbledb/mvcc/db_ascending.go 18.23% <18.23%> (ø)
...i-db/db_engine/pebbledb/mvcc/iterator_ascending.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Kbhat1 Kbhat1 changed the title perf: speed up historical MVCC reads and trace replays Speed up historical MVCC reads and trace replays Apr 17, 2026
Kbhat1 added 2 commits April 16, 2026 20:24
ReplayTransactionTillIndex was storing sdk.Context values whose MultiStore
field still pointed at the live replay state. The -1 checkpoint was cached
before replay started but then advanced in place as transactions were
applied, and later cache hits reused the stored checkpoint directly.

Store a branched CacheMultiStore snapshot when caching a checkpoint, and
branch again when resuming from one, so each replay starts from an
immutable saved state instead of mutating the cached checkpoint itself.

Made-with: Cursor
sei-db prune: the existing NextPrefix guard was written for ascending
version order. Under the new descending encoding the newest version of
a key is seen first, so whenever the newest version was above the prune
height the iterator skipped to the next logical key and every older
version (including ones below the prune height) was leaked on disk.
Rewrite the loop around per-logical-key state: on entering a new key
with newest > prune, seek straight to the first version <= prune, then
keep one version below the prune height (when KeepLastVersion=true) and
delete the rest. The fast-path seek preserves the old optimization of
skipping above-prune versions.

evmrpc replay cache: replayStateCache grew one entry per (block, tx)
forever with no eviction, each holding a branched CacheMultiStore. Wrap
it in an expirable LRU bounded by block count (32) and TTL (10m); keep
per-block mutexes so inner-map updates stay safe under concurrent
traces.

Regression tests:
- sei-db/.../prune_test.go exercises the descending bug directly by
  reading raw pebble entries after Prune, covering keepLast true/false,
  mixed above/below-prune keys, and sentinel preservation. They fail on
  the pre-fix code.
- evmrpc/simulate_cache_test.go covers get/put, LRU eviction past the
  cap, and concurrent access under -race.
Kbhat1 and others added 10 commits April 17, 2026 00:08
- Change replayStateCacheMu to *sync.Mutex so Backend copies stay safe
  under copylocks (Backend has value-receiver methods and tracers.go
  dereferences the pointer into a local).
- Annotate two int64->uint64 conversions on non-negative block
  heights with nolint:gosec, matching existing conventions in this
  package.
- Wrap recordReadTrace defer calls in closures so time.Since is
  evaluated at defer fire time, not at defer registration.
- Drop redundant Database selector from tracedDatabase iterator calls.
Drops the reserved math.MaxInt64 latest-pointer entry and returns the
historical read path to the inverted-order SeekGE lookup plus the
request-scoped read cache and reusable iterator session.

Benchmarks showed the sentinel pointer made traces slower, not faster:
the extra db.Get per read plus the write amplification from emitting
a pointer entry per set outweighed the bloom-filter-accelerated hit.
This keeps the other MVCC wins while removing the sentinel.

Also drops the sentinel-skip branches in Prune/RawIterate/iterator and
the corresponding sentinel-specific prune test.
- Drop unused batchOp.order field; sortBatchOps sorts by key only.
- Drop unused NewReadTraceEvent helper and its time import.
- Fix misleading Reverse:true trace events in getMVCCSlice and
  getMVCCSliceWithSession; these paths are forward SeekGE + First.
- Drop stale "reverse iteration is NOT supported" comment on iterator.
- Rename shadowed builtin cap to size in replay-cache eviction test.
- Move latestVersionKey Set into the pebble batch so data and
  version metadata commit atomically, restoring the invariant that
  crash recovery observes a consistent version pointer.
- Extract decodeMVCCEntry to share the post-seek validate/decode/clone
  logic between getMVCCSlice and getMVCCSliceWithSession.
- Drop the fine-grained per-step trace events on the Get hot path and
  route the remaining top-level timing through traceGetMVCCSlice, which
  short-circuits when no collector is attached (no slices.Clone in the
  default zero-collector path).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the ReadTrace types, per-op trace hooks, tracedDatabase wrapper,
and historicalReadSession read-cache. This branch is scoped to MVCC
perf changes; the tracing plumbing belongs with the debug trace-profile
endpoint PR and can be re-introduced there once it lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reject legacy ascending-version databases on open. On a populated DB
lacking the s/_mvcc_descending marker, return an error that points
operators at state sync; on an empty DB, write the marker so future
opens fast-path. Protects against silent version mis-decoding after
an in-place upgrade to the new encoding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Archive nodes that cannot state-sync need to keep reading and writing
their existing ascending-version MVCC databases. Rather than forcing a
migration, detect the on-disk encoding at open time and dispatch to the
matching code path.

- OpenDB sets Database.descending from the s/_mvcc_descending sentinel:
  present or empty DB -> descending (fast path, marker written on
  fresh DBs); marker absent on a populated DB -> ascending legacy mode,
  no error, no marker write.
- Public Get/Has/Iterator/ReverseIterator/Prune dispatch to *Descending
  or *Ascending implementations. Batch threads the mode through writes
  (NewBatchWithMode) so ApplyChangeset / Import / DeleteKeysAtVersion
  persist in the DB's native encoding.
- File layout makes the split obvious for review:
  - db_descending.go, iterator_descending.go -- perf fast path (this
    branch's rewrite)
  - db_ascending.go, iterator_ascending.go   -- verbatim port of main's
    legacy implementation, only adjusted to use MVCCEncodeAscending and
    the ascendingIterator type
  - comparator.go                            -- both encode/decode
    variants plus a parameterized MVCCEncode(key, v, descending) for
    shared call sites
- Tests cover all three open paths (fresh-writes-marker, legacy-opens-
  ascending, marked-roundtrip) including a write+read+reopen cycle on a
  seeded legacy DB to confirm the ascending path is end-to-end correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop three ornamental wrappers flagged by review:

- Batch.appendSet / Batch.appendDelete / RawBatch.appendSet were each
  five-line clone-and-append helpers called 1-3 times; inline at the
  call sites.
- NewBatch / NewRawBatch were zero-argument wrappers that always
  forwarded with descending=true to NewBatchWithMode. Drop the wrappers,
  rename NewBatchWithMode -> NewBatch and NewRawBatchWithMode ->
  NewRawBatch so the mode is required at the one kind of call site that
  exists.
- snapshotReplayState was a one-liner ctx.WithMultiStore(ctx.MultiStore()
  .CacheMultiStore()) wrapper with two call sites. Inline it.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The block-level replay-state LRU only paid off when trace requests
clustered on the same block in quick succession. evmrpc handles one
call at a time and can't observe whether callers are clustering;
workload-aware prefix reuse belongs above this layer (in a block
explorer, indexer, or tracing harness that actually knows its access
pattern). Baking it into the RPC added stateful fields on Backend, an
LRU size / TTL we picked without data, and a double-lock put path that
existed only to serve a workload the RPC can't verify.

Drop the cache entirely:
- blockReplayState, replayStateCache, replayStateCacheMu,
  replayStateCacheBlocks, replayStateCacheTTL from evmrpc/simulate.go
- getReplayState, putReplayState methods
- the cache check + two putReplayState calls in
  ReplayTransactionTillIndex; unconditionally replay from tx 0
- evmrpc/simulate_cache_test.go (no longer meaningful)
- the golang-lru/v2/expirable import (still used elsewhere in evmrpc)

Per-call perf is unaffected: the descending-version MVCC encoding
still makes each replayed read cheap, which is the right layer for a
per-call optimization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kbhat1 Kbhat1 changed the title Speed up historical MVCC reads and trace replays Speed up historical MVCC reads Apr 17, 2026
The three-way file split (db.go / db_descending.go / db_ascending.go)
implied two equal modes when the intent is one canonical path plus a
backstop. Fold the descending implementation back into db.go so the
canonical Database lives in one place, and keep db_ascending.go and
iterator_ascending.go as the clearly-labeled legacy carveouts. Rename
iterator_descending.go -> iterator.go for the same reason: it is the
iterator; iterator_ascending.go is the compat variant.

File layout after:
- db.go            - Database, OpenDB, dispatchers, writes, descending
                     mode implementation, shared helpers, metrics
- db_ascending.go  - legacy ascending-mode carveout for pre-migration DBs
- iterator.go      - descending iterator (the iterator)
- iterator_ascending.go - legacy iterator carveout

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kbhat1 Kbhat1 requested a review from yzang2019 April 17, 2026 18:28
Kbhat1 and others added 2 commits April 17, 2026 15:27
Two identical code paths were waiting for a single kernel:

- retrieveLatestVersion and retrieveEarliestVersion differed only in
  the metadata key name. Fold both into retrieveVersionKey(db, key).
- Batch.Write and RawBatch.Write share the otel metrics defer, the
  pebble batch open/close, sortBatchOps, and the ops-application loop.
  The only delta is the latestVersionKey stamp in Batch.Write. Extract
  writeBatchOps(storage, ops, beforeCommit) and pass the stamp as a
  hook from Batch.Write; RawBatch.Write passes nil.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Leftover from the earlier replay-cache revert — the one-line whitespace
delta was the only difference from main in this file. Match main exactly
so simulate.go drops out of the PR's diff stat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants