Skip to content

Supervoxel splitting with base+fork support and locks#534

Open
akhileshh wants to merge 18 commits intopcgv3from
akhilesh/sv-splitting-locks
Open

Supervoxel splitting with base+fork support and locks#534
akhileshh wants to merge 18 commits intopcgv3from
akhilesh/sv-splitting-locks

Conversation

@akhileshh
Copy link
Copy Markdown
Contributor

@akhileshh akhileshh commented Apr 23, 2026

Summary

Supervoxel splits in OCDBT-backed segmentations are now atomic under worker failure, with an operator tool for recovering stuck ops.

Locks

Two indefinite locks, both holding their cells on exception:

  • Root lock wraps multicut → split detection → voxel-level split → commit.
  • L2 chunk lock nests inside, scoped to rewritten chunks + 1-chunk margin for the edge-routing shell.

On unclean exit, both hold and the L2 scope is written to the op-log row.

Stuck-op signal

Non-empty scope field on the op-log row. Covers both CREATED (worker crash) and FAILED (caught exception) paths; status alone missed FAILED. Clean exit clears it.

Cleanup-then-replay

Concurrent ops on non-overlapping chunks advance OCDBT during the outage, so recovery can't use one pinned view.

  • Cleanup: read pre-op voxels from an OCDBT handle pinned at the op's timestamp, write them back unpinned at the same chunk keys. Chunks outside the scope are untouched.
  • Replay: reruns via the privileged-repair path against latest state. Indefinite-lock exit releases cells by value-match on the original op ID.

OCDBT time-travel

Tensorstore version field in the spec (integer generation or ISO-8601 Z timestamp, interpreted as commit_time ≤ T). Threaded as pinned_at. Pinned handles are read-only.

Reader contract

dataset_info now publishes the full kvstore spec (kvstack layers + OCDBT config + data prefixes) instead of path fragments, so readers pass it verbatim to tensorstore. The multi-scale open asserts fork presence, failing with a clear message instead of a tensorstore internal.

Drop the inline propagate_to_coarser_scales call from write_seg; coarser
mips are now the async downsample worker's responsibility. write_seg is
back to a single base-scale tensorstore write, so SV splits no longer
block on the full pyramid update. TestWriteSeg updated to assert the
coarser scales stay zero after write_seg (propagation tested separately
via TestPropagateToCoarserScales).
split_supervoxel now returns its base-resolution bbox. split_with_sv_splits
collects one per call and attaches the list to Result.seg_bbox (new
optional field). publish_edit includes the list in the payload and sets
downsample="true" so the worker only runs on edits that touched base seg.

List kept unmerged — lets the worker skip tiles outside the actual
change region.
workers/downsample_worker.py consumes edits-exchange messages flagged
downsample="true" and writes each non-base mip within the SV-split
bbox. graph/downsample.py splits the region into pyramid_blocks (sized
so no two blocks share a storage chunk at any mip), then either
tinybrain'd in one call (fast path, typical small edits) or per-mip
(fallback when base read exceeds memory budget). Write filtering keeps
OCDBT delta proportional to the actual change.

DownsampleBlockLock serializes overlapping jobs via kvdbclient's
row-key lock API; 26-byte hash-prefixed keys avoid tablet hot-spots.
Depends on kvdbclient lock_by_row_key / unlock_by_row_key /
renew_lock_by_row_key landing first.
Used by the async downsample worker as the mip-pyramid kernel.
Closes a concurrency gap where two SV splits on overlapping L2 chunks
but distinct roots can't be serialized by root locks — they acquire
disjoint root-lock sets and race on seg state. L2ChunkLock serializes
them via the kvdbclient row-key lock primitive.

Row key = 2-byte blake2b hash + 8-byte uint64 chunk_id (10 bytes).
Hash prefix keeps spatially-clustered L2 chunks from hot-spotting a
single bigtable tablet under concurrent load.

Primitive only — callers land separately.

RowKeyLockRegistry helper moved to tests/helpers.py so L2ChunkLock and
DownsampleBlockLock tests share it instead of duplicating.
Callers now get Cut(atomic_edges) | PreviewCut(ccs, illegal_split) |
SvSplitRequired(sv_remapping) instead of unwrapping-by-convention or
catching SupervoxelSplitRequiredError. The exception still unwinds
inside LocalMincutGraph — cheapest way to bail out of deep path code —
but it's caught once at the run_multicut boundary and never escapes,
so callers don't use raise/catch for control flow.
MulticutOperation._apply dispatches on the tagged multicut result and,
when an SV split is needed, calls the new edits_sv.split_supervoxels
under its surrounding RootLock, refreshes source/sink SV IDs from seg,
and retries multicut against the post-split graph. Root lock spans the
whole critical section; L2ChunkLock held only around the split loop.

This closes two races that existed when split_with_sv_splits handled
the flow outside any lock: same-root (root lock now never released
between multicut and commit) and cross-root (L2ChunkLock serializes
overlapping split regions). split_with_sv_splits is deleted;
handle_split calls cg.remove_edges directly.
Pre-compute each rep's bbox from the chunk coords of its CC members in
sv_remapping (no coord-padding, no resolution-axis assumption).
split_supervoxels builds the union lock set across reps — sparse chunks
plus one L2-chunk margin for update_edges's 1-voxel overlap read —
acquires once, then loops per-rep splits.

_update_chunks surfaces the change_chunks that actually got new SV IDs;
write_seg_chunks fires one tensorstore future per change-chunk and
awaits together, so only chunks with real label changes hit OCDBT.
Gap chunks between CC pieces and neighbor chunks read for the overlap
never get rewritten, keeping the delta proportional to the edit.

split_supervoxels also threads back the fresh source/sink SV IDs from
the in-memory new_seg block (same bytes that just landed on storage),
so the retry multicut sees current IDs without an extra seg read.

Drops _get_whole_sv (dead since the sv_remapping switch).

Adds a high-level architecture doc covering the end-to-end flow,
concurrency design, and durable invariants.
Enable opening a CG's OCDBT at a prior commit via the driver's `version`
spec field (int generation or ISO-8601 commit_time upper bound).
Groundwork for operator-driven replay of failed SV splits, which needs
clean pre-op reads against append-only storage.

OCDBT stamps commits from absl::Now() with no caller-override hook, so
pins will use OperationTimeStamp captured under the L2 chunk lock rather
than aligning OCDBT commit times to operation time.
@codecov

This comment was marked as spam.

split_supervoxels is now a pure planner returning a SplitResult
(seg_bboxes, source_ids_fresh, sink_ids_fresh, seg_writes,
bigtable_rows). No lock acquisition, no writes. The caller
(MulticutOperation._apply) holds the L2 chunk locks and fires
the consolidated persist — OCDBT chunks + bigtable rows — inside
an inner lock scope.

seg_writes is a flat list of (voxel_slices, data) pairs across all
reps so write_seg_chunks fires every chunk write as one parallel
tensorstore batch. Removes the per-rep serialization in the old
write_seg_chunks loop.

get_seg_source_and_destination_ocdbt gains a pinned_at kwarg, forwarded
to build_cg_ocdbt_spec — used later by the recovery path.
New L2 chunk counterpart to IndefiniteRootLock, keyed by chunk row.
L2ChunkLock now acquires via lock_by_row_key_with_indefinite so a
temporal acquire sees a crashed op's indefinite cell and refuses.
IndefiniteL2ChunkLock records its chunk scope on the op-log row's
L2ChunkLockScope column at __enter__ and clears it on clean exit,
giving recovery a durable scope without a bigtable-wide scan.

Both indefinite locks (root and L2 chunk) now short-circuit __exit__
when an exception is propagating: cells stay held, scope stays set.
Partial writes may exist after an exception; leaving the cells forces
subsequent ops to refuse at lock-acquire and the operator to run
recovery explicitly.

privileged_mode=True on either lock is the operator recovery escape
hatch: skips acquire, pre-populates acquired_keys so __exit__'s
value-matched release deletes the crashed op's cells.

RowKeyLockRegistry (test helper) gains the three new kvdbclient
primitives.
Operator recovery for SV-split ops that crashed mid-write. A worker
death inside IndefiniteL2ChunkLock leaves per-chunk indefinite cells
set and records the chunk scope on the op-log row. Recovery reverts
partial OCDBT writes using a version-pinned read of pre-op voxels,
then replays the op normally.

list_stuck scans OperationLogs for ops still at CREATED past a
min-age threshold. replay(cg, op_id) runs cleanup_partial_writes
followed by repair.edits.repair_operation(..., unlock=True);
IndefiniteL2ChunkLock's privileged-mode __exit__ deletes the crashed
op's pre-existing cells after the replay's writes land.

Architecture-level operator guide at docs/sv_splitting_recovery.md,
linked from docs/sv_splitting.md's Concurrency section.
Pass the op's timestamp from execute → _apply → split_supervoxels →
split_supervoxel → copy_parents_and_add_lineage / add_new_edges so
every new-SV bigtable mutation lands at the op's logical write time.
Gets atomic visibility under a parent_ts filter and makes replay's
override_ts actually control what time-filtered readers see after a
repair_operation. Parent-copy and Child-list writes deliberately keep
the old cell's timestamp so pre-op readers still see the old
hierarchy.

Replace the seven-tuple in/out soup in split_supervoxels with named
dataclasses: SvSplitTask (plan_sv_splits → split_supervoxel input)
and SvSplitOutcome (split_supervoxel's per-task output). Drop the
two unused return fields on split_supervoxel.
list_stuck filter switches from Status==CREATED to "L2ChunkLockScope
set and Status != SUCCESS past min_age". The authoritative signal for
stuck-ness is "scope recorded, not cleared" — worker crash (Status
stays CREATED) and Python exception during persist (Status=EXCEPTION
after Fix 1) both fall under it. Ops without scope aren't blocking
other ops and are outside stuck_ops' concern.

replay now verifies each chunk in the recorded scope actually has
Concurrency.IndefiniteLock held by this op_id before running cleanup
or repair. If cells are missing or held by a different op, raises
with a clear error. Protects against double-replay (first run already
released cells) and out-of-band clearing (manual bigtable edit,
buggy release path) — both would have cleanup_partial_writes revert
chunks that aren't ours.
fork_base_manifest is now an explicit step — invoked from the ingest
CLI's --ocdbt path or the seg_ocdbt notebook — rather than being
auto-triggered on first ws_ocdbt_scales access. ws_ocdbt_scales
asserts fork_exists() so a missing fork fails with a clear pointer
instead of a tensorstore mismatch/not-found error.
Stuck-op detection keys off L2ChunkLockScope being populated, not
Status=CREATED — that filter also covers caught-exception paths
(Status=FAILED) where cells are held but the row isn't CREATED.

Recovery now verifies each chunk's IndefiniteLock is actually held
by the op before cleaning up, so a stale scope can't have us revert
chunks another op owns. Reflect both in docs and the list CLI help.
Drop "mode-downsample" from the SV-split diagram — tinybrain owns
the algorithm; the doc shouldn't pin it.
@akhileshh akhileshh changed the title sv splitting locks Supervoxel splitting with locks Apr 24, 2026
@akhileshh akhileshh changed the title Supervoxel splitting with locks Supervoxel splitting with base+fork support and locks Apr 24, 2026
@akhileshh akhileshh requested review from fcollman and sdorkenw April 24, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant