Supervoxel splitting with base+fork support and locks#534
Open
Supervoxel splitting with base+fork support and locks#534
Conversation
Drop the inline propagate_to_coarser_scales call from write_seg; coarser mips are now the async downsample worker's responsibility. write_seg is back to a single base-scale tensorstore write, so SV splits no longer block on the full pyramid update. TestWriteSeg updated to assert the coarser scales stay zero after write_seg (propagation tested separately via TestPropagateToCoarserScales).
split_supervoxel now returns its base-resolution bbox. split_with_sv_splits collects one per call and attaches the list to Result.seg_bbox (new optional field). publish_edit includes the list in the payload and sets downsample="true" so the worker only runs on edits that touched base seg. List kept unmerged — lets the worker skip tiles outside the actual change region.
workers/downsample_worker.py consumes edits-exchange messages flagged downsample="true" and writes each non-base mip within the SV-split bbox. graph/downsample.py splits the region into pyramid_blocks (sized so no two blocks share a storage chunk at any mip), then either tinybrain'd in one call (fast path, typical small edits) or per-mip (fallback when base read exceeds memory budget). Write filtering keeps OCDBT delta proportional to the actual change. DownsampleBlockLock serializes overlapping jobs via kvdbclient's row-key lock API; 26-byte hash-prefixed keys avoid tablet hot-spots. Depends on kvdbclient lock_by_row_key / unlock_by_row_key / renew_lock_by_row_key landing first.
Used by the async downsample worker as the mip-pyramid kernel.
Closes a concurrency gap where two SV splits on overlapping L2 chunks but distinct roots can't be serialized by root locks — they acquire disjoint root-lock sets and race on seg state. L2ChunkLock serializes them via the kvdbclient row-key lock primitive. Row key = 2-byte blake2b hash + 8-byte uint64 chunk_id (10 bytes). Hash prefix keeps spatially-clustered L2 chunks from hot-spotting a single bigtable tablet under concurrent load. Primitive only — callers land separately. RowKeyLockRegistry helper moved to tests/helpers.py so L2ChunkLock and DownsampleBlockLock tests share it instead of duplicating.
Callers now get Cut(atomic_edges) | PreviewCut(ccs, illegal_split) | SvSplitRequired(sv_remapping) instead of unwrapping-by-convention or catching SupervoxelSplitRequiredError. The exception still unwinds inside LocalMincutGraph — cheapest way to bail out of deep path code — but it's caught once at the run_multicut boundary and never escapes, so callers don't use raise/catch for control flow.
MulticutOperation._apply dispatches on the tagged multicut result and, when an SV split is needed, calls the new edits_sv.split_supervoxels under its surrounding RootLock, refreshes source/sink SV IDs from seg, and retries multicut against the post-split graph. Root lock spans the whole critical section; L2ChunkLock held only around the split loop. This closes two races that existed when split_with_sv_splits handled the flow outside any lock: same-root (root lock now never released between multicut and commit) and cross-root (L2ChunkLock serializes overlapping split regions). split_with_sv_splits is deleted; handle_split calls cg.remove_edges directly.
Pre-compute each rep's bbox from the chunk coords of its CC members in sv_remapping (no coord-padding, no resolution-axis assumption). split_supervoxels builds the union lock set across reps — sparse chunks plus one L2-chunk margin for update_edges's 1-voxel overlap read — acquires once, then loops per-rep splits. _update_chunks surfaces the change_chunks that actually got new SV IDs; write_seg_chunks fires one tensorstore future per change-chunk and awaits together, so only chunks with real label changes hit OCDBT. Gap chunks between CC pieces and neighbor chunks read for the overlap never get rewritten, keeping the delta proportional to the edit. split_supervoxels also threads back the fresh source/sink SV IDs from the in-memory new_seg block (same bytes that just landed on storage), so the retry multicut sees current IDs without an extra seg read. Drops _get_whole_sv (dead since the sv_remapping switch). Adds a high-level architecture doc covering the end-to-end flow, concurrency design, and durable invariants.
Enable opening a CG's OCDBT at a prior commit via the driver's `version` spec field (int generation or ISO-8601 commit_time upper bound). Groundwork for operator-driven replay of failed SV splits, which needs clean pre-op reads against append-only storage. OCDBT stamps commits from absl::Now() with no caller-override hook, so pins will use OperationTimeStamp captured under the L2 chunk lock rather than aligning OCDBT commit times to operation time.
This comment was marked as spam.
This comment was marked as spam.
split_supervoxels is now a pure planner returning a SplitResult (seg_bboxes, source_ids_fresh, sink_ids_fresh, seg_writes, bigtable_rows). No lock acquisition, no writes. The caller (MulticutOperation._apply) holds the L2 chunk locks and fires the consolidated persist — OCDBT chunks + bigtable rows — inside an inner lock scope. seg_writes is a flat list of (voxel_slices, data) pairs across all reps so write_seg_chunks fires every chunk write as one parallel tensorstore batch. Removes the per-rep serialization in the old write_seg_chunks loop. get_seg_source_and_destination_ocdbt gains a pinned_at kwarg, forwarded to build_cg_ocdbt_spec — used later by the recovery path.
New L2 chunk counterpart to IndefiniteRootLock, keyed by chunk row. L2ChunkLock now acquires via lock_by_row_key_with_indefinite so a temporal acquire sees a crashed op's indefinite cell and refuses. IndefiniteL2ChunkLock records its chunk scope on the op-log row's L2ChunkLockScope column at __enter__ and clears it on clean exit, giving recovery a durable scope without a bigtable-wide scan. Both indefinite locks (root and L2 chunk) now short-circuit __exit__ when an exception is propagating: cells stay held, scope stays set. Partial writes may exist after an exception; leaving the cells forces subsequent ops to refuse at lock-acquire and the operator to run recovery explicitly. privileged_mode=True on either lock is the operator recovery escape hatch: skips acquire, pre-populates acquired_keys so __exit__'s value-matched release deletes the crashed op's cells. RowKeyLockRegistry (test helper) gains the three new kvdbclient primitives.
Operator recovery for SV-split ops that crashed mid-write. A worker death inside IndefiniteL2ChunkLock leaves per-chunk indefinite cells set and records the chunk scope on the op-log row. Recovery reverts partial OCDBT writes using a version-pinned read of pre-op voxels, then replays the op normally. list_stuck scans OperationLogs for ops still at CREATED past a min-age threshold. replay(cg, op_id) runs cleanup_partial_writes followed by repair.edits.repair_operation(..., unlock=True); IndefiniteL2ChunkLock's privileged-mode __exit__ deletes the crashed op's pre-existing cells after the replay's writes land. Architecture-level operator guide at docs/sv_splitting_recovery.md, linked from docs/sv_splitting.md's Concurrency section.
Pass the op's timestamp from execute → _apply → split_supervoxels → split_supervoxel → copy_parents_and_add_lineage / add_new_edges so every new-SV bigtable mutation lands at the op's logical write time. Gets atomic visibility under a parent_ts filter and makes replay's override_ts actually control what time-filtered readers see after a repair_operation. Parent-copy and Child-list writes deliberately keep the old cell's timestamp so pre-op readers still see the old hierarchy. Replace the seven-tuple in/out soup in split_supervoxels with named dataclasses: SvSplitTask (plan_sv_splits → split_supervoxel input) and SvSplitOutcome (split_supervoxel's per-task output). Drop the two unused return fields on split_supervoxel.
list_stuck filter switches from Status==CREATED to "L2ChunkLockScope set and Status != SUCCESS past min_age". The authoritative signal for stuck-ness is "scope recorded, not cleared" — worker crash (Status stays CREATED) and Python exception during persist (Status=EXCEPTION after Fix 1) both fall under it. Ops without scope aren't blocking other ops and are outside stuck_ops' concern. replay now verifies each chunk in the recorded scope actually has Concurrency.IndefiniteLock held by this op_id before running cleanup or repair. If cells are missing or held by a different op, raises with a clear error. Protects against double-replay (first run already released cells) and out-of-band clearing (manual bigtable edit, buggy release path) — both would have cleanup_partial_writes revert chunks that aren't ours.
fork_base_manifest is now an explicit step — invoked from the ingest CLI's --ocdbt path or the seg_ocdbt notebook — rather than being auto-triggered on first ws_ocdbt_scales access. ws_ocdbt_scales asserts fork_exists() so a missing fork fails with a clear pointer instead of a tensorstore mismatch/not-found error.
Stuck-op detection keys off L2ChunkLockScope being populated, not Status=CREATED — that filter also covers caught-exception paths (Status=FAILED) where cells are held but the row isn't CREATED. Recovery now verifies each chunk's IndefiniteLock is actually held by the op before cleaning up, so a stale scope can't have us revert chunks another op owns. Reflect both in docs and the list CLI help. Drop "mode-downsample" from the SV-split diagram — tinybrain owns the algorithm; the doc shouldn't pin it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Supervoxel splits in OCDBT-backed segmentations are now atomic under worker failure, with an operator tool for recovering stuck ops.
Locks
Two indefinite locks, both holding their cells on exception:
On unclean exit, both hold and the L2 scope is written to the op-log row.
Stuck-op signal
Non-empty scope field on the op-log row. Covers both
CREATED(worker crash) andFAILED(caught exception) paths; status alone missedFAILED. Clean exit clears it.Cleanup-then-replay
Concurrent ops on non-overlapping chunks advance OCDBT during the outage, so recovery can't use one pinned view.
OCDBT time-travel
Tensorstore
versionfield in the spec (integer generation or ISO-8601Ztimestamp, interpreted ascommit_time ≤ T). Threaded aspinned_at. Pinned handles are read-only.Reader contract
dataset_infonow publishes the full kvstore spec (kvstack layers + OCDBT config + data prefixes) instead of path fragments, so readers pass it verbatim to tensorstore. The multi-scale open asserts fork presence, failing with a clear message instead of a tensorstore internal.