Skip to content

refactor: SVS shared thread pool with rental model (MOD-9881)#925

Open
meiravgri wants to merge 15 commits intomainfrom
meiravg_adjustable_thpool
Open

refactor: SVS shared thread pool with rental model (MOD-9881)#925
meiravgri wants to merge 15 commits intomainfrom
meiravg_adjustable_thpool

Conversation

@meiravgri
Copy link
Copy Markdown
Collaborator

@meiravgri meiravgri commented Mar 31, 2026

Overview

Replaces per-index SVS thread pools with a single shared thread pool that can be physically resized at runtime, and introduces a rental-based execution model for parallel_for. This eliminates over-provisioning (previously K indexes × N threads = K×N OS threads) and enables the SVS pool to actually grow/shrink when the external pool capacity changes — something the old design could not do.

Also adds a new C API VecSim_UpdateThreadPoolSize() that combines write-mode toggling with pool resizing, and deprecates SVSParams::num_threads.

Key Guarantees and Assumptions

The fundamental scheduling invariant

The shared SVS thread pool has the same size as the RediSearch worker thread pool (search-workers). Every SVS multi-threaded operation (update, GC) is dispatched via SVSMultiThreadJob, which submits N reserve jobs to the RediSearch worker pool — one per worker thread. Only the threads that actually check in (are not busy with other work) participate in the SVS operation. This means:

  • The sum of all concurrent rent() requests can never exceed the pool capacity, because each rented SVS thread corresponds to a RediSearch worker that checked in via a reserve job. If a worker is busy, it doesn't check in, so it's never counted in availableThreads and never requested from the SVS pool.
  • setParallelism(n) is always called with n = min(availableThreads, problemSize), where availableThreads ≤ poolSize(). This is why the assertion n <= pool->size() is safe.
  • Shrink while threads are rented is safe by design — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and its OS thread) alive until ~RentedThreads releases it. Resize never blocks on in-flight operations.

Other invariants

  • Pool is always valid — the shared pool singleton is initialized with size 1 on first access (VecSimSVSThreadPoolImpl::instance()). There is no nullptr state; write-in-place mode uses a pool of size 1 (0 worker threads, calling thread only).
  • parallelism_ >= 1 — the calling thread always participates, so parallelism can never be 0.
  • Rental is safe during resize — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and OS thread) alive until ~RentedThreads.
  • No cross-index state leakage — each index has its own shared_ptr<atomic<size_t>> parallelism_; setParallelism() on one index never affects another.

Relationship with RediSearch

  • RediSearch owns the scheduling — it decides when to dispatch SVS jobs and how many workers are available.
  • VecSim owns the threads — the SVS thread pool is a separate set of OS threads inside VecSim, distinct from RediSearch workers.
  • RediSearch's only responsibility toward the SVS pool is calling VecSim_UpdateThreadPoolSize(N) when the worker count changes, so VecSim can create/destroy the matching number of OS threads.
  • The two pools are always the same size: RediSearch workers = N, SVS threads = N. This 1:1 sizing is what makes the "sum of rents ≤ capacity" guarantee work.

Changes from Old Implementation

Aspect Old (per-index pools) New (shared pool + rental)
Pool ownership Each SVS index creates its own VecSimSVSThreadPoolImpl Singleton VecSimSVSThreadPoolImpl shared across all indexes via shared_ptr
Thread lifecycle Pre-allocated at max capacity; resize() clamped a logical counter Physical grow/shrink: resize() spawns/destroys OS threads
Pool could be null Yes (write-in-place = no pool) Never null; write-in-place = pool with size 1
parallel_for threading Threads used directly from the per-index pool Threads rented from shared pool under mutex, released lock-free via atomic<bool>
Concurrent index operations Each index had dedicated threads (K×N total) Indexes rent disjoint subsets from N shared threads
Per-index parallelism control setNumThreads(n) / resize(n) — mutated the pool setParallelism(n) — sets a per-index atomic<size_t>, pool unchanged
Pool size query capacity() (max pre-allocated) poolSize() (current shared pool size)
Error handling manage_exception_during_run() threw unconditionally manage_workers_after_run() only throws if an error occurred

API Changes

Renamed methods

Old New Scope
setNumThreads(n) setParallelism(n) SVSIndex, SVSIndexBase virtual interface
getNumThreads() getParallelism() SVSIndex, SVSIndexBase virtual interface
getThreadPoolCapacity() getPoolSize() SVSIndex, SVSIndexBase virtual interface

Removed methods

Method Reason
VecSimSVSThreadPool::capacity() No pre-allocation; use poolSize() for shared pool size
VecSimSVSThreadPool::resize() Wrappers don't own threads; use setParallelism() for per-index control, VecSimSVSThreadPool::resize(size_t) (static) for global pool

New C API

// Combines write-mode toggling + physical SVS pool resize.
// new_size == 0 → WriteInPlace mode, pool resized to 1
// new_size >  0 → WriteAsync mode, pool resized to new_size
void VecSim_UpdateThreadPoolSize(size_t new_size);

New internal APIs

API Purpose
VecSimSVSThreadPoolImpl::instance() Static singleton accessor; returns shared_ptr<VecSimSVSThreadPoolImpl> by value
VecSimSVSThreadPool(void* log_ctx) Constructor — uses shared pool singleton internally
VecSimSVSThreadPool::resize(size_t) Static method — resizes the shared pool singleton
VecSimSVSThreadPool::poolSize() Returns shared pool size (for scheduling reserve jobs)
VecSimSVSThreadPoolImpl::rent(count, log_ctx) Rents worker threads with RAII guard (RentedThreads)

Deprecated

API Replacement Notes
VecSim_SetWriteMode() VecSim_UpdateThreadPoolSize() Kept for internal VecSim usage (called inside VecSim_UpdateThreadPoolSize) and unit tests. Not called by RediSearch — RediSearch will use only VecSim_UpdateThreadPoolSize
SVSParams::num_threads Shared pool singleton Deprecated and ignored. Setting it to a non-zero value emits a deprecation warning log. Kept for backward compatibility with existing callers/tests. Pool size is controlled globally via VecSim_UpdateThreadPoolSize()

New types

  • ThreadSlot — wraps svs::threads::Thread + atomic<bool> occupied; non-copyable, non-movable
  • RentedThreads — RAII guard; move-only; releases slots via lock-free atomic stores in destructor

Behavioral Changes

Area Old behavior New behavior
setParallelism(0) resize(0) silently clamped to 1 Asserts (debug) / undefined (release) — reserving 0 threads is a bug
setParallelism(n > poolSize) Grew the per-index pool beyond search-workers Asserts n <= pool->size() — scheduling bug
n > parallelism in parallel_for Threw ThreadingException if n > size_ Asserts n <= parallelism_ on the wrapper — bug in SVS partitioning
parallelism_ initial value Implicitly matched pool size at construction Starts at 1 — safe for immediate write-in-place use
Fewer threads rented than requested N/A (no rental) Logs a warning via logCallback, asserts in debug, uses rented.count() for graceful degradation in release
Exception in parallel_for manage_exception_during_run() always threw manage_workers_after_run() only throws if an error actually occurred
updateSVSIndexWrapper Always locked backend even if no vectors to move Skips lock when labels_to_move is empty
SVSParams::num_threads set to non-zero Used as per-index pool size Ignored; logs a deprecation warning

Public info API fields — unchanged

The debug info field names numThreads and lastReservedThreads (and their string keys NUM_THREADS, LAST_RESERVED_NUM_THREADS) are unchanged. Their semantics shift slightly:

  • numThreads → now reports getPoolSize() (shared pool size, not per-index capacity)
  • lastReservedThreads → now reports getParallelism() (per-index parallelism for last operation)

What Does NOT Change

  • ScalableVectorSearch (deps/ScalableVectorSearch/) — no modifications
  • SVSMultiThreadJob pattern — still submits N reserve jobs to the external worker pool
  • HNSW tiered indexes — unaffected
  • SVS Thread state machine — used as-is; lifecycle managed via shared_ptr

Files Changed

File Changes
src/VecSim/algorithms/svs/svs_utils.h Core refactor: ThreadSlot, RentedThreads, rental-based VecSimSVSThreadPoolImpl, per-index VecSimSVSThreadPool wrapper
src/VecSim/algorithms/svs/svs.h Singleton accessor, API renames, constructor takes shared pool, num_threads deprecation warning
src/VecSim/algorithms/svs/svs_tiered.h API renames, empty-labels guard in updateSVSIndexWrapper
src/VecSim/vec_sim.h / vec_sim.cpp New VecSim_UpdateThreadPoolSize() C API
src/VecSim/vec_sim_common.h SVSParams::num_threads marked as deprecated in comments
tests/unit/test_svs_tiered.cpp Adapted to shared pool: VecSim_UpdateThreadPoolSize in setup, removed num_threads from params, updated assertions
tests/unit/test_svs_fp16.cpp Same test adaptations
tests/unit/test_svs.cpp numThreads info field expects default; new NumThreadsParamIgnored test
tests/benchmark/bm_utils.h / bm_vecsim_svs.h API renames in benchmark helpers

Testing

  • All existing SVS tiered tests updated to use the shared pool model
  • testThreadPool rewritten to cover: default parallelism, setParallelism boundary assertions (ASSERT_DEATH), parallel_for with n < parallelism, write-in-place mode, exception handling
  • TestDebugInfoThreadCount / TestDebugInfoThreadCountWriteInPlace updated: lastReservedThreads now starts at 1 (not num_threads)
  • ThreadsReservation test uses tieredIndexMock(num_threads) to properly size the shared pool
  • New: NumThreadsParamIgnored — verifies that setting SVSParams::num_threads emits a deprecation warning, does not affect the shared pool size, and that omitting it produces no warning
  • Tests that previously set num_threads = 1 in SVSParams now use tieredIndexMock(1) to resize the shared pool instead

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

High Risk
High risk because it rewrites SVS threading/concurrency primitives (shared pool, rental, resize semantics) and introduces a new global C API that changes write-mode behavior; bugs here could cause deadlocks, performance regressions, or incorrect parallel execution under load.

Overview
Refactors SVS to use a single shared, physically resizable thread pool with a rental-based parallel_for, replacing per-index thread pools and renaming the public per-index controls from get/setNumThreads to get/setParallelism (and getThreadPoolCapacity to getPoolSize).

Adds VecSim_UpdateThreadPoolSize() to globally resize the shared SVS pool and toggle write mode (0 → in-place, >0 → async), and deprecates/ignores SVSParams::num_threads while emitting a warning when set.

Updates tiered SVS scheduling to use pool size for job creation, sets per-operation parallelism before backend updates/GC, skips backend locking when there’s nothing to move, and adjusts debug-info semantics/tests; adds extensive new unit tests covering pool resize/rental/concurrency and the deprecation behavior.

Reviewed by Cursor Bugbot for commit b90ac4d. Bugbot is set up for automated code reviews on this repo. Configure here.

…l terminology**

Rename `setNumThreads`/`getNumThreads` → `setParallelism`/`getParallelism` and `getThreadPoolCapacity` → `getPoolSize` across VectorSimilarity. Public info API fields (`numThreads`, `lastReservedThreads`, `NUM_THREADS`, `LAST_RESERVED_NUM_THREADS`) remain unchanged. No behavioral changes.
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Mar 31, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 93.39623% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.90%. Comparing base (f177c40) to head (b90ac4d).

Files with missing lines Patch % Lines
src/VecSim/algorithms/svs/svs_utils.h 91.56% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   96.98%   96.90%   -0.09%     
==========================================
  Files         129      129              
  Lines        7574     7628      +54     
==========================================
+ Hits         7346     7392      +46     
- Misses        228      236       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thpool tests
fix fp16 tests
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2163e56. Configure here.

@meiravgri meiravgri changed the title SVS Shared Thread Pool Refactor (MOD-9881) refactor: SVS shared thread pool with rental model (MOD-9881) Apr 7, 2026
rfsaliev
rfsaliev previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM, but some suggestions.

// Singleton accessor for the shared SVS thread pool.
// Always valid — initialized with size 1 (write-in-place mode: 0 worker threads,
// only the calling thread participates). Resized on VecSim_UpdateThreadPoolSize() calls.
static std::shared_ptr<VecSimSVSThreadPoolImpl> getSharedThreadPool() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the VecSimSVSThreadPoolImpl is expected to be always used as a singleton.
If this assumption is true, then I would suggest to use more generic idiom:

class VecSimSVSThreadPoolImpl {
...
private:
    explicit VecSimSVSThreadPoolImpl(size_t num_threads = 1) {...}
public:
    static std::shared_ptr<VecSimSVSThreadPoolImpl> instance() {...}
...
};
  1. Such idiom prevents non-singleton usages.
  2. Moves TP lifetime managing responsibilities out of SVSIndex class

Moreover, I would move the full ownership to VecSimSVSThreadPool with possible hiding VecSimSVSThreadPoolImpl to the details namespace:

class VecSimSVSThreadPool {
...
public:
    explicit VecSimSVSThreadPool(void *log_ctx = nullptr)
        : pool_(details::VecSimSVSThreadPoolImpl::instance()),
          parallelism_(std::make_shared<std::atomic<size_t>>(1)),
          log_ctx_(log_ctx) {
        assert(pool_ && "Pool must not be null");
    }

    static auto resize(size_t new_size) {
        return details::VecSimSVSThreadPoolImpl::instance()->resize(new_size);
    }
...
};

Comment on lines +529 to +530
void manage_workers_after_run(const std::string &main_thread_error, RentedThreads &rented,
size_t rented_count) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 3 arguments if rented.count() == rented_count here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good reason :)

Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM


// Move-only
RentedThreads(RentedThreads &&other) noexcept : slots_(std::move(other.slots_)) {}
RentedThreads &operator=(RentedThreads &&other) noexcept {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the move operator being used?

// to keep slots alive even if the pool shrinks while the rental is active.
class RentedThreads {
public:
RentedThreads() = default;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be private or should also be used outside of rent()?


size_t count() const { return slots_.size(); }

svs::threads::Thread &operator[](size_t i) { return slots_[i]->thread; }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add and aseertion on the i values or the outside assertion is enough?

Comment on lines +513 to +521
for (auto &slot : slots_) {
if (rented.count() >= count) {
break;
}
// Run on the main function.
try {
f(0);
} catch (const std::exception &error) {
manage_exception_during_run(error.what());
bool expected = false;
if (slot->occupied.compare_exchange_strong(expected, true, std::memory_order_acq_rel)) {
rented.add(slot);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be checked one time outside and use a local var for faster loop:

if (rented.count() < count) {
    int rent = 0;
    for (auto &slot : slots_) {
        bool expected = false;
        if (slot->occupied.compare_exchange_strong(expected, true, std::memory_order_acq_rel)) {
            rented.add(slot);
            if (++rent >= count) {
                break;
            }
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants