Create (grv|commit)ProxyLoadBalance helper coroutines#12918
Create (grv|commit)ProxyLoadBalance helper coroutines#12918tclinkenbeard-oai wants to merge 10 commits intoapple:mainfrom
(grv|commit)ProxyLoadBalance helper coroutines#12918Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
gxglass
left a comment
There was a problem hiding this comment.
In principle this seems pretty straightforward.
Here's what Claude said. What do you make of the non-cancellation?
PR #12918 Analysis
Goal
This PR refactors the "proxy load-balance with retry on proxy change" pattern from the legacy FoundationDB actor-compiler idiom (ACTOR + loop { choose { when(...) } }) to modern C++20 coroutines (co_await, co_return, race()).
The repeated boilerplate:
loop {
choose {
when(wait(cx->onProxiesChanged())) {}
when(Reply rep = wait(basicLoadBalance(cx->getProxies(...), &Interface::method, Request(...)))) {
// use rep
}
}
}
is replaced by a single call to a new helper:
Reply rep = co_await commitProxyLoadBalance(cx, makeReqBuilder(...), &Interface::method);
A new header fdbclient/ProxyLoadBalance.h provides two template functions (commitProxyLoadBalance / grvProxyLoadBalance) and a ReqBuilder class that stores constructor args so the request can be rebuilt on each retry iteration.
Four call sites are converted:
- getHealthMetricsActor (GRV proxy)
- waitDataDistributionMetricsList (commit proxy)
- snapCreate (commit proxy)
- checkSafeExclusions (commit proxy)
Risks and Bugs
- BUG: checkSafeExclusions uses wait() instead of co_await
In the diff (lines 155-158), checkSafeExclusions remains an ACTOR (it still has the ACTOR macro and uses state variables and wait() later in the function), but the new code uses wait() on the result of commitProxyLoadBalance. The problem is commitProxyLoadBalance is a coroutine
(uses co_await/co_return), and it should work fine as it returns Future — wait() works on any Future. However, the function is not fully converted like the others. It still uses ACTOR, state, and wait() in the rest of the body. This is technically valid (mixing actor-compiled
wait() with a callee that happens to be a coroutine), but it's inconsistent — every other converted call site dropped the ACTOR macro and switched to co_await. This looks like an incomplete conversion, likely unintentional.
- Semantic difference: non-winning futures are detached, not cancelled
The old choose/when pattern (implemented by the actor compiler) cancels the losing future when the winning branch is taken. The new race() implementation detaches non-winning futures rather than actively cancelling them. This means that when proxies change, the in-flight
basicLoadBalance call from the old iteration continues running in the background until it completes or errors, rather than being immediately cancelled. This could cause:
- Stale requests arriving at old proxies
- Minor resource waste (network, memory)
- In the AtMostOnce::True case (snapCreate), if the old request actually succeeds on a stale proxy while the retry also succeeds on a new proxy, the operation could execute twice, violating the at-most-once guarantee.
- snapCreate at-most-once semantics risk
snapCreate passes AtMostOnce::True. With the old choose/when, if proxies changed, the in-flight request was cancelled. With the new race(), the old request is detached and could still complete successfully on the old proxy, while a new request is sent to the new proxy. This is a
correctness risk — snapshot creation could be triggered more than once.
- ReqBuilder copies arguments by value on each build() call
build() is const and constructs a new Req from copied args each iteration. If any request argument is expensive to copy (e.g., Standalone in snapCreate, or KeyRange in waitDataDistributionMetricsList), this adds overhead per retry. The old pattern constructed the
request once. This is a minor efficiency concern, not a bug.
- ExplicitVoid default parameter purpose is unclear
Both commitProxyLoadBalance and grvProxyLoadBalance have a trailing ExplicitVoid = {} parameter. This is a coroutine infrastructure marker that causes co_await Future to produce Void instead of void. It's correct but uncommented — could confuse future maintainers.
- Header placement: private header in fdbclient/ root
ProxyLoadBalance.h is placed in fdbclient/ (not fdbclient/include/fdbclient/) and accessed via a PRIVATE include directory. This is intentional (it's an implementation detail, not a public API), but it introduces a new convention — most fdbclient headers live under the include/
subdirectory. Minor style concern.
Summary
┌────────────────────────────────────────────────────────────────┬──────────────────────────────────┐
│ Issue │ Severity │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ checkSafeExclusions not fully converted (still ACTOR + wait()) │ Low (works but inconsistent) │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ race() detaches rather than cancels losing futures │ Medium — behavioral change │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ snapCreate at-most-once violation risk due to detached futures │ High — potential correctness bug │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ ReqBuilder::build() copies args each retry │ Low (performance) │
└────────────────────────────────────────────────────────────────┴──────────────────────────────────┘
The most serious concern is #3: the snapCreate at-most-once guarantee may be violated because race() detaches the old basicLoadBalance future instead of cancelling it when proxies change. The old choose/when pattern cancelled the losing branch, which was important for
AtMostOnce::True semantics.
That would always have a crash before reply type of potential. A way to solve it would be for the client to provide some kind of idempotency ID to attach to the thing being created, but I don't see anything like that here. So I don't think this is a real risk. |
A common pattern in client code is to load balance requests across proxies and retry requests if a proxy change is detected:
Converting this code to standard coroutines can result in verbose code. This PR creates
(grv|commit)ProxyLoadBalancehelper functions that encapsulate this pattern, provided by a new privateProxyLoadBalance.hheader file.There are several more places in
NativeAPI.actor.cppwhere the new helpers can be used, but they're performance-sensitive, so not updated yet in this PR.Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)