Skip to content

Create (grv|commit)ProxyLoadBalance helper coroutines#12918

Open
tclinkenbeard-oai wants to merge 10 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/resend-on-proxies-change
Open

Create (grv|commit)ProxyLoadBalance helper coroutines#12918
tclinkenbeard-oai wants to merge 10 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/resend-on-proxies-change

Conversation

@tclinkenbeard-oai
Copy link
Copy Markdown
Collaborator

A common pattern in client code is to load balance requests across proxies and retry requests if a proxy change is detected:

loop choose {
  when (cx->onProxiesChanged()) {}
  when (ReplyType reply = basicLoadBalance(...)) {
    // callback
  }
}

Converting this code to standard coroutines can result in verbose code. This PR creates (grv|commit)ProxyLoadBalance helper functions that encapsulate this pattern, provided by a new private ProxyLoadBalance.h header file.

There are several more places in NativeAPI.actor.cpp where the new helpers can be used, but they're performance-sensitive, so not updated yet in this PR.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 72cd510
  • Duration 0:03:53
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 72cd510
  • Duration 0:04:18
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 72cd510
  • Duration 0:04:24
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 72cd510
  • Duration 0:04:24
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 72cd510
  • Duration 0:04:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 9b69b41
  • Duration 0:29:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 9b69b41
  • Duration 0:35:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 9b69b41
  • Duration 0:46:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 9b69b41
  • Duration 0:53:40
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 9b69b41
  • Duration 1:07:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 9b69b41
  • Duration 2:19:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Copy Markdown
Collaborator

@gxglass gxglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this seems pretty straightforward.

Here's what Claude said. What do you make of the non-cancellation?


PR #12918 Analysis

Goal

This PR refactors the "proxy load-balance with retry on proxy change" pattern from the legacy FoundationDB actor-compiler idiom (ACTOR + loop { choose { when(...) } }) to modern C++20 coroutines (co_await, co_return, race()).

The repeated boilerplate:
loop {
choose {
when(wait(cx->onProxiesChanged())) {}
when(Reply rep = wait(basicLoadBalance(cx->getProxies(...), &Interface::method, Request(...)))) {
// use rep
}
}
}
is replaced by a single call to a new helper:
Reply rep = co_await commitProxyLoadBalance(cx, makeReqBuilder(...), &Interface::method);

A new header fdbclient/ProxyLoadBalance.h provides two template functions (commitProxyLoadBalance / grvProxyLoadBalance) and a ReqBuilder class that stores constructor args so the request can be rebuilt on each retry iteration.

Four call sites are converted:

  1. getHealthMetricsActor (GRV proxy)
  2. waitDataDistributionMetricsList (commit proxy)
  3. snapCreate (commit proxy)
  4. checkSafeExclusions (commit proxy)

Risks and Bugs

  1. BUG: checkSafeExclusions uses wait() instead of co_await

In the diff (lines 155-158), checkSafeExclusions remains an ACTOR (it still has the ACTOR macro and uses state variables and wait() later in the function), but the new code uses wait() on the result of commitProxyLoadBalance. The problem is commitProxyLoadBalance is a coroutine
(uses co_await/co_return), and it should work fine as it returns Future — wait() works on any Future. However, the function is not fully converted like the others. It still uses ACTOR, state, and wait() in the rest of the body. This is technically valid (mixing actor-compiled
wait() with a callee that happens to be a coroutine), but it's inconsistent — every other converted call site dropped the ACTOR macro and switched to co_await. This looks like an incomplete conversion, likely unintentional.

  1. Semantic difference: non-winning futures are detached, not cancelled

The old choose/when pattern (implemented by the actor compiler) cancels the losing future when the winning branch is taken. The new race() implementation detaches non-winning futures rather than actively cancelling them. This means that when proxies change, the in-flight
basicLoadBalance call from the old iteration continues running in the background until it completes or errors, rather than being immediately cancelled. This could cause:

  • Stale requests arriving at old proxies
  • Minor resource waste (network, memory)
  • In the AtMostOnce::True case (snapCreate), if the old request actually succeeds on a stale proxy while the retry also succeeds on a new proxy, the operation could execute twice, violating the at-most-once guarantee.
  1. snapCreate at-most-once semantics risk

snapCreate passes AtMostOnce::True. With the old choose/when, if proxies changed, the in-flight request was cancelled. With the new race(), the old request is detached and could still complete successfully on the old proxy, while a new request is sent to the new proxy. This is a
correctness risk — snapshot creation could be triggered more than once.

  1. ReqBuilder copies arguments by value on each build() call

build() is const and constructs a new Req from copied args each iteration. If any request argument is expensive to copy (e.g., Standalone in snapCreate, or KeyRange in waitDataDistributionMetricsList), this adds overhead per retry. The old pattern constructed the
request once. This is a minor efficiency concern, not a bug.

  1. ExplicitVoid default parameter purpose is unclear

Both commitProxyLoadBalance and grvProxyLoadBalance have a trailing ExplicitVoid = {} parameter. This is a coroutine infrastructure marker that causes co_await Future to produce Void instead of void. It's correct but uncommented — could confuse future maintainers.

  1. Header placement: private header in fdbclient/ root

ProxyLoadBalance.h is placed in fdbclient/ (not fdbclient/include/fdbclient/) and accessed via a PRIVATE include directory. This is intentional (it's an implementation detail, not a public API), but it introduces a new convention — most fdbclient headers live under the include/
subdirectory. Minor style concern.

Summary

┌────────────────────────────────────────────────────────────────┬──────────────────────────────────┐
│ Issue │ Severity │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ checkSafeExclusions not fully converted (still ACTOR + wait()) │ Low (works but inconsistent) │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ race() detaches rather than cancels losing futures │ Medium — behavioral change │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ snapCreate at-most-once violation risk due to detached futures │ High — potential correctness bug │
├────────────────────────────────────────────────────────────────┼──────────────────────────────────┤
│ ReqBuilder::build() copies args each retry │ Low (performance) │
└────────────────────────────────────────────────────────────────┴──────────────────────────────────┘

The most serious concern is #3: the snapCreate at-most-once guarantee may be violated because race() detaches the old basicLoadBalance future instead of cancelling it when proxies change. The old choose/when pattern cancelled the losing branch, which was important for
AtMostOnce::True semantics.

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 7, 2026

The most serious concern is #3: the snapCreate at-most-once guarantee may be violated because race() detaches the old basicLoadBalance future instead of cancelling it when proxies change. The old choose/when pattern cancelled the losing branch, which was important for AtMostOnce::True semantics.

That would always have a crash before reply type of potential. A way to solve it would be for the client to provide some kind of idempotency ID to attach to the thing being created, but I don't see anything like that here. So I don't think this is a real risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants