Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7938 +/- ##
==========================================
- Coverage 85.06% 85.03% -0.03%
==========================================
Files 629 629
Lines 40859 40981 +122
Branches 4748 4764 +16
==========================================
+ Hits 34757 34850 +93
- Misses 5029 5050 +21
- Partials 1073 1081 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
#7945) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Rate Limiter: Smart Sleep-to-Boundary Fix
This is a well-targeted fix for a real bug — the old 30 s hardcoded timeout was shorter than a MINUTE bucket period (60 s), causing RateLimiterTimeoutException on connectors like Okta and SurveyMonkey when a breach occurred early in the bucket. The sleep-to-boundary approach is the right solution and avoids the previous busy-poll of 0.1 s intervals.
Strengths
- The dynamic timeout formula correctly gives MINUTE-period callers a full 65 s window, fixing the regression.
- The 120 s cap is the right safety valve to prevent Celery workers from sleeping until the next day bucket.
seconds_until_next_bucketis a clean, testable helper with good edge-case coverage (boundary values, all four period types).- Test coverage is thorough — the freeze_time approach avoids wall-clock waits while using real Redis for state.
Concerns
Behaviour change for SECOND-period callers (medium): The new dynamic default yields min(1 + 5, 120) = 6 s for SECOND-period requests, down from the previous 30 s. Both authenticated_client.py and okta_http_client.py call RateLimiter().limit() with no explicit timeout_seconds, so they inherit this new default. SaaS connectors configured with period: second under sustained contention will hit RateLimiterTimeoutException five times sooner. Worth a quick audit of SaaS connector YAMLs to confirm this doesn't regress anything, or explicitly documenting the intentional trade-off.
current_seconds staleness in sleep calculation (low): See inline comment at line 190. The value captured at loop start is stale by the time seconds_until_next_bucket is called (after two Redis pipeline round-trips). The practical impact is minor (sub-second over-sleep), but refreshing now = int(time.time()) immediately before the max(...) call would be cleaner.
Test assertion gaps (low): See inline comments on test_minute_period_breach_waits_for_rollover (no positive assertion) and test_dynamic_timeout_capped_for_day_limits (one-sided bound). Minor improvements that would make these stronger regression guards.
Test class placement (nit): TestRateLimiterRedisFailure and TestSecondsUntilNextBucket don't need real Redis but live in integration_tests/. No marker is needed (they should run everywhere), but a short comment explaining the intentional placement would help future readers.
🔬 Codegraph: connected (46832 nodes)
💡 Write /code-review in a comment to re-run this review.
| if requests | ||
| else self.MIN_DEFAULT_TIMEOUT_SECONDS, | ||
| self.MAX_DEFAULT_TIMEOUT_SECONDS, | ||
| ) |
There was a problem hiding this comment.
src/fides/api/service/connectors/limiter/rate_limiter.py:146-152
The dynamic timeout logic is sound for MINUTE+ periods, but for SECOND-period requests it silently reduces the timeout from the old hardcoded 30 s to min(1 + 5, 120) = 6 s. Any SaaS connector that configures period: second under sustained load will now hit RateLimiterTimeoutException five times faster than before.
The Okta and SaaS authenticated_client callers pass no explicit timeout_seconds, so they'll pick up this new default. Worth validating that no active SaaS connector YAML relies on the old 30 s behaviour for second-period limits (or documenting the intentional change).
| sleep_seconds = max( | ||
| self.seconds_until_next_bucket(current_seconds, r) | ||
| for r in breached_requests | ||
| ) |
There was a problem hiding this comment.
src/fides/api/service/connectors/limiter/rate_limiter.py:190-193
current_seconds was captured at the top of the loop (int(time.time())) before the Redis pipeline round-trips for both increment_usage and decrement_usage. By the time we call seconds_until_next_bucket(current_seconds, r) here, real clock time has advanced (typically a few milliseconds, but up to hundreds on a loaded Redis). The computed sleep_seconds is therefore slightly over-estimated — we'll wake up a bit past the true bucket boundary and then pay an extra increment_usage / branch iteration.
For practical purposes this is harmless (the 0.05 s buffer and the remaining cap absorb it), but snapshotting the time again here or passing the actual elapsed time to seconds_until_next_bucket would make the intent clearer:
now = int(time.time())
sleep_seconds = max(
self.seconds_until_next_bucket(now, r)
for r in breached_requests
)| side_effect=advancing_sleep, | ||
| ): | ||
| limiter.limit(requests=[request]) # fills the single slot | ||
| limiter.limit(requests=[request]) # breach -> sleep to boundary -> succeed |
There was a problem hiding this comment.
tests/ops/integration_tests/limiter/test_rate_limiter.py:253-254
The test passes by not raising, which is correct, but there is no positive assertion to confirm the limiter actually waited for the bucket rollover. If, say, freeze_time failed to intercept time.sleep and the second limit() call returned immediately (e.g., due to a Redis key collision cleaning itself up), the test would still pass.
A small guard like checking that the frozen clock advanced by roughly the expected sleep duration would make the regression protection more robust:
limiter.limit(requests=[request]) # fills the single slot
before = time.time()
limiter.limit(requests=[request]) # breach -> sleep to boundary -> succeed
assert time.time() - before >= 50 # slept at least 50 s into the new bucket|
|
||
| # Total mocked sleep must reflect the 120s cap, not the 86405s | ||
| # uncapped value. | ||
| assert sleep_total[0] < 130 |
There was a problem hiding this comment.
tests/ops/integration_tests/limiter/test_rate_limiter.py:301
assert sleep_total[0] < 130 allows up to 10 s of slack above the 120 s cap. That's fine for flakiness tolerance, but it wouldn't catch a regression that, say, doubled the cap to 240 s. Adding a lower bound tightens this as a regression guard:
assert 110 <= sleep_total[0] < 130 # should be ~120 s, not 86400 s|
|
||
|
|
||
| class TestRateLimiterRedisFailure: | ||
| """Unit tests for RateLimiter.limit() when Redis is unavailable.""" |
There was a problem hiding this comment.
tests/ops/integration_tests/limiter/test_rate_limiter.py:362
TestRateLimiterRedisFailure and TestSecondsUntilNextBucket are pure unit tests — they mock all external deps or do pure computation — but they live in integration_tests/ without a @pytest.mark.integration marker on their methods. This means they run in all test modes (unit + integration), which is actually desirable. Just worth a brief comment explaining the intentional placement so a future reader doesn't add an unnecessary marker or move the class to a unit test file.
Ticket (3459)[https://ethyca.atlassian.net/browse/ENG-3459]
Description Of Changes
The rate limiter had a hardcoded
timeout_seconds=30default that was shorter than a MINUTE-period bucket (60s). When a breach occurred more than 30s before the next bucket boundary, the limiter would raiseRateLimiterTimeoutExceptioninstead of waiting — Affecting minute rate limits on saas integrationsTwo root causes are fixed:
timeout_secondsis nowmin(max(period.factor) + 5, 120)— enough time for at least one full bucket rollover, capped at 120s so HOUR/DAY limits fail fast instead of blocking a Celery worker for hoursCode Changes
RateLimiter.seconds_until_next_bucket()to compute remaining time in the current bucket for a given requestlimit()timeout_secondsdefault from hardcoded30to a dynamic value based on the longest period in the request list, capped at 120sSteps to Confirm
Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works