Skip to content

Fix 3459 rate limit#7938

Open
Vagoasdf wants to merge 9 commits intomainfrom
fix-3459-rate-limit
Open

Fix 3459 rate limit#7938
Vagoasdf wants to merge 9 commits intomainfrom
fix-3459-rate-limit

Conversation

@Vagoasdf
Copy link
Copy Markdown
Contributor

@Vagoasdf Vagoasdf commented Apr 15, 2026

Ticket (3459)[https://ethyca.atlassian.net/browse/ENG-3459]

Description Of Changes

The rate limiter had a hardcoded timeout_seconds=30 default that was shorter than a MINUTE-period bucket (60s). When a breach occurred more than 30s before the next bucket boundary, the limiter would raise RateLimiterTimeoutException instead of waiting — Affecting minute rate limits on saas integrations

Two root causes are fixed:

  1. Busy-wait replaced with sleep-to-boundary.: Instead of retrying every 100ms, the limiter now sleeps until the next bucket boundary of the longest-period breached request, plus a 50ms buffer to avoid landing exactly on the edge
  2. Dynamic timeout replaces the hardcoded 30s. The default timeout_seconds is now min(max(period.factor) + 5, 120) — enough time for at least one full bucket rollover, capped at 120s so HOUR/DAY limits fail fast instead of blocking a Celery worker for hours

Code Changes

  • Added RateLimiter.seconds_until_next_bucket() to compute remaining time in the current bucket for a given request
  • Changed limit() timeout_seconds default from hardcoded 30 to a dynamic value based on the longest period in the request list, capped at 120s
  • Replaced the 100ms busy-wait sleep with a sleep-to-boundary approach

Steps to Confirm

  1. Configure a SaaS conncetor with a per-minute rate limit, like SurveyMonkey
  2. Trigger enough request to bereach the limit (Or edit the limit to a lower bound)
  3. Confirm the connector waits for the minute to roll over

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • All UX related changes have been reviewed by a designer
    • No UX review needed
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Apr 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
fides-plus-nightly Ignored Ignored Preview Apr 17, 2026 2:36pm
fides-privacy-center Ignored Ignored Apr 17, 2026 2:36pm

Request Review

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.03%. Comparing base (b3048ee) to head (0c2c7a0).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
...des/api/service/connectors/limiter/rate_limiter.py 66.66% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7938      +/-   ##
==========================================
- Coverage   85.06%   85.03%   -0.03%     
==========================================
  Files         629      629              
  Lines       40859    40981     +122     
  Branches     4748     4764      +16     
==========================================
+ Hits        34757    34850      +93     
- Misses       5029     5050      +21     
- Partials     1073     1081       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Vagoasdf Vagoasdf marked this pull request as ready for review April 16, 2026 21:20
@Vagoasdf Vagoasdf requested a review from a team as a code owner April 16, 2026 21:20
@Vagoasdf Vagoasdf requested review from adamsachs and removed request for a team April 16, 2026 21:20
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limiter: Smart Sleep-to-Boundary Fix

This is a well-targeted fix for a real bug — the old 30 s hardcoded timeout was shorter than a MINUTE bucket period (60 s), causing RateLimiterTimeoutException on connectors like Okta and SurveyMonkey when a breach occurred early in the bucket. The sleep-to-boundary approach is the right solution and avoids the previous busy-poll of 0.1 s intervals.

Strengths

  • The dynamic timeout formula correctly gives MINUTE-period callers a full 65 s window, fixing the regression.
  • The 120 s cap is the right safety valve to prevent Celery workers from sleeping until the next day bucket.
  • seconds_until_next_bucket is a clean, testable helper with good edge-case coverage (boundary values, all four period types).
  • Test coverage is thorough — the freeze_time approach avoids wall-clock waits while using real Redis for state.

Concerns

Behaviour change for SECOND-period callers (medium): The new dynamic default yields min(1 + 5, 120) = 6 s for SECOND-period requests, down from the previous 30 s. Both authenticated_client.py and okta_http_client.py call RateLimiter().limit() with no explicit timeout_seconds, so they inherit this new default. SaaS connectors configured with period: second under sustained contention will hit RateLimiterTimeoutException five times sooner. Worth a quick audit of SaaS connector YAMLs to confirm this doesn't regress anything, or explicitly documenting the intentional trade-off.

current_seconds staleness in sleep calculation (low): See inline comment at line 190. The value captured at loop start is stale by the time seconds_until_next_bucket is called (after two Redis pipeline round-trips). The practical impact is minor (sub-second over-sleep), but refreshing now = int(time.time()) immediately before the max(...) call would be cleaner.

Test assertion gaps (low): See inline comments on test_minute_period_breach_waits_for_rollover (no positive assertion) and test_dynamic_timeout_capped_for_day_limits (one-sided bound). Minor improvements that would make these stronger regression guards.

Test class placement (nit): TestRateLimiterRedisFailure and TestSecondsUntilNextBucket don't need real Redis but live in integration_tests/. No marker is needed (they should run everywhere), but a short comment explaining the intentional placement would help future readers.


🔬 Codegraph: connected (46832 nodes)


💡 Write /code-review in a comment to re-run this review.

if requests
else self.MIN_DEFAULT_TIMEOUT_SECONDS,
self.MAX_DEFAULT_TIMEOUT_SECONDS,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/fides/api/service/connectors/limiter/rate_limiter.py:146-152

The dynamic timeout logic is sound for MINUTE+ periods, but for SECOND-period requests it silently reduces the timeout from the old hardcoded 30 s to min(1 + 5, 120) = 6 s. Any SaaS connector that configures period: second under sustained load will now hit RateLimiterTimeoutException five times faster than before.

The Okta and SaaS authenticated_client callers pass no explicit timeout_seconds, so they'll pick up this new default. Worth validating that no active SaaS connector YAML relies on the old 30 s behaviour for second-period limits (or documenting the intentional change).

sleep_seconds = max(
self.seconds_until_next_bucket(current_seconds, r)
for r in breached_requests
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/fides/api/service/connectors/limiter/rate_limiter.py:190-193

current_seconds was captured at the top of the loop (int(time.time())) before the Redis pipeline round-trips for both increment_usage and decrement_usage. By the time we call seconds_until_next_bucket(current_seconds, r) here, real clock time has advanced (typically a few milliseconds, but up to hundreds on a loaded Redis). The computed sleep_seconds is therefore slightly over-estimated — we'll wake up a bit past the true bucket boundary and then pay an extra increment_usage / branch iteration.

For practical purposes this is harmless (the 0.05 s buffer and the remaining cap absorb it), but snapshotting the time again here or passing the actual elapsed time to seconds_until_next_bucket would make the intent clearer:

now = int(time.time())
sleep_seconds = max(
    self.seconds_until_next_bucket(now, r)
    for r in breached_requests
)

side_effect=advancing_sleep,
):
limiter.limit(requests=[request]) # fills the single slot
limiter.limit(requests=[request]) # breach -> sleep to boundary -> succeed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/ops/integration_tests/limiter/test_rate_limiter.py:253-254

The test passes by not raising, which is correct, but there is no positive assertion to confirm the limiter actually waited for the bucket rollover. If, say, freeze_time failed to intercept time.sleep and the second limit() call returned immediately (e.g., due to a Redis key collision cleaning itself up), the test would still pass.

A small guard like checking that the frozen clock advanced by roughly the expected sleep duration would make the regression protection more robust:

limiter.limit(requests=[request])  # fills the single slot
before = time.time()
limiter.limit(requests=[request])  # breach -> sleep to boundary -> succeed
assert time.time() - before >= 50  # slept at least 50 s into the new bucket


# Total mocked sleep must reflect the 120s cap, not the 86405s
# uncapped value.
assert sleep_total[0] < 130
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/ops/integration_tests/limiter/test_rate_limiter.py:301

assert sleep_total[0] < 130 allows up to 10 s of slack above the 120 s cap. That's fine for flakiness tolerance, but it wouldn't catch a regression that, say, doubled the cap to 240 s. Adding a lower bound tightens this as a regression guard:

assert 110 <= sleep_total[0] < 130  # should be ~120 s, not 86400 s



class TestRateLimiterRedisFailure:
"""Unit tests for RateLimiter.limit() when Redis is unavailable."""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/ops/integration_tests/limiter/test_rate_limiter.py:362

TestRateLimiterRedisFailure and TestSecondsUntilNextBucket are pure unit tests — they mock all external deps or do pure computation — but they live in integration_tests/ without a @pytest.mark.integration marker on their methods. This means they run in all test modes (unit + integration), which is actually desirable. Just worth a brief comment explaining the intentional placement so a future reader doesn't add an unnecessary marker or move the class to a unit test file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Vagoasdf Vagoasdf requested review from galvana and removed request for adamsachs April 17, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants