Fix 3459 rate limit by Vagoasdf · Pull Request #7938 · ethyca/fides

Vagoasdf · 2026-04-15T21:12:08Z

Ticket (3459)[https://ethyca.atlassian.net/browse/ENG-3459]

Description Of Changes

The rate limiter had a hardcoded timeout_seconds=30 default that was shorter than a MINUTE-period bucket (60s). When a breach occurred more than 30s before the next bucket boundary, the limiter would raise RateLimiterTimeoutException instead of waiting — Affecting minute rate limits on saas integrations

Two root causes are fixed:

Busy-wait replaced with sleep-to-boundary.: Instead of retrying every 100ms, the limiter now sleeps until the next bucket boundary of the longest-period breached request, plus a 50ms buffer to avoid landing exactly on the edge
Dynamic timeout replaces the hardcoded 30s. The default timeout_seconds is now min(max(period.factor) + 5, 120) — enough time for at least one full bucket rollover, capped at 120s so HOUR/DAY limits fail fast instead of blocking a Celery worker for hours

Code Changes

Added RateLimiter.seconds_until_next_bucket() to compute remaining time in the current bucket for a given request
Changed limit() timeout_seconds default from hardcoded 30 to a dynamic value based on the longest period in the request list, capped at 120s
Replaced the 100ms busy-wait sleep with a sleep-to-boundary approach

Steps to Confirm

Configure a SaaS conncetor with a per-minute rate limit, like SurveyMonkey
Trigger enough request to bereach the limit (Or edit the limit to a lower bound)
Confirm the connector waits for the minute to roll over

Pre-Merge Checklist

vercel · 2026-04-15T21:12:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
fides-plus-nightly	Ignored	Preview	Apr 17, 2026 2:36pm
fides-privacy-center	Ignored		Apr 17, 2026 2:36pm

codecov · 2026-04-15T21:23:23Z

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.03%. Comparing base (b3048ee) to head (0c2c7a0).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
...des/api/service/connectors/limiter/rate_limiter.py	66.66%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7938      +/-   ##
==========================================
- Coverage   85.06%   85.03%   -0.03%     
==========================================
  Files         629      629              
  Lines       40859    40981     +122     
  Branches     4748     4764      +16     
==========================================
+ Hits        34757    34850      +93     
- Misses       5029     5050      +21     
- Partials     1073     1081       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

#7945) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude

Rate Limiter: Smart Sleep-to-Boundary Fix

This is a well-targeted fix for a real bug — the old 30 s hardcoded timeout was shorter than a MINUTE bucket period (60 s), causing RateLimiterTimeoutException on connectors like Okta and SurveyMonkey when a breach occurred early in the bucket. The sleep-to-boundary approach is the right solution and avoids the previous busy-poll of 0.1 s intervals.

Strengths

The dynamic timeout formula correctly gives MINUTE-period callers a full 65 s window, fixing the regression.
The 120 s cap is the right safety valve to prevent Celery workers from sleeping until the next day bucket.
seconds_until_next_bucket is a clean, testable helper with good edge-case coverage (boundary values, all four period types).
Test coverage is thorough — the freeze_time approach avoids wall-clock waits while using real Redis for state.

Concerns

Behaviour change for SECOND-period callers (medium): The new dynamic default yields min(1 + 5, 120) = 6 s for SECOND-period requests, down from the previous 30 s. Both authenticated_client.py and okta_http_client.py call RateLimiter().limit() with no explicit timeout_seconds, so they inherit this new default. SaaS connectors configured with period: second under sustained contention will hit RateLimiterTimeoutException five times sooner. Worth a quick audit of SaaS connector YAMLs to confirm this doesn't regress anything, or explicitly documenting the intentional trade-off.

current_seconds staleness in sleep calculation (low): See inline comment at line 190. The value captured at loop start is stale by the time seconds_until_next_bucket is called (after two Redis pipeline round-trips). The practical impact is minor (sub-second over-sleep), but refreshing now = int(time.time()) immediately before the max(...) call would be cleaner.

Test assertion gaps (low): See inline comments on test_minute_period_breach_waits_for_rollover (no positive assertion) and test_dynamic_timeout_capped_for_day_limits (one-sided bound). Minor improvements that would make these stronger regression guards.

Test class placement (nit): TestRateLimiterRedisFailure and TestSecondsUntilNextBucket don't need real Redis but live in integration_tests/. No marker is needed (they should run everywhere), but a short comment explaining the intentional placement would help future readers.

🔬 Codegraph: connected (46832 nodes)

💡 Write /code-review in a comment to re-run this review.

claude · 2026-04-16T21:24:09Z

+                if requests
+                else self.MIN_DEFAULT_TIMEOUT_SECONDS,
+                self.MAX_DEFAULT_TIMEOUT_SECONDS,
+            )


src/fides/api/service/connectors/limiter/rate_limiter.py:146-152

The dynamic timeout logic is sound for MINUTE+ periods, but for SECOND-period requests it silently reduces the timeout from the old hardcoded 30 s to min(1 + 5, 120) = 6 s. Any SaaS connector that configures period: second under sustained load will now hit RateLimiterTimeoutException five times faster than before.

The Okta and SaaS authenticated_client callers pass no explicit timeout_seconds, so they'll pick up this new default. Worth validating that no active SaaS connector YAML relies on the old 30 s behaviour for second-period limits (or documenting the intentional change).

claude · 2026-04-16T21:24:18Z

+                sleep_seconds = max(
+                    self.seconds_until_next_bucket(current_seconds, r)
+                    for r in breached_requests
+                )


src/fides/api/service/connectors/limiter/rate_limiter.py:190-193

current_seconds was captured at the top of the loop (int(time.time())) before the Redis pipeline round-trips for both increment_usage and decrement_usage. By the time we call seconds_until_next_bucket(current_seconds, r) here, real clock time has advanced (typically a few milliseconds, but up to hundreds on a loaded Redis). The computed sleep_seconds is therefore slightly over-estimated — we'll wake up a bit past the true bucket boundary and then pay an extra increment_usage / branch iteration.

For practical purposes this is harmless (the 0.05 s buffer and the remaining cap absorb it), but snapshotting the time again here or passing the actual elapsed time to seconds_until_next_bucket would make the intent clearer:

now = int(time.time()) sleep_seconds = max( self.seconds_until_next_bucket(now, r) for r in breached_requests )

claude · 2026-04-16T21:24:26Z

+            side_effect=advancing_sleep,
+        ):
+            limiter.limit(requests=[request])  # fills the single slot
+            limiter.limit(requests=[request])  # breach -> sleep to boundary -> succeed


tests/ops/integration_tests/limiter/test_rate_limiter.py:253-254

The test passes by not raising, which is correct, but there is no positive assertion to confirm the limiter actually waited for the bucket rollover. If, say, freeze_time failed to intercept time.sleep and the second limit() call returned immediately (e.g., due to a Redis key collision cleaning itself up), the test would still pass.

A small guard like checking that the frozen clock advanced by roughly the expected sleep duration would make the regression protection more robust:

limiter.limit(requests=[request]) # fills the single slot before = time.time() limiter.limit(requests=[request]) # breach -> sleep to boundary -> succeed assert time.time() - before >= 50 # slept at least 50 s into the new bucket

claude · 2026-04-16T21:24:32Z

+
+    # Total mocked sleep must reflect the 120s cap, not the 86405s
+    # uncapped value.
+    assert sleep_total[0] < 130


tests/ops/integration_tests/limiter/test_rate_limiter.py:301

assert sleep_total[0] < 130 allows up to 10 s of slack above the 120 s cap. That's fine for flakiness tolerance, but it wouldn't catch a regression that, say, doubled the cap to 240 s. Adding a lower bound tightens this as a regression guard:

assert 110 <= sleep_total[0] < 130 # should be ~120 s, not 86400 s

claude · 2026-04-16T21:24:38Z

+
+
+class TestRateLimiterRedisFailure:
+    """Unit tests for RateLimiter.limit() when Redis is unavailable."""


tests/ops/integration_tests/limiter/test_rate_limiter.py:362

TestRateLimiterRedisFailure and TestSecondsUntilNextBucket are pure unit tests — they mock all external deps or do pure computation — but they live in integration_tests/ without a @pytest.mark.integration marker on their methods. This means they run in all test modes (unit + integration), which is actually desirable. Just worth a brief comment explaining the intentional placement so a future reader doesn't add an unnecessary marker or move the class to a unit test file.

Vagoasdf added 2 commits April 15, 2026 17:11

Updating rate limiters

1ad8a0d

Adding new tests

5ec49b2

galvana mentioned this pull request Apr 16, 2026

Cap dynamic timeout at 120s, replace unit tests with integration tests #7945

Merged

4 tasks

galvana and others added 5 commits April 16, 2026 14:16

Cap dynamic timeout at 120s, replace unit tests with integration tests (

ff0af8f

#7945) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Using Const on Rate Limiter

6002925

CLeaning up tests

9a1a3f6

Adding test for the rate limiter- For the Codecov Patch

389bb1c

Running ruff static checks

b2f5f7f

Vagoasdf marked this pull request as ready for review April 16, 2026 21:20

Vagoasdf requested a review from a team as a code owner April 16, 2026 21:20

Vagoasdf requested review from adamsachs and removed request for a team April 16, 2026 21:20

claude bot reviewed Apr 16, 2026

View reviewed changes

Vagoasdf requested review from galvana and removed request for adamsachs April 17, 2026 13:28

Vagoasdf added 2 commits April 17, 2026 10:22

Improving on tests, re-running ruff

4c961ae

Adding changelog

0c2c7a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 3459 rate limit#7938

Fix 3459 rate limit#7938
Vagoasdf wants to merge 9 commits intomainfrom
fix-3459-rate-limit

Vagoasdf commented Apr 15, 2026 •

edited

Loading

Uh oh!

vercel bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

Vagoasdf Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		class TestRateLimiterRedisFailure:
		"""Unit tests for RateLimiter.limit() when Redis is unavailable."""

Conversation

Vagoasdf commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description Of Changes

Code Changes

Steps to Confirm

Pre-Merge Checklist

Uh oh!

vercel bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Rate Limiter: Smart Sleep-to-Boundary Fix

Strengths

Concerns

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Vagoasdf Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vagoasdf commented Apr 15, 2026 •

edited

Loading

vercel bot commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading