Skip to content

fix(kb): disable connectors after repeated sync failures#4046

Merged
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-google-docs-deferred
Apr 8, 2026
Merged

fix(kb): disable connectors after repeated sync failures#4046
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/fix-google-docs-deferred

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

@waleedlatif1 waleedlatif1 commented Apr 8, 2026

Summary

  • Connectors are automatically disabled after 10 consecutive sync failures, stopping the cron from scheduling further retries
  • UI shows an amber alert triangle on the connector icon and a warning banner with failure count and "Reconnect" button
  • Re-enabling a connector (via play button or reconnect) resets failure count and schedules immediate sync
  • Improved error logging in resolveAccessToken to surface the actual root cause of token failures instead of a generic message

Test plan

  • Verify connector with >10 consecutive failures shows "Disabled" badge and amber warning banner
  • Verify "Reconnect" button opens OAuth modal
  • Verify play/resume button re-enables a disabled connector (resets failures, schedules sync)
  • Verify sync button is disabled for disabled connectors
  • Verify cron endpoint skips disabled connectors
  • Verify manual sync still works for disabled connectors via API

waleedlatif1 and others added 3 commits April 8, 2026 09:50
The generic "Failed to obtain access token" error hid the actual root cause.
Now logs credentialId, userId, authMode, and provider to help diagnose
token refresh failures in trigger.dev.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Connectors that fail 10 times in a row are set to 'disabled' status,
stopping the cron from scheduling further syncs. The UI shows an alert
triangle with a reconnect banner. Users can re-enable via the play
button or by reconnecting their account, which resets failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… variant

Sync button should be disabled when connector is in disabled state to
guide users toward reconnecting first. Badge variant changed from red
to amber to match the warning banner styling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 8, 2026

PR Summary

Medium Risk
Changes connector sync failure handling and introduces a new disabled status that stops scheduled retries; incorrect status transitions could prevent syncing or cause unexpected re-enables. UI/UX changes are low risk but depend on backend state consistency.

Overview
Adds a new connector terminal state, disabled, which is automatically applied after MAX_CONSECUTIVE_FAILURES (10) sync failures; once disabled, nextSyncAt is cleared so the cron scheduler no longer retries it.

Updates the sync engine to treat missing OAuth refresh tokens as a hard error with better logging, and to record a clearer error message when a connector is disabled.

Extends the workspace connectors UI to surface disabled connectors (badge, alert icon, warning banner with failure count and reconnect CTA), disables manual “Sync now” for disabled connectors, and treats the play/resume action as re-enabling.

Updates the connector PATCH API so re-activating a connector resets failure counters/last error and schedules an immediate or interval-based next sync.

Reviewed by Cursor Bugbot for commit c3ea8e1. Configure here.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 8, 2026 5:31pm

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR implements a circuit-breaker for knowledge-base connectors: after 10 consecutive sync failures the connector is automatically set to disabled, nextSyncAt is cleared so the cron scheduler skips it, and the UI shows an amber alert with a failure count. Re-enabling is done via the play button (PATCH status: active), which resets consecutiveFailures and schedules an immediate sync. The previous review comments (falsy nextSyncAt check, Reconnect button for non-OAuth connectors) are correctly addressed in this iteration.

Confidence Score: 5/5

Safe to merge — circuit-breaker logic is correct, prior review comments are addressed, and the only remaining finding is a P2 copy clarification.

All P0/P1 concerns from previous review threads are resolved. The only open finding is a P2 UX copy issue where the Reconnect banner text implies single-step re-enable while the actual flow requires a second Resume click. No data-integrity, security, or runtime correctness issues remain.

connectors-section.tsx — minor banner copy clarification needed

Vulnerabilities

No security concerns identified. The PATCH endpoint uses the existing write-access guard and only accepts 'active' | 'paused' for status, preventing callers from injecting arbitrary status values. The consecutiveFailures reset is only reachable through the authenticated write-access path.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/connectors/sync-engine.ts Adds circuit-breaker: increments consecutiveFailures on failure, disables connector and clears nextSyncAt after 10 failures, with exponential backoff up to 24 h. Success path resets counter. resolveAccessToken is now inside the try/catch so token failures count toward the circuit-breaker. Logic is correct.
apps/sim/app/api/knowledge/[id]/connectors/[connectorId]/route.ts PATCH handler resets consecutiveFailures and lastSyncError when status is set to active, and correctly uses === undefined (not falsy check) to avoid overriding an explicit null from the syncIntervalMinutes: 0 path.
apps/sim/hooks/queries/kb/connectors.ts Adds consecutiveFailures: number and 'disabled' to ConnectorData type. No logic changes to query/mutation hooks.
apps/sim/app/workspace/[workspaceId]/knowledge/[id]/components/connectors-section/connectors-section.tsx Adds disabled-state UI: amber badge, AlertTriangle icon overlay, warning banner with failure count. Reconnect button is now correctly gated on serviceId && providerId. Minor UX gap: the copy implies OAuth reconnect alone resumes syncing, but the user still needs to press Resume afterward.

Sequence Diagram

sequenceDiagram
    participant Cron
    participant SyncEngine
    participant DB
    participant UI

    Cron->>DB: Query status IN ['active','error'] AND nextSyncAt <= now
    DB-->>Cron: connectors due

    loop For each connector
        Cron->>SyncEngine: dispatchSync(connectorId)
        SyncEngine->>DB: SET status='syncing' WHERE status != 'syncing'
        alt Sync success
            SyncEngine->>DB: SET status='active', consecutiveFailures=0, nextSyncAt=now+interval
        else Sync failure (< 10)
            SyncEngine->>DB: SET status='error', consecutiveFailures+=1, nextSyncAt=now+backoff
        else Sync failure >= 10
            SyncEngine->>DB: SET status='disabled', consecutiveFailures=10, nextSyncAt=NULL
            DB-->>UI: Connector marked disabled
            UI->>UI: Show amber banner + Reconnect (OAuth) or Resume prompt (API key)
        end
    end

    UI->>UI: User clicks Resume (play button)
    UI->>DB: PATCH status='active' resets consecutiveFailures=0, nextSyncAt=now
    DB-->>Cron: Connector eligible again on next cron tick
Loading

Reviews (4): Last reviewed commit: "fix(kb): remove dead interval branch whe..." | Re-trigger Greptile

- Use `=== undefined` instead of falsy check for nextSyncAt to preserve
  explicit null (manual sync only) when syncIntervalMinutes is 0
- Gate Reconnect button on serviceId/providerId so it only renders for
  OAuth connectors; show appropriate copy for API key connectors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

… coverage

Token resolution failures (e.g. revoked OAuth tokens) were thrown before
the try/catch block, bypassing consecutiveFailures tracking entirely.
Also removes dead `if (refreshed)` guards at mid-sync refresh sites since
resolveAccessToken now always returns a string or throws.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c3ea8e1. Configure here.

When `updates.nextSyncAt === undefined`, syncIntervalMinutes was not in
the request, so `parsed.data.syncIntervalMinutes` is always undefined.
Simplify to just schedule an immediate sync — the sync engine sets the
proper nextSyncAt based on the connector's DB interval after completion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor revew

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 8, 2026

Perfect! Let me create a comprehensive review summary. All the issues flagged by the automated reviewers have been addressed:

PR Review Summary

I've completed a comprehensive review of the PR. All previously flagged issues have been successfully resolved:

✅ Fixed Issues

  1. Dead code from resolveAccessToken return type change (commit c3ea8e1)

    • Removed the obsolete if (refreshed) guards at lines 368 and 507
    • resolveAccessToken now returns Promise<string> and throws on failure instead of returning null
    • Moved resolveAccessToken inside the try/catch block so OAuth token failures properly increment consecutiveFailures
  2. Falsy check conflating null and undefined (commit 780ac55)

    • Changed if (!updates.nextSyncAt) to if (updates.nextSyncAt === undefined) in the PATCH route
    • This correctly distinguishes "field not in request" from "explicitly set to null"
  3. Dead interval branch (commit 23a44de)

    • Removed unreachable code that tried to read syncIntervalMinutes from the request when it wasn't provided
    • Simplified to just schedule an immediate sync when re-enabling
  4. Reconnect button no-op for API key connectors (commit 780ac55)

    • Button now only renders for OAuth connectors (serviceId && providerId)
    • Different copy for API key connectors directing users to the resume button

✅ Core Feature Implementation

Circuit Breaker Logic:

  • After 10 consecutive failures (MAX_CONSECUTIVE_FAILURES), connector status → 'disabled'
  • nextSyncAt set to null so cron/scheduler skips it
  • Clear error message: "Connector disabled after repeated sync failures. Please reconnect."

UI Indicators:

  • Amber alert triangle icon overlay on connector icon
  • Amber warning banner with failure count and "Reconnect" button (OAuth only)
  • "Disabled" badge with amber variant
  • Sync button disabled for disabled connectors
  • Play/Resume button re-enables and resets failure counters

Re-enabling Flow:

  • PATCH with status: 'active' resets consecutiveFailures to 0
  • Clears lastSyncError
  • Schedules immediate sync (nextSyncAt = new Date())

Code Quality

  • All TypeScript types properly updated
  • Proper error logging with context
  • Follows repository conventions (absolute imports, TSDoc, no console.log)
  • Clean separation of concerns
  • No security issues identified

Remaining Note

The Greptile review mentioned a narrow TOCTOU race where consecutiveFailures is read before acquiring the sync lock and written back with +1. This is a theoretical edge case that would only occur if a sync is in-flight at the exact moment a user presses Resume. The impact is minimal (counter increments one extra time instead of resetting), and fixing it would require more invasive changes to the locking mechanism. This is acceptable for the current implementation.

Status: Ready to merge

The PR successfully implements the circuit-breaker pattern for connector sync failures with proper UI feedback and re-enabling flow. All P0/P1 review comments have been addressed.

Open in Web Open in Cursor 

@waleedlatif1 waleedlatif1 merged commit 04c9057 into staging Apr 8, 2026
10 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/fix-google-docs-deferred branch April 8, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant