Skip to content

OCPBUGS-81476: Fix race condition in PinnedImages GC test#30962

Draft
isabella-janssen wants to merge 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-81476
Draft

OCPBUGS-81476: Fix race condition in PinnedImages GC test#30962
isabella-janssen wants to merge 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-81476

Conversation

@isabella-janssen
Copy link
Copy Markdown
Member

@isabella-janssen isabella-janssen commented Apr 6, 2026

Note: This PR was generated with claude.

This fixes a race condition in the "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection" test that caused nodes to get stuck in degraded state with missing MachineConfig.

The Problem:
The test was using defers in the wrong order, causing cleanup to happen like this:

  1. Delete KubeletConfig
  2. Delete PinnedImageSet (triggers rendered-custom deletion)
  3. Unlabel node (triggers transition to worker pool)
  4. Wait for worker config

When step 3 triggered the transition, the node would reboot to apply the worker config. However, because the rendered-custom config was already deleted in step 2, the node would come back up with a reference to a non-existent config on disk and get stuck in degraded state:

currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca
desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba
state: Degraded
reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found"

The Fix:
Changed cleanup order to:

  1. Unlabel node (triggers transition)
  2. Wait for worker config transition to complete
  3. Delete KubeletConfig
  4. Delete PinnedImageSet

This ensures the node successfully transitions back to the worker pool BEFORE we delete any configs, eliminating the race condition.

Changes:

  • Removed defers for unlabelNode, waitTillNodeReadyWithConfig, deletePIS, and deleteKC
  • Added explicit cleanup after GCPISTest completes that performs operations in the correct order
  • Added logging to track cleanup progress
  • Removed defer deleteKC from GCPISTest function

This commit fixes a race condition in the "All Nodes in a custom Pool
should have the PinnedImages even after Garbage Collection" test that
caused nodes to get stuck in degraded state with missing MachineConfig.

The Problem:
The test was using defers in the wrong order, causing cleanup to happen
like this:
1. Delete KubeletConfig
2. Delete PinnedImageSet (triggers rendered-custom deletion)
3. Unlabel node (triggers transition to worker pool)
4. Wait for worker config

When step 3 triggered the transition, the node would reboot to apply
the worker config. However, because the rendered-custom config was
already deleted in step 2, the node would come back up with a reference
to a non-existent config on disk and get stuck in degraded state:

  currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca
  desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba
  state: Degraded
  reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found"

The Fix:
Changed cleanup order to:
1. Unlabel node (triggers transition)
2. Wait for worker config transition to complete
3. Delete KubeletConfig
4. Delete PinnedImageSet

This ensures the node successfully transitions back to the worker pool
BEFORE we delete any configs, eliminating the race condition.

Changes:
- Removed defers for unlabelNode, waitTillNodeReadyWithConfig,
  deletePIS, and deleteKC
- Added explicit cleanup after GCPISTest completes that performs
  operations in the correct order
- Added logging to track cleanup progress
- Removed defer deleteKC from GCPISTest function

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This fixes a race condition in the "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection" test that caused nodes to get stuck in degraded state with missing MachineConfig.

The Problem:
The test was using defers in the wrong order, causing cleanup to happen like this:

  1. Delete KubeletConfig
  2. Delete PinnedImageSet (triggers rendered-custom deletion)
  3. Unlabel node (triggers transition to worker pool)
  4. Wait for worker config

When step 3 triggered the transition, the node would reboot to apply the worker config. However, because the rendered-custom config was already deleted in step 2, the node would come back up with a reference to a non-existent config on disk and get stuck in degraded state:

currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca
desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba
state: Degraded
reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found"

The Fix:
Changed cleanup order to:

  1. Unlabel node (triggers transition)
  2. Wait for worker config transition to complete
  3. Delete KubeletConfig
  4. Delete PinnedImageSet

This ensures the node successfully transitions back to the worker pool BEFORE we delete any configs, eliminating the race condition.

Changes:

  • Removed defers for unlabelNode, waitTillNodeReadyWithConfig, deletePIS, and deleteKC
  • Added explicit cleanup after GCPISTest completes that performs operations in the correct order
  • Added logging to track cleanup progress
  • Removed defer deleteKC from GCPISTest function

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b31fc3b5-2fdd-4080-8cb4-603d2b823d78

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2026
@isabella-janssen
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d7be16f0-31ca-11f1-9d47-a6fb2a91cc22-0

@isabella-janssen
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants