OCPBUGS-81476: Fix race condition in PinnedImages GC test#30962
OCPBUGS-81476: Fix race condition in PinnedImages GC test#30962isabella-janssen wants to merge 1 commit intoopenshift:mainfrom
Conversation
This commit fixes a race condition in the "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection" test that caused nodes to get stuck in degraded state with missing MachineConfig. The Problem: The test was using defers in the wrong order, causing cleanup to happen like this: 1. Delete KubeletConfig 2. Delete PinnedImageSet (triggers rendered-custom deletion) 3. Unlabel node (triggers transition to worker pool) 4. Wait for worker config When step 3 triggered the transition, the node would reboot to apply the worker config. However, because the rendered-custom config was already deleted in step 2, the node would come back up with a reference to a non-existent config on disk and get stuck in degraded state: currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba state: Degraded reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found" The Fix: Changed cleanup order to: 1. Unlabel node (triggers transition) 2. Wait for worker config transition to complete 3. Delete KubeletConfig 4. Delete PinnedImageSet This ensures the node successfully transitions back to the worker pool BEFORE we delete any configs, eliminating the race condition. Changes: - Removed defers for unlabelNode, waitTillNodeReadyWithConfig, deletePIS, and deleteKC - Added explicit cleanup after GCPISTest completes that performs operations in the correct order - Added logging to track cleanup progress - Removed defer deleteKC from GCPISTest function Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Skipping CI for Draft Pull Request. |
|
@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5 |
|
@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d7be16f0-31ca-11f1-9d47-a6fb2a91cc22-0 |
|
/jira refresh |
|
@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Note: This PR was generated with claude.
This fixes a race condition in the "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection" test that caused nodes to get stuck in degraded state with missing MachineConfig.
The Problem:
The test was using defers in the wrong order, causing cleanup to happen like this:
When step 3 triggered the transition, the node would reboot to apply the worker config. However, because the rendered-custom config was already deleted in step 2, the node would come back up with a reference to a non-existent config on disk and get stuck in degraded state:
currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca
desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba
state: Degraded
reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found"
The Fix:
Changed cleanup order to:
This ensures the node successfully transitions back to the worker pool BEFORE we delete any configs, eliminating the race condition.
Changes: