ARO-24544: migrate admin API endpoint for control plane VM resize#4733
ARO-24544: migrate admin API endpoint for control plane VM resize#4733tuxerrante wants to merge 14 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new admin API endpoint to orchestrate control plane VM resizes directly within ARO-RP, migrating the resize workflow from the external Geneva Action into the RP so it can reuse existing validation and be unit-tested.
Changes:
- Adds
POST /admin/.../resizecontrolplaneroute guarded byUnplannedMaintenanceSignal. - Implements synchronous, sequential master resize orchestration with best-effort recovery and kube metadata/label updates.
- Adds unit tests for orchestration helpers and minimal E2E validations for request rejection cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| test/e2e/adminapi_resize_controlplane.go | Adds E2E coverage for basic 400 validation cases (unsupported size, missing vmSize). |
| pkg/frontend/frontend.go | Wires the new /resizecontrolplane admin route with maintenance signaling middleware. |
| pkg/frontend/admin_openshiftcluster_resize_controlplane.go | New resize orchestration implementation (cordon/drain/stop/resize/start/wait/uncordon + metadata updates + recovery). |
| pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go | Adds unit tests for CPMS gating, readiness checks, drain retries, full sequence ordering, and recovery paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mociarain
left a comment
There was a problem hiding this comment.
LGTM but there a bunch of copilot suggestions. I'm gonna treat these like the PR is still a draft i.e. when they're resolved I'll give it a more thorough review... Is this a good or bad heuristic?
…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor
Add a new admin API POST endpoint `resizecontrolplane` that performs sequential in-place resizing of all control plane VMs in an ARO cluster. The operation follows a safe rolling approach: for each master node (processed in reverse order starting from the highest-numbered, least critical node), it cordons, drains, stops, resizes, starts the VM, waits for the node to become Ready, uncordons, and updates the Machine object and Node labels to reflect the new VM size. Key design decisions: - Pre-flight validation via _getPreResizeControlPlaneVMsValidation checks API server health, etcd health, service principal validity, VM SKU availability, and compute quota before starting. - CPMS (ControlPlaneMachineSet) guard ensures the operator is not Active, preventing conflicts with automated machine management. - Drain uses retries with context-aware delays (moved to kubeActions interface as DrainNodeWithRetries for reusability). - Node readiness polling uses wait.PollImmediate for idiomatic k8s wait patterns. - Machine object updates use typed machinev1beta1.Machine structs instead of Unstructured for type safety. - Node instance-type labels use shared constants (nodeLabelInstanceType, nodeLabelBetaInstanceType). - Recovery logic intentionally omitted — will be added as a separate PR for easier review and revert if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When resizeControlPlaneNode fails before the VM SKU has been changed (drain, stop, or resize step), attempt to restore the node to a schedulable state: - If the node is still running (drain/stop failed): uncordon it. - If the VM was stopped (resize failed): start VM, wait for node Ready, then uncordon. - If the node does not become Ready after recovery start, leave it cordoned per SOP — SRE should verify health before re-enabling scheduling. The original error is always returned with recovery outcome appended so SREs can see both the failure reason and the cluster state. Ref: #4723 (comment) Made-with: Cursor
…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor
- Remove recovery code (bestEffortUncordon, bestEffortRecoverVM, resizeRecoveryError) per reviewer request to keep as separate PR - Use wait.PollImmediateUntilWithContext for waitForNodeReady - Revert doUpdateMachineVMSize to typed Machine object approach - Move checkCPMSNotActive to prevalidation pipeline - Remove recovery-related tests, add simple failure tests Made-with: Cursor
… sort cleanup - Make validateAPIServerHealth a synchronous gate before parallel checks so kube-unreachable errors are reported once instead of N times - Add cordonNode/uncordonNode wrappers around CordonNode(bool) to make intent explicit at every call site - Add getControlPlaneMachines wrapper to clarify that the returned map only contains master machines (getClusterMachines already filters) - Replace sort.Sort(sort.Reverse(sort.StringSlice(…))) with slices.SortedFunc(maps.Keys(…)) and add comment explaining why we process in reverse order - Add CPMS mock to allKubeChecksHealthyMock in pre-validation tests Made-with: Cursor
fece767 to
31e9d5e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 40 out of 44 changed files in this pull request and generated 11 comments.
Comments suppressed due to low confidence (1)
pkg/util/azureclient/azuresdk/common/options.go:51
- shouldRetry() no longer reads the response body correctly:
var b []byte; resp.Body.Read(b)reads 0 bytes, so the retry checks for AADSTS/AuthorizationFailed will never match. It also doesn’t restore/close the body, which can break downstream SDK error handling and leak connections. Please revert to fully reading the body (e.g., io.ReadAll), close the original body, and restore resp.Body so it can be read again; keep/restore the unit tests that validated this behavior.
// Check if the body contains the certain strings that can be retried.
var b []byte
_, err = resp.Body.Read(b)
if err != nil {
return true
}
body := string(b)
return strings.Contains(body, ErrCodeInvalidClientSecretProvided) ||
strings.Contains(body, ErrCodeMissingRequiredParameters) ||
strings.Contains(body, AuthorizationFailed)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ature Remove all changes unrelated to the control plane VM resize endpoint that were accidentally included in this PR: ACR token SDK revert (acrtoken, armcontainerregistry, mgmt/containerregistry), deploy generator script minification changes, changefeed test modifications, version const downgrade, options.go retry regression, and dependency changes. The PR now only contains the resize-cp feature files. Made-with: Cursor
Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Validate deallocateVM strictly to prevent silent behavior changes from invalid values and refactor control-plane preflight validation to reuse already-fetched docs and clients in the resize handler. Add a happy-path resize handler test to cover the shared prevalidation helper end-to-end. Made-with: Cursor
Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review. Made-with: Cursor
Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics. Made-with: Cursor
Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior. Made-with: Cursor
| machine := machines[name] | ||
| if machine.size == desiredVMSize { | ||
| log.Infof("%s is already running %s, skipping", name, desiredVMSize) | ||
| continue | ||
| } |
There was a problem hiding this comment.
I think this has potential to move forward to the next machine resize even if the first machine's node is under NotReady.
In case, the resize GA is retried, the machine can have status Running even if the node is not ready. A good check would be to also confirm the node readiness as well as cordon status. We don't want to resize the next master until the first one is all working.
There was a problem hiding this comment.
Great point. I have added a pre-loop gate before any skip/resize decision: ensureControlPlaneNodesReadyAndSchedulable now verifies every control-plane node is Ready and schedulable.
If any node is NotReady or unschedulable, the operation fails with 409 and does not move to the next master. This closes the scenario where a node already at target size could be skipped while unhealthy. We still keep waitForNodeReady after each VM restart for per-node post-resize stabilization.
Prevent the resize loop from continuing when any control plane node is NotReady or still unschedulable, even if a machine already matches the desired VM size. This closes the skip-path safety gap raised in review and adds test coverage for the new pre-loop guard. Made-with: Cursor
Align prevalidation flow with related resize work by gating kube-apiserver pod checks behind the ClusterOperator health check, reducing redundant failures and future merge conflicts. Extend unit coverage for pod-level API server validation and update resize handler mocks to reflect the new preflight behavior. Made-with: Cursor
| } | ||
|
|
||
| remainingRetries := drainMaxAttempts - attempt - 1 | ||
| k.log.Infof("Drain attempt %d failed for %s: %v. Retrying %d more times.", attempt+1, nodeName, err, remainingRetries) |
There was a problem hiding this comment.
Perhaps use k.log throughout, and remove the "log" import?
There was a problem hiding this comment.
Good catch. This stdlib log usage is pre-existing in adminactions and not introduced by this PR.
To keep this already-large PR scoped to control-plane resize behavior, I’d prefer to track log consistency cleanup (k.log everywhere / remove stdlib log) in a small follow-up PR.
| ) | ||
|
|
||
| const ( | ||
| nodeReadyPollTimeout = 30 * time.Minute |
There was a problem hiding this comment.
This can lead to up to 90 minutes with three masters.
There was a problem hiding this comment.
AFAIK Geneva Actions doesn't enforce any timeout limit, what are you suggesting here?
There was a problem hiding this comment.
I'm not sure. We need data from real resizes I suppose, and come up with a realistic timeout.
|
|
||
| conditions, found, err := unstructured.NestedSlice(node.Object, "status", "conditions") | ||
| if err != nil || !found { | ||
| return false, false, nil |
There was a problem hiding this comment.
If an error is found, nil will be returned instead of err.
Maybe split the two conditions
| } | ||
|
|
||
| if len(machines) == 0 { | ||
| return fmt.Errorf("no control plane machines found") |
There was a problem hiding this comment.
Maybe use an api.NewCloudError() here? I think an HTTP/500 error is too generic here, maybe use an http.StatusConflict?
Which issue this PR addresses:
Fixes ARO-24544
What this PR does / why we need it:
Migrates the control plane VM resize orchestration from the external C# Geneva Action (
ResizeControlPlaneVMsOperation.cs) into the ARO-RP as a native admin API endpoint. This eliminates split-brain logic between the Geneva Action and the RP, gives the operation access to the RP's existing validation infrastructure, and makes the orchestration testable with standard Go tooling.The new
POST /admin/.../resizecontrolplane?vmSize=<sku>&deallocateVM=<bool>endpoint performs pre-flight health checks and then sequentially resizes each master node through the full lifecycle: cordon, drain (with retry), stop VM, resize VM, start VM, wait for Node Ready, uncordon, update Machine object metadata, and update Node instance-type labels.Design choices and deviations from the original C# implementation
stopvm,startvm,redeployvm. TheUnplannedMaintenanceSignalmiddleware signals maintenance state.deallocateVMdefaults totruebecause Azure requires deallocation for most cross-family resizes. Callers can passdeallocateVM=falseexplicitly.409 Conflict. CPMS-aware resize (patching the CPMS spec) is planned for a follow-up.master-2→master-1→master-0) minimises etcd leader elections, matching the C# behaviour._getPreResizeControlPlaneVMsValidationendpoint, which checks API server health (synchronous gate), etcd health, service principal validity, VM SKU/quota, and CPMS state.validateAdminMasterVMSize,getClusterMachines(viagetControlPlaneMachineswrapper), and health checks from the pre-resize validation endpoint.machinev1beta1.Machineobjects for safety.cordonNode()anduncordonNode()wrapCordonNode(bool)to make intent visible at every call site.Test plan for issue:
Unit tests (
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go):TestCheckCPMSNotActive— CPMS not found, Inactive, Active (blocked), empty state, non-NotFound error (fails closed), invalid JSON (fails closed)TestIsNodeReady— node ready, not ready, not foundTestResizeControlPlane— all-nodes-already-at-size no-op, single-node full sequence (verifies exact call ordering viagomock.InOrder), no machines found, drain/stop/resize failure casesTestUpdateMachineVMSize— success, retry on conflictTestUpdateNodeInstanceTypeLabels— success, retry on conflictTestAdminResizeControlPlane— full HTTP integration: invalid VM size (400), cluster not found (404), subscription not found (400)E2E tests (
test/e2e/adminapi_resize_controlplane.go):vmSizeparameter (400)Local integration testing against containerized RP:
Prerequisites: Docker (or Podman on Linux),
envfile at repo root (fromenv.example),secrets/directory with valid credentials.vmSize400 InvalidParametervmSize400 InvalidParametervmSize, nonexistent cluster404 ResourceNotFoundIs there any documentation that needs to be updated for this PR?
The admin API endpoint documentation in
docs/deploy-development-rp.mdshould be updated to include an examplecurlcommand for the new/resizecontrolplaneendpoint. This can be done in a follow-up.How do you know this will function as expected in production?
UnplannedMaintenanceSignalmiddleware, CPMS-active check, and pre-flight health checks (apiserver gate + etcd + SP + CPMS) prevent conflicting or unsafe operations.