ARO-24544: migrate admin API endpoint for control plane VM resize by tuxerrante · Pull Request #4733 · Azure/ARO-RP

tuxerrante · 2026-03-31T08:48:15Z

Which issue this PR addresses:

What this PR does / why we need it:

Migrates the control plane VM resize orchestration from the external C# Geneva Action (ResizeControlPlaneVMsOperation.cs) into the ARO-RP as a native admin API endpoint. This eliminates split-brain logic between the Geneva Action and the RP, gives the operation access to the RP's existing validation infrastructure, and makes the orchestration testable with standard Go tooling.

The new POST /admin/.../resizecontrolplane?vmSize=<sku>&deallocateVM=<bool> endpoint performs pre-flight health checks and then sequentially resizes each master node through the full lifecycle: cordon, drain (with retry), stop VM, resize VM, start VM, wait for Node Ready, uncordon, update Machine object metadata, and update Node instance-type labels.

Design choices and deviations from the original C# implementation

Synchronous admin operation. Matches the pattern used by stopvm, startvm, redeployvm. The UnplannedMaintenanceSignal middleware signals maintenance state.
deallocateVM defaults to true because Azure requires deallocation for most cross-family resizes. Callers can pass deallocateVM=false explicitly.
CPMS-active blocks the operation with 409 Conflict. CPMS-aware resize (patching the CPMS spec) is planned for a follow-up.
Reverse name-order processing (master-2 → master-1 → master-0) minimises etcd leader elections, matching the C# behaviour.
Pre-flight health validation via the shared _getPreResizeControlPlaneVMsValidation endpoint, which checks API server health (synchronous gate), etcd health, service principal validity, VM SKU/quota, and CPMS state.
API server health as a synchronous gate: if the kube-apiserver is unreachable, we fail immediately instead of spawning parallel kube-based checks that would all fail with connection errors.
Reuse of existing helpers: validateAdminMasterVMSize, getClusterMachines (via getControlPlaneMachines wrapper), and health checks from the pre-resize validation endpoint.
Machine/Node metadata updates with retry (up to 3 attempts, ctx-aware delays). The Machine update uses typed machinev1beta1.Machine objects for safety.
Explicit cordon/uncordon wrappers: cordonNode() and uncordonNode() wrap CordonNode(bool) to make intent visible at every call site.

Test plan for issue:

Unit tests (pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go):
- TestCheckCPMSNotActive — CPMS not found, Inactive, Active (blocked), empty state, non-NotFound error (fails closed), invalid JSON (fails closed)
- TestIsNodeReady — node ready, not ready, not found
- TestResizeControlPlane — all-nodes-already-at-size no-op, single-node full sequence (verifies exact call ordering via gomock.InOrder), no machines found, drain/stop/resize failure cases
- TestUpdateMachineVMSize — success, retry on conflict
- TestUpdateNodeInstanceTypeLabels — success, retry on conflict
- TestAdminResizeControlPlane — full HTTP integration: invalid VM size (400), cluster not found (404), subscription not found (400)
E2E tests (test/e2e/adminapi_resize_controlplane.go):
- Rejects unsupported VM size (400)
- Rejects missing vmSize parameter (400)
- (Full resize E2E requires a live cluster and is validated manually in dev environments)

Local integration testing against containerized RP:

Prerequisites: Docker (or Podman on Linux), env file at repo root (from env.example), secrets/ directory with valid credentials.

# Build the dev container image (first time or after Dockerfile changes)
make dev-env-build

# Start the containerized RP (first run compiles from source, ~2-5 min)
make dev-env-start

# Wait for the RP to become healthy
until curl -ksSf https://localhost:8443/healthz/ready 2>/dev/null; do sleep 5; done

# Test: invalid vmSize → 400
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?vmSize=Standard_Invalid_Fake&deallocateVM=true"

# Test: missing vmSize → 400
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?deallocateVM=true"

# Test: valid vmSize, nonexistent cluster → 404
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/nonexistent/resizecontrolplane?vmSize=Standard_D8s_v3&deallocateVM=true"

# Stop the containerized RP
make dev-env-stop

Test case	Expected
Invalid `vmSize`	`400 InvalidParameter`
Missing `vmSize`	`400 InvalidParameter`
Valid `vmSize`, nonexistent cluster	`404 ResourceNotFound`

Is there any documentation that needs to be updated for this PR?

The admin API endpoint documentation in docs/deploy-development-rp.md should be updated to include an example curl command for the new /resizecontrolplane endpoint. This can be done in a follow-up.

How do you know this will function as expected in production?

The operation is 1:1 with the proven C# Geneva Action sequence — same steps, same order.
Structured logging at every step provides full observability.
Failure modes are handled explicitly: drain retries (3×2s), kube update retries (3×1s), node-ready polling (30min×5s), context cancellation respected throughout.
The UnplannedMaintenanceSignal middleware, CPMS-active check, and pre-flight health checks (apiserver gate + etcd + SP + CPMS) prevent conflicting or unsafe operations.
Best-effort node recovery is planned as a separate follow-up PR for easier review and revertability.

Copilot

Pull request overview

Adds a new admin API endpoint to orchestrate control plane VM resizes directly within ARO-RP, migrating the resize workflow from the external Geneva Action into the RP so it can reuse existing validation and be unit-tested.

Changes:

Adds POST /admin/.../resizecontrolplane route guarded by UnplannedMaintenanceSignal.
Implements synchronous, sequential master resize orchestration with best-effort recovery and kube metadata/label updates.
Adds unit tests for orchestration helpers and minimal E2E validations for request rejection cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
test/e2e/adminapi_resize_controlplane.go	Adds E2E coverage for basic 400 validation cases (unsupported size, missing `vmSize`).
pkg/frontend/frontend.go	Wires the new `/resizecontrolplane` admin route with maintenance signaling middleware.
pkg/frontend/admin_openshiftcluster_resize_controlplane.go	New resize orchestration implementation (cordon/drain/stop/resize/start/wait/uncordon + metadata updates + recovery).
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go	Adds unit tests for CPMS gating, readiness checks, drain retries, full sequence ordering, and recovery paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

mociarain

LGTM but there a bunch of copilot suggestions. I'm gonna treat these like the PR is still a draft i.e. when they're resolved I'll give it a more thorough review... Is this a good or bad heuristic?

…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

Add a new admin API POST endpoint `resizecontrolplane` that performs sequential in-place resizing of all control plane VMs in an ARO cluster. The operation follows a safe rolling approach: for each master node (processed in reverse order starting from the highest-numbered, least critical node), it cordons, drains, stops, resizes, starts the VM, waits for the node to become Ready, uncordons, and updates the Machine object and Node labels to reflect the new VM size. Key design decisions: - Pre-flight validation via _getPreResizeControlPlaneVMsValidation checks API server health, etcd health, service principal validity, VM SKU availability, and compute quota before starting. - CPMS (ControlPlaneMachineSet) guard ensures the operator is not Active, preventing conflicts with automated machine management. - Drain uses retries with context-aware delays (moved to kubeActions interface as DrainNodeWithRetries for reusability). - Node readiness polling uses wait.PollImmediate for idiomatic k8s wait patterns. - Machine object updates use typed machinev1beta1.Machine structs instead of Unstructured for type safety. - Node instance-type labels use shared constants (nodeLabelInstanceType, nodeLabelBetaInstanceType). - Recovery logic intentionally omitted — will be added as a separate PR for easier review and revert if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When resizeControlPlaneNode fails before the VM SKU has been changed (drain, stop, or resize step), attempt to restore the node to a schedulable state: - If the node is still running (drain/stop failed): uncordon it. - If the VM was stopped (resize failed): start VM, wait for node Ready, then uncordon. - If the node does not become Ready after recovery start, leave it cordoned per SOP — SRE should verify health before re-enabling scheduling. The original error is always returned with recovery outcome appended so SREs can see both the failure reason and the cluster state. Ref: #4723 (comment) Made-with: Cursor

…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor

- Remove recovery code (bestEffortUncordon, bestEffortRecoverVM, resizeRecoveryError) per reviewer request to keep as separate PR - Use wait.PollImmediateUntilWithContext for waitForNodeReady - Revert doUpdateMachineVMSize to typed Machine object approach - Move checkCPMSNotActive to prevalidation pipeline - Remove recovery-related tests, add simple failure tests Made-with: Cursor

… sort cleanup - Make validateAPIServerHealth a synchronous gate before parallel checks so kube-unreachable errors are reported once instead of N times - Add cordonNode/uncordonNode wrappers around CordonNode(bool) to make intent explicit at every call site - Add getControlPlaneMachines wrapper to clarify that the returned map only contains master machines (getClusterMachines already filters) - Replace sort.Sort(sort.Reverse(sort.StringSlice(…))) with slices.SortedFunc(maps.Keys(…)) and add comment explaining why we process in reverse order - Add CPMS mock to allKubeChecksHealthyMock in pre-validation tests Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 40 out of 44 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

pkg/util/azureclient/azuresdk/common/options.go:51

shouldRetry() no longer reads the response body correctly: var b []byte; resp.Body.Read(b) reads 0 bytes, so the retry checks for AADSTS/AuthorizationFailed will never match. It also doesn’t restore/close the body, which can break downstream SDK error handling and leak connections. Please revert to fully reading the body (e.g., io.ReadAll), close the original body, and restore resp.Body so it can be read again; keep/restore the unit tests that validated this behavior.

	// Check if the body contains the certain strings that can be retried.
	var b []byte
	_, err = resp.Body.Read(b)
	if err != nil {
		return true
	}
	body := string(b)
	return strings.Contains(body, ErrCodeInvalidClientSecretProvided) ||
		strings.Contains(body, ErrCodeMissingRequiredParameters) ||
		strings.Contains(body, AuthorizationFailed)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/changefeed/subscriptioncache_test.go

pkg/util/acrtoken/acrtoken.go

pkg/deploy/generator/resources_gateway.go

pkg/util/azureclient/mgmt/containerregistry/tokens_addons.go

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

pkg/util/version/const.go

…ature Remove all changes unrelated to the control plane VM resize endpoint that were accidentally included in this PR: ACR token SDK revert (acrtoken, armcontainerregistry, mgmt/containerregistry), deploy generator script minification changes, changefeed test modifications, version const downgrade, options.go retry regression, and dependency changes. The PR now only contains the resize-cp feature files. Made-with: Cursor

Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

Validate deallocateVM strictly to prevent silent behavior changes from invalid values and refactor control-plane preflight validation to reuse already-fetched docs and clients in the resize handler. Add a happy-path resize handler test to cover the shared prevalidation helper end-to-end. Made-with: Cursor

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review. Made-with: Cursor

Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics. Made-with: Cursor

Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

pkg/frontend/adminactions/drain.go

Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior. Made-with: Cursor

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

rajdeepc2792 · 2026-04-03T16:02:52Z

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

+		machine := machines[name]
+		if machine.size == desiredVMSize {
+			log.Infof("%s is already running %s, skipping", name, desiredVMSize)
+			continue
+		}


I think this has potential to move forward to the next machine resize even if the first machine's node is under NotReady.
In case, the resize GA is retried, the machine can have status Running even if the node is not ready. A good check would be to also confirm the node readiness as well as cordon status. We don't want to resize the next master until the first one is all working.

Great point. I have added a pre-loop gate before any skip/resize decision: ensureControlPlaneNodesReadyAndSchedulable now verifies every control-plane node is Ready and schedulable.

If any node is NotReady or unschedulable, the operation fails with 409 and does not move to the next master. This closes the scenario where a node already at target size could be skipped while unhealthy. We still keep waitForNodeReady after each VM restart for per-node post-resize stabilization.

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

Prevent the resize loop from continuing when any control plane node is NotReady or still unschedulable, even if a machine already matches the desired VM size. This closes the skip-path safety gap raised in review and adds test coverage for the new pre-loop guard. Made-with: Cursor

Align prevalidation flow with related resize work by gating kube-apiserver pod checks behind the ClusterOperator health check, reducing redundant failures and future merge conflicts. Extend unit coverage for pod-level API server validation and update resize handler mocks to reflect the new preflight behavior. Made-with: Cursor

aasserzo · 2026-04-07T13:28:33Z

pkg/frontend/adminactions/drain.go

+		}
+
+		remainingRetries := drainMaxAttempts - attempt - 1
+		k.log.Infof("Drain attempt %d failed for %s: %v. Retrying %d more times.", attempt+1, nodeName, err, remainingRetries)


Perhaps use k.log throughout, and remove the "log" import?

Good catch. This stdlib log usage is pre-existing in adminactions and not introduced by this PR.
To keep this already-large PR scoped to control-plane resize behavior, I’d prefer to track log consistency cleanup (k.log everywhere / remove stdlib log) in a small follow-up PR.

aasserzo · 2026-04-07T13:33:57Z

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

+)
+
+const (
+	nodeReadyPollTimeout        = 30 * time.Minute


This can lead to up to 90 minutes with three masters.

AFAIK Geneva Actions doesn't enforce any timeout limit, what are you suggesting here?

I'm not sure. We need data from real resizes I suppose, and come up with a realistic timeout.

aasserzo · 2026-04-07T14:09:35Z

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

+
+	conditions, found, err := unstructured.NestedSlice(node.Object, "status", "conditions")
+	if err != nil || !found {
+		return false, false, nil


If an error is found, nil will be returned instead of err.
Maybe split the two conditions

aasserzo · 2026-04-07T14:38:32Z

pkg/frontend/admin_openshiftcluster_resize_controlplane.go

+	}
+
+	if len(machines) == 0 {
+		return fmt.Errorf("no control plane machines found")


Maybe use an api.NewCloudError() here? I think an HTTP/500 error is too generic here, maybe use an http.StatusConflict?

Copilot AI review requested due to automatic review settings March 31, 2026 08:48

tuxerrante requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kevinobriendotca, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, ventifus, wanghaoran1988 and yjst2012 as code owners March 31, 2026 08:48

Copilot started reviewing on behalf of tuxerrante March 31, 2026 08:48 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

mociarain reviewed Mar 31, 2026

View reviewed changes

mrWinston requested changes Mar 31, 2026

View reviewed changes

tuxerrante and others added 5 commits April 1, 2026 16:30

Copilot AI review requested due to automatic review settings April 1, 2026 15:36

tuxerrante force-pushed the resize-cp-subagent branch from fece767 to 31e9d5e Compare April 1, 2026 15:36

Copilot AI reviewed Apr 1, 2026

View reviewed changes

tuxerrante added 2 commits April 2, 2026 08:28

test(frontend): extend resize control plane error-path coverage

c42a7b7

Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 2, 2026 09:01

Copilot started reviewing on behalf of tuxerrante April 2, 2026 09:02 View session

tuxerrante requested review from mociarain and mrWinston April 2, 2026 09:04

Copilot AI reviewed Apr 2, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

aasserzo reviewed Apr 2, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated Show resolved Hide resolved

aasserzo reviewed Apr 2, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated Show resolved Hide resolved

tuxerrante added 3 commits April 2, 2026 16:42

fix(frontend): align kube object update attempt semantics

64c100d

Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review. Made-with: Cursor

fix(frontend): sync providerSpec creation timestamp on resize

46312c5

Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics. Made-with: Cursor

refactor(frontend): avoid json roundtrip for machine conversion

9445526

Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 2, 2026 15:09

Copilot started reviewing on behalf of tuxerrante April 2, 2026 15:09 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Show resolved Hide resolved

pkg/frontend/adminactions/drain.go Show resolved Hide resolved

fix(frontend): align drain retry attempt semantics

7537aa2

Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior. Made-with: Cursor

rajdeepc2792 reviewed Apr 3, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

rajdeepc2792 reviewed Apr 3, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

rajdeepc2792 reviewed Apr 3, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_resize_controlplane.go Show resolved Hide resolved

rajdeepc2792 mentioned this pull request Apr 3, 2026

ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation #4719

Open

rajdeepc2792 reviewed Apr 3, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Outdated Show resolved Hide resolved

tuxerrante added 2 commits April 7, 2026 10:37

aasserzo reviewed Apr 7, 2026

View reviewed changes

Conversation

tuxerrante commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Design choices and deviations from the original C# implementation

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mociarain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tuxerrante commented Mar 31, 2026 •

edited

Loading