Skip to content

ARO-24544: migrate admin API endpoint for control plane VM resize#4733

Open
tuxerrante wants to merge 14 commits intomasterfrom
resize-cp-subagent
Open

ARO-24544: migrate admin API endpoint for control plane VM resize#4733
tuxerrante wants to merge 14 commits intomasterfrom
resize-cp-subagent

Conversation

@tuxerrante
Copy link
Copy Markdown
Collaborator

@tuxerrante tuxerrante commented Mar 31, 2026

Which issue this PR addresses:

Fixes ARO-24544

What this PR does / why we need it:

Migrates the control plane VM resize orchestration from the external C# Geneva Action (ResizeControlPlaneVMsOperation.cs) into the ARO-RP as a native admin API endpoint. This eliminates split-brain logic between the Geneva Action and the RP, gives the operation access to the RP's existing validation infrastructure, and makes the orchestration testable with standard Go tooling.

The new POST /admin/.../resizecontrolplane?vmSize=<sku>&deallocateVM=<bool> endpoint performs pre-flight health checks and then sequentially resizes each master node through the full lifecycle: cordon, drain (with retry), stop VM, resize VM, start VM, wait for Node Ready, uncordon, update Machine object metadata, and update Node instance-type labels.

Design choices and deviations from the original C# implementation

  • Synchronous admin operation. Matches the pattern used by stopvm, startvm, redeployvm. The UnplannedMaintenanceSignal middleware signals maintenance state.
  • deallocateVM defaults to true because Azure requires deallocation for most cross-family resizes. Callers can pass deallocateVM=false explicitly.
  • CPMS-active blocks the operation with 409 Conflict. CPMS-aware resize (patching the CPMS spec) is planned for a follow-up.
  • Reverse name-order processing (master-2master-1master-0) minimises etcd leader elections, matching the C# behaviour.
  • Pre-flight health validation via the shared _getPreResizeControlPlaneVMsValidation endpoint, which checks API server health (synchronous gate), etcd health, service principal validity, VM SKU/quota, and CPMS state.
  • API server health as a synchronous gate: if the kube-apiserver is unreachable, we fail immediately instead of spawning parallel kube-based checks that would all fail with connection errors.
  • Reuse of existing helpers: validateAdminMasterVMSize, getClusterMachines (via getControlPlaneMachines wrapper), and health checks from the pre-resize validation endpoint.
  • Machine/Node metadata updates with retry (up to 3 attempts, ctx-aware delays). The Machine update uses typed machinev1beta1.Machine objects for safety.
  • Explicit cordon/uncordon wrappers: cordonNode() and uncordonNode() wrap CordonNode(bool) to make intent visible at every call site.

Test plan for issue:

  • Unit tests (pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go):

    • TestCheckCPMSNotActive — CPMS not found, Inactive, Active (blocked), empty state, non-NotFound error (fails closed), invalid JSON (fails closed)
    • TestIsNodeReady — node ready, not ready, not found
    • TestResizeControlPlane — all-nodes-already-at-size no-op, single-node full sequence (verifies exact call ordering via gomock.InOrder), no machines found, drain/stop/resize failure cases
    • TestUpdateMachineVMSize — success, retry on conflict
    • TestUpdateNodeInstanceTypeLabels — success, retry on conflict
    • TestAdminResizeControlPlane — full HTTP integration: invalid VM size (400), cluster not found (404), subscription not found (400)
  • E2E tests (test/e2e/adminapi_resize_controlplane.go):

    • Rejects unsupported VM size (400)
    • Rejects missing vmSize parameter (400)
    • (Full resize E2E requires a live cluster and is validated manually in dev environments)
  • Local integration testing against containerized RP:

    Prerequisites: Docker (or Podman on Linux), env file at repo root (from env.example), secrets/ directory with valid credentials.

    # Build the dev container image (first time or after Dockerfile changes)
    make dev-env-build
    
    # Start the containerized RP (first run compiles from source, ~2-5 min)
    make dev-env-start
    
    # Wait for the RP to become healthy
    until curl -ksSf https://localhost:8443/healthz/ready 2>/dev/null; do sleep 5; done
    
    # Test: invalid vmSize → 400
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?vmSize=Standard_Invalid_Fake&deallocateVM=true"
    
    # Test: missing vmSize → 400
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?deallocateVM=true"
    
    # Test: valid vmSize, nonexistent cluster → 404
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/nonexistent/resizecontrolplane?vmSize=Standard_D8s_v3&deallocateVM=true"
    
    # Stop the containerized RP
    make dev-env-stop
    Test case Expected
    Invalid vmSize 400 InvalidParameter
    Missing vmSize 400 InvalidParameter
    Valid vmSize, nonexistent cluster 404 ResourceNotFound

Is there any documentation that needs to be updated for this PR?

The admin API endpoint documentation in docs/deploy-development-rp.md should be updated to include an example curl command for the new /resizecontrolplane endpoint. This can be done in a follow-up.

How do you know this will function as expected in production?

  • The operation is 1:1 with the proven C# Geneva Action sequence — same steps, same order.
  • Structured logging at every step provides full observability.
  • Failure modes are handled explicitly: drain retries (3×2s), kube update retries (3×1s), node-ready polling (30min×5s), context cancellation respected throughout.
  • The UnplannedMaintenanceSignal middleware, CPMS-active check, and pre-flight health checks (apiserver gate + etcd + SP + CPMS) prevent conflicting or unsafe operations.
  • Best-effort node recovery is planned as a separate follow-up PR for easier review and revertability.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new admin API endpoint to orchestrate control plane VM resizes directly within ARO-RP, migrating the resize workflow from the external Geneva Action into the RP so it can reuse existing validation and be unit-tested.

Changes:

  • Adds POST /admin/.../resizecontrolplane route guarded by UnplannedMaintenanceSignal.
  • Implements synchronous, sequential master resize orchestration with best-effort recovery and kube metadata/label updates.
  • Adds unit tests for orchestration helpers and minimal E2E validations for request rejection cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
test/e2e/adminapi_resize_controlplane.go Adds E2E coverage for basic 400 validation cases (unsupported size, missing vmSize).
pkg/frontend/frontend.go Wires the new /resizecontrolplane admin route with maintenance signaling middleware.
pkg/frontend/admin_openshiftcluster_resize_controlplane.go New resize orchestration implementation (cordon/drain/stop/resize/start/wait/uncordon + metadata updates + recovery).
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go Adds unit tests for CPMS gating, readiness checks, drain retries, full sequence ordering, and recovery paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but there a bunch of copilot suggestions. I'm gonna treat these like the PR is still a draft i.e. when they're resolved I'll give it a more thorough review... Is this a good or bad heuristic?

tuxerrante added a commit that referenced this pull request Mar 31, 2026
…tx-aware retries

- checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from
  KubeGet; return 500 on other errors instead of silently proceeding.
  Also return errors on JSON unmarshal failure and NestedString errors
  instead of treating them as "CPMS absent".
- doUpdateMachineVMSize: handle NestedString and SetNestedField errors
  for creationTimestamp sync instead of discarding them.
- updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep
  with ctx-aware select so retries abort promptly on context cancellation.
- Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases
  for non-NotFound KubeGet error and invalid JSON (both fail closed).

Ref: #4733

Made-with: Cursor
tuxerrante and others added 5 commits April 1, 2026 16:30
Add a new admin API POST endpoint `resizecontrolplane` that performs
sequential in-place resizing of all control plane VMs in an ARO cluster.

The operation follows a safe rolling approach: for each master node
(processed in reverse order starting from the highest-numbered, least
critical node), it cordons, drains, stops, resizes, starts the VM,
waits for the node to become Ready, uncordons, and updates the Machine
object and Node labels to reflect the new VM size.

Key design decisions:
- Pre-flight validation via _getPreResizeControlPlaneVMsValidation
  checks API server health, etcd health, service principal validity,
  VM SKU availability, and compute quota before starting.
- CPMS (ControlPlaneMachineSet) guard ensures the operator is not
  Active, preventing conflicts with automated machine management.
- Drain uses retries with context-aware delays (moved to kubeActions
  interface as DrainNodeWithRetries for reusability).
- Node readiness polling uses wait.PollImmediate for idiomatic k8s
  wait patterns.
- Machine object updates use typed machinev1beta1.Machine structs
  instead of Unstructured for type safety.
- Node instance-type labels use shared constants
  (nodeLabelInstanceType, nodeLabelBetaInstanceType).
- Recovery logic intentionally omitted — will be added as a separate
  PR for easier review and revert if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When resizeControlPlaneNode fails before the VM SKU has been changed
(drain, stop, or resize step), attempt to restore the node to a
schedulable state:

- If the node is still running (drain/stop failed): uncordon it.
- If the VM was stopped (resize failed): start VM, wait for node
  Ready, then uncordon.
- If the node does not become Ready after recovery start, leave it
  cordoned per SOP — SRE should verify health before re-enabling
  scheduling.

The original error is always returned with recovery outcome appended
so SREs can see both the failure reason and the cluster state.

Ref: #4723 (comment)
Made-with: Cursor
…tx-aware retries

- checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from
  KubeGet; return 500 on other errors instead of silently proceeding.
  Also return errors on JSON unmarshal failure and NestedString errors
  instead of treating them as "CPMS absent".
- doUpdateMachineVMSize: handle NestedString and SetNestedField errors
  for creationTimestamp sync instead of discarding them.
- updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep
  with ctx-aware select so retries abort promptly on context cancellation.
- Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases
  for non-NotFound KubeGet error and invalid JSON (both fail closed).

Ref: #4733

Made-with: Cursor
- Remove recovery code (bestEffortUncordon, bestEffortRecoverVM,
  resizeRecoveryError) per reviewer request to keep as separate PR
- Use wait.PollImmediateUntilWithContext for waitForNodeReady
- Revert doUpdateMachineVMSize to typed Machine object approach
- Move checkCPMSNotActive to prevalidation pipeline
- Remove recovery-related tests, add simple failure tests

Made-with: Cursor
… sort cleanup

- Make validateAPIServerHealth a synchronous gate before parallel checks
  so kube-unreachable errors are reported once instead of N times
- Add cordonNode/uncordonNode wrappers around CordonNode(bool) to make
  intent explicit at every call site
- Add getControlPlaneMachines wrapper to clarify that the returned map
  only contains master machines (getClusterMachines already filters)
- Replace sort.Sort(sort.Reverse(sort.StringSlice(…))) with
  slices.SortedFunc(maps.Keys(…)) and add comment explaining why we
  process in reverse order
- Add CPMS mock to allKubeChecksHealthyMock in pre-validation tests

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 1, 2026 15:36
@tuxerrante tuxerrante force-pushed the resize-cp-subagent branch from fece767 to 31e9d5e Compare April 1, 2026 15:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 44 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

pkg/util/azureclient/azuresdk/common/options.go:51

  • shouldRetry() no longer reads the response body correctly: var b []byte; resp.Body.Read(b) reads 0 bytes, so the retry checks for AADSTS/AuthorizationFailed will never match. It also doesn’t restore/close the body, which can break downstream SDK error handling and leak connections. Please revert to fully reading the body (e.g., io.ReadAll), close the original body, and restore resp.Body so it can be read again; keep/restore the unit tests that validated this behavior.
	// Check if the body contains the certain strings that can be retried.
	var b []byte
	_, err = resp.Body.Read(b)
	if err != nil {
		return true
	}
	body := string(b)
	return strings.Contains(body, ErrCodeInvalidClientSecretProvided) ||
		strings.Contains(body, ErrCodeMissingRequiredParameters) ||
		strings.Contains(body, AuthorizationFailed)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ature

Remove all changes unrelated to the control plane VM resize endpoint
that were accidentally included in this PR: ACR token SDK revert
(acrtoken, armcontainerregistry, mgmt/containerregistry), deploy
generator script minification changes, changefeed test modifications,
version const downgrade, options.go retry regression, and dependency
changes. The PR now only contains the resize-cp feature files.

Made-with: Cursor
Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 2, 2026 09:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Validate deallocateVM strictly to prevent silent behavior changes from invalid values and refactor control-plane preflight validation to reuse already-fetched docs and clients in the resize handler. Add a happy-path resize handler test to cover the shared prevalidation helper end-to-end.

Made-with: Cursor
Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review.

Made-with: Cursor
Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics.

Made-with: Cursor
Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 2, 2026 15:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior.

Made-with: Cursor
Comment on lines +136 to +140
machine := machines[name]
if machine.size == desiredVMSize {
log.Infof("%s is already running %s, skipping", name, desiredVMSize)
continue
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has potential to move forward to the next machine resize even if the first machine's node is under NotReady.
In case, the resize GA is retried, the machine can have status Running even if the node is not ready. A good check would be to also confirm the node readiness as well as cordon status. We don't want to resize the next master until the first one is all working.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I have added a pre-loop gate before any skip/resize decision: ensureControlPlaneNodesReadyAndSchedulable now verifies every control-plane node is Ready and schedulable.

If any node is NotReady or unschedulable, the operation fails with 409 and does not move to the next master. This closes the scenario where a node already at target size could be skipped while unhealthy. We still keep waitForNodeReady after each VM restart for per-node post-resize stabilization.

Prevent the resize loop from continuing when any control plane node is NotReady or still unschedulable, even if a machine already matches the desired VM size. This closes the skip-path safety gap raised in review and adds test coverage for the new pre-loop guard.

Made-with: Cursor
Align prevalidation flow with related resize work by gating kube-apiserver pod checks behind the ClusterOperator health check, reducing redundant failures and future merge conflicts. Extend unit coverage for pod-level API server validation and update resize handler mocks to reflect the new preflight behavior.

Made-with: Cursor
}

remainingRetries := drainMaxAttempts - attempt - 1
k.log.Infof("Drain attempt %d failed for %s: %v. Retrying %d more times.", attempt+1, nodeName, err, remainingRetries)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use k.log throughout, and remove the "log" import?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This stdlib log usage is pre-existing in adminactions and not introduced by this PR.
To keep this already-large PR scoped to control-plane resize behavior, I’d prefer to track log consistency cleanup (k.log everywhere / remove stdlib log) in a small follow-up PR.

)

const (
nodeReadyPollTimeout = 30 * time.Minute
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to up to 90 minutes with three masters.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Geneva Actions doesn't enforce any timeout limit, what are you suggesting here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. We need data from real resizes I suppose, and come up with a realistic timeout.


conditions, found, err := unstructured.NestedSlice(node.Object, "status", "conditions")
if err != nil || !found {
return false, false, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error is found, nil will be returned instead of err.
Maybe split the two conditions

}

if len(machines) == 0 {
return fmt.Errorf("no control plane machines found")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use an api.NewCloudError() here? I think an HTTP/500 error is too generic here, maybe use an http.StatusConflict?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants