Skip to content

fix(cluster): declare openshell namespace via k3s auto-manifest#871

Open
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
latenighthackathon:fix/k8s-namespace-race-1974
Open

fix(cluster): declare openshell namespace via k3s auto-manifest#871
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
latenighthackathon:fix/k8s-namespace-race-1974

Conversation

@latenighthackathon
Copy link
Copy Markdown
Contributor

@latenighthackathon latenighthackathon commented Apr 17, 2026

Summary

reconcile_pki calls wait_for_namespace("openshell") with a ~115 s budget before the PKI phase can read or write secrets. Today the namespace is created only by the k3s Helm controller reconciling openshell-helmchart.yaml with createNamespace: true. On slow networks, cold boots, or stalled chart downloads the Helm controller can exceed that budget, causing the gateway to fail with:

Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server
    (NotFound): namespaces "openshell" not found

Declaring the namespace as a standalone auto-applied manifest makes k3s create it within seconds of the API server becoming ready — decoupled from Helm controller latency.

Related Issue

Closes NVIDIA/NemoClaw#1974

Changes

  • Add deploy/kube/manifests/openshell-namespace.yaml — a minimal kind: Namespace manifest with SPDX header. k3s auto-applies everything in /var/lib/rancher/k3s/server/manifests/ on startup, before Helm reconciliation.
  • Update crates/openshell-vm/scripts/build-rootfs.sh to include the new file in its explicit manifest copy list. The docker path in cluster-entrypoint.sh uses a *.yaml glob and picks it up automatically.
  • Unit test in crates/openshell-bootstrap/src/lib.rs — compile-time embeds the manifest via include_str! and asserts apiVersion, kind, and metadata.name. Fails the build if the file is deleted/renamed; fails the test if any of the three fields wait_for_namespace depends on drift.
  • E2E test e2e/rust/tests/namespace_bootstrap.rs — against a healthy gateway, asserts kubectl get namespace openshell returns namespace/openshell and that status.phase == Active. The phase check rejects a Terminating namespace from a tear-down or an empty response from a transient API error.
  • architecture/gateway-single-node.md — lists the new manifest in the bundled-manifests section and explains why it exists independently of the HelmChart CR.
  • createNamespace: true on the HelmChart is retained as an idempotent fallback — Helm's --create-namespace coexists with pre-existing namespaces without error.

Testing

  • cargo test -p openshell-bootstrap --lib — 109 passed / 0 failed, including the new openshell_namespace_manifest_is_present_and_well_formed
  • cargo check --tests --features e2e (in e2e/rust/) — new e2e suite compiles cleanly
  • Live e2e against rancher/k3s:v1.29.8-k3s1 — dropped the new manifest into /var/lib/rancher/k3s/server/manifests/ on a running k3s container. k3s applied it within 6 ms (ApplyingManifestAppliedManifest per the addon controller events), and kubectl get namespace openshell -o name returned namespace/openshell with status.phase == Active. This mirrors exactly what cluster-entrypoint.sh's cp "$manifest" "$K3S_MANIFESTS/" step does, using a stock k3s image.
  • mise run license:check — SPDX headers present on all new files
  • mise run helm:lint — no regression on the openshell chart
  • mise run docs — architecture + Fern docs validate
  • bash -n crates/openshell-vm/scripts/build-rootfs.sh — syntax OK
  • YAML parses and matches apiVersion: v1 / kind: Namespace / metadata.name: openshell

Checklist

@latenighthackathon latenighthackathon requested a review from a team as a code owner April 17, 2026 04:53
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@latenighthackathon latenighthackathon force-pushed the fix/k8s-namespace-race-1974 branch from 20584f9 to 5764450 Compare April 17, 2026 05:26
reconcile_pki calls wait_for_namespace("openshell") with a ~115s budget
(60 attempts, 200ms→2s backoff) before the PKI phase can read or write
secrets. Today the namespace is created only by the k3s Helm controller
reconciling openshell-helmchart.yaml with createNamespace: true. On slow
networks, cold boots, or when the chart tarball download stalls, the
Helm controller can easily exceed that budget, producing:

  Error: × K8s namespace not ready
  ╰─▶ timed out waiting for namespace 'openshell' to exist: Error from
      server (NotFound): namespaces "openshell" not found

k3s auto-applies every YAML in /var/lib/rancher/k3s/server/manifests/
as soon as its API server is ready, before any Helm reconciliation.
A standalone Namespace manifest guarantees the namespace exists within
seconds of cluster startup, decoupled from Helm controller latency.

createNamespace: true on the HelmChart stays as an idempotent fallback
— Helm's --create-namespace coexists with pre-existing namespaces
without error.

Also updates the openshell-vm rootfs builder to include the new manifest
in its explicit copy list; the docker cluster-entrypoint picks it up
automatically via its *.yaml glob.

Docs:

- architecture/gateway-single-node.md lists the new manifest and
  explains why it exists independently of the HelmChart CR.

Tests:

- Unit test in openshell-bootstrap compile-time embeds the manifest via
  include_str! and asserts apiVersion/kind/metadata.name. include_str!
  fails the build if the file is deleted or moved; the string checks
  catch drift in the fields wait_for_namespace depends on.
- E2E test asserts `kubectl get namespace openshell` returns
  `namespace/openshell` and that `status.phase == Active` against a
  healthy gateway, rejecting a Terminating namespace or a transient
  empty API response that would pass a bare existence check.

Closes NVIDIA/NemoClaw#1974

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@latenighthackathon latenighthackathon force-pushed the fix/k8s-namespace-race-1974 branch from 5764450 to f5118b6 Compare April 17, 2026 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant