From 8cd2b3c3ff54c5f871f031102a78208c13128e9a Mon Sep 17 00:00:00 2001 From: Joshua Gilman Date: Thu, 16 Apr 2026 15:38:36 -0700 Subject: [PATCH] docs: design network device backups --- README.md | 1 + docs/docs/architecture.md | 22 ++- docs/docs/index.md | 1 + docs/docs/network-device-backups.md | 201 ++++++++++++++++++++++++++++ 4 files changed, 224 insertions(+), 1 deletion(-) create mode 100644 docs/docs/network-device-backups.md diff --git a/README.md b/README.md index 0fe04fa..957352e 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,7 @@ moon run docs:start - [`docs/docs/index.md`](docs/docs/index.md): docs landing page - [`docs/docs/architecture.md`](docs/docs/architecture.md): architecture overview - [`docs/docs/hardware.md`](docs/docs/hardware.md): hardware inventory +- [`docs/docs/network-device-backups.md`](docs/docs/network-device-backups.md): RouterOS backup design for the future platform cluster ## Support diff --git a/docs/docs/architecture.md b/docs/docs/architecture.md index 4f2d40c..fbeb964 100644 --- a/docs/docs/architecture.md +++ b/docs/docs/architecture.md @@ -53,6 +53,7 @@ This cluster is intended to own the following responsibilities: - `Argo CD`: GitOps for the platform cluster itself, and potentially for downstream cluster registration and sync - `AWX`: Ansible orchestration for infrastructure tasks that are still better handled through playbooks - `TerraKube`: optional Terraform-based automation for future non-node-bootstrap workflows +- network-device backup services for RouterOS configuration history and encrypted recovery artifacts This machine is not being used as a general-purpose compute node. Its purpose is to act as the lab control plane. @@ -173,6 +174,24 @@ The intended trust boundary is deliberately split: This design accepts that routing, internal DNS, and the online intermediate CA share the `VP6630` failure domain. That is an intentional trade for the homelab: a single edge host keeps the bootstrap path simple, while the root CA remains outside that host's routine operating privileges. +### Network Device Backups + +Network-device backup collection belongs in the platform cluster once that +cluster is online. + +The first target devices are the MikroTik `CRS309` lab switch and `CCR2004` home +router. The durable flow should use `Oxidized` for RouterOS collection and a +small SOPS-aware writer that commits only encrypted backup artifacts into the +private `secrets` repo. + +This is intentionally not a `VP6630` container responsibility. RouterOS backups +are operational recovery support, not a bootstrap dependency like DNS or PKI. +Keeping the backup stack in the platform cluster keeps the router focused on +routing, internal DNS, and certificate issuance while the platform cluster owns +automation and Git-backed operational services. + +The design is documented in [Network device backups](./network-device-backups.md). + ## Control Flow ### 1. Platform Bootstrap @@ -295,6 +314,7 @@ At minimum, that includes: - platform cluster applications - platform cluster infrastructure controllers - provisioning stack configuration +- platform-owned operational services such as network-device backups The current design keeps `Argo CD` scoped to the platform cluster itself. @@ -340,4 +360,4 @@ As the design firms up, the next useful additions to this document are likely: - storage model - network model - downstream cluster lifecycle -- backup and disaster recovery boundaries +- restore drills and disaster recovery procedures diff --git a/docs/docs/index.md b/docs/docs/index.md index 9f59ce1..46256fe 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -12,5 +12,6 @@ Start with: - [Architecture overview](./architecture.md) - [Hardware reference](./hardware.md) +- [Network device backups](./network-device-backups.md) More runbooks, decisions, and operating guides will live here as the lab grows. diff --git a/docs/docs/network-device-backups.md b/docs/docs/network-device-backups.md new file mode 100644 index 0000000..f2fc526 --- /dev/null +++ b/docs/docs/network-device-backups.md @@ -0,0 +1,201 @@ +--- +title: Network Device Backups +description: Future design for RouterOS configuration backup and encrypted Git storage. +--- + +# Network Device Backups + +This document defines the intended backup process for managed network devices. + +The first scope is the MikroTik RouterOS devices: + +- `CRS309-1G-8S+IN`: lab switch +- `CCR2004`: home router + +The `VP6630` runs VyOS and is managed separately through the `infra` repo's +VyOS configuration and Ansible flow. It may be added to the same visibility +surface later, but the first backup process should stay focused on RouterOS. + +## Placement + +The durable implementation belongs in the platform cluster, not on the `VP6630`. + +The backup service is operational plumbing rather than a bootstrap dependency. +Running it in the platform cluster keeps the router focused on routing, DNS, and +PKI duties, while the platform cluster owns automation, GitOps-managed services, +and recovery helpers. + +Until the platform cluster exists, ad hoc manual exports are acceptable. Do not +add a long-lived RouterOS backup container to the VyOS router as the default +design. + +## Desired Flow + +The target flow is: + +1. `Oxidized` polls each RouterOS device on a schedule. +2. Oxidized writes the latest fetched config to a private staging volume. +3. A small `backup-sync` job or sidecar reads the staged export. +4. `backup-sync` compares the plaintext export with the decrypted current SOPS + backup. +5. If the plaintext changed, `backup-sync` writes a structured SOPS-encrypted + backup into the private `secrets` repo. +6. `backup-sync` commits and pushes only encrypted files. +7. Health checks report the last successful backup time per device. + +Oxidized should use file output for the handoff to `backup-sync`, not its native +Git output. Oxidized's Git backend commits plaintext configs, and its encrypted +Git option uses `git-crypt`, not SOPS. + +The encrypted Git writer should be deliberately small. It only needs to compare, +encrypt, commit, and push. Oxidized should remain responsible for device polling, +connection handling, and RouterOS model support. + +## Secret Boundary + +Encrypted backup payloads belong in the private `secrets` repo. + +Public repos may contain: + +- the platform workload manifests +- the list of expected devices +- config templates +- references to secret names and paths +- the `backup-sync` source code, if a custom tool is needed + +Public repos must not contain: + +- RouterOS credentials +- Git deploy keys +- SOPS age identities +- encrypted backup payloads +- raw RouterOS exports + +The expected private repo layout is: + +```text +network/ + mikrotik/ + backup-credentials.sops.yaml + backup-writer.sops.yaml + backups/ + ccr2004.sops.yaml + crs309.sops.yaml +``` + +`backup-credentials.sops.yaml` should hold the RouterOS backup user material. +`backup-writer.sops.yaml` should hold the Git and SOPS identity material needed +by the platform-cluster writer. + +## Backup Format + +The reviewable backup artifact should be a SOPS-encrypted YAML envelope, not a +bare `.rsc` file. + +The plaintext form before encryption should look like: + +```yaml +device: crs309 +kind: routeros-export +captured_at: "2026-04-16T00:00:00Z" +routeros_version: "7.16.2" +source: oxidized +export: | + /interface ethernet + set [ find default-name=sfp-sfpplus1 ] name=to-vyos +``` + +This keeps metadata available to tooling while still letting SOPS encrypt the +actual configuration content. The exact schema can grow, but it should remain +simple enough to inspect with `sops -d`. + +Do not blindly re-encrypt and commit every poll. SOPS encryption output can +change even when the plaintext has not changed, so `backup-sync` must compare +plaintext before writing a new encrypted file. + +## Export Policy + +Start with plain text RouterOS exports. + +The initial command should prefer a terse export rather than verbose output. +Verbose exports include more default and built-in state, which makes them harder +to review and more fragile as restore input. + +Treat text exports as the primary review and change-history artifact. They are +not automatically a complete bare-metal recovery guarantee. + +RouterOS text exports do not include every sensitive or device-local artifact, +including system user passwords, installed certificates, SSH keys, Dude data, or +User Manager databases. Future implementation should explicitly decide whether +to add same-device binary backups for disaster recovery. If binary backups are +added, they must also be SOPS-encrypted before commit and their restore path must +be tested. + +## Access Model + +Create a dedicated RouterOS backup identity per device or per backup domain. + +The backup user should have only the policies needed for the selected export +method. Do not reuse the day-to-day administrator identity. If the export process +eventually needs sensitive values, grant that deliberately and document why. + +The platform cluster needs network reachability from the backup namespace to: + +- `crs309.mgmt.lab.gilman.io` or `10.10.10.2` +- the home `CCR2004` management address + +The implementation session must add the minimum firewall and service-access +rules needed for those connections. + +## GitOps Shape + +The backup stack should be deployed by Argo CD as a platform-owned application. + +The public desired state should define: + +- namespace +- Oxidized deployment +- `backup-sync` job or sidecar +- persistent or ephemeral staging volume +- network policy, if the cluster network plugin supports it +- health checks and alerting hooks +- secret references, not secret payloads + +The private `secrets` repo supplies credentials and stores the encrypted backup +artifacts. The backup writer therefore needs both read access to its deployment +secrets and write access to the backup destination path. + +## Restore Expectations + +Every backup mechanism must be paired with a restore drill. + +The first implementation is not complete until it proves: + +- a current export can be decrypted from the `secrets` repo +- the export can be inspected by an operator +- a non-destructive import dry run or lab-device restore test has been performed +- the known gaps in text exports are documented + +Do not assume a RouterOS `.rsc` export can be applied blindly to a wiped or +replacement device. RouterOS imports are sensitive to version, hardware, default +objects, interface naming, certificates, keys, and users. + +## Future Implementation Checklist + +- Choose whether to deploy stock Oxidized plus a custom `backup-sync` container + or a single custom collector for the first two devices. +- Add RouterOS backup users for the `CRS309` and `CCR2004`. +- Add encrypted RouterOS credentials under `secrets/network/mikrotik/`. +- Add the platform-cluster Git writer credentials under + `secrets/network/mikrotik/`. +- Create the Argo CD application and manifests for the backup stack. +- Confirm platform-cluster network reachability to both devices. +- Run the first backup and verify only SOPS-encrypted files are committed. +- Test restore behavior against a lab-safe target before relying on the backups + for disaster recovery. + +## References + +- [Oxidized](https://github.com/ytti/oxidized) +- [RouterOS Configuration Management](https://help.mikrotik.com/docs/spaces/ROS/pages/328155/Configuration%2BManagement) +- [SOPS](https://github.com/getsops/sops)