From kubectl apply to sleeping well at night

I had this cluster running for a while. Single node k3s, couple of apps, deployments done through CI pipelines that would kubectl apply manifests directly into the cluster. It worked. Until it didn't feel right anymore.

Every time I thought about "what if this machine dies", the answer was a shrug and a prayer. Secrets were created manually on the cluster. Database backups? Didn't exist. If that VM went down, I'd be rebuilding everything from memory and scattered YAML files across repositories.

"If your disaster recovery plan is 'I'll remember', you won't."

So I decided to fix it properly. What started as "let me just set up ArgoCD" turned into a full infrastructure overhaul that I'm honestly proud of. Here's the journey.

Chapter 1: ArgoCD and the GitOps Mindset

The first thing was getting ArgoCD running and moving all deployments under its control. The idea is simple — one Git repository is the source of truth for everything running in the cluster. Push to main, ArgoCD syncs.

I created softastik-infra, an infrastructure repository that holds:

All Kubernetes manifests (deployments, services, ingresses)
Helm values for any chart-based apps
ArgoCD Application CRDs that point to their respective directories
SealedSecret manifests for encrypted credentials

The CI pipelines in app repositories now just build an image, push it, and update the image tag in the infra repo. That's it. ArgoCD picks up the change and rolls it out.

# ArgoCD Application — one per app
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: tenexengineer
  namespace: argocd
spec:
  project: softastik
  source:
    repoURL: https://github.com/softastik-org/softastik-infra.git
    path: apps/tenexengineer
  destination:
    server: https://kubernetes.default.svc
    namespace: tenex
  syncPolicy:
    automated:
      selfHeal: true
      prune: true

selfHeal: true means if someone manually changes something on the cluster, ArgoCD reverts it. prune: true means if I delete a manifest from Git, the resource gets deleted from the cluster. Git is the boss now.

Sealed Secrets — because credentials in Git sounds terrible

Obviously you can't commit plain secrets to a Git repo. Bitnami Sealed Secrets solves this. You encrypt secrets with a public key, commit the encrypted version, and only the controller running in the cluster can decrypt them.

The critical thing: back up the encryption key. Without it, a new cluster can't decrypt your existing sealed secrets. That backup file is the one piece that lives outside of Git, and it's the foundation of the entire recovery process.

# This one command is your insurance policy
kubectl get secret -n kube-system \
  -l sealedsecrets.bitnami.com/sealed-secrets-key \
  -o yaml > ~/sealed-secrets-key-backup.yaml

Chapter 2: Shared Postgres — One Database to Rule Them All

Each app was running its own Postgres instance. On a home cluster with limited resources, that's wasteful. More importantly, it means N separate backup strategies.

I consolidated everything into a shared Postgres instance running pgvector/pgvector:pg17. Each app gets its own database and user, created by an init script:

CREATE USER tenex WITH PASSWORD '${TENEX_PASSWORD}';
CREATE DATABASE tenexengineer OWNER tenex;
\c tenexengineer
CREATE EXTENSION IF NOT EXISTS vector;

Adding a new app means adding a few lines to the init ConfigMap. One Postgres pod, one backup job, one restore process.

Automated Backups

A CronJob runs pg_dumpall every night at 3 AM, gzips it, uploads to S3, and cleans up backups older than 7 days:

schedule: "0 3 * * *"

That's it. ArgoCD deploys the CronJob, and it just runs. No cron on the host machine, no scripts to forget about.

Backup Verification

Here's the thing about backups — they're worthless if you've never tested a restore. So I added a weekly verification job that downloads the latest backup, restores it into a temporary database, checks that tables and data exist, then drops it:

=== Backup Verification PASSED ===
  File: all_databases_2026-04-13_0300.sql.gz
  Size: 847293 bytes
  Tables: 24

If it fails, the existing alert rules catch it and Slack yells at me.

Chapter 3: The "What If" Moment — One Command Recovery

With ArgoCD managing deployments, Sealed Secrets handling credentials, and backups going to S3, I realized I was close to something powerful: a one-command cluster rebuild.

I wrote scripts/bootstrap-cluster.sh that does everything:

Installs k3s
Installs Helm
Installs Sealed Secrets controller and restores the encryption key
Installs ArgoCD
Connects the private infra repo
Applies all ArgoCD Applications
Waits for everything to sync
Restores the database from S3

# New machine? One command.
./scripts/bootstrap-cluster.sh ~/sealed-secrets-key-backup.yaml

That script is the answer to "what if this machine dies". The answer is: get a new machine, run one script, update DNS. Done.

Chapter 4: Going HA — Because Single Points of Failure Are Boring

At some point I looked at this beautiful setup running on a single VM and thought... that's still one machine. One disk failure away from downtime.

So I got three servers and went full HA.

k3s with embedded etcd

k3s supports HA clustering with embedded etcd — no external database needed. Three server nodes give you quorum. If one dies, the cluster keeps running.

# First node initializes the cluster
curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --tls-san 178.104.173.177 \
  --disable local-storage

Longhorn for distributed storage

With multiple nodes, you need storage that replicates across them. Longhorn does exactly that. Postgres writes to a Longhorn volume, Longhorn replicates the data to all three nodes. A node dies? The volume is still available from the other two.

NAME                STATE      ROBUSTNESS
shared-postgres     attached   healthy     # 3 replicas across 3 nodes

No more single disk as a point of failure.

ArgoCD itself runs HA

Two replicas for the API server, two for the repo server. The bootstrap script and Helm values handle this automatically.

Chapter 5: Observability — Because Guessing Is Not Monitoring

A cluster without observability is just a fancy way to run things you can't see. I added the full stack:

Prometheus — scrapes metrics from all cluster components and applications. 15 days retention, 10Gi persistent storage on Longhorn.

Grafana — dashboards for everything. I created four as code (ConfigMaps auto-discovered by the Grafana sidecar):

Kubernetes Apps Overview — CPU, memory, restarts, network I/O per pod
TenexEngineer — request rate, latency percentiles (p50/p95/p99), error rate, pipeline metrics
Shared Postgres — pod status, uptime, backup status, PVC usage, disk I/O
Uptime & Availability — endpoint status, response times, SSL expiry, 30-day uptime percentage

Alertmanager — fires alerts to Slack. Custom rules for the things that matter:

- alert: NodeDown          # server unreachable for 2 min
- alert: PodCrashLooping   # pod restarting repeatedly
- alert: HighMemoryUsage   # node memory above 90%
- alert: PostgresDown      # shared postgres not ready
- alert: BackupCronJobFailed  # nightly backup or weekly verify failed
- alert: LonghornVolumeUnhealthy  # storage replication degraded
- alert: EndpointDown      # public endpoint unreachable

Blackbox Exporter — external uptime checks. Every 30 seconds it probes the blog, tenexengineer, ArgoCD, and Grafana from inside the cluster.

All of this is deployed through ArgoCD as Helm charts with multi-source Applications. Values live in the infra repo, charts come from the prometheus-community Helm repository. Push a change to alert rules? ArgoCD syncs it. Add a new blackbox target? Same thing.

Chapter 6: Application-Level Metrics

The cluster-level metrics were great, but I also wanted to see what's happening inside my apps. So I added a Prometheus middleware to TenexEngineer — it records request count, latency histograms, and in-flight requests for every endpoint:

// Three new metrics, automatically recorded for every request
HTTPRequestDuration   // histogram by method, path, status
HTTPRequestsTotal     // counter by method, path, status
HTTPRequestsInFlight  // gauge

A ServiceMonitor tells Prometheus to scrape it, and a Grafana dashboard visualizes it. Now I can see exactly how many webhook events TenexEngineer processes, how long the pipeline takes, and whether anything is erroring out.

What's Actually in Git Now

In Git (automatic)	NOT in Git (manual setup)
All Kubernetes manifests	k3s installation
SealedSecret manifests	Sealed Secrets encryption key
ArgoCD Application CRDs	ArgoCD repo credentials
Helm values (monitoring, blackbox)	DNS records (Cloudflare)
Ingress rules	Longhorn installation (HA)
Backup CronJob + Verify Job	GitHub repo secrets
Custom Prometheus alert rules
Grafana dashboards as code
Blackbox uptime targets

The "NOT in Git" column is the migration checklist. Everything else rebuilds itself.

The End Result

$ kubectl get applications -n argocd
NAME                HEALTH    SYNC
blog                Healthy   Synced
dashboards          Healthy   Synced
tenexengineer       Healthy   Synced
shared              Healthy   Synced
prometheus-stack    Healthy   Synced
blackbox-exporter   Healthy   Synced

Three nodes. Six apps. Automated backups with verification. Full observability with Slack alerting. Grafana dashboards for every component. One-command disaster recovery.

From kubectl apply and hoping for the best, to sleeping well at night knowing that if everything burns down, one script and a DNS update brings it all back.

Was it overkill for a home cluster? Absolutely. Did I learn a ton and have fun doing it? Also absolutely.

Till next time, keep your sealed secrets backed up.