// BLOG

How I Built a Production-Grade Homelab with Kubernetes, Observability, and Full GitOps — And the AI That Helped Me Do It

When most people think “homelab,” they picture a dusty old PC running Plex and maybe a Pi-hole. What I built is something considerably different: a six-node RHEL 9 Kubernetes cluster with full GitOps CI/CD, enterprise-grade observability, split-horizon DNS, wildcard TLS, and a growing stack of self-hosted applications — all running on hardware I own, in my own environment.

And throughout the entire build, I had Claude as my copilot.

This post is a deep dive into what we built together, the problems we solved, and why this homelab is more “production infrastructure” than “weekend project.”


The Foundation: A Six-Node RHEL 9 Kubernetes Cluster

The cluster is built on Red Hat Enterprise Linux 9 — not Ubuntu, not Debian. RHEL 9 is what I run professionally, and I wanted my homelab to reflect real-world enterprise conditions. The node layout is straightforward but robust:

  • master01 (192.168.1.11), master02 (192.168.1.12) — control plane nodes
  • kube01–kube04 (192.168.1.13–192.168.1.16) — worker nodes
  • VIP: 192.168.1.30 — HA control plane endpoint via keepalived

The CNI is Cilium, chosen for its eBPF-based networking and observability capabilities. One early gotcha: RHEL 9’s kernel had an eBPF interaction issue with crun, requiring a switch to runc with CRI-O 1.32. Claude helped me track down the root cause and pivot quickly.

MetalLB handles bare-metal load balancing, with an IP pool spanning 192.168.1.30–50. Traefik v3 sits at the 192.168.1.30 VIP and serves as the cluster’s ingress controller, routing all external traffic into the cluster via hostname-based rules.


Networking and DNS: Split-Horizon Done Right

Networking was one of the most architecturally interesting parts of the build. The environment runs a full split-horizon DNS setup:

  • burnedworm.com — External zone managed through Cloudflare, used for public-facing and cluster services accessible from outside
  • blackburn.lan — Internal zone managed by FreeIPA/IDM (with a replica at idm02.blackburn.lan), covering all internal VMs and infrastructure
  • k8s.burnedworm.com — A stub zone that resolves cluster services like traefik.k8s.burnedworm.com, elastic.k8s.blackburn.lan, and similar

One of the more subtle issues we worked through together was cert-manager’s DNS-01 challenge behavior in this environment. Because IDM intercepts internal DNS resolution, cert-manager needed to be configured with --dns01-recursive-nameservers-only and pointed at external resolvers — otherwise it would get authoritative answers from IDM that didn’t reflect the Cloudflare TXT records it needed to validate. Claude diagnosed this quickly and provided the exact flag and resolver configuration needed to make wildcard cert issuance work cleanly.

The result: a wildcard certificate for *.k8s.burnedworm.com issued automatically via cert-manager + Cloudflare DNS-01, renewed without human intervention.


TLS Everywhere, Automatically

Every service in the cluster gets HTTPS automatically through the wildcard cert. Traefik handles termination, and cert-manager handles renewal. The Traefik dashboard itself is accessible at traefik.k8s.burnedworm.com over HTTPS.

For the Elastic Stack specifically, Elasticsearch 9.x defaults to HTTP/2 with TLS, which introduced some interesting Traefik compatibility considerations. Claude helped me work through the implications of HTTP/2 passthrough versus termination and get the stack talking cleanly.


Storage: NFS and OpenEBS for the Right Workloads

Storage is handled by multiple storage classes, each suited to different workload profiles:

  • nfs-hdd (default) — Backed by TrueNAS at /mnt/GP/kubernetes over NFS, suitable for most workloads
  • nfs-ssd — Higher I/O NFS storage for latency-sensitive applications
  • openebs-hostpath — Local node storage, used specifically for Elasticsearch to avoid NFS overhead on a search workload

Pinning Elasticsearch to specific nodes via openebs-hostpath was one of those decisions that required understanding both Kubernetes scheduling (node selectors, tolerations) and Elasticsearch’s shard allocation behavior. Claude walked through both layers with me to arrive at a configuration that’s stable and performant.


The Application Stack

Elastic Stack (ECK 3.3.1 / Elastic 9.x)

The Elastic Stack is a centerpiece of the environment. Deployed via ECK (Elastic Cloud on Kubernetes) operator, the stack includes Elasticsearch, Kibana, Fleet Server, and an Elastic Agent DaemonSet running on all nodes — including masters, via tolerations.

The upgrade from Elastic 8.x to 9.3.2 was a significant undertaking. Key gotchas we documented and worked through:

  • logsdb mode auto-enabling on logs-* data streams in 9.x
  • HTTP/2 as the default transport with TLS (relevant to ingress configuration)
  • Kibana’s new requirement for encryptedSavedObjects.encryptionKey
  • Legacy index template conflicts requiring cleanup before upgrade

AWX (Ansible Automation Platform)

AWX is deployed via the awx-operator, backed by an external PostgreSQL instance running on a dedicated database.blackburn.lan VM (192.168.1.187). Keeping the database external rather than in-cluster is a deliberate architectural choice — it simplifies backup, recovery, and cluster lifecycle management.

GitLab and GitOps CI/CD

A self-hosted GitLab instance runs at gitlab.blackburn.lan (192.168.1.156), and it’s the engine behind the GitOps workflow. CI/CD pipelines use bitnami/kubectl:latest with a manually provisioned kubeconfig built from KUBE_CA_CERT, KUBE_URL, and KUBE_TOKEN variables — intentionally explicit and auditable rather than relying on opaque operator magic.

Pipelines handle everything from building container images with Kaniko (no Docker daemon required in the runner) to deploying Kubernetes manifests with kubectl apply. Version variables are injected via sed substitution at pipeline time, keeping manifests clean and the CI config as the source of truth.

Logstash pipeline configurations are managed as Git-tracked files in a conf.d/ directory with numerically-prefixed files (10-beats-input.conf, 50-filters.conf, 90-elastic-output.conf), dynamically assembled into a ConfigMap by the CI pipeline at deploy time.

Cribl Stream

Cribl Stream 4.17.0 is deployed in Kubernetes as a critical piece of the observability pipeline. It sits between log sources and Elasticsearch, providing fan-out routing, parsing, and data enrichment capabilities. Getting Cribl running in Kubernetes required working through some non-obvious configuration details:

  • The correct environment variable for the config volume is CRIBL_VOLUME_DIR, not what the docs imply
  • The mount path must be /opt/cribl/config
  • The deployment strategy must be Recreate (not RollingUpdate) when using an RWO PVC, since two pods cannot mount the same RWO volume simultaneously

Claude was invaluable in diagnosing the PVC mount contention and identifying the deployment strategy as the fix — a classic “obvious in hindsight” problem.

Uptime Kuma

Monitoring is handled by Uptime Kuma, with custom monitors including:

  • Elasticsearch cluster health via JSONPath ($.status = green)
  • Kibana availability via $.status.overall.level = available

Alerts are delivered via Resend (SMTP through smtp.resend.com) using the burnedworm.com domain.

Plex

Even Plex is part of the stack — running on TrueNAS directly, with the media library organized using MusicBrainz Picard for acoustic fingerprinting and beets for CLI-based pipeline automation.


What Made This Work: The AI Collaboration

Building infrastructure at this level of complexity means constantly context-switching between Kubernetes internals, Linux networking, and application-specific behavior. The value Claude provided wasn’t just answering individual questions — it was maintaining context across long, complex troubleshooting threads.

Some specific examples where Claude made a real difference:

cert-manager DNS-01 with IDM: Diagnosing why wildcard cert issuance was failing in a split-horizon DNS environment and identifying --dns01-recursive-nameservers-only as the fix required understanding three different systems simultaneously.

Cribl PVC mount contention: Understanding why Cribl’s pod was failing to start, tracing it to RWO PVC semantics and the RollingUpdate strategy, and pivoting to Recreate — all while also identifying CRIBL_VOLUME_DIR as the correct config volume environment variable.

Elastic 8→9 upgrade: Coordinating an upgrade across ECK operator, Elasticsearch, Kibana, Fleet, and Elastic Agent — with multiple breaking changes — required careful sequencing and a clear picture of what changed between versions.


The Architecture in Summary

LayerTechnology
OSRHEL 9
KubernetesSelf-managed, CRI-O 1.32 + runc
CNICilium
IngressTraefik v3
Load BalancingMetalLB
TLScert-manager + Cloudflare DNS-01
StorageNFS (TrueNAS) + OpenEBS hostpath
DNSFreeIPA/IDM + Cloudflare (split-horizon)
ObservabilityElastic Stack 9.x (ECK), Logstash, Cribl Stream
AutomationAWX (Ansible)
CI/CDGitLab + GitLab Runner + Kaniko
MonitoringUptime Kuma + Resend

Final Thoughts

This isn’t a homelab in the casual sense — it’s a production-equivalent environment that happens to run on hardware I own. Every architectural decision mirrors what you’d find in an enterprise environment: HA control plane, external databases, GitOps deployments, split-horizon DNS, automated TLS, and a real observability pipeline.

The collaboration with Claude throughout this build was genuinely useful. Not because it replaced the need to understand the systems — it didn’t — but because it compressed the time between “I see a problem” and “I understand the problem well enough to fix it.” For complex infrastructure work, that’s exactly what you want from an AI assistant.

If you’re building something similar and want to talk through the architecture, find me online. And if you’re wondering whether AI assistance is actually useful for serious infrastructure work — based on this experience, the answer is yes.


Jeremy Blackburn is a systems and infrastructure engineer and owner of BurnedWorm Productions LLC. This post reflects his personal homelab build.