CI workflows
The repo has three workflows in
.github/workflows/.
Two of them run on every relevant PR and gate the merge; the third
publishes this site on every push to main. None of them apply
infrastructure — tofu apply and argocd app sync remain operator
actions on a workstation with ADC. The point of CI in this repo is to
catch broken manifests and broken Terraform before a human runs the
apply.
| Workflow | Trigger | Runner | What it gates |
|---|---|---|---|
apps-lint | PR touching kubernetes/apps/**, the workflow file, or scripts/lint-apps.sh | GitHub-hosted | Each affected addon renders cleanly and validates against Kubernetes + CRD schemas. |
terraform-plan | PR touching terraform/** or the workflow file | GitHub-hosted | tofu fmt, tofu validate, and tofu plan succeed on every changed root; the plan is posted as a sticky PR comment. |
deploy-docs | Push to main touching docs/** or the workflow file (plus manual workflow_dispatch) | GitHub-hosted | This site is rebuilt with Astro and published to GitHub Pages. |
Self-hosted ARC runners exist in the
cluster but no repo workflow currently targets them. The runners
are provisioned for personal projects under
RaptGroup and the operator’s
brazostech org; this homelab repo
deliberately stays on GitHub-hosted runners so a broken cluster cannot
break its own CI. See ARC integration below for
the contract that lets other repos opt in.
apps-lint
Section titled “apps-lint”.github/workflows/apps-lint.yml
catches manifest-shape bugs before ArgoCD ever sees them. Two jobs:
detectcomputes the set of affected addons fromgit diff, filtering paths underkubernetes/apps/<name>/(top-level files inkubernetes/apps/— the README — do not trigger the lint). If the workflow itself orscripts/lint-apps.shchanged, every addon is re-linted to prove the pipeline still works.lintfans out over the detected addons via a matrix and runsscripts/lint-apps.sh <app>per cell withfail-fast: false, so one broken addon doesn’t mask another.
What scripts/lint-apps.sh does
Section titled “What scripts/lint-apps.sh does”For each affected kubernetes/apps/<name>/:
- Read the chart coordinates from
application.yaml(supporting bothspec.sourceandspec.sources[]multi-source Applications). helm template <name> <chart> --repo <url> --version <v> -f helm-values.yaml --include-crdsto render the manifests the cluster would see. OCI repos are folded into anoci://<host>/<chart>URL sincehelm --repodoesn’t accept them.kubeconform -strict -summaryagainst the rendered output, using the datreeio CRD catalog for the CRDs the cluster ships (Cilium, cert-manager, Gateway API, External Secrets). A smallSKIP_KINDSlist covers resources whose schemas aren’t in the catalog (Application,AppProject,AutoscalingRunnerSet) or aren’t validated on principle (CustomResourceDefinition).- If the addon ships raw manifests under
manifests/, those arekubeconform-validated too. - A
yqpass scans rendered Gateways and LoadBalancer Services for thelbipam.cilium.io/ipsannotation and fails if any are missing. Without the pin, Cilium’s LB IPAM hands out the first free address from the pool and a new addon can silently steal an IP another addon depends on. The check is presence-only — pool membership and value format are Cilium’s concern.
Running it locally is identical to running it in CI:
scripts/lint-apps.sh # all addonsscripts/lint-apps.sh kubernetes/apps/foo # one addonIt requires helm, kubeconform, and yq (mikefarah). No cluster
access is needed — helm template is offline once the chart is
cached.
terraform-plan
Section titled “terraform-plan”.github/workflows/terraform-plan.yml
runs on PRs that touch terraform/**. The structure mirrors
apps-lint: a detect job computes the affected roots (gcp,
bootstrap) and a matrix runs per root.
For each affected root the job runs tofu fmt -check, tofu init -lockfile=readonly, tofu validate, and tofu plan -detailed-exitcode.
The plan output is posted as a sticky PR
comment,
one per root, capped at ~60k chars to stay under GitHub’s PR comment
limit. Exit code 2 (changes detected) is the normal state on a
TF-touching PR and does not fail the job — the reviewer reads the
plan.
The bootstrap root needs special handling: its k8s/helm/kubectl
providers point at a local kubeconfig that doesn’t exist in CI, and
the cluster is LAN-only anyway. The workflow writes a stub kubeconfig
so provider validation passes and skips -refresh on plan; the
config-level diff is still useful for review.
Authentication is OIDC-based — the workflow exchanges a GitHub-issued
token for a GCP service-account impersonation via Workload Identity
Federation. The pool, provider, and plan-only tf-ci-plan service
account are documented in
Cloud / Workload Identity Federation;
the trust is locked to RaptGroup/homelab on the provider’s
attribute_condition and the SA is roles/viewer only, structurally
incapable of applying.
CI-side wire-up: the repository variables GCP_WIF_PROVIDER /
GCP_CI_SA and the secret GCP_BILLING_ACCOUNT are set manually
after the first terraform/gcp/ apply. The workflow’s Verify CI configuration step fails loud with the exact tofu output commands
to run if any are missing.
deploy-docs
Section titled “deploy-docs”.github/workflows/deploy-docs.yml
publishes this site. It runs on every push to main that changes
docs/** (and on manual workflow_dispatch):
withastro/action@v6buildsdocs/with Astro and uploads the artifact.actions/deploy-pages@v5deploys it to thegithub-pagesenvironment.
Pages concurrency is grouped to pages with
cancel-in-progress: false so two rapid pushes don’t trample each
other’s deploys — the second waits for the first to finish. The
deployed URL is whatever GitHub Pages reports back via
steps.deployment.outputs.page_url (currently
raptgroup.github.io/homelab).
There is no preview deploy on PRs — the site is rebuilt on merge and docs PRs are reviewed against the rendered markdown in the diff.
ARC integration
Section titled “ARC integration”The cluster’s ARC runners sit idle until
a workflow with a matching runs-on: lands in one of the orgs the
GitHub App is installed on. Two scale sets, two orgs:
| Scale set | Target org | Canonical runs-on: |
|---|---|---|
arc-runners-raptgroup | RaptGroup | [self-hosted, raptgroup] |
arc-runners-brazostech | brazostech | [self-hosted, brazostech] |
The full label set per pool and the alternative runs-on: forms are
on the ARC runners page.
Where the credentials come from
Section titled “Where the credentials come from”A workflow scheduled to a pool runs in an ephemeral pod whose lifetime is bounded by one job. The credentials chain that lets the pool’s listener register that pod with GitHub is:
Google Secret Manager (terraform/gcp/) │ arc-app-id, arc-app-private-key, │ arc-installation-id-{raptgroup,brazostech} ▼External Secrets Operator (ClusterSecretStore/gsm) │ ExternalSecret per pool → K8s Secret in the pool's namespace ▼gha-runner-scale-set chart (githubConfigSecret) │ ▼listener pod → GitHub API → ephemeral runner podsApp ID and private key are shared across both pools; the installation ID is per-org and is what scopes a pool to one set of repos. The Cloud / Secret Manager table lists each container by ID; the External Secrets Operator page covers the GSM↔ESO bridge in general; the ARC runners page covers the App-and-installations model and why rotating either piece is a GSM update plus a listener restart.
The access boundary is server-side: the App is installed on
RaptGroup and brazostech, not on Scale Computing (the
operator’s employer). Even from inside a brazostech runner pod, an
installation token cannot reach a Scale Computing repo. There is no
client-side allow/deny list to maintain.
How another repo opts in
Section titled “How another repo opts in”Three steps, from the perspective of a repo in RaptGroup or
brazostech:
- Confirm the homelab GitHub App is installed on the target org (operator action, one-time).
- Add the desired
runs-on:to the workflow — e.g.runs-on: [self-hosted, raptgroup]. - Open the PR. On dispatch, the pool’s listener scales 0 → 1, the ephemeral pod runs the job, and the pod is torn down within a minute or two of completion.
maxRunners: 4 per pool is a deliberate ceiling. Workloads bursty
enough to need more concurrency are better served by paying for
GitHub-hosted runners than by sizing the homelab around CI peaks.
Adding a new workflow
Section titled “Adding a new workflow”The mental model for “what gates a PR” in this repo:
kubernetes/apps/<name>/changes →apps-lintre-renders and re-validates that addon. New CRDs that aren’t in the datreeio catalog need aSKIP_KINDSentry inscripts/lint-apps.sh; new LB-bearing resources need thelbipam.cilium.io/ipsannotation or the lint fails.terraform/**changes →terraform-planposts a plan comment per affected root. A new root means a new entry in thecandidateslist in the workflow’sdetectjob; the matrix and PR comment shape flow from there automatically.docs/**changes → no PR gate; the site rebuilds on merge. Reviewers see the rendered markdown in the diff.- Talos config, smoke scripts, READMEs outside
kubernetes/apps/→ no automated check. Reviewers verify by hand; changes that affect the cluster are applied by the operator.
Conventions to keep new workflows consistent with the existing three:
- Use
pull_requestwith a tightpaths:filter so unrelated PRs don’t trigger the workflow. Thedetectjobs above show the “compute affected subset fromgit diff, fan out via matrix” pattern when a workflow needs to lint multiple independent things. - Set a
concurrency.groupkeyed on the head ref withcancel-in-progress: trueso a force-push supersedes the in-flight run. Thedeploy-docsworkflow is the exception — Pages deploys are serialised, not cancelled. - Default to GitHub-hosted runners. Targeting
self-hostedon a repo in this homelab would create a circular dependency: a broken cluster would break its own CI. The ARC pools exist for other repos inRaptGroupandbrazostech. - If the workflow needs to authenticate to GCP, reuse the WIF pool —
add a new plan-only service account with the narrowest roles the
workflow needs and a
workloadIdentityUserbinding scoped to this repo. Do not mint long-lived JSON keys.
scripts/lint-apps.sh— the script behindapps-lint.- Cloud / Workload Identity Federation
— the WIF pool, provider trust scope, and plan-only
tf-ci-planSA theterraform-planworkflow authenticates against. terraform/gcp/README.md— operator run book for the WIF pool, including how to extract the values forGCP_WIF_PROVIDERandGCP_CI_SA.- ARC runners — the in-cluster pools themselves.