Observability for developers
You are landing an app in a lab.* namespace on rockingham. This
page is the one-page contract for making it observable: how to get
metrics, logs, alerts, traces, and dashboards, with no cluster-side
configuration and no ticket to the operator.
The stack behind this — Prometheus, Loki, Grafana, Alloy — is described on the Observability stack page and decided in ADR-0007. You do not need to read either to use what is below.
Everything here assumes mutual trust: every Grafana user sees every namespace’s metrics and logs. There is no per-tenant isolation — the operator and every developer on this cluster are investigating its health together.
Metrics: the ServiceMonitor contract
Section titled “Metrics: the ServiceMonitor contract”Expose your metrics in Prometheus format and Prometheus will discover them. Two steps:
- Expose
/metricson a port of yourService, in Prometheus exposition format. Most frameworks have a one-line library for this. - Drop a
ServiceMonitornext to yourService, in your own namespace. That is it — no namespace label, no scrape-config edit, no operator involvement.
Prometheus runs wide-open discovery: it scrapes any
ServiceMonitor in any namespace. A concrete example for an app
whose Service carries app.kubernetes.io/name: my-app and exposes
a port named metrics:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: my-app namespace: my-app labels: app.kubernetes.io/name: my-appspec: # Selects the Service(s) to scrape — must match your Service's labels. selector: matchLabels: app.kubernetes.io/name: my-app # Each entry is a port on the selected Service to scrape. endpoints: - port: metrics # the *name* of the Service port, not the number path: /metrics interval: 30skubectl apply -f it. Within a scrape interval the target shows up in
Prometheus’s Status → Target health page, and the series are
queryable in Grafana. If the target never appears, the usual cause is
the selector not matching your Service’s labels, or port naming
a port the Service does not actually declare.
Use a PodMonitor instead if you are scraping pods directly with no
Service in front — same shape, spec.selector matches pod labels.
Logs: zero action
Section titled “Logs: zero action”Your pod’s stdout and stderr are collected automatically. There is nothing to configure — no sidecar, no annotation, no opt-in.
The Alloy DaemonSet tails every pod’s
log files on every node and ships them to Loki, labelled by
namespace, pod, and container. Write structured logs to stdout
and query them in Grafana’s Explore view with
LogQL:
{namespace="my-app"} # everything from your namespace{namespace="my-app", container="server"} # one container{namespace="my-app"} |= "error" # lines containing "error"{namespace="my-app"} | json | level="warn" # parse JSON, filter on a fieldLogs are retained for 14 days. The one thing worth doing on your
side: log as JSON to stdout, so | json in LogQL gives you
queryable fields instead of substring matching.
Alerting: PrometheusRule in your namespace
Section titled “Alerting: PrometheusRule in your namespace”Express your SLOs as code, alongside your app. A PrometheusRule in
your namespace is evaluated by Prometheus automatically — the same
wide-open discovery that picks up ServiceMonitors picks up rules.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-slo namespace: my-app labels: app.kubernetes.io/name: my-appspec: groups: - name: my-app.rules rules: - alert: MyAppHighErrorRate # 5xx rate over total request rate, above 5% for 10 minutes. expr: | sum(rate(http_requests_total{namespace="my-app",code=~"5.."}[5m])) / sum(rate(http_requests_total{namespace="my-app"}[5m])) > 0.05 for: 10m labels: severity: warning annotations: summary: "my-app 5xx error rate above 5% for 10m" description: "Error ratio is {{ $value | humanizePercentage }}."kubectl apply -f it; the rule appears in Prometheus’s Alerts
page. A firing alert is delivered to Alertmanager. How Alertmanager
forwards it from there — to a Discord channel, a heartbeat check, or
nowhere — is operator-managed cluster config, not something you set
in your namespace. Set the severity label (warning / critical)
so the operator’s routing can act on it.
Tracing: OTLP endpoint (Phase 3)
Section titled “Tracing: OTLP endpoint (Phase 3)”Distributed tracing is Phase 3 — there is no trace store on the cluster yet. But the ingest endpoint is already live, so you can instrument now and not change anything later.
Alloy exposes an OTLP receiver, reachable cluster-wide:
| Protocol | Endpoint |
|---|---|
| OTLP/gRPC | alloy.observability.svc:4317 |
| OTLP/HTTP | alloy.observability.svc:4318 |
Point your OpenTelemetry SDK exporter at one of these. Be aware: spans sent today are accepted and dropped — there is no consumer until Phase 3 lands Tempo as the trace backend. The value of using the endpoint now is that it does not move: when Tempo arrives, your traces start being stored without a single change to your app’s exporter config.
Do not deploy your own collector for traces — this endpoint is the cluster’s shared one.
Dashboards: Scratch → code
Section titled “Dashboards: Scratch → code”Grafana shows two folders you can use:
- Scratch — writable. Throw together a dashboard for an investigation here. It is ephemeral and not in Git — treat anything left in Scratch as disposable.
- Homelab — repo-owned, read-only in the UI. Dashboards here
come from
kubernetes/apps/kube-prometheus-stack/dashboards/and the JSON in Git is the source of truth.
When a Scratch dashboard is worth keeping, promote it to code:
- In Grafana, open the dashboard → Settings → JSON Model; copy the JSON.
- Save it as
kubernetes/apps/kube-prometheus-stack/dashboards/<name>.json. Set"id": nulland give it a unique"uid"(e.g.homelab-<name>). - Add a
configMapGeneratorentry for it in that directory’skustomization.yaml. - Open a PR. On merge, Argo provisions it into the read-only Homelab folder — then delete the Scratch copy.
A dashboard only becomes durable by landing in dashboards/ via PR.
Where to look
Section titled “Where to look”One Grafana — https://grafana.lab.jackhall.dev — is the front door
for everything above:
- Dashboards — your
ServiceMonitormetrics, plus the bundled node and cluster-health boards. - Explore → Prometheus — ad-hoc PromQL against your metrics.
- Explore → Loki — LogQL against your logs.
- Alerts — the state of your
PrometheusRules.
You reach it on the LAN once your device uses AdGuard Home for DNS (Split-horizon DNS). Anonymous access is read-only Viewer — enough to browse dashboards and run Explore queries; editing needs the admin login.