In Part 1, I covered the LGTM stack - the backend that stores and visualises telemetry. But a backend without data is just an empty dashboard. In this post, I'll cover how metrics actually get collected from services and the cluster itself.
The Collection Layer: Grafana k8s-monitoring
Instead of deploying Prometheus Operator, Node Exporter, and kube-state-metrics separately, I use Grafana's k8s-monitoring Helm chart. It bundles everything into one cohesive package based on Grafana Alloy (the successor to Grafana Agent).
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: k8s-monitoring
spec:
source:
repoURL: https://grafana.github.io/helm-charts
chart: k8s-monitoring
targetRevision: 3.4.0
helm:
values: |
cluster:
name: homelab-cluster
destinations:
- name: local-otlp
type: otlp
url: http://lgtm-simple.monitoring.svc.cluster.local:4318
protocol: http
clusterMetrics:
enabled: true
clusterEvents:
enabled: true
podLogs:
enabled: true
applicationObservability:
enabled: trueThis deploys several Alloy components:
| Component | Type | Purpose |
|---|---|---|
| alloy-metrics | Deployment | Scrapes Prometheus metrics |
| alloy-logs | DaemonSet | Collects pod logs |
| alloy-singleton | Deployment | Cluster-wide operations |
| alloy-receiver | Deployment | OTLP receiver for apps |
Everything flows to the LGTM stack via OTLP HTTP on port 4318. No Prometheus remote write, no separate Loki push - just OTLP everywhere.
Kafka Metrics: The Dual Exporter Approach
Kafka is notorious for metric complexity. I use two exporters to get comprehensive coverage:
JMX Exporter (Broker Internals)
Kafka exposes internal metrics via JMX. The JMX Exporter translates these to Prometheus format:
# Sidecar container in Kafka deployment
- name: jmx-exporter
image: bitnami/jmx-exporter:0.20.0
args:
- "9308"
- /config/jmx-config.yaml
ports:
- containerPort: 9308
name: jmx-metricsThe JMX config whitelists specific MBeans:
# ConfigMap: kafka-jmx-config
whitelistObjectNames:
- "kafka.server:type=BrokerTopicMetrics,*"
- "kafka.server:type=ReplicaManager,*"
- "kafka.controller:type=KafkaController,*"
- "kafka.network:type=RequestMetrics,*"
- "java.lang:type=Memory"
- "java.lang:type=GarbageCollector,*"This gives you broker-level metrics:
- Messages in/out per second
- Bytes in/out per topic
- Request latency percentiles
- JVM memory and GC stats
- Under-replicated partitions
Kafka Exporter (Topic & Consumer Groups)
The JMX exporter doesn't know about consumer lag or topic-level details. That's where kafka-exporter comes in:
- name: kafka-exporter
image: danielqsj/kafka-exporter:v1.7.0
args:
- --kafka.server=kafka:9092
ports:
- containerPort: 9309
name: kafka-metricsThis provides:
- Consumer group lag
- Topic partition offsets
- Per-topic message rates
- Consumer group state
Both exporters get ServiceMonitors for automatic discovery:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-jmx
labels:
app: kafka
spec:
selector:
matchLabels:
app: kafka
endpoints:
- port: jmx-metrics
interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-exporter
spec:
selector:
matchLabels:
app: kafka
endpoints:
- port: kafka-metrics
interval: 30sPostgreSQL Exporter
PostgreSQL metrics come from the postgres_exporter:
- name: postgres-exporter
image: prometheuscommunity/postgres-exporter:v0.15.0
env:
- name: DATA_SOURCE_NAME
value: "postgresql://user:pass@postgres:5432/mydb?sslmode=disable"
ports:
- containerPort: 9187
name: pg-metricsKey metrics collected:
- Active connections vs max connections
- Transaction rates (commits, rollbacks)
- Cache hit ratio
- Table and index sizes
- Slow query counts
- Replication lag (if applicable)
The DATA_SOURCE_NAME needs access to the database. I create a dedicated monitoring user with limited permissions:
CREATE USER exporter WITH PASSWORD 'secret';
GRANT pg_monitor TO exporter;Redis Metrics
Redis is simpler - the Bitnami Helm chart has built-in metrics:
# In redis-multisource.yaml ArgoCD Application
helm:
values: |
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: 30sThat's it. The chart deploys a redis-exporter sidecar automatically. Metrics include:
- Commands processed per second
- Connected clients
- Memory usage and fragmentation
- Key eviction rates
- Keyspace hits/misses
MinIO Metrics
MinIO exposes Prometheus metrics natively. Just point it at the LGTM stack:
# MinIO Helm values
env:
- name: MINIO_PROMETHEUS_URL
value: "http://lgtm-simple.monitoring.svc.cluster.local:9090"
- name: MINIO_PROMETHEUS_AUTH_TYPE
value: "public"Metrics available at /minio/v2/metrics/cluster:
- Bucket object counts and sizes
- S3 API request rates and latencies
- Disk usage per drive
- Healing status
Blackbox Exporter: External Monitoring
Internal metrics are great, but what about "is this service actually reachable?" That's the blackbox exporter's job.
# Helm chart deployment
source:
repoURL: https://prometheus-community.github.io/helm-charts
chart: prometheus-blackbox-exporter
targetRevision: 11.3.1
helm:
values: |
config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
fail_if_not_ssl: trueThen I define Probe custom resources for each service:
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
name: core-infrastructure-probe
namespace: monitoring
spec:
jobName: core-infrastructure
interval: 60s
module: http_2xx
prober:
url: blackbox-exporter:9115
targets:
staticConfig:
static:
- https://grafana.homelab.example.com/api/health
- https://argocd.homelab.example.com/healthz
- https://prometheus.homelab.example.com/-/healthyI have probes for:
| Probe | Services Checked |
|---|---|
| core-infrastructure | Grafana, ArgoCD, Prometheus |
| storage-infrastructure | MinIO health endpoints |
| database-infrastructure | pgAdmin ping |
| messaging-infrastructure | Kafka UI |
| kubernetes-infrastructure | Headlamp, Hubble UI |
| registry-external | Docker registry /v2/ |
The blackbox exporter produces two key metrics:
probe_success: 1 if the probe succeeded, 0 if failedprobe_duration_seconds: How long the check took
These feed directly into alerts (more on that in Part 4).
Istio Control Plane Metrics
Istiod exposes Prometheus metrics for mesh health:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istiod
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 30s
path: /metricsKey metrics:
pilot_xds_pushes: Config pushes to proxiespilot_proxy_convergence_time: Time for config to reach all proxiespilot_conflict_outbound_listener_http_over_current_tcp: Config conflicts- Certificate expiry times (critical for mTLS)
Annotation-Based Autodiscovery
For services that don't have dedicated exporters, k8s-monitoring supports annotation-based discovery:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"Any pod with these annotations gets scraped automatically. Useful for custom applications that expose metrics but don't need a full ServiceMonitor.
Scrape Interval Considerations
I use 30-second intervals for most metrics:
scrape:
interval: 30sFor a homelab, this balances:
- Granularity: 30s is fine for most troubleshooting
- Storage: Lower intervals mean more data points
- Load: More frequent scrapes = more CPU on exporters
The blackbox probes use 60-second intervals - external availability doesn't change that fast.
Lessons Learned
1. Kafka needs both exporters. JMX alone misses consumer lag. kafka-exporter alone misses broker internals. Use both.
2. ServiceMonitors beat annotation-based discovery. Explicit configuration is easier to debug than magic annotations.
3. Blackbox probes are underrated. Internal metrics can look healthy while the service is unreachable externally. Probe from outside.
4. Watch the cardinality. Kafka topics with many partitions can explode metric cardinality. Use relabeling to drop high-cardinality labels you don't need.
What's Next
Metrics tell you what's happening. But when a request fails, you need traces to understand why. In Part 3, I'll cover distributed tracing with Istio and Tempo - how traces flow through the mesh, sampling strategies, and correlating traces with logs.
This is Part 2 of a 4-part series on homelab observability.