Metrics Collection - From Kafka JMX to Kubernetes Events

Metrics Collection - From Kafka JMX to Kubernetes Events architecture diagram

In Part 1, I covered the LGTM stack - the backend that stores and visualises telemetry. But a backend without data is just an empty dashboard. In this post, I'll cover how metrics actually get collected from services and the cluster itself.

The Collection Layer: Grafana k8s-monitoring

Instead of deploying Prometheus Operator, Node Exporter, and kube-state-metrics separately, I use Grafana's k8s-monitoring Helm chart. It bundles everything into one cohesive package based on Grafana Alloy (the successor to Grafana Agent).

yaml
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8s-monitoring
spec:
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: k8s-monitoring
    targetRevision: 3.4.0
    helm:
      values: |
        cluster:
          name: homelab-cluster

        destinations:
          - name: local-otlp
            type: otlp
            url: http://lgtm-simple.monitoring.svc.cluster.local:4318
            protocol: http

        clusterMetrics:
          enabled: true
        clusterEvents:
          enabled: true
        podLogs:
          enabled: true
        applicationObservability:
          enabled: true

This deploys several Alloy components:

ComponentTypePurpose
alloy-metricsDeploymentScrapes Prometheus metrics
alloy-logsDaemonSetCollects pod logs
alloy-singletonDeploymentCluster-wide operations
alloy-receiverDeploymentOTLP receiver for apps

Everything flows to the LGTM stack via OTLP HTTP on port 4318. No Prometheus remote write, no separate Loki push - just OTLP everywhere.

observability-part-2-metrics-exporters/observability-metrics-collection diagram
Click to expand
2125 × 1204px

Kafka Metrics: The Dual Exporter Approach

Kafka is notorious for metric complexity. I use two exporters to get comprehensive coverage:

JMX Exporter (Broker Internals)

Kafka exposes internal metrics via JMX. The JMX Exporter translates these to Prometheus format:

yaml
# Sidecar container in Kafka deployment
- name: jmx-exporter
  image: bitnami/jmx-exporter:0.20.0
  args:
    - "9308"
    - /config/jmx-config.yaml
  ports:
    - containerPort: 9308
      name: jmx-metrics

The JMX config whitelists specific MBeans:

yaml
# ConfigMap: kafka-jmx-config
whitelistObjectNames:
  - "kafka.server:type=BrokerTopicMetrics,*"
  - "kafka.server:type=ReplicaManager,*"
  - "kafka.controller:type=KafkaController,*"
  - "kafka.network:type=RequestMetrics,*"
  - "java.lang:type=Memory"
  - "java.lang:type=GarbageCollector,*"

This gives you broker-level metrics:

  • Messages in/out per second
  • Bytes in/out per topic
  • Request latency percentiles
  • JVM memory and GC stats
  • Under-replicated partitions

Kafka Exporter (Topic & Consumer Groups)

The JMX exporter doesn't know about consumer lag or topic-level details. That's where kafka-exporter comes in:

yaml
- name: kafka-exporter
  image: danielqsj/kafka-exporter:v1.7.0
  args:
    - --kafka.server=kafka:9092
  ports:
    - containerPort: 9309
      name: kafka-metrics

This provides:

  • Consumer group lag
  • Topic partition offsets
  • Per-topic message rates
  • Consumer group state

Both exporters get ServiceMonitors for automatic discovery:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-jmx
  labels:
    app: kafka
spec:
  selector:
    matchLabels:
      app: kafka
  endpoints:
    - port: jmx-metrics
      interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-exporter
spec:
  selector:
    matchLabels:
      app: kafka
  endpoints:
    - port: kafka-metrics
      interval: 30s

PostgreSQL Exporter

PostgreSQL metrics come from the postgres_exporter:

yaml
- name: postgres-exporter
  image: prometheuscommunity/postgres-exporter:v0.15.0
  env:
    - name: DATA_SOURCE_NAME
      value: "postgresql://user:pass@postgres:5432/mydb?sslmode=disable"
  ports:
    - containerPort: 9187
      name: pg-metrics

Key metrics collected:

  • Active connections vs max connections
  • Transaction rates (commits, rollbacks)
  • Cache hit ratio
  • Table and index sizes
  • Slow query counts
  • Replication lag (if applicable)

The DATA_SOURCE_NAME needs access to the database. I create a dedicated monitoring user with limited permissions:

sql
CREATE USER exporter WITH PASSWORD 'secret';
GRANT pg_monitor TO exporter;

Redis Metrics

Redis is simpler - the Bitnami Helm chart has built-in metrics:

yaml
# In redis-multisource.yaml ArgoCD Application
helm:
  values: |
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        interval: 30s

That's it. The chart deploys a redis-exporter sidecar automatically. Metrics include:

  • Commands processed per second
  • Connected clients
  • Memory usage and fragmentation
  • Key eviction rates
  • Keyspace hits/misses

MinIO Metrics

MinIO exposes Prometheus metrics natively. Just point it at the LGTM stack:

yaml
# MinIO Helm values
env:
  - name: MINIO_PROMETHEUS_URL
    value: "http://lgtm-simple.monitoring.svc.cluster.local:9090"
  - name: MINIO_PROMETHEUS_AUTH_TYPE
    value: "public"

Metrics available at /minio/v2/metrics/cluster:

  • Bucket object counts and sizes
  • S3 API request rates and latencies
  • Disk usage per drive
  • Healing status

Blackbox Exporter: External Monitoring

Internal metrics are great, but what about "is this service actually reachable?" That's the blackbox exporter's job.

yaml
# Helm chart deployment
source:
  repoURL: https://prometheus-community.github.io/helm-charts
  chart: prometheus-blackbox-exporter
  targetRevision: 11.3.1
  helm:
    values: |
      config:
        modules:
          http_2xx:
            prober: http
            timeout: 5s
            http:
              valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
              valid_status_codes: [200]
              fail_if_not_ssl: true

Then I define Probe custom resources for each service:

yaml
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: core-infrastructure-probe
  namespace: monitoring
spec:
  jobName: core-infrastructure
  interval: 60s
  module: http_2xx
  prober:
    url: blackbox-exporter:9115
  targets:
    staticConfig:
      static:
        - https://grafana.homelab.example.com/api/health
        - https://argocd.homelab.example.com/healthz
        - https://prometheus.homelab.example.com/-/healthy

I have probes for:

ProbeServices Checked
core-infrastructureGrafana, ArgoCD, Prometheus
storage-infrastructureMinIO health endpoints
database-infrastructurepgAdmin ping
messaging-infrastructureKafka UI
kubernetes-infrastructureHeadlamp, Hubble UI
registry-externalDocker registry /v2/

The blackbox exporter produces two key metrics:

  • probe_success: 1 if the probe succeeded, 0 if failed
  • probe_duration_seconds: How long the check took

These feed directly into alerts (more on that in Part 4).

Istio Control Plane Metrics

Istiod exposes Prometheus metrics for mesh health:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: istiod
  endpoints:
    - port: http-monitoring
      interval: 30s
      path: /metrics

Key metrics:

  • pilot_xds_pushes: Config pushes to proxies
  • pilot_proxy_convergence_time: Time for config to reach all proxies
  • pilot_conflict_outbound_listener_http_over_current_tcp: Config conflicts
  • Certificate expiry times (critical for mTLS)

Annotation-Based Autodiscovery

For services that don't have dedicated exporters, k8s-monitoring supports annotation-based discovery:

yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Any pod with these annotations gets scraped automatically. Useful for custom applications that expose metrics but don't need a full ServiceMonitor.

Scrape Interval Considerations

I use 30-second intervals for most metrics:

yaml
scrape:
  interval: 30s

For a homelab, this balances:

  • Granularity: 30s is fine for most troubleshooting
  • Storage: Lower intervals mean more data points
  • Load: More frequent scrapes = more CPU on exporters

The blackbox probes use 60-second intervals - external availability doesn't change that fast.

Lessons Learned

1. Kafka needs both exporters. JMX alone misses consumer lag. kafka-exporter alone misses broker internals. Use both.

2. ServiceMonitors beat annotation-based discovery. Explicit configuration is easier to debug than magic annotations.

3. Blackbox probes are underrated. Internal metrics can look healthy while the service is unreachable externally. Probe from outside.

4. Watch the cardinality. Kafka topics with many partitions can explode metric cardinality. Use relabeling to drop high-cardinality labels you don't need.

What's Next

Metrics tell you what's happening. But when a request fails, you need traces to understand why. In Part 3, I'll cover distributed tracing with Istio and Tempo - how traces flow through the mesh, sampling strategies, and correlating traces with logs.


This is Part 2 of a 4-part series on homelab observability.

← Back to all posts