How do you monitor Kafka with Prometheus?

Use two exporters: JMX Exporter (as a Java agent) for broker internals like heap usage and request latencies, and Kafka Exporter for topic-level metrics like consumer lag and partition counts. Both expose /metrics endpoints for Prometheus scraping.

What is Grafana k8s-monitoring and why use it?

Grafana k8s-monitoring is a Helm chart that deploys a complete Kubernetes monitoring stack including Grafana Alloy for collection, pre-configured dashboards, and automatic service discovery. It simplifies setting up cluster-wide observability.

How does Prometheus autodiscovery work with annotations?

Pods with prometheus.io/scrape: "true" annotations are automatically discovered and scraped. Additional annotations control the path (/metrics), port, and scheme. This enables monitoring without manually configuring each service.

Metrics Collection - From Kafka JMX to Kubernetes Events

In Part 1, I covered the LGTM stack - the backend that stores and visualises telemetry. But a backend without data is just an empty dashboard. In this post, I'll cover how metrics actually get collected from services and the cluster itself.

The Collection Layer: Grafana k8s-monitoring

Instead of deploying Prometheus Operator, Node Exporter, and kube-state-metrics separately, I use Grafana's k8s-monitoring Helm chart. It bundles everything into one cohesive package based on Grafana Alloy (the successor to Grafana Agent).

yaml

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8s-monitoring
spec:
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: k8s-monitoring
    targetRevision: 3.4.0
    helm:
      values: |
        cluster:
          name: homelab-cluster

        destinations:
          - name: local-otlp
            type: otlp
            url: http://lgtm-simple.monitoring.svc.cluster.local:4318
            protocol: http

        clusterMetrics:
          enabled: true
        clusterEvents:
          enabled: true
        podLogs:
          enabled: true
        applicationObservability:
          enabled: true

This deploys several Alloy components:

Component	Type	Purpose
alloy-metrics	Deployment	Scrapes Prometheus metrics
alloy-logs	DaemonSet	Collects pod logs
alloy-singleton	Deployment	Cluster-wide operations
alloy-receiver	Deployment	OTLP receiver for apps

Everything flows to the LGTM stack via OTLP HTTP on port 4318. No Prometheus remote write, no separate Loki push - just OTLP everywhere.

Click to expand

2125 × 1204px

Kafka Metrics: The Dual Exporter Approach

Kafka is notorious for metric complexity. I use two exporters to get comprehensive coverage:

JMX Exporter (Broker Internals)

Kafka exposes internal metrics via JMX. The JMX Exporter translates these to Prometheus format:

yaml

# Sidecar container in Kafka deployment
- name: jmx-exporter
  image: bitnami/jmx-exporter:0.20.0
  args:
    - "9308"
    - /config/jmx-config.yaml
  ports:
    - containerPort: 9308
      name: jmx-metrics

The JMX config whitelists specific MBeans:

yaml

# ConfigMap: kafka-jmx-config
whitelistObjectNames:
  - "kafka.server:type=BrokerTopicMetrics,*"
  - "kafka.server:type=ReplicaManager,*"
  - "kafka.controller:type=KafkaController,*"
  - "kafka.network:type=RequestMetrics,*"
  - "java.lang:type=Memory"
  - "java.lang:type=GarbageCollector,*"

This gives you broker-level metrics:

Messages in/out per second
Bytes in/out per topic
Request latency percentiles
JVM memory and GC stats
Under-replicated partitions

Kafka Exporter (Topic & Consumer Groups)

The JMX exporter doesn't know about consumer lag or topic-level details. That's where kafka-exporter comes in:

yaml

- name: kafka-exporter
  image: danielqsj/kafka-exporter:v1.7.0
  args:
    - --kafka.server=kafka:9092
  ports:
    - containerPort: 9309
      name: kafka-metrics

This provides:

Consumer group lag
Topic partition offsets
Per-topic message rates
Consumer group state

Both exporters get ServiceMonitors for automatic discovery:

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-jmx
  labels:
    app: kafka
spec:
  selector:
    matchLabels:
      app: kafka
  endpoints:
    - port: jmx-metrics
      interval: 30s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-exporter
spec:
  selector:
    matchLabels:
      app: kafka
  endpoints:
    - port: kafka-metrics
      interval: 30s

PostgreSQL Exporter

PostgreSQL metrics come from the postgres_exporter:

yaml

- name: postgres-exporter
  image: prometheuscommunity/postgres-exporter:v0.15.0
  env:
    - name: DATA_SOURCE_NAME
      value: "postgresql://user:pass@postgres:5432/mydb?sslmode=disable"
  ports:
    - containerPort: 9187
      name: pg-metrics

Key metrics collected:

Active connections vs max connections
Transaction rates (commits, rollbacks)
Cache hit ratio
Table and index sizes
Slow query counts
Replication lag (if applicable)

The DATA_SOURCE_NAME needs access to the database. I create a dedicated monitoring user with limited permissions:

sql

CREATE USER exporter WITH PASSWORD 'secret';
GRANT pg_monitor TO exporter;

Redis Metrics

Redis is simpler - the Bitnami Helm chart has built-in metrics:

yaml

# In redis-multisource.yaml ArgoCD Application
helm:
  values: |
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        interval: 30s

That's it. The chart deploys a redis-exporter sidecar automatically. Metrics include:

Commands processed per second
Connected clients
Memory usage and fragmentation
Key eviction rates
Keyspace hits/misses

MinIO Metrics

MinIO exposes Prometheus metrics natively. Just point it at the LGTM stack:

yaml

# MinIO Helm values
env:
  - name: MINIO_PROMETHEUS_URL
    value: "http://lgtm-simple.monitoring.svc.cluster.local:9090"
  - name: MINIO_PROMETHEUS_AUTH_TYPE
    value: "public"

Metrics available at /minio/v2/metrics/cluster:

Bucket object counts and sizes
S3 API request rates and latencies
Disk usage per drive
Healing status

Blackbox Exporter: External Monitoring

Internal metrics are great, but what about "is this service actually reachable?" That's the blackbox exporter's job.

yaml

# Helm chart deployment
source:
  repoURL: https://prometheus-community.github.io/helm-charts
  chart: prometheus-blackbox-exporter
  targetRevision: 11.3.1
  helm:
    values: |
      config:
        modules:
          http_2xx:
            prober: http
            timeout: 5s
            http:
              valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
              valid_status_codes: [200]
              fail_if_not_ssl: true

Then I define Probe custom resources for each service:

yaml

apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: core-infrastructure-probe
  namespace: monitoring
spec:
  jobName: core-infrastructure
  interval: 60s
  module: http_2xx
  prober:
    url: blackbox-exporter:9115
  targets:
    staticConfig:
      static:
        - https://grafana.homelab.example.com/api/health
        - https://argocd.homelab.example.com/healthz
        - https://prometheus.homelab.example.com/-/healthy

I have probes for:

Probe	Services Checked
core-infrastructure	Grafana, ArgoCD, Prometheus
storage-infrastructure	MinIO health endpoints
database-infrastructure	pgAdmin ping
messaging-infrastructure	Kafka UI
kubernetes-infrastructure	Headlamp, Hubble UI
registry-external	Docker registry /v2/

The blackbox exporter produces two key metrics:

probe_success: 1 if the probe succeeded, 0 if failed
probe_duration_seconds: How long the check took

These feed directly into alerts (more on that in Part 4).

Istio Control Plane Metrics

Istiod exposes Prometheus metrics for mesh health:

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istiod
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: istiod
  endpoints:
    - port: http-monitoring
      interval: 30s
      path: /metrics

Key metrics:

pilot_xds_pushes: Config pushes to proxies
pilot_proxy_convergence_time: Time for config to reach all proxies
pilot_conflict_outbound_listener_http_over_current_tcp: Config conflicts
Certificate expiry times (critical for mTLS)

Annotation-Based Autodiscovery

For services that don't have dedicated exporters, k8s-monitoring supports annotation-based discovery:

yaml

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Any pod with these annotations gets scraped automatically. Useful for custom applications that expose metrics but don't need a full ServiceMonitor.

Scrape Interval Considerations

I use 30-second intervals for most metrics:

yaml

scrape:
  interval: 30s

For a homelab, this balances:

Granularity: 30s is fine for most troubleshooting
Storage: Lower intervals mean more data points
Load: More frequent scrapes = more CPU on exporters

The blackbox probes use 60-second intervals - external availability doesn't change that fast.

Lessons Learned

1. Kafka needs both exporters. JMX alone misses consumer lag. kafka-exporter alone misses broker internals. Use both.

2. ServiceMonitors beat annotation-based discovery. Explicit configuration is easier to debug than magic annotations.

3. Blackbox probes are underrated. Internal metrics can look healthy while the service is unreachable externally. Probe from outside.

4. Watch the cardinality. Kafka topics with many partitions can explode metric cardinality. Use relabeling to drop high-cardinality labels you don't need.

What's Next

Metrics tell you what's happening. But when a request fails, you need traces to understand why. In Part 3, I'll cover distributed tracing with Istio and Tempo - how traces flow through the mesh, sampling strategies, and correlating traces with logs.

This is Part 2 of a 4-part series on homelab observability.

Related Posts