The LGTM All-in-One Stack - Unified Observability for Homelabs

The LGTM All-in-One Stack - Unified Observability for Homelabs architecture diagram

Running a full observability stack usually means deploying Prometheus, Loki, Tempo, and Grafana separately. That's four Helm charts, four sets of configuration, and four things that can break independently. For a homelab, that felt like overkill.

Then I discovered Grafana's otel-lgtm image - an all-in-one container that bundles everything together. This is Part 1 of my observability series, covering the foundation of my monitoring setup.

What is LGTM?

LGTM stands for Loki, Grafana, Tempo, Mimir. It's Grafana's complete observability stack:

  • Loki: Log aggregation (like Elasticsearch, but simpler)
  • Grafana: Dashboards and visualisation
  • Tempo: Distributed tracing backend
  • Mimir: Long-term metrics storage (Prometheus-compatible)

The grafana/otel-lgtm image packages all of these (plus Pyroscope for profiling) into a single container. One deployment, one service, complete observability.

Why All-in-One?

For a homelab, the benefits are significant:

Resource efficiency: Instead of 4+ deployments with their own memory footprints, you get one container with shared resources. My LGTM stack runs happily with 512Mi-2Gi of memory.

Simpler configuration: One place to configure everything. No worrying about Prometheus scrape configs pointing to the right Loki endpoint.

Easier debugging: When something breaks, there's only one thing to look at.

Built-in OTLP support: The container includes an OpenTelemetry collector, so everything speaks the same protocol.

The trade-off is obvious: it's not production-grade. No high availability, no horizontal scaling. But for a homelab? Perfect.

observability-part-1-lgtm-stack/observability-lgtm-stack diagram
Click to expand
1555 × 1775px

The Deployment

Here's how I deploy the LGTM stack:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lgtm-simple
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: lgtm-simple
  template:
    spec:
      containers:
        - name: lgtm
          image: grafana/otel-lgtm:latest
          ports:
            - containerPort: 3000   # Grafana
            - containerPort: 9090   # Prometheus
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
          resources:
            requests:
              memory: "512Mi"
              cpu: "200m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          volumeMounts:
            - name: grafana-data
              mountPath: /data/grafana
            - name: loki-data
              mountPath: /data/loki
            - name: prometheus-data
              mountPath: /data/prometheus
            - name: tempo-data
              mountPath: /data/tempo

Each component gets its own PersistentVolume for data retention:

ComponentSizePurpose
Grafana1GiDashboards, users, settings
Loki2GiLog storage
Prometheus2GiMetrics TSDB
Tempo1GiTrace storage
Pyroscope500MiProfiling data

Exposed Endpoints

The Service exposes multiple ports for different protocols:

yaml
apiVersion: v1
kind: Service
metadata:
  name: lgtm-simple
  namespace: monitoring
spec:
  selector:
    app: lgtm-simple
  ports:
    - name: grafana
      port: 3000
    - name: prometheus
      port: 9090
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318
  • Port 3000: Grafana UI - dashboards, alerting, exploration
  • Port 9090: Prometheus-compatible API - for tools that query metrics
  • Port 4317: OTLP gRPC - for high-throughput telemetry
  • Port 4318: OTLP HTTP - for simpler integrations

Everything flows through OTLP. Metrics, logs, and traces all use the same protocol and endpoint.

External Access via Gateway API

Grafana and Prometheus get external access through HTTPRoutes:

yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: grafana-route
  namespace: monitoring
spec:
  parentRefs:
    - name: main-gateway
      namespace: istio-ingress
  hostnames:
    - "grafana.homelab.example.com"
  rules:
    - backendRefs:
        - name: lgtm-simple
          port: 3000

Same pattern for Prometheus at prometheus.homelab.example.com. TLS termination happens at the gateway.

Grafana Configuration

The admin password comes from Infisical via External Secrets:

yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: lgtm-grafana-admin
  namespace: monitoring
spec:
  refreshInterval: 15m
  secretStoreRef:
    name: infisical-store
    kind: ClusterSecretStore
  target:
    name: lgtm-grafana-admin
  data:
    - secretKey: GF_SECURITY_ADMIN_PASSWORD
      remoteRef:
        key: /lgtm/GF_SECURITY_ADMIN_PASSWORD

Key environment variables:

yaml
env:
  - name: GF_SECURITY_ADMIN_USER
    value: "admin"
  - name: GF_SECURITY_ADMIN_PASSWORD
    valueFrom:
      secretKeyRef:
        name: lgtm-grafana-admin
        key: GF_SECURITY_ADMIN_PASSWORD
  - name: GF_SERVER_ROOT_URL
    value: "https://grafana.homelab.example.com"
  - name: GF_AUTH_ANONYMOUS_ENABLED
    value: "false"
  # Use Loki for alert state history
  - name: GF_UNIFIED_ALERTING_STATE_HISTORY_ENABLED
    value: "true"
  - name: GF_UNIFIED_ALERTING_STATE_HISTORY_BACKEND
    value: "loki"

That last bit is nice - Grafana stores alert state history in Loki, so you can query alert history like any other logs.

The Data Flow

Everything speaks OTLP. The LGTM container's built-in collector routes:

  • Metrics → Prometheus/Mimir
  • Logs → Loki
  • Traces → Tempo
  • Profiles → Pyroscope

What You Get Out of the Box

With this single deployment, you immediately have:

Log exploration: Query logs with LogQL

{namespace="kafka"} |= "error"

Metrics queries: Standard PromQL

rate(http_requests_total[5m])

Trace exploration: Search spans by service, duration, error status

Correlated data: Click from a log line to see related traces, or from a trace to see metrics at that time

Resource Considerations

The otel-lgtm image is designed for development and testing. My production-ish homelab settings:

yaml
resources:
  requests:
    memory: "512Mi"
    cpu: "200m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

This handles:

  • ~50 pods worth of metrics
  • Moderate log volume
  • 100% trace sampling (more on that in Part 3)

If you're seeing OOM kills or slow queries, the first lever to pull is memory limits. The second is reducing data retention periods.

When NOT to Use This

The all-in-one approach breaks down when:

  • You need high availability
  • You have multiple clusters sending telemetry
  • Your data volume exceeds what a single container can handle
  • You need separate scaling for metrics vs logs vs traces

At that point, deploy the components separately. But for a homelab? The simplicity wins.

What's Next

The LGTM stack is the backend. But where does the data come from? In Part 2, I'll cover the metrics collection layer - Grafana's k8s-monitoring chart, service exporters for Kafka and PostgreSQL, and the blackbox exporter for external monitoring.


This is Part 1 of a 4-part series on homelab observability.

← Back to all posts