← All posts

Aug 16, 2025

Building a Practical Observability Stack

How I think about logs, metrics, traces, dashboards, and alerts without overengineering the stack.

ObservabilityGrafanaPrometheus

Most teams talk about observability like it is a badge of maturity. In practice, a lot of setups are just expensive dashboards with no operational value.

A practical observability stack should help answer three questions fast:

  • what is failing
  • how bad is it
  • where should I look next

If the stack cannot help with that during pressure, it is decoration.

The core layers

I prefer a simple starting stack:

  • Prometheus for metrics
  • Grafana for dashboards
  • Loki for logs
  • Tempo or OpenTelemetry-based tracing if the system justifies it

This is enough for most internal tools, SaaS platforms, and reliability labs.

Metrics first, but not vanity metrics

Good metrics are tied to system behavior:

  • request rate
  • error rate
  • latency
  • queue depth
  • worker failures
  • webhook success and failure
  • DB pool saturation
  • retry counts

A dashboard should not exist unless it helps someone decide something.

Final thought

Observability is not about collecting everything. It is about reducing ambiguity when systems misbehave.

Start with the questions you need to answer under pressure. Then build the metrics, logs, traces, dashboards, and alerts that make those answers obvious.