Aug 16, 2025
Building a Practical Observability Stack
How I think about logs, metrics, traces, dashboards, and alerts without overengineering the stack.
Most teams talk about observability like it is a badge of maturity. In practice, a lot of setups are just expensive dashboards with no operational value.
A practical observability stack should help answer three questions fast:
- what is failing
- how bad is it
- where should I look next
If the stack cannot help with that during pressure, it is decoration.
The core layers
I prefer a simple starting stack:
- Prometheus for metrics
- Grafana for dashboards
- Loki for logs
- Tempo or OpenTelemetry-based tracing if the system justifies it
This is enough for most internal tools, SaaS platforms, and reliability labs.
Metrics first, but not vanity metrics
Good metrics are tied to system behavior:
- request rate
- error rate
- latency
- queue depth
- worker failures
- webhook success and failure
- DB pool saturation
- retry counts
A dashboard should not exist unless it helps someone decide something.
Final thought
Observability is not about collecting everything. It is about reducing ambiguity when systems misbehave.
Start with the questions you need to answer under pressure. Then build the metrics, logs, traces, dashboards, and alerts that make those answers obvious.