Observability in Distributed Systems: Beyond Logging and Monitoring

Observability in Distributed Systems: Beyond Logging and Monitoring

The 3 AM Page That Changed How We Monitor Systems

I got paged at 3 AM. "API is slow." That's all the alert said.

I opened our monitoring dashboard. CPU? Normal. Memory? Normal. Database? Normal. Everything looked fine.

But users were experiencing 5-second response times. Something was broken. I just had no idea what.

I spent 2 hours clicking through dashboards, checking logs, guessing. Finally found it: one microservice was making 100 database queries per request instead of 1. Recent deploy introduced an N+1 query bug.

That night I learned: monitoring tells you WHAT is broken. Observability tells you WHY.

The Three Pillars of Observability

1. Logs (What Happened)

"User X tried to checkout at 3:15 AM. Payment failed with error: insufficient funds."

Logs are stories. They tell you events that occurred.

2. Metrics (How Many, How Often)

"API latency p99 is 2.5 seconds. Error rate is 5%. Requests per second: 1000."

Metrics are numbers over time. They tell you trends and patterns.

3. Traces (The Journey)

"Request came in → called Auth Service (50ms) → called Product Service (200ms) → called Database (2000ms) ← HERE'S THE SLOWDOWN."

Traces show the full path of a request across services. They tell you where time is spent.

What We Actually Use

Logs: ElasticSearch + Kibana. Every service sends logs here. Searchable, filterable.

Metrics: Prometheus + Grafana. Tracks latency, error rates, throughput. Pretty dashboards.

Traces: OpenTelemetry + Jaeger. Shows request flow across microservices.

Together? I can debug that 3 AM issue in 5 minutes instead of 2 hours.

The Game-Changer: Correlation

The magic happens when you connect all three.

Alert fires: "High latency on /checkout endpoint."

  • Metrics dashboard → p99 latency spiked to 5s
  • Traces → shows Payment Service is slow
  • Logs → Payment Service logs show "Stripe API timeout"

Root cause found in 60 seconds: Stripe is having an outage. Not our fault. We add a status page note and go back to sleep.

Start Simple

Don't build Netflix-level observability on day 1. Start with:

  1. Structured logging: Log in JSON. Include request ID in every log.
  2. Basic metrics: Track requests, errors, latency. Even just counting these helps.
  3. One trace: Instrument your slowest endpoint. See where time goes.

You'll catch issues faster. Sleep better. And when you get paged, you'll know WHY, not just WHAT.