System Design — Page 6
Monitoring & Observability
Metrics, logs, traces, golden signals, SLOs, alerting, and incident response — how engineering teams detect, diagnose, and resolve production issues in minutes, not hours.
—
Prometheus adoption (K8s)
—
OTel production demand
—
MTTD reduction w/ tracing
—
Netflix logs / day
Building a system is the first half. Keeping it running is the second — and it's harder. In 2025, AWS went down for 15 hours due to a DNS race condition. Google Cloud was out for 7 hours because of a null pointer. Cloudflare took down ChatGPT, Claude, and X for 2 hours with a database permission change. Every one of these was detected by monitoring, diagnosed through logs and traces, and resolved through incident response.
This page covers the seven monitoring concepts that come up in every system design interview: the three pillars of observability, Google's golden signals, SLOs and error budgets, distributed tracing, alerting, log aggregation, and real incident response. Every diagram uses real data from production systems.
If you haven't seen Distributed Systems yet, start there — fault tolerance and circuit breakers are the patterns that monitoring detects and alerting acts on. And Security & Authentication covers the attacks that monitoring helps you catch.
The Three Pillars — Metrics, Logs, Traces
Every observability system is built on three types of telemetry data. Metrics tell you WHAT is wrong. Logs tell you WHY. Traces tell you WHERE. None is sufficient alone — you need all three to debug a distributed system.
Numerical measurements collected at regular intervals. CPU usage, request count, error rate, latency percentiles. Cheap to store, fast to query, ideal for dashboards and alerts. But they tell you WHAT is wrong, not WHY.
Time series: { timestamp, metric_name, value, labels }
Example: { t: 14:03:22, metric: "http_requests_total", value: 14523, method: "GET", status: "500" }Dashboard shows http_error_rate jumped from 0.1% to 15% at 14:03. You know something broke. You don't know what. Metrics answer: WHAT is wrong (error rate is high). Next step: check logs to find WHY.
Same Incident, Three Lenses
A payment database goes down at 14:03. Click each pillar to see what it reveals.
http_error_rate: 0.1% → 15% http_latency_p99: 45ms → 2,300ms payment_success_rate: 99.8% → 0% postgres_connections_active: 50 → 0
Metrics detected the problem in seconds. Alerts fired. But WHY did all payments fail?
Three types of data, one source of truth
The three pillars are useless in isolation. Metrics without logs is “something is wrong but I don't know why.” Logs without traces is “I found the error but I don't know which service caused it.” The magic happens when you correlate: a metric alert leads to a log search, which reveals a trace_id, which shows the full request path. OpenTelemetry unifies all three with a single trace_id — that's why 89% of production users demand it.
But even with perfect telemetry, you need to know what to watch. Google distilled decades of SRE experience into four numbers.
The Four Golden Signals
Google's SRE book defines four signals that every service should monitor. If you only have four dashboards, make them these. They've been the industry standard since 2014 because they catch 90% of production issues.
Latency
How long it takes to serve a request. Track both successful AND failed requests — a fast 500 error is still a problem, and a slow error wastes user time.
p50 (median), p95, p99 latency. The p99 matters most — it's what your worst 1% of users experience. If p99 = 3s but p50 = 50ms, 1% of users are having a terrible time.
p99 > 1s for API endpoints. p99 > 3s for page loads.
Stripe targets p99 < 200ms for payment APIs. At 3s+ latency, users assume the page is broken and retry — causing duplicate charges.
Signals are meaningless without targets
“Latency is 200ms” — is that good or bad? It depends on your target. A 200ms p99 is excellent for a payment API and terrible for a CDN edge response. You need a number that defines “good enough” — a Service Level Objective. And you need a way to measure it — a Service Level Indicator. Get these wrong and you either over-invest in reliability nobody needs, or under-invest until users leave.
SLOs, SLIs, SLAs — Defining “Reliable Enough”
99.9% uptime sounds great until you realize it means 43 minutes of downtime per month. 99.99% means 4.3 minutes. The difference between three nines and four nines is a 10x engineering investment — and every system must decide how many nines it needs.
The Nines — How Much Downtime Can You Afford?
8.77 hours
per year
43.8 minutes
per month
10.1 minutes
per week
Most SaaS products, APIs
The thin sliver on the right is your entire error budget. Use it wisely.
SLI → SLO → SLA — The Chain
A quantitative measurement of one aspect of service quality. The raw number. What you actually measure.
Percentage of HTTP requests that complete in < 200ms. Measured: 99.3% of requests were under 200ms last month.
Audience: Engineers
SLIs feed SLOs. SLOs feed SLAs. Measure → Target → Contract.
When your SLO breaches, you need to find the bottleneck — fast
Your error budget is burning. The golden signals show latency spiked. Logs reveal a timeout error. But which of your 15 microservices is the culprit? This is where distributed tracing saves you. A single trace shows the full request path — every service, every database call, every external API — with millisecond timing. The bottleneck lights up like a red flag.
Distributed Tracing — Following a Request Across Services
When a checkout request touches 5 services, which one is slow? Distributed tracing follows the request hop by hop, timing each span. OpenTelemetry is the standard — the second most active CNCF project after Kubernetes, with 89% of production users demanding vendor compliance.
The Stripe API call dominates total latency. Optimization options: cache Stripe responses for idempotent requests, use Stripe's async payment intents, or add a circuit breaker for Stripe timeouts.
Traces find the problem. Alerts tell you it exists.
You can have the best dashboards in the world, but if nobody is looking at them at 3 AM, they're useless. Alerting bridges the gap between data and action. The challenge: too few alerts and you miss problems. Too many and your team ignores them all. PagerDuty reports that teams with mature alerting acknowledge incidents in under 5 minutes; teams without processes average 30+.
Alerting — When to Page, When to Ignore
The worst monitoring setup isn't one with no alerts — it's one with too many. Alert fatigue kills incident response. PagerDuty reports that mature teams have MTTA under 5 minutes; teams without processes average 30+ minutes. Every alert must be actionable, or it's noise.
Should This Alert Page Someone?
Alerting Anti-Patterns
Too many alerts. Team ignores them. Mean time to acknowledge (MTTA) increases from 2 min to 30 min. Critical alerts get lost in noise.
Audit every alert. If nobody acted on it in 30 days, delete it. Target < 5 actionable alerts per on-call shift.
Alerts fire. Now where do you look?
The alert says “payment error rate > 5%.” Your first move: check the logs. But when Netflix generates 1.3 petabytes of logs per day, you can't just SSH into a server and grep. You need a log aggregation system that can ingest, index, and search billions of events in seconds. The choice between ELK, Loki, Datadog, and Splunk is one of the most consequential infrastructure decisions a team makes.
Log Aggregation — Making Sense of a Billion Events
Netflix generates 1.3 petabytes of logs per day. You can't SSH into a server and grep anymore. Log aggregation tools collect, index, and search logs from thousands of services. The choice between ELK, Loki, Datadog, and Splunk comes down to cost, scale, and how much you want to manage.
Elasticsearch + Logstash + Kibana. The original log aggregation stack. Elasticsearch indexes logs for full-text search; Logstash ingests and transforms; Kibana visualizes.
Self-hosted, full control, complex search queries
Netflix, LinkedIn, and eBay run massive ELK deployments. Netflix indexes 1.3 PB/day.
Free (open source) but expensive to operate. Elasticsearch is memory-hungry — budget $0.50-2.00/GB/day for infrastructure.
Log Levels — When to Use Each
Detailed diagnostic info. Development only. Never in production (too noisy).
Normal operations. Request handled, user logged in, job completed.
Something unexpected but not broken. Retry succeeded, approaching limit, deprecated API used.
Something failed. Request returned 500, database query failed, external API timeout.
System is crashing. Out of memory, can't connect to database on startup, unrecoverable state.
Structured vs Unstructured Logs
[2026-03-07 14:03:22] ERROR - Payment failed for user 12345, amount $49.99, reason: connection timeout to stripe Good luck parsing this with regex at 1M logs/sec.
{
"timestamp": "2026-03-07T14:03:22Z",
"level": "ERROR",
"service": "payment-api",
"event": "payment_failed",
"user_id": "12345",
"amount": 49.99,
"reason": "connection_timeout",
"target": "stripe",
"trace_id": "abc123"
}Structured logs are JSON. Every field is queryable. “Show me all payments > $100 that failed due to Stripe timeouts in the last hour” — one query, instant results.
Theory is great. Here's what it looks like when everything goes wrong.
On November 18, 2025, a single database permission change at Cloudflare caused the internet to break. ChatGPT, Claude, X, Shopify, and thousands of other services went down for approximately 2 hours. Let's walk through the incident step by step — from automated detection to blameless postmortem — to see how everything we've covered works in practice.
Real Incident — The Cloudflare Outage That Broke the Internet
On November 18, 2025, a single database permission change took down ChatGPT, Claude, X, Shopify, Indeed, and thousands more for ~2 hours. This is the anatomy of a real incident — from detection to postmortem — showing exactly how monitoring, alerting, and incident response work in practice.
November 18, 2025
Date
~2 hours
Duration
DB permission change
Root Cause
Hundreds of millions
Impact
Incident Timeline
Cloudflare's monitoring detects significant failures delivering core network traffic. Error pages start appearing for end users. Internal dashboards show error rates spiking globally.
Continue the Series
Monitoring is the feedback loop that keeps everything else working. Next: learn how to present all of this in a 45-minute interview.
Frequently Asked Questions
What are the three pillars of observability?+
What are the four golden signals?+
What is the difference between SLI, SLO, and SLA?+
What is an error budget and why does it matter?+
What is distributed tracing?+
What is OpenTelemetry and why should I use it?+
How do I avoid alert fatigue?+
ELK Stack vs Grafana Loki — which should I use?+
How should I handle monitoring in a system design interview?+
What happens during a real incident response?+
Sources & References
- Google — Site Reliability Engineering: How Google Runs Production Systems (sre.google/sre-book)
- Cloudflare — November 18, 2025 Outage Post-Mortem (blog.cloudflare.com)
- Grafana Labs — Observability Survey Report 2025 (grafana.com/observability-survey/2025)
- CNCF — OpenTelemetry Adoption and Jaeger v2.0 (cncf.io/blog)
- PagerDuty — From Alert to Resolution: Incident Response Automation (pagerduty.com)
- Gartner — 70% Distributed Tracing Adoption Forecast for Microservices Organizations
- AWS, Google Cloud, Azure — 2025 Outage Reports and Postmortems
- Netflix Tech Blog — Log Pipeline Architecture (netflixtechblog.com)