Observability
The three pillars — logs, metrics, traces — what question each answers, and the tool concepts (Prometheus, Grafana, OpenTelemetry).
Observability is the ability to answer questions about your running system from the outside, without shipping new code to add a print statement. The classic framing is the three pillars — logs, metrics, and traces — and the senior skill is knowing which one answers which question, because reaching for the wrong pillar wastes an incident. Monitoring asks the questions you knew to ask in advance (a dashboard, an alert); observability lets you ask new questions of data you already collected, after the system surprises you.
The three pillars at a glance
Each pillar is a different shape of data answering a different kind of question. The art is matching the question to the pillar before you start digging.
The three pillars overlap, but each has a job it does best. Pick by the question you’re asking.
| Pillar | Answers | Shape | Cost / caveat |
|---|---|---|---|
| Metrics | Is something wrong, and what's the trend? | numeric time series, aggregated | cheap; but only what you pre-defined — and high cardinality is fatal |
| Logs | What exactly happened on this event? | discrete structured records | high detail; high volume and cost at scale |
| Traces | Where in the request path did the time/error go? | spans across services, sampled | needs context propagation; usually sampled, not 100% |
The healthy workflow chains them: a metric alert fires (error rate spiked), you pull up a
trace to see which downstream call is failing, then read the logs for that span to see
the exact error message and stack. Metrics tell you that, traces tell you where, logs tell
you what. The senior move is to glue the pillars together — stamp the same trace_id onto your
log lines so that, from a slow span in a trace, you can jump straight to the matching log records.
A trace makes the distributed bottleneck obvious
A single log line can’t show you the shape of a request that touched five services. A trace can — it breaks the wall-clock time into nested spans, one per unit of work, so the bottleneck is visually obvious.
A user reports a slow checkout. Latency metrics confirm p99 is up, but not why. A trace of one request shows the time broken down by span:
Trace: POST /checkout total 1240ms
├─ api-gateway [██] 40ms
├─ order-service [████████████████████████] 1150ms
│ ├─ validate-cart [█] 20ms
│ ├─ payment-service [██] 60ms
│ └─ inventory-service [███████████████████████] 1050ms ◀ here
│ └─ db: SELECT ... [██████████████████████] 1010ms ◀ N+1 query
└─ notification-service [█] 30msNo single log line would have revealed this. The trace shows the request’s shape: 85% of the
wall-clock time is one DB call inside inventory-service. Now you know exactly which service
and which span to fix — likely an N+1 query or a missing index. Metrics raised the alarm; the
trace localized it; the next step is to read that span’s logs.
The mechanism that makes this work is context propagation: a trace ID (and the parent span
ID) is passed in request headers from service to service, so the spans emitted by five different
services can be stitched into one tree. The W3C traceparent header is the standard carrier.
Structured logs beat string logs
A log line is only useful if you can query it. A free-text line like User 42 failed login from 1.2.3.4 forces a regex to extract anything; a structured (JSON) line makes every field a
queryable key.
// Hard to query: fields are buried in a sentence
logger.info(`User ${userId} failed login from ${ip}`);
// Queryable: each field is a key your log backend can filter and aggregate on
logger.info({
event: "login_failed",
userId,
ip,
reason: "bad_password",
trace_id: ctx.traceId, // lets you jump from a trace span to these logs
});With the structured form you can answer “how many login_failed events from this IP in the last
hour?” with a filter, not a brittle regex — and the trace_id field links the log straight back
to the request’s trace.
SLIs, SLOs, and error budgets
“Is it up?” is the wrong question — everything fails sometimes. The senior framing is how reliable, against a target you chose on purpose, and how much failure budget is left. Three terms, then the idea that ties them together:
- SLI (indicator) — a measured ratio that tracks user happiness: good events ÷ valid events (e.g. fraction of requests served successfully in under 300 ms).
- SLO (objective) — your internal target for that SLI over a window: 99.9% of requests succeed over 28 days. Set it from what users actually need, not 100%.
- SLA (agreement) — the contractual promise to customers, with penalties. Always looser than your SLO, so you trip your own alarm before you breach the contract.
| Term | What it is | Example |
|---|---|---|
| SLI | a measured reliability ratio | 99.95% of requests < 300ms this week |
| SLO | your target for the SLI | ≥ 99.9% success over 28 days |
| SLA | the customer contract (penalties) | ≥ 99.5% or service credits owed |
| Error budget | 100% − SLO — the failure you're allowed | 0.1% ≈ 43 min / 30 days |
The unifying idea is the error budget = 100% − SLO. A 99.9% SLO grants a 0.1% budget —
about 43 minutes of failure per 30 days. That budget is a currency: while it’s healthy, spend
it (ship features, take deploy risks); when it’s exhausted, the policy flips — freeze risky
changes and pour effort into reliability until you’re back in budget. It turns “dev velocity vs.
stability” from an argument into a number.
The tool concepts
01 Learning objectives
0 / 4 done02 Curated reading
03 Knowledge check
- 01medium
Which pillar answers “what was the exact path of this one slow request across services?”
- 02medium
An error budget is…
- 03medium
The RED method's three signals for a request-driven service are…
04 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
What are the three pillars of observability, and what question does each one answer?
Logs are timestamped, discrete records — the narrative of *what happened* on one service. Best for forensic, after-the-fact debugging of a specific event.
Metrics are aggregated numbers over time (counters, gauges, histograms) — they answer *how much / how often / is the trend bad?* Cheap to store, great for dashboards and alerting thresholds.
Traces follow a single request across service boundaries — they answer *where did the time go / which hop failed?* in a distributed system.
The strong answer ties them together: a metric alert tells you something is wrong, a trace localizes which service, and logs from that service explain why.
Follow-ups they push on- Which pillar is most expensive to store at scale, and why?
- How do you correlate a log line with the trace it belongs to?
Red flag Treating the three as interchangeable, or claiming logs alone give you observability — logs do not show cross-service latency the way traces do.
source: Sematext — Three Pillars of Observability ↗ -
What's the difference between monitoring and observability?
Monitoring watches for *known* failure modes: you decide in advance what to measure, set thresholds, and alert when a line is crossed. It answers questions you predicted.
Observability is the property of a system that lets you ask *new* questions about its internal state from the outside, without shipping new code — to debug failures you did not anticipate.
The relationship: monitoring is a subset of what observable systems enable. You still need both — monitoring catches the predictable, observability lets you investigate the unknown-unknowns in complex distributed systems.
Follow-ups they push on- What property of your telemetry makes a system observable rather than just monitored?
- Why do microservices raise the bar for observability versus a monolith?
Red flag Saying observability is 'just monitoring with more dashboards' — the distinction is exploring unknown-unknowns versus alerting on known thresholds.
source: TechTarget — The 3 pillars of observability ↗ -
Your Prometheus storage is exploding after a deploy. What's the most likely cause and the fix?
Almost always a high-cardinality label. Each unique combination of label values is a separate time series; adding an unbounded label like
user_id,request_id,email, or a raw URL with IDs multiplies series count explosively.Fix: drop the offending label, or replace it with a bounded one. Use
http_method, status-code *class* (2xx/5xx),route*template* (/users/:id, not/users/8123), andservice— values with a small, fixed set.If you genuinely need per-user detail, that belongs in logs or traces (high cardinality there is fine), not in metric labels.
Follow-ups they push on- Why is high cardinality cheap in tracing but catastrophic in metrics?
- How would you find which metric is the culprit?
Red flag Putting unbounded identifiers (user IDs, request IDs, timestamps) into metric labels — the classic cardinality blow-up.
source: Sematext — Three Pillars of Observability (cardinality) ↗ -
What problem does OpenTelemetry solve?
OpenTelemetry (OTel) is a vendor-neutral standard — APIs, SDKs, and the Collector — for generating and exporting traces, metrics, and logs.
The problem it solves: before OTel, each backend (Datadog, Jaeger, New Relic, Prometheus) had its own agent and instrumentation library, so switching vendors meant re-instrumenting your code. With OTel you instrument *once* against a common API, then point the Collector at whatever backend you choose — no code change to switch or fan out to several.
It is now a CNCF project and the de-facto wire format (OTLP) for telemetry.
Follow-ups they push on- What does the OTel Collector do that an in-process SDK exporter doesn't?
- How does context propagation let a trace span multiple services?
Red flag Calling OpenTelemetry a 'monitoring tool' or a backend — it generates and ships telemetry; it does not store or visualize it (that's Prometheus, Grafana, Jaeger, etc.).
source: OpenTelemetry — What is OpenTelemetry? ↗ -
How do Prometheus and Grafana divide responsibilities in a typical stack?
Prometheus is the time-series database and collector: it *pulls* (scrapes) metrics from instrumented targets, stores them, and evaluates alerting rules. Querying is done with PromQL.
Grafana is the visualization/dashboard layer: it queries Prometheus (and many other sources) and renders graphs, tables, and alerts for humans.
The one-liner: Prometheus collects and stores the numbers; Grafana makes them legible. They are complementary, not competitors — you commonly run both together.
Follow-ups they push on- Why does Prometheus prefer a pull model over push?
- Where does Alertmanager fit relative to Prometheus?
Red flag Thinking Grafana stores metrics — it is a query/visualization front-end over data sources, not a TSDB.
source: Grafana — Prometheus data source ↗ -
What are the RED and USE methods, and when would you use each?
RED (Rate, Errors, Duration) is request-centric — for services/endpoints, you watch request rate, error rate, and latency distribution. It answers 'is this service healthy from the caller's view?'
USE (Utilization, Saturation, Errors) is resource-centric — for every resource (CPU, disk, network, memory) you watch how busy it is, how much work is queued, and its error count. It answers 'is this machine/resource a bottleneck?'
Use RED for your request-serving services and USE for the infrastructure underneath them; they are complementary lenses.
Follow-ups they push on- Why is a latency *percentile* (p99) more useful than a mean for the D in RED?
- What are the four golden signals and how do they relate to RED?
Red flag Alerting on averages instead of percentiles — a healthy mean hides a brutal p99 tail.
source: Grafana — RED method ↗ -
Define SLI, SLO, SLA, and error budget — how do they relate?
An SLI (Service Level *Indicator*) is a measured quantity of service health — e.g. the proportion of HTTP requests that succeed under 300ms.
An SLO (Service Level *Objective*) is the internal target for an SLI over a window — e.g. 99.9% of requests succeed over 28 days. It is what you *aim* for.
An SLA (Service Level *Agreement*) is a contract with customers that includes consequences (refunds, penalties) if you miss it. SLAs are looser than SLOs so you have headroom before you owe anyone money.
The error budget is
1 − SLO— the allowed amount of unreliability (0.1% for a 99.9% SLO). It turns reliability into a currency: while budget remains you can ship fast and take risks; when it is exhausted you freeze risky launches and prioritize stability. It is the mechanism that lets dev and ops stop arguing about pace versus reliability.What a strong answer coversSLI = the measurement; SLO = the internal target on that measurement; SLA = the externally-promised, consequence-bearing version.
Set the SLO tighter than the SLA so you get warning before breaching the contract.
Error budget = 1 − SLO — the explicit, spendable allowance of failure over the window.
100% is the wrong reliability target: it is impossibly expensive and leaves no budget to ship features.
When the budget is spent, the policy is to halt risky releases until reliability recovers.
Quick self-checkYour SLO is 99.9% success over 28 days. What is the error budget?
-
Correct — the error budget is 1 − SLO, so 100% − 99.9% = 0.1% of requests can fail before the objective is breached.
-
Backwards — 99.9% is the success *target*, not the allowed failure.
-
Wrong on two counts: the SLO is an internal target (not the contract), and it explicitly tolerates 0.1% failure.
-
The error budget is derived purely from the SLO; the SLA's penalties are a separate, looser commitment.
Follow-ups they push on- Why is targeting 100% availability the wrong goal?
- What should happen operationally when the error budget is fully consumed?
- Why is an SLA usually looser than the corresponding SLO?
Red flag Conflating SLO and SLA, or setting them equal — the SLA must be looser than the SLO, and the error budget only makes sense as the gap below the SLO target.
source: Google SRE Book — Service Level Objectives ↗ -
What are the four golden signals, and why is each one worth alerting on?
Google SRE's four golden signals for a user-facing system are latency, traffic, errors, and saturation.
Latency — how long requests take; crucially, track *successful* and *failed* latency separately, since a fast error can hide a problem. Traffic — demand on the system (requests/sec, transactions/sec). Errors — the rate of failed requests, including the sneaky ones that return 200 but are wrong. Saturation — how 'full' the most constrained resource is (memory, I/O, CPU), the leading indicator of imminent degradation.
If you can only instrument four things, these give you the broadest coverage of user-visible health. RED is essentially the request-side subset (rate/errors/duration); saturation adds the resource-pressure dimension.
What a strong answer coversThe four: latency, traffic, errors, saturation — broad coverage from a minimal set.
Measure latency of failures separately from successes — a fast 500 skews the average and masks the outage.
Saturation is a *leading* indicator: it warns before latency and errors blow up.
RED (Rate/Errors/Duration) maps onto the request-facing three; saturation is the resource lens (the S in USE-style thinking).
Follow-ups they push on- Why must you separate the latency of failed requests from successful ones?
- How do the golden signals overlap with the RED method?
Red flag Folding failed-request latency into your overall latency metric — a flood of instant errors makes p50 latency look great while users are seeing failures.
source: Google SRE Book — Monitoring Distributed Systems (Golden Signals) ↗ -
Why is structured logging preferred over plain-text logs, and what is a correlation/trace ID for?
Structured logging emits each log as machine-parseable key/value data (typically JSON) —
{"level":"error","user_id":42,"latency_ms":910}— instead of a free-text sentence. The payoff: you can index, filter, and aggregate on fields (level=error AND service=checkout) in a log platform, rather than writing fragile regexes against prose.A correlation ID (a.k.a. request/trace ID) is a unique identifier generated at the edge and propagated through every service and log line for a single request. It lets you reconstruct the entire path of one request across many services by filtering on one value — turning scattered log lines into a coherent story, and linking logs to the matching distributed trace.
Together they make logs queryable *and* joinable, which is what makes them useful at scale.
What a strong answer coversStructured logs are key/value (JSON) — indexable and filterable on fields, not parsed from prose.
A correlation/trace ID is generated at the edge and propagated so all log lines for one request share it.
Filtering on the correlation ID reconstructs one request's journey across every service it touched.
It also bridges logs and traces — the same ID ties a log line to its span in a distributed trace.
Follow-ups they push on- How does a correlation ID get propagated across an async message queue?
- Why does free-text logging become unmanageable in a microservices fleet?
Red flag Logging unstructured prose (or, worse, logging secrets/PII into those fields) — it forces brittle text parsing and can leak sensitive data into the log store.
source: OpenTelemetry — Logs / Correlation ↗ -
What is tail-based sampling in distributed tracing, and why use it over head-based sampling?
Tracing every request at full volume is too expensive to store, so you sample. The question is *when* you decide.
Head-based sampling decides at the *start* of a trace — e.g. keep 1% of requests, chosen randomly at the root. It is cheap and simple, but blind: it might throw away the slow or errored traces, which are exactly the ones you want.
Tail-based sampling buffers the spans of a trace and decides *after* it completes, so it can keep traces based on outcome — every error, every request over 1s, plus a baseline sample of normal ones. You get the interesting traces without storing everything.
The tradeoff: tail-based needs to buffer complete traces (memory/coordination in the Collector) and is operationally heavier, but it captures the long tail that head-based sampling probabilistically discards.
What a strong answer coversSampling exists because storing 100% of traces is prohibitively expensive at scale.
Head-based decides at trace start (cheap, stateless) but can discard the slow/errored traces you most need.
Tail-based decides after the trace finishes, so it can retain all errors and high-latency traces.
Tail-based costs more: it must buffer whole traces and coordinate spans before deciding.
Follow-ups they push on- Why can't head-based sampling preferentially keep error traces?
- What infrastructure does the OTel Collector need to do tail-based sampling?
Red flag Using uniform head-based sampling and then being surprised that the rare production error has no trace — the random sample almost never captured it.
source: OpenTelemetry — Sampling ↗ -
What's the difference between a counter, a gauge, and a histogram in Prometheus, and when do you use each?
A counter only ever increases (or resets to zero on restart): total requests served, total errors. You don't read its raw value — you apply
rate()to get per-second throughput. A counter answers 'how many, cumulatively?'A gauge goes up and down: current memory usage, in-flight requests, queue depth, temperature. You read it directly; it answers 'what is the value *right now*?'
A histogram samples observations into configurable buckets (e.g. request durations) so you can compute quantiles like p95/p99 with
histogram_quantile(). It answers 'what does the *distribution* look like?' — essential for latency, where the mean lies.Pick by the question: cumulative count → counter; point-in-time level → gauge; distribution/percentiles → histogram.
What a strong answer coversCounter = monotonically increasing; query with
rate(), never read raw (it resets on restart).Gauge = a value that rises and falls; read directly for current state.
Histogram = bucketed observations enabling quantiles (p95/p99) via
histogram_quantile().Latency belongs in a histogram, not a gauge or an average — the tail is what hurts users.
Quick self-checkYou want p99 request latency on a dashboard. Which metric type do you instrument?
-
Correct — histograms bucket observations so `histogram_quantile()` can compute p95/p99 latency.
-
A counter only tracks a monotonically increasing total (e.g. request count), not a distribution.
-
A gauge captures a single current value, not the spread needed for a percentile.
-
You cannot reconstruct a latency distribution from a plain counter — percentiles require bucketed observations.
Follow-ups they push on- Why do you apply `rate()` to a counter instead of reading its value?
- What's the difference between a Prometheus histogram and a summary?
Red flag Using a gauge for an ever-growing total (so a restart silently resets it and breaks your dashboards), or averaging latency instead of using a histogram for percentiles.
source: Prometheus — Metric types ↗ -
What makes a good alert? Why do teams end up with alert fatigue, and how do you fix it?
A good alert is actionable, urgent, and user-impacting — it pages a human only when something needs a human to intervene *now*. The SRE guidance is to alert on symptoms (users are seeing errors / latency, the SLO is burning) rather than causes (CPU is at 80%), because a high CPU that isn't hurting anyone is not worth waking someone.
Alert fatigue sets in when too many alerts fire — noisy thresholds, alerts on causes that self-heal, duplicate pages for one incident — so on-call engineers start ignoring them, and the real page gets lost in the noise.
Fixes: alert on SLO burn rate rather than raw thresholds; route non-urgent signals to a dashboard or ticket instead of a page; deduplicate and group related alerts; and ruthlessly delete or tune any alert that consistently fires without requiring action. Every page should be reviewed: was it actionable?
What a strong answer coversPage only on symptoms users feel (errors, latency, SLO burn) — not on causes that may be harmless.
Every page must be actionable and urgent; if no human action is needed now, it shouldn't page.
Alert fatigue comes from noisy/duplicate/self-healing alerts; people then ignore the real one.
Fix it with burn-rate alerts, deduplication/grouping, ticket-not-page routing, and pruning useless alerts.
Follow-ups they push on- What is multi-window, multi-burn-rate alerting and why is it better than a static threshold?
- Why is paging on high CPU usually a bad idea?
Red flag Alerting on every resource metric (cause-based alerting) — it buries the few symptom-based pages that actually matter and trains on-call to dismiss notifications.
source: Google SRE Workbook — Alerting on SLOs ↗