6.2.7 ★ core [J][A] 12 interview Q's

Observability

The three pillars — logs, metrics, traces — what question each answers, and the tool concepts (Prometheus, Grafana, OpenTelemetry).

Observability is the ability to answer questions about your running system from the outside, without shipping new code to add a print statement. The classic framing is the three pillars — logs, metrics, and traces — and the senior skill is knowing which one answers which question, because reaching for the wrong pillar wastes an incident. Monitoring asks the questions you knew to ask in advance (a dashboard, an alert); observability lets you ask new questions of data you already collected, after the system surprises you.

At a glance

Metrics: numbers over time → is it broken, what's the trend? cheap, pre-aggregated, alert on these
Logs: discrete events → what exactly happened on this one request? high detail, high volume
Traces: one request across services → where did the latency/error go? needs context propagation
Workflow: metric alerts → trace localizes → logs explain. that → where → what
Cardinality: distinct label combos. user_id as a metric label = OOM. per-request detail belongs in logs/traces
SLI / SLO / SLA: measured ratio / your target / the customer contract (looser than the SLO)
Error budget: 100% − SLO — spend it on velocity; freeze risky changes when it's gone
RED / USE: Rate-Errors-Duration (services) · Utilization-Saturation-Errors (resources)
Stack: Prometheus (metrics) · Grafana (dashboards) · OpenTelemetry (vendor-neutral instrumentation)

Key vocabulary

Logs: Timestamped, discrete event records — the narrative of what happened. Best as structured (JSON) lines so they're queryable. High detail, high volume.
Metrics: Numeric measurements aggregated over time — request rate, error rate, p99 latency, CPU. Cheap to store and ideal for trends, dashboards, and alerts, but they tell you that something changed, not why.
Traces: The end-to-end path of a single request as it fans out across services, broken into timed spans. Answers "where did the latency go?" in a distributed call.
Span: One timed unit of work within a trace (e.g. a DB call), carrying a start/end time, a parent span, and attributes. Spans nest to form the trace's tree.
Cardinality: The number of distinct label/tag combinations on a metric. High cardinality (e.g. a label per user_id) explodes storage and is the most common way to break a metrics backend.
Context propagation: Passing the trace ID and parent span ID across service boundaries (in request headers), so spans from many services stitch into one trace tree.
OpenTelemetry (OTel): A vendor-neutral standard and SDK for generating and exporting all three signals, so you instrument once and can switch backends without re-instrumenting.

The three pillars at a glance

Each pillar is a different shape of data answering a different kind of question. The art is matching the question to the pillar before you start digging.

FIG 1 · the three pillars Metrics detect that something is wrong; traces locate where in the request path; logs explain what exactly happened.

The three pillars overlap, but each has a job it does best. Pick by the question you’re asking.

Pillar	Answers	Shape	Cost / caveat
Metrics	Is something wrong, and what's the trend?	numeric time series, aggregated	cheap; but only what you pre-defined — and high cardinality is fatal
Logs	What exactly happened on this event?	discrete structured records	high detail; high volume and cost at scale
Traces	Where in the request path did the time/error go?	spans across services, sampled	needs context propagation; usually sampled, not 100%

Metrics detect, logs explain a single event, traces locate it across services.

The healthy workflow chains them: a metric alert fires (error rate spiked), you pull up a trace to see which downstream call is failing, then read the logs for that span to see the exact error message and stack. Metrics tell you that, traces tell you where, logs tell you what. The senior move is to glue the pillars together — stamp the same trace_id onto your log lines so that, from a slow span in a trace, you can jump straight to the matching log records.

A trace makes the distributed bottleneck obvious

A single log line can’t show you the shape of a request that touched five services. A trace can — it breaks the wall-clock time into nested spans, one per unit of work, so the bottleneck is visually obvious.

FIG 2 · trace waterfall One slow checkout, broken into spans. 85% of the wall-clock time is a single DB call inside inventory-service.

One slow checkout request

A user reports a slow checkout. Latency metrics confirm p99 is up, but not why. A trace of one request shows the time broken down by span:

Trace: POST /checkout                                 total 1240ms
├─ api-gateway            [██]                            40ms
├─ order-service          [████████████████████████]   1150ms
│   ├─ validate-cart      [█]                             20ms
│   ├─ payment-service    [██]                            60ms
│   └─ inventory-service  [███████████████████████]     1050ms  ◀ here
│       └─ db: SELECT ... [██████████████████████]      1010ms  ◀ N+1 query
└─ notification-service   [█]                             30ms

No single log line would have revealed this. The trace shows the request’s shape: 85% of the wall-clock time is one DB call inside inventory-service. Now you know exactly which service and which span to fix — likely an N+1 query or a missing index. Metrics raised the alarm; the trace localized it; the next step is to read that span’s logs.

The mechanism that makes this work is context propagation: a trace ID (and the parent span ID) is passed in request headers from service to service, so the spans emitted by five different services can be stitched into one tree. The W3C traceparent header is the standard carrier.

Structured logs beat string logs

A log line is only useful if you can query it. A free-text line like User 42 failed login from 1.2.3.4 forces a regex to extract anything; a structured (JSON) line makes every field a queryable key.

Free-text vs structured logging

// Hard to query: fields are buried in a sentence
logger.info(`User ${userId} failed login from ${ip}`);

// Queryable: each field is a key your log backend can filter and aggregate on
logger.info({
  event: "login_failed",
  userId,
  ip,
  reason: "bad_password",
  trace_id: ctx.traceId,   // lets you jump from a trace span to these logs
});

With the structured form you can answer “how many login_failed events from this IP in the last hour?” with a filter, not a brittle regex — and the trace_id field links the log straight back to the request’s trace.

SLIs, SLOs, and error budgets

“Is it up?” is the wrong question — everything fails sometimes. The senior framing is how reliable, against a target you chose on purpose, and how much failure budget is left. Three terms, then the idea that ties them together:

SLI (indicator) — a measured ratio that tracks user happiness: good events ÷ valid events (e.g. fraction of requests served successfully in under 300 ms).
SLO (objective) — your internal target for that SLI over a window: 99.9% of requests succeed over 28 days. Set it from what users actually need, not 100%.
SLA (agreement) — the contractual promise to customers, with penalties. Always looser than your SLO, so you trip your own alarm before you breach the contract.

Term	What it is	Example
SLI	a measured reliability ratio	99.95% of requests < 300ms this week
SLO	your target for the SLI	≥ 99.9% success over 28 days
SLA	the customer contract (penalties)	≥ 99.5% or service credits owed
Error budget	`100% − SLO` — the failure you're allowed	0.1% ≈ 43 min / 30 days

SLI measures, SLO targets, SLA promises; the error budget is what's left to spend.

The unifying idea is the error budget = 100% − SLO. A 99.9% SLO grants a 0.1% budget — about 43 minutes of failure per 30 days. That budget is a currency: while it’s healthy, spend it (ship features, take deploy risks); when it’s exhausted, the policy flips — freeze risky changes and pour effort into reliability until you’re back in budget. It turns “dev velocity vs. stability” from an argument into a number.

The tool concepts

Key points

Three pillars: logs = narrative of what happened, metrics = quantitative trends, traces = the request path across services.
Pick by question: metrics say that something’s wrong, traces say where, logs say what exactly — and chain them in that order during an incident.
Prefer structured (JSON) logs so they’re queryable; stamp a trace_id to link logs to traces.
Traces rely on context propagation (trace ID in headers) and are usually sampled, not captured 100%.
Tooling: Prometheus (metrics) + Grafana (dashboards) + OpenTelemetry (vendor-neutral instrumentation for all three).
Keep metric labels low-cardinality — per-request identifiers belong in logs/traces, never in metric labels — and never log secrets or PII.
Reliability is a budget: SLI (measured) vs SLO (target you chose, not 100%); the error budget (100% − SLO) gates how much risk/velocity you can spend. Measure golden signals with RED (services) / USE (resources).

01 Learning objectives

0 / 4 done

02 Curated reading

OpenTelemetry — What is OpenTelemetry?
optional 12m — Vendor-neutral framing of the three pillars.

03 Knowledge check

knowledge check3 questions · pass ≥ 70%

01medium
Which pillar answers “what was the exact path of this one slow request across services?”
02medium
An error budget is…
03medium
The RED method's three signals for a request-driven service are…

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common What are the three pillars of observability, and what question does each one answer?
Logs are timestamped, discrete records — the narrative of *what happened* on one service. Best for forensic, after-the-fact debugging of a specific event.
Metrics are aggregated numbers over time (counters, gauges, histograms) — they answer *how much / how often / is the trend bad?* Cheap to store, great for dashboards and alerting thresholds.
Traces follow a single request across service boundaries — they answer *where did the time go / which hop failed?* in a distributed system.
The strong answer ties them together: a metric alert tells you something is wrong, a trace localizes which service, and logs from that service explain why.
Follow-ups they push on
- Which pillar is most expensive to store at scale, and why?
- How do you correlate a log line with the trace it belongs to?
Red flag Treating the three as interchangeable, or claiming logs alone give you observability — logs do not show cross-service latency the way traces do.
source: Sematext — Three Pillars of Observability ↗
Commonly asked senior concept common What's the difference between monitoring and observability?
Monitoring watches for *known* failure modes: you decide in advance what to measure, set thresholds, and alert when a line is crossed. It answers questions you predicted.
Observability is the property of a system that lets you ask *new* questions about its internal state from the outside, without shipping new code — to debug failures you did not anticipate.
The relationship: monitoring is a subset of what observable systems enable. You still need both — monitoring catches the predictable, observability lets you investigate the unknown-unknowns in complex distributed systems.
Follow-ups they push on
- What property of your telemetry makes a system observable rather than just monitored?
- Why do microservices raise the bar for observability versus a monolith?
Red flag Saying observability is 'just monitoring with more dashboards' — the distinction is exploring unknown-unknowns versus alerting on known thresholds.
source: TechTarget — The 3 pillars of observability ↗
Commonly asked senior debug common Your Prometheus storage is exploding after a deploy. What's the most likely cause and the fix?
Almost always a high-cardinality label. Each unique combination of label values is a separate time series; adding an unbounded label like user_id, request_id, email, or a raw URL with IDs multiplies series count explosively.
Fix: drop the offending label, or replace it with a bounded one. Use http_method, status-code *class* (2xx/5xx), route *template* (/users/:id, not /users/8123), and service — values with a small, fixed set.
If you genuinely need per-user detail, that belongs in logs or traces (high cardinality there is fine), not in metric labels.
Follow-ups they push on
- Why is high cardinality cheap in tracing but catastrophic in metrics?
- How would you find which metric is the culprit?
Red flag Putting unbounded identifiers (user IDs, request IDs, timestamps) into metric labels — the classic cardinality blow-up.
source: Sematext — Three Pillars of Observability (cardinality) ↗
Commonly asked mid concept common What problem does OpenTelemetry solve?
OpenTelemetry (OTel) is a vendor-neutral standard — APIs, SDKs, and the Collector — for generating and exporting traces, metrics, and logs.
The problem it solves: before OTel, each backend (Datadog, Jaeger, New Relic, Prometheus) had its own agent and instrumentation library, so switching vendors meant re-instrumenting your code. With OTel you instrument *once* against a common API, then point the Collector at whatever backend you choose — no code change to switch or fan out to several.
It is now a CNCF project and the de-facto wire format (OTLP) for telemetry.
Follow-ups they push on
- What does the OTel Collector do that an in-process SDK exporter doesn't?
- How does context propagation let a trace span multiple services?
Red flag Calling OpenTelemetry a 'monitoring tool' or a backend — it generates and ships telemetry; it does not store or visualize it (that's Prometheus, Grafana, Jaeger, etc.).
source: OpenTelemetry — What is OpenTelemetry? ↗
Commonly asked junior concept common How do Prometheus and Grafana divide responsibilities in a typical stack?
Prometheus is the time-series database and collector: it *pulls* (scrapes) metrics from instrumented targets, stores them, and evaluates alerting rules. Querying is done with PromQL.
Grafana is the visualization/dashboard layer: it queries Prometheus (and many other sources) and renders graphs, tables, and alerts for humans.
The one-liner: Prometheus collects and stores the numbers; Grafana makes them legible. They are complementary, not competitors — you commonly run both together.
Follow-ups they push on
- Why does Prometheus prefer a pull model over push?
- Where does Alertmanager fit relative to Prometheus?
Red flag Thinking Grafana stores metrics — it is a query/visualization front-end over data sources, not a TSDB.
source: Grafana — Prometheus data source ↗
Commonly asked senior concept occasional What are the RED and USE methods, and when would you use each?
RED (Rate, Errors, Duration) is request-centric — for services/endpoints, you watch request rate, error rate, and latency distribution. It answers 'is this service healthy from the caller's view?'
USE (Utilization, Saturation, Errors) is resource-centric — for every resource (CPU, disk, network, memory) you watch how busy it is, how much work is queued, and its error count. It answers 'is this machine/resource a bottleneck?'
Use RED for your request-serving services and USE for the infrastructure underneath them; they are complementary lenses.
Follow-ups they push on
- Why is a latency *percentile* (p99) more useful than a mean for the D in RED?
- What are the four golden signals and how do they relate to RED?
Red flag Alerting on averages instead of percentiles — a healthy mean hides a brutal p99 tail.
source: Grafana — RED method ↗
★ must-know Google senior concept very common Define SLI, SLO, SLA, and error budget — how do they relate?
An SLI (Service Level *Indicator*) is a measured quantity of service health — e.g. the proportion of HTTP requests that succeed under 300ms.
An SLO (Service Level *Objective*) is the internal target for an SLI over a window — e.g. 99.9% of requests succeed over 28 days. It is what you *aim* for.
An SLA (Service Level *Agreement*) is a contract with customers that includes consequences (refunds, penalties) if you miss it. SLAs are looser than SLOs so you have headroom before you owe anyone money.
The error budget is 1 − SLO — the allowed amount of unreliability (0.1% for a 99.9% SLO). It turns reliability into a currency: while budget remains you can ship fast and take risks; when it is exhausted you freeze risky launches and prioritize stability. It is the mechanism that lets dev and ops stop arguing about pace versus reliability.
What a strong answer covers
- SLI = the measurement; SLO = the internal target on that measurement; SLA = the externally-promised, consequence-bearing version.
- Set the SLO tighter than the SLA so you get warning before breaching the contract.
- Error budget = 1 − SLO — the explicit, spendable allowance of failure over the window.
- 100% is the wrong reliability target: it is impossibly expensive and leaves no budget to ship features.
- When the budget is spent, the policy is to halt risky releases until reliability recovers.
Quick self-check
Your SLO is 99.9% success over 28 days. What is the error budget?
Follow-ups they push on
- Why is targeting 100% availability the wrong goal?
- What should happen operationally when the error budget is fully consumed?
- Why is an SLA usually looser than the corresponding SLO?
Red flag Conflating SLO and SLA, or setting them equal — the SLA must be looser than the SLO, and the error budget only makes sense as the gap below the SLO target.
source: Google SRE Book — Service Level Objectives ↗
Google mid concept common What are the four golden signals, and why is each one worth alerting on?
Google SRE's four golden signals for a user-facing system are latency, traffic, errors, and saturation.
Latency — how long requests take; crucially, track *successful* and *failed* latency separately, since a fast error can hide a problem. Traffic — demand on the system (requests/sec, transactions/sec). Errors — the rate of failed requests, including the sneaky ones that return 200 but are wrong. Saturation — how 'full' the most constrained resource is (memory, I/O, CPU), the leading indicator of imminent degradation.
If you can only instrument four things, these give you the broadest coverage of user-visible health. RED is essentially the request-side subset (rate/errors/duration); saturation adds the resource-pressure dimension.
What a strong answer covers
- The four: latency, traffic, errors, saturation — broad coverage from a minimal set.
- Measure latency of failures separately from successes — a fast 500 skews the average and masks the outage.
- Saturation is a *leading* indicator: it warns before latency and errors blow up.
- RED (Rate/Errors/Duration) maps onto the request-facing three; saturation is the resource lens (the S in USE-style thinking).
Follow-ups they push on
- Why must you separate the latency of failed requests from successful ones?
- How do the golden signals overlap with the RED method?
Red flag Folding failed-request latency into your overall latency metric — a flood of instant errors makes p50 latency look great while users are seeing failures.
source: Google SRE Book — Monitoring Distributed Systems (Golden Signals) ↗
Commonly asked mid concept common Why is structured logging preferred over plain-text logs, and what is a correlation/trace ID for?
Structured logging emits each log as machine-parseable key/value data (typically JSON) — {"level":"error","user_id":42,"latency_ms":910} — instead of a free-text sentence. The payoff: you can index, filter, and aggregate on fields (level=error AND service=checkout) in a log platform, rather than writing fragile regexes against prose.
A correlation ID (a.k.a. request/trace ID) is a unique identifier generated at the edge and propagated through every service and log line for a single request. It lets you reconstruct the entire path of one request across many services by filtering on one value — turning scattered log lines into a coherent story, and linking logs to the matching distributed trace.
Together they make logs queryable *and* joinable, which is what makes them useful at scale.
What a strong answer covers
- Structured logs are key/value (JSON) — indexable and filterable on fields, not parsed from prose.
- A correlation/trace ID is generated at the edge and propagated so all log lines for one request share it.
- Filtering on the correlation ID reconstructs one request's journey across every service it touched.
- It also bridges logs and traces — the same ID ties a log line to its span in a distributed trace.
Follow-ups they push on
- How does a correlation ID get propagated across an async message queue?
- Why does free-text logging become unmanageable in a microservices fleet?
Red flag Logging unstructured prose (or, worse, logging secrets/PII into those fields) — it forces brittle text parsing and can leak sensitive data into the log store.
source: OpenTelemetry — Logs / Correlation ↗
Commonly asked senior concept occasional What is tail-based sampling in distributed tracing, and why use it over head-based sampling?
Tracing every request at full volume is too expensive to store, so you sample. The question is *when* you decide.
Head-based sampling decides at the *start* of a trace — e.g. keep 1% of requests, chosen randomly at the root. It is cheap and simple, but blind: it might throw away the slow or errored traces, which are exactly the ones you want.
Tail-based sampling buffers the spans of a trace and decides *after* it completes, so it can keep traces based on outcome — every error, every request over 1s, plus a baseline sample of normal ones. You get the interesting traces without storing everything.
The tradeoff: tail-based needs to buffer complete traces (memory/coordination in the Collector) and is operationally heavier, but it captures the long tail that head-based sampling probabilistically discards.
What a strong answer covers
- Sampling exists because storing 100% of traces is prohibitively expensive at scale.
- Head-based decides at trace start (cheap, stateless) but can discard the slow/errored traces you most need.
- Tail-based decides after the trace finishes, so it can retain all errors and high-latency traces.
- Tail-based costs more: it must buffer whole traces and coordinate spans before deciding.
Follow-ups they push on
- Why can't head-based sampling preferentially keep error traces?
- What infrastructure does the OTel Collector need to do tail-based sampling?
Red flag Using uniform head-based sampling and then being surprised that the rare production error has no trace — the random sample almost never captured it.
source: OpenTelemetry — Sampling ↗
Commonly asked mid concept occasional What's the difference between a counter, a gauge, and a histogram in Prometheus, and when do you use each?
A counter only ever increases (or resets to zero on restart): total requests served, total errors. You don't read its raw value — you apply rate() to get per-second throughput. A counter answers 'how many, cumulatively?'
A gauge goes up and down: current memory usage, in-flight requests, queue depth, temperature. You read it directly; it answers 'what is the value *right now*?'
A histogram samples observations into configurable buckets (e.g. request durations) so you can compute quantiles like p95/p99 with histogram_quantile(). It answers 'what does the *distribution* look like?' — essential for latency, where the mean lies.
Pick by the question: cumulative count → counter; point-in-time level → gauge; distribution/percentiles → histogram.
What a strong answer covers
- Counter = monotonically increasing; query with rate(), never read raw (it resets on restart).
- Gauge = a value that rises and falls; read directly for current state.
- Histogram = bucketed observations enabling quantiles (p95/p99) via histogram_quantile().
- Latency belongs in a histogram, not a gauge or an average — the tail is what hurts users.
Quick self-check
You want p99 request latency on a dashboard. Which metric type do you instrument?
Follow-ups they push on
- Why do you apply `rate()` to a counter instead of reading its value?
- What's the difference between a Prometheus histogram and a summary?
Red flag Using a gauge for an ever-growing total (so a restart silently resets it and breaks your dashboards), or averaging latency instead of using a histogram for percentiles.
source: Prometheus — Metric types ↗
Google senior concept common What makes a good alert? Why do teams end up with alert fatigue, and how do you fix it?
A good alert is actionable, urgent, and user-impacting — it pages a human only when something needs a human to intervene *now*. The SRE guidance is to alert on symptoms (users are seeing errors / latency, the SLO is burning) rather than causes (CPU is at 80%), because a high CPU that isn't hurting anyone is not worth waking someone.
Alert fatigue sets in when too many alerts fire — noisy thresholds, alerts on causes that self-heal, duplicate pages for one incident — so on-call engineers start ignoring them, and the real page gets lost in the noise.
Fixes: alert on SLO burn rate rather than raw thresholds; route non-urgent signals to a dashboard or ticket instead of a page; deduplicate and group related alerts; and ruthlessly delete or tune any alert that consistently fires without requiring action. Every page should be reviewed: was it actionable?
What a strong answer covers
- Page only on symptoms users feel (errors, latency, SLO burn) — not on causes that may be harmless.
- Every page must be actionable and urgent; if no human action is needed now, it shouldn't page.
- Alert fatigue comes from noisy/duplicate/self-healing alerts; people then ignore the real one.
- Fix it with burn-rate alerts, deduplication/grouping, ticket-not-page routing, and pruning useless alerts.
Follow-ups they push on
- What is multi-window, multi-burn-rate alerting and why is it better than a static threshold?
- Why is paging on high CPU usually a bad idea?
Red flag Alerting on every resource metric (cause-based alerting) — it buries the few symptom-based pages that actually matter and trains on-call to dismiss notifications.
source: Google SRE Workbook — Alerting on SLOs ↗