> cs·fundamentals
interview 0% 18m read
6.2.7 ★ core [J][A] 12 interview Q's

Observability

The three pillars — logs, metrics, traces — what question each answers, and the tool concepts (Prometheus, Grafana, OpenTelemetry).

Observability is the ability to answer questions about your running system from the outside, without shipping new code to add a print statement. The classic framing is the three pillars — logs, metrics, and traces — and the senior skill is knowing which one answers which question, because reaching for the wrong pillar wastes an incident. Monitoring asks the questions you knew to ask in advance (a dashboard, an alert); observability lets you ask new questions of data you already collected, after the system surprises you.

The three pillars at a glance

Each pillar is a different shape of data answering a different kind of question. The art is matching the question to the pillar before you start digging.

Three columns — Metrics, Traces, Logs — each labelled with the question it answers and its data shape.METRICSis it broken?what’s the trend?numeric time seriesTRACESwhere didthe time go?spans across servicesLOGSwhat exactlyhappened here?discrete event records
FIG 1 · the three pillars Metrics detect that something is wrong; traces locate where in the request path; logs explain what exactly happened.

The three pillars overlap, but each has a job it does best. Pick by the question you’re asking.

PillarAnswersShapeCost / caveat
MetricsIs something wrong, and what's the trend?numeric time series, aggregatedcheap; but only what you pre-defined — and high cardinality is fatal
LogsWhat exactly happened on this event?discrete structured recordshigh detail; high volume and cost at scale
TracesWhere in the request path did the time/error go?spans across services, sampledneeds context propagation; usually sampled, not 100%
Metrics detect, logs explain a single event, traces locate it across services.

The healthy workflow chains them: a metric alert fires (error rate spiked), you pull up a trace to see which downstream call is failing, then read the logs for that span to see the exact error message and stack. Metrics tell you that, traces tell you where, logs tell you what. The senior move is to glue the pillars together — stamp the same trace_id onto your log lines so that, from a slow span in a trace, you can jump straight to the matching log records.

A trace makes the distributed bottleneck obvious

A single log line can’t show you the shape of a request that touched five services. A trace can — it breaks the wall-clock time into nested spans, one per unit of work, so the bottleneck is visually obvious.

A waterfall of spans for POST /checkout; the inventory-service DB span dominates the total 1240ms.POST /checkouttotal 1240msapi-gateway40msorder-service1150msvalidate-cart20mspayment-service60msinventory-service1050ms ◀db: SELECT …1010ms (N+1)notification-service30ms
FIG 2 · trace waterfall One slow checkout, broken into spans. 85% of the wall-clock time is a single DB call inside inventory-service.
One slow checkout request

A user reports a slow checkout. Latency metrics confirm p99 is up, but not why. A trace of one request shows the time broken down by span:

Trace: POST /checkout                                 total 1240ms
├─ api-gateway            [██]                            40ms
├─ order-service          [████████████████████████]   1150ms
│   ├─ validate-cart      [█]                             20ms
│   ├─ payment-service    [██]                            60ms
│   └─ inventory-service  [███████████████████████]     1050ms  ◀ here
│       └─ db: SELECT ... [██████████████████████]      1010ms  ◀ N+1 query
└─ notification-service   [█]                             30ms

No single log line would have revealed this. The trace shows the request’s shape: 85% of the wall-clock time is one DB call inside inventory-service. Now you know exactly which service and which span to fix — likely an N+1 query or a missing index. Metrics raised the alarm; the trace localized it; the next step is to read that span’s logs.

The mechanism that makes this work is context propagation: a trace ID (and the parent span ID) is passed in request headers from service to service, so the spans emitted by five different services can be stitched into one tree. The W3C traceparent header is the standard carrier.

Structured logs beat string logs

A log line is only useful if you can query it. A free-text line like User 42 failed login from 1.2.3.4 forces a regex to extract anything; a structured (JSON) line makes every field a queryable key.

Free-text vs structured logging
// Hard to query: fields are buried in a sentence
logger.info(`User ${userId} failed login from ${ip}`);

// Queryable: each field is a key your log backend can filter and aggregate on
logger.info({
  event: "login_failed",
  userId,
  ip,
  reason: "bad_password",
  trace_id: ctx.traceId,   // lets you jump from a trace span to these logs
});

With the structured form you can answer “how many login_failed events from this IP in the last hour?” with a filter, not a brittle regex — and the trace_id field links the log straight back to the request’s trace.

SLIs, SLOs, and error budgets

“Is it up?” is the wrong question — everything fails sometimes. The senior framing is how reliable, against a target you chose on purpose, and how much failure budget is left. Three terms, then the idea that ties them together:

  • SLI (indicator) — a measured ratio that tracks user happiness: good events ÷ valid events (e.g. fraction of requests served successfully in under 300 ms).
  • SLO (objective) — your internal target for that SLI over a window: 99.9% of requests succeed over 28 days. Set it from what users actually need, not 100%.
  • SLA (agreement) — the contractual promise to customers, with penalties. Always looser than your SLO, so you trip your own alarm before you breach the contract.
TermWhat it isExample
SLIa measured reliability ratio99.95% of requests < 300ms this week
SLOyour target for the SLI≥ 99.9% success over 28 days
SLAthe customer contract (penalties)≥ 99.5% or service credits owed
Error budget100% − SLO — the failure you're allowed0.1% ≈ 43 min / 30 days
SLI measures, SLO targets, SLA promises; the error budget is what's left to spend.

The unifying idea is the error budget = 100% − SLO. A 99.9% SLO grants a 0.1% budget — about 43 minutes of failure per 30 days. That budget is a currency: while it’s healthy, spend it (ship features, take deploy risks); when it’s exhausted, the policy flips — freeze risky changes and pour effort into reliability until you’re back in budget. It turns “dev velocity vs. stability” from an argument into a number.

The tool concepts

01 Learning objectives

0 / 4 done

02 Curated reading

03 Knowledge check

knowledge check3 questions · pass ≥ 70%
  1. 01medium

    Which pillar answers “what was the exact path of this one slow request across services?”

  2. 02medium

    An error budget is…

  3. 03medium

    The RED method's three signals for a request-driven service are…

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept very common What are the three pillars of observability, and what question does each one answer?

    Logs are timestamped, discrete records — the narrative of *what happened* on one service. Best for forensic, after-the-fact debugging of a specific event.

    Metrics are aggregated numbers over time (counters, gauges, histograms) — they answer *how much / how often / is the trend bad?* Cheap to store, great for dashboards and alerting thresholds.

    Traces follow a single request across service boundaries — they answer *where did the time go / which hop failed?* in a distributed system.

    The strong answer ties them together: a metric alert tells you something is wrong, a trace localizes which service, and logs from that service explain why.

    Red flag Treating the three as interchangeable, or claiming logs alone give you observability — logs do not show cross-service latency the way traces do.

    source: Sematext — Three Pillars of Observability ↗
  • Commonly asked senior concept common What's the difference between monitoring and observability?

    Monitoring watches for *known* failure modes: you decide in advance what to measure, set thresholds, and alert when a line is crossed. It answers questions you predicted.

    Observability is the property of a system that lets you ask *new* questions about its internal state from the outside, without shipping new code — to debug failures you did not anticipate.

    The relationship: monitoring is a subset of what observable systems enable. You still need both — monitoring catches the predictable, observability lets you investigate the unknown-unknowns in complex distributed systems.

    Red flag Saying observability is 'just monitoring with more dashboards' — the distinction is exploring unknown-unknowns versus alerting on known thresholds.

    source: TechTarget — The 3 pillars of observability ↗
  • Commonly asked senior debug common Your Prometheus storage is exploding after a deploy. What's the most likely cause and the fix?

    Almost always a high-cardinality label. Each unique combination of label values is a separate time series; adding an unbounded label like user_id, request_id, email, or a raw URL with IDs multiplies series count explosively.

    Fix: drop the offending label, or replace it with a bounded one. Use http_method, status-code *class* (2xx/5xx), route *template* (/users/:id, not /users/8123), and service — values with a small, fixed set.

    If you genuinely need per-user detail, that belongs in logs or traces (high cardinality there is fine), not in metric labels.

    Red flag Putting unbounded identifiers (user IDs, request IDs, timestamps) into metric labels — the classic cardinality blow-up.

    source: Sematext — Three Pillars of Observability (cardinality) ↗
  • Commonly asked mid concept common What problem does OpenTelemetry solve?

    OpenTelemetry (OTel) is a vendor-neutral standard — APIs, SDKs, and the Collector — for generating and exporting traces, metrics, and logs.

    The problem it solves: before OTel, each backend (Datadog, Jaeger, New Relic, Prometheus) had its own agent and instrumentation library, so switching vendors meant re-instrumenting your code. With OTel you instrument *once* against a common API, then point the Collector at whatever backend you choose — no code change to switch or fan out to several.

    It is now a CNCF project and the de-facto wire format (OTLP) for telemetry.

    Red flag Calling OpenTelemetry a 'monitoring tool' or a backend — it generates and ships telemetry; it does not store or visualize it (that's Prometheus, Grafana, Jaeger, etc.).

    source: OpenTelemetry — What is OpenTelemetry? ↗
  • Commonly asked junior concept common How do Prometheus and Grafana divide responsibilities in a typical stack?

    Prometheus is the time-series database and collector: it *pulls* (scrapes) metrics from instrumented targets, stores them, and evaluates alerting rules. Querying is done with PromQL.

    Grafana is the visualization/dashboard layer: it queries Prometheus (and many other sources) and renders graphs, tables, and alerts for humans.

    The one-liner: Prometheus collects and stores the numbers; Grafana makes them legible. They are complementary, not competitors — you commonly run both together.

    Red flag Thinking Grafana stores metrics — it is a query/visualization front-end over data sources, not a TSDB.

    source: Grafana — Prometheus data source ↗
  • Commonly asked senior concept occasional What are the RED and USE methods, and when would you use each?

    RED (Rate, Errors, Duration) is request-centric — for services/endpoints, you watch request rate, error rate, and latency distribution. It answers 'is this service healthy from the caller's view?'

    USE (Utilization, Saturation, Errors) is resource-centric — for every resource (CPU, disk, network, memory) you watch how busy it is, how much work is queued, and its error count. It answers 'is this machine/resource a bottleneck?'

    Use RED for your request-serving services and USE for the infrastructure underneath them; they are complementary lenses.

    Red flag Alerting on averages instead of percentiles — a healthy mean hides a brutal p99 tail.

    source: Grafana — RED method ↗
  • ★ must-know Google senior concept very common Define SLI, SLO, SLA, and error budget — how do they relate?

    An SLI (Service Level *Indicator*) is a measured quantity of service health — e.g. the proportion of HTTP requests that succeed under 300ms.

    An SLO (Service Level *Objective*) is the internal target for an SLI over a window — e.g. 99.9% of requests succeed over 28 days. It is what you *aim* for.

    An SLA (Service Level *Agreement*) is a contract with customers that includes consequences (refunds, penalties) if you miss it. SLAs are looser than SLOs so you have headroom before you owe anyone money.

    The error budget is 1 − SLO — the allowed amount of unreliability (0.1% for a 99.9% SLO). It turns reliability into a currency: while budget remains you can ship fast and take risks; when it is exhausted you freeze risky launches and prioritize stability. It is the mechanism that lets dev and ops stop arguing about pace versus reliability.

    What a strong answer covers
    • SLI = the measurement; SLO = the internal target on that measurement; SLA = the externally-promised, consequence-bearing version.

    • Set the SLO tighter than the SLA so you get warning before breaching the contract.

    • Error budget = 1 − SLO — the explicit, spendable allowance of failure over the window.

    • 100% is the wrong reliability target: it is impossibly expensive and leaves no budget to ship features.

    • When the budget is spent, the policy is to halt risky releases until reliability recovers.

    Quick self-check

    Your SLO is 99.9% success over 28 days. What is the error budget?

    Red flag Conflating SLO and SLA, or setting them equal — the SLA must be looser than the SLO, and the error budget only makes sense as the gap below the SLO target.

    source: Google SRE Book — Service Level Objectives ↗
  • Google mid concept common What are the four golden signals, and why is each one worth alerting on?

    Google SRE's four golden signals for a user-facing system are latency, traffic, errors, and saturation.

    Latency — how long requests take; crucially, track *successful* and *failed* latency separately, since a fast error can hide a problem. Traffic — demand on the system (requests/sec, transactions/sec). Errors — the rate of failed requests, including the sneaky ones that return 200 but are wrong. Saturation — how 'full' the most constrained resource is (memory, I/O, CPU), the leading indicator of imminent degradation.

    If you can only instrument four things, these give you the broadest coverage of user-visible health. RED is essentially the request-side subset (rate/errors/duration); saturation adds the resource-pressure dimension.

    What a strong answer covers
    • The four: latency, traffic, errors, saturation — broad coverage from a minimal set.

    • Measure latency of failures separately from successes — a fast 500 skews the average and masks the outage.

    • Saturation is a *leading* indicator: it warns before latency and errors blow up.

    • RED (Rate/Errors/Duration) maps onto the request-facing three; saturation is the resource lens (the S in USE-style thinking).

    Red flag Folding failed-request latency into your overall latency metric — a flood of instant errors makes p50 latency look great while users are seeing failures.

    source: Google SRE Book — Monitoring Distributed Systems (Golden Signals) ↗
  • Commonly asked mid concept common Why is structured logging preferred over plain-text logs, and what is a correlation/trace ID for?

    Structured logging emits each log as machine-parseable key/value data (typically JSON) — {"level":"error","user_id":42,"latency_ms":910} — instead of a free-text sentence. The payoff: you can index, filter, and aggregate on fields (level=error AND service=checkout) in a log platform, rather than writing fragile regexes against prose.

    A correlation ID (a.k.a. request/trace ID) is a unique identifier generated at the edge and propagated through every service and log line for a single request. It lets you reconstruct the entire path of one request across many services by filtering on one value — turning scattered log lines into a coherent story, and linking logs to the matching distributed trace.

    Together they make logs queryable *and* joinable, which is what makes them useful at scale.

    What a strong answer covers
    • Structured logs are key/value (JSON) — indexable and filterable on fields, not parsed from prose.

    • A correlation/trace ID is generated at the edge and propagated so all log lines for one request share it.

    • Filtering on the correlation ID reconstructs one request's journey across every service it touched.

    • It also bridges logs and traces — the same ID ties a log line to its span in a distributed trace.

    Red flag Logging unstructured prose (or, worse, logging secrets/PII into those fields) — it forces brittle text parsing and can leak sensitive data into the log store.

    source: OpenTelemetry — Logs / Correlation ↗
  • Commonly asked senior concept occasional What is tail-based sampling in distributed tracing, and why use it over head-based sampling?

    Tracing every request at full volume is too expensive to store, so you sample. The question is *when* you decide.

    Head-based sampling decides at the *start* of a trace — e.g. keep 1% of requests, chosen randomly at the root. It is cheap and simple, but blind: it might throw away the slow or errored traces, which are exactly the ones you want.

    Tail-based sampling buffers the spans of a trace and decides *after* it completes, so it can keep traces based on outcome — every error, every request over 1s, plus a baseline sample of normal ones. You get the interesting traces without storing everything.

    The tradeoff: tail-based needs to buffer complete traces (memory/coordination in the Collector) and is operationally heavier, but it captures the long tail that head-based sampling probabilistically discards.

    What a strong answer covers
    • Sampling exists because storing 100% of traces is prohibitively expensive at scale.

    • Head-based decides at trace start (cheap, stateless) but can discard the slow/errored traces you most need.

    • Tail-based decides after the trace finishes, so it can retain all errors and high-latency traces.

    • Tail-based costs more: it must buffer whole traces and coordinate spans before deciding.

    Red flag Using uniform head-based sampling and then being surprised that the rare production error has no trace — the random sample almost never captured it.

    source: OpenTelemetry — Sampling ↗
  • Commonly asked mid concept occasional What's the difference between a counter, a gauge, and a histogram in Prometheus, and when do you use each?

    A counter only ever increases (or resets to zero on restart): total requests served, total errors. You don't read its raw value — you apply rate() to get per-second throughput. A counter answers 'how many, cumulatively?'

    A gauge goes up and down: current memory usage, in-flight requests, queue depth, temperature. You read it directly; it answers 'what is the value *right now*?'

    A histogram samples observations into configurable buckets (e.g. request durations) so you can compute quantiles like p95/p99 with histogram_quantile(). It answers 'what does the *distribution* look like?' — essential for latency, where the mean lies.

    Pick by the question: cumulative count → counter; point-in-time level → gauge; distribution/percentiles → histogram.

    What a strong answer covers
    • Counter = monotonically increasing; query with rate(), never read raw (it resets on restart).

    • Gauge = a value that rises and falls; read directly for current state.

    • Histogram = bucketed observations enabling quantiles (p95/p99) via histogram_quantile().

    • Latency belongs in a histogram, not a gauge or an average — the tail is what hurts users.

    Quick self-check

    You want p99 request latency on a dashboard. Which metric type do you instrument?

    Red flag Using a gauge for an ever-growing total (so a restart silently resets it and breaks your dashboards), or averaging latency instead of using a histogram for percentiles.

    source: Prometheus — Metric types ↗
  • Google senior concept common What makes a good alert? Why do teams end up with alert fatigue, and how do you fix it?

    A good alert is actionable, urgent, and user-impacting — it pages a human only when something needs a human to intervene *now*. The SRE guidance is to alert on symptoms (users are seeing errors / latency, the SLO is burning) rather than causes (CPU is at 80%), because a high CPU that isn't hurting anyone is not worth waking someone.

    Alert fatigue sets in when too many alerts fire — noisy thresholds, alerts on causes that self-heal, duplicate pages for one incident — so on-call engineers start ignoring them, and the real page gets lost in the noise.

    Fixes: alert on SLO burn rate rather than raw thresholds; route non-urgent signals to a dashboard or ticket instead of a page; deduplicate and group related alerts; and ruthlessly delete or tune any alert that consistently fires without requiring action. Every page should be reviewed: was it actionable?

    What a strong answer covers
    • Page only on symptoms users feel (errors, latency, SLO burn) — not on causes that may be harmless.

    • Every page must be actionable and urgent; if no human action is needed now, it shouldn't page.

    • Alert fatigue comes from noisy/duplicate/self-healing alerts; people then ignore the real one.

    • Fix it with burn-rate alerts, deduplication/grouping, ticket-not-page routing, and pruning useless alerts.

    Red flag Alerting on every resource metric (cause-based alerting) — it buries the few symptom-based pages that actually matter and trains on-call to dismiss notifications.

    source: Google SRE Workbook — Alerting on SLOs ↗