2.7 ★ core [A][I] 15 interview Q's

Distributed systems & scaling

Monolith vs microservices, API gateway/service discovery, CAP (CP vs AP and its scalability caveat), eventual consistency, and horizontal vs vertical scaling.

Distributed systems trade simplicity for scale and availability — and every one of those trades is governed by physics you can’t cheat: networks partition, machines fail independently, and clocks disagree. When a partition happens you must give something up. The senior-level skill is naming exactly what you give up, and — just as important — not over-applying CAP to questions it doesn’t answer.

At a glance

monolith: one deployable — start here; simpler ops, no network between modules
microservices: independently deployable services — team autonomy, but distributed failure + ops cost
API gateway: single front door: routing, auth, rate-limit, TLS
service discovery: how services find each other's moving addresses (registry / DNS)
vertical scaling: bigger box — simple, but a ceiling + a single point of failure
horizontal scaling: more boxes behind a load balancer — needs statelessness
CAP: under a partition: pick Consistency or Availability
CAP caveat: says nothing about scalability or latency
timeout: bound every call — a dead dependency must free the thread, not hang it
retry + backoff + jitter: retry transient + idempotent only; randomize delay to avoid a retry storm
circuit breaker: open after N failures → fail fast; half-open to test recovery
bulkhead: isolate a pool per dependency so one slow call can't sink everything

Key vocabulary

Monolith vs microservices: A monolith is one deployable; microservices split it into independently deployable services. Microservices buy team autonomy and independent scaling at the cost of network calls, distributed failure, and ops overhead — don't reach for them by default.
API gateway vs service discovery: A gateway is the single front door (routing, auth, rate-limiting, TLS). Service discovery is how services find each other's moving addresses (a registry like Consul/DNS).
Horizontal vs vertical scaling: Vertical = a bigger machine (simple, but a ceiling + a single point of failure). Horizontal = more machines behind a load balancer (near-unlimited, but requires statelessness).
Partition (network): A break in connectivity that splits the cluster into groups that can't reach each other. Not optional — links fail, switches reboot, regions disconnect — so you must plan for it, which is what CAP forces you to do.
Eventual consistency: Replicas may briefly disagree, but converge if writes stop. The price of staying available under partitions — accepted because strong consistency across a network is slow and fragile.

Monolith vs microservices

The default answer to “monolith or microservices?” is monolith — and that’s the senior answer, not the lazy one. A monolith is one deployable: function calls instead of network hops, one database transaction instead of distributed coordination, one thing to deploy and trace. Microservices buy team autonomy (squads ship independently) and independent scaling (scale the hot service alone), but you pay in distributed failure (every call can now time out or partially fail), operational overhead (N deploy pipelines, N dashboards, service discovery, distributed tracing), and data consistency headaches (no cross-service transaction). Split only when a real boundary demands it — a team that needs to deploy on its own cadence, or a component with a wildly different scaling profile — never by default or for résumé reasons.

Two supporting pieces show up the moment you have many services: an API gateway is the single front door that handles cross-cutting concerns (routing, auth, rate-limiting, TLS termination) so each service doesn’t reimplement them; service discovery is how services locate each other when instances come and go (a registry like Consul, or DNS-based discovery) — because in a dynamic cluster you can’t hardcode addresses.

CAP — and the caveat that earns you senior points

FIG 1 · the CAP triangle Under a partition (P is forced), you keep at most one of Consistency or Availability. CA only exists when the network never fails — i.e. not in the real world.

CAP says: when a network partition happens (and it will), you can keep Consistency (every read sees the latest write) or Availability (every request gets a non-error response), but not both. The crucial framing interviewers reward: there’s no choice to make when the network is healthy — CAP only bites during a partition. Calling a database “CP” or “AP” is shorthand for what it does when the cluster splits.

Choice	Means under partition	Examples	Gotcha
CP (consistency)	reject/stall requests rather than return stale data	MongoDB, HBase, ZooKeeper	you sacrifice availability — some requests fail
AP (availability)	always answer, even if data is stale; reconcile later	Cassandra, CouchDB, DynamoDB	clients may read stale values — needs conflict resolution

The choice only exists during a partition; the rest of the time you get both.

CP vs AP, concretely

Scenario: a network partition splits your replicas into two groups.
A client writes X=2 to group 1; another client reads X from group 2.

CP system (e.g. ZooKeeper):
  group 2 cannot confirm it has the latest value → it ERRORS / blocks the read.
  → no stale data, but the read is unavailable until the partition heals.

AP system (e.g. Cassandra):
  group 2 happily returns the old X=1.
  → always available, but the read is stale until replicas reconcile (eventual consistency).

Neither is “better” — you choose based on whether stale reads or failed reads hurt your domain more. A bank balance or an inventory count wants CP (a wrong number is worse than a brief outage); a social feed, a “last seen” timestamp, or a like-count wants AP (a slightly stale value is fine; an error is not). The senior move is to scope the choice per data type, not per whole system — most real architectures are CP for money and AP for everything else.

Scaling and statelessness

You scale vertically (a bigger box) until you hit the ceiling — the largest instance money can buy — or the single-point-of-failure risk gets unacceptable, then you scale horizontally (more boxes behind a load balancer), which is effectively unlimited. But horizontal scaling has a hard prerequisite: your services must be stateless — no per-request or per-session state stuck on one instance. Only then can any box serve any request and the load balancer route freely, add capacity by cloning, and survive a node death without dropping sessions.

The practical rule: push state into a shared store (Redis for sessions, a database for data, object storage for files) so the application tier stays disposable — any instance is interchangeable, scaling is “launch more clones,” and a crashed box loses nothing but in-flight requests. This is the same statelessness principle from HTTP (2.1) and API design (2.2), now paying off at the infrastructure layer.

Resilience patterns

A remote call can be slow, fail, or hang — and without guardrails one sick dependency cascades into a full outage as callers pile up waiting on it. Four patterns, layered, keep a partial failure partial:

Timeout — never wait forever. Bound every network call so a dead dependency frees the caller’s thread instead of hanging it. The most important and most-forgotten one.
Retry with exponential backoff + jitter — retry a transient failure, but wait longer each attempt (1s → 2s → 4s) and add random jitter, or every client retries in lockstep and you DDoS your own recovering service (a “retry storm”). Only retry idempotent operations.
Circuit breaker — after N consecutive failures, open the circuit and fail fast without even calling the dead dependency; after a cooldown go half-open to test one request, and close again on success. Stops you hammering a down service and gives it room to recover.
Bulkhead — isolate resources (a separate thread/connection pool per dependency) so one slow downstream can’t consume every thread and sink the whole service. Named for a ship’s watertight compartments.

FIG 2 · circuit breaker states Closed = calls flow. Too many failures → Open (fail fast, don't call). After a cooldown → Half-Open (test one). Success closes it; failure re-opens.

These compose: a timeout feeds the failure count, the circuit breaker trips so you stop calling a dead service, retries with backoff+jitter ride out the transient blips, and bulkheads keep the blast radius contained. Together they’re the difference between “one dependency got slow” and “the whole site went down.”

From the trenches Discord

Discord stores trillions of messages, and its evolution is a tour of these exact tradeoffs. It moved off MongoDB, ran on Cassandra, then migrated to ScyllaDB — all AP datastores. For a chat backlog, availability and horizontal scale matter more than strict consistency: a message appearing a beat late on one replica is acceptable; the service refusing to load your channel history is not. So they accepted eventual consistency and partitioned messages by channel to scale writes horizontally across the cluster.

The writeup also surfaces the operational realities CAP’s tidy diagram hides: hot partitions (one massive channel overwhelming a single node), the cost of compaction and tail latencies, and the engineering work to keep an AP store’s eventual consistency from surfacing as user-visible weirdness. It’s a concrete answer to the interview prompt “design a chat system at scale” — pick AP, partition by a natural key, and budget heavily for the operational long tail. Pair it with the AWS Builders’ Library on timeouts, retries, and backoff with jitter for the resilience patterns that keep distributed calls from amplifying a partial failure into an outage.

read the writeup ↗ discord.com

Key points

Microservices add network failure + ops cost; start with a monolith, split only when a real boundary (team cadence, scaling profile) demands it.
API gateway = front door (routing/auth/rate-limit/TLS); service discovery = how services locate each other’s moving addresses.
CAP: under a partition, pick Consistency (CP: MongoDB/HBase) or Availability (AP: Cassandra/CouchDB) — and it says nothing about scalability or latency.
Scope CP/AP per data type — CP for money/inventory, AP for feeds/presence — not per whole system.
Eventual consistency is the price of AP — replicas converge once writes settle.
Vertical scaling hits a ceiling; horizontal scaling is unlimited but requires statelessness — keep state in a shared store so the app tier is disposable.
Resilience patterns keep a partial failure partial: timeouts (free the thread), retries with backoff + jitter (idempotent only, no retry storm), circuit breakers (fail fast, then test recovery), bulkheads (isolate pools).

01 Learning objectives

0 / 6 done

02 Curated reading

The System Design Primer
essential 40m — The canonical free survey of scaling building blocks.
Discord — How we store trillions of messages
optional eng blog 20m — AP datastore (Cassandra→ScyllaDB) at real scale.
Amazon Builders' Library — Timeouts, retries, and backoff with jitter
optional eng blog 18m — Resilience patterns for distributed calls.

03 Knowledge check

knowledge check5 questions · pass ≥ 70%

01easy
Adding more machines behind a load balancer is:
02medium
The CAP theorem tells you how well a system scales.
03medium
An open circuit breaker…
04medium
Why add random jitter to retry backoff?
05hard
During a network partition, a strongly-consistent (CP) store will:

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

Commonly asked mid concept very common Explain the CAP theorem. Under a partition, what are you actually choosing between?
CAP says that when a network partition happens, a distributed data store must choose between Consistency (every read sees the latest write) and Availability (every request gets a non-error response). You can't have both during a partition; without a partition you get both.
So it's really a choice made *when partitioned*. CP systems (e.g. HBase, MongoDB in its default config) refuse or block to stay consistent; AP systems (e.g. Cassandra, CouchDB) keep serving and reconcile later (eventual consistency). Important caveat: CAP says nothing about latency or scalability — it's strictly about behavior under partition.
Follow-ups they push on
- Why is the 'pick 2 of 3' framing misleading?
- What does PACELC add to CAP?
Red flag Saying 'pick 2 of 3' as if you choose freely. Partition tolerance is mandatory in a distributed system; the real choice is C vs A only when a partition occurs.
source: system-design-primer — CAP theorem ↗
Commonly asked senior concept very common Monolith vs microservices — what are the real tradeoffs, and why not default to microservices?
A monolith is one deployable: simpler local dev, easy refactors across boundaries, in-process calls, one transaction — at the cost of coupled deploys and scaling the whole app together. Microservices give independent deploys, team autonomy, and targeted scaling — but you pay with distributed-systems tax: network failures, eventual consistency, distributed transactions/sagas, harder debugging and tracing, and heavy ops.
The seasoned answer: don't reach for microservices by default. Most teams should start with a well-modularized monolith and split out services only when a clear scaling, team-ownership, or deploy-cadence boundary justifies the added operational cost.
Follow-ups they push on
- What forces a split — scaling, team size, or deploy cadence?
- What is a 'distributed monolith' and why is it the worst of both?
Red flag Starting greenfield with microservices for resume-driven reasons, inheriting distributed-systems complexity before you have the scale or teams to need it.
source: Martin Fowler — Monolith First ↗
Commonly asked mid concept very common Horizontal vs vertical scaling — and why does statelessness matter for scaling out?
Vertical scaling means a bigger machine (more CPU/RAM) — simple but bounded by the largest box and a single point of failure. Horizontal scaling means more machines behind a load balancer — effectively unbounded and fault-tolerant, but only if requests can hit any node.
That's why statelessness matters: if a server keeps user session state in local memory, the load balancer must pin a user to one node (sticky sessions), which breaks failover and uneven load. Push state to a shared store (Redis/DB) so any node can serve any request, and horizontal scaling becomes trivial.
Follow-ups they push on
- What breaks if you keep sessions in local server memory?
- How do load balancers route — round robin, least connections, hashing?
Red flag Storing session state in process memory and then trying to scale horizontally — you're forced into sticky sessions, which undermine failover and balancing.
source: system-design-primer — Scalability ↗
Commonly asked mid concept common What is eventual consistency, and why do distributed systems accept it?
Eventual consistency means replicas may temporarily disagree, but if writes stop, they converge to the same value given enough time. AP systems accept it as the price of staying available and low-latency under partitions and across regions.
Why accept it: strong consistency requires coordination (consensus, quorums) on every write, which adds latency and reduces availability when nodes can't reach each other. For many features — a like count, a social feed, a shopping cart — a few seconds of staleness is fine, and the availability/latency win is worth it. For money movement you choose strong consistency instead.
Follow-ups they push on
- Give a feature where eventual consistency is fine and one where it's not.
- What is read-your-own-writes consistency?
Red flag Using eventual consistency for invariants that must hold immediately (e.g. account balances). Match the consistency level to the business need.
source: AWS — Eventual consistency ↗
Commonly asked mid concept common What roles do an API gateway and service discovery play in a microservices system?
An API gateway is the single entry point for clients: it routes to the right service and handles cross-cutting concerns — auth, rate limiting, TLS termination, request aggregation, and sometimes response shaping — so each service doesn't reimplement them and clients don't need to know the internal topology.
Service discovery lets services find each other's network locations as instances scale up/down and move. A registry (Consul, Eureka, or Kubernetes DNS/Services) maps a logical service name to current healthy instances, so callers resolve a name instead of hardcoding IPs. Together they decouple clients from the shifting set of backend instances.
Follow-ups they push on
- Client-side vs server-side discovery — what's the difference?
- What concerns belong in the gateway vs each service?
Red flag Putting business logic in the gateway. It handles routing and cross-cutting concerns; domain logic stays in the services.
source: microservices.io — API gateway & service discovery ↗
Commonly asked senior design common Why retry failed calls with exponential backoff AND jitter? What goes wrong without jitter?
Retries handle transient failures, but naive retries cause two problems. Exponential backoff (wait 1s, 2s, 4s…) stops a struggling service from being hammered every few milliseconds while it tries to recover.
Jitter (randomizing each wait) prevents a thundering herd: if many clients fail at the same instant and all back off by the exact same schedule, they retry in synchronized waves that keep knocking the service over. Adding randomness spreads the retries out. Pair this with retry budgets/circuit breakers and only retry idempotent or idempotency-keyed operations.
Follow-ups they push on
- Why only retry idempotent operations?
- What does a circuit breaker add on top of backoff?
Red flag Backoff without jitter — synchronized clients retry in lockstep, creating a self-reinforcing herd that prevents recovery.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
Commonly asked senior concept common What is a circuit breaker and how does it protect a distributed system?
A circuit breaker wraps calls to a dependency and tracks failures. In closed state calls pass through; once failures cross a threshold it opens and fails fast (returns an error or fallback immediately) instead of piling up calls on a sick service. After a cooldown it goes half-open, lets a trial request through, and closes again if it succeeds.
It prevents cascading failures: without it, a slow dependency exhausts the caller's threads/connections waiting on timeouts, which then takes the caller down, propagating upstream. Failing fast contains the blast radius and lets the dependency recover.
Follow-ups they push on
- Walk through closed → open → half-open transitions.
- How does this complement timeouts and bulkheads?
Red flag Relying on retries alone with no breaker — retries against a failing dependency amplify load and accelerate the cascade.
source: Martin Fowler — CircuitBreaker ↗
Commonly asked senior concept common What is consistent hashing and why do distributed caches and databases use it?
With plain hash(key) % N, changing the number of nodes N remaps almost every key — catastrophic for a cache (mass misses) or a sharded DB (mass data movement). Consistent hashing maps both keys and nodes onto a ring; a key belongs to the next node clockwise. Adding or removing a node only relocates the keys in that node's arc — about 1/N of keys — instead of nearly all of them.
Virtual nodes (each physical node placed at many ring positions) smooth out uneven distribution. This is why Cassandra, DynamoDB, and Memcached-style caches use it.
Follow-ups they push on
- What do virtual nodes solve?
- How much data moves when you add the (N+1)th node?
Red flag Using modulo hashing for a sharded cluster, so adding one node reshuffles nearly all keys and stampedes the backing store.
source: system-design-primer — Consistent hashing ↗
Commonly asked senior concept occasional Why are leader election and quorum used in distributed coordination?
Many tasks need exactly one node in charge (assigning work, ordering writes) to avoid conflicts — so the cluster elects a leader via a consensus protocol (Raft, Paxos, ZooKeeper/ZAB). If the leader dies, a new one is elected.
To agree despite failures, decisions use a quorum — a majority (N/2 + 1). Requiring a majority for writes and reads guarantees any two quorums overlap, so the system never commits two conflicting decisions and can tolerate a minority of nodes failing. This is the backbone of consistent distributed stores and coordination services.
Follow-ups they push on
- Why a majority specifically? (overlapping quorums prevent split decisions)
- What is split-brain and how does quorum prevent it?
Red flag Allowing writes without a majority quorum, enabling split-brain where two partitions both think they have a leader and diverge.
source: The Raft Consensus Algorithm ↗
Commonly asked mid concept common Compare load-balancing algorithms: round robin, least connections, and consistent hashing. When does each shine?
Round robin sends each request to the next server in rotation — simple and fine when requests are uniform and servers identical, but blind to actual load. Least connections routes to the server with the fewest active connections — better when request durations vary, since it adapts to real load instead of assuming uniformity.
Consistent hashing (hash the client/key to a server) keeps a given key/session on the same server — essential for cache affinity or sticky routing, and it minimizes remapping when servers are added/removed. Round robin for stateless uniform work; least connections for variable work; consistent hashing when affinity/locality matters.
What a strong answer covers
- Round robin: simple rotation, ignores load; good for uniform requests.
- Least connections: adapts to variable request durations.
- Consistent hashing: routes a key to a stable server (cache/session affinity).
- Weighted variants account for heterogeneous server capacity.
Quick self-check
Requests have highly variable processing times. Which LB algorithm adapts best to real server load?
Follow-ups they push on
- When does round robin distribute poorly?
- Why does consistent hashing help cache hit rates behind a load balancer?
Red flag Defaulting to round robin when request costs vary wildly — a few expensive requests pile onto one server while others idle. Least connections adapts better.
source: Cloudflare — What is load balancing? ↗
★ must-know Commonly asked senior concept common Why is an idempotency key essential for a client retry after a timeout, and what subtlety makes timeouts dangerous?
A timeout is ambiguous: when a client's request times out, it cannot tell whether the server never received it, processed it but the response was lost, or is still processing. So a retry might be a true retry or an accidental duplicate of a request that already succeeded.
That's why a non-idempotent operation (charge a card, place an order) needs an idempotency key: the client sends a stable key, the server dedupes on it, and a retry of an already-applied request returns the original result instead of re-applying it. Without the key, the safe-looking retry can double-charge. The subtlety: the failure you can see (timeout) hides whether the side effect happened.
What a strong answer covers
- A timeout doesn't tell you if the operation succeeded — it's inherently ambiguous.
- Retrying a non-idempotent op risks a duplicate side effect.
- An idempotency key lets the server dedupe and return the first result.
- Idempotent methods (GET/PUT/DELETE) are safe to retry without a key.
Follow-ups they push on
- Why can a request that 'timed out' have actually succeeded server-side?
- How does this connect to at-least-once delivery in messaging?
Red flag Treating a timeout as a definite failure and blindly retrying a charge/order. The request may have completed; without an idempotency key you double-apply it.
source: AWS Builders' Library — Making retries safe with idempotent APIs ↗
Commonly asked mid concept common What is the difference between latency and throughput, and why can optimizing one hurt the other?
Latency is how long a single operation takes (time per request); throughput is how many operations complete per unit time. They're related but distinct — a system can have high throughput and high latency at once.
They trade off because techniques that raise throughput often add per-request latency: batching many requests amortizes overhead (more throughput) but each request waits for the batch to fill (more latency); deep queues keep workers busy (throughput) but messages wait longer (latency). The discipline is to measure latency as a distribution (p50/p95/p99), not a mean, since tail latency is what users feel, and to choose the tradeoff per workload.
What a strong answer covers
- Latency = time per operation; throughput = operations per unit time.
- Batching/queuing raise throughput but add per-request latency.
- Report latency as percentiles (p95/p99), not averages — tails matter.
- Little's Law links them: concurrency ≈ throughput × latency.
Quick self-check
Why is p99 latency usually more informative than mean latency?
Follow-ups they push on
- Why report p99 instead of the mean?
- How does batching trade latency for throughput?
Red flag Reporting only average latency. A good mean can hide a terrible p99 that real users hit; and maxing throughput via batching can quietly wreck per-request latency.
source: Hello Interview — Latency vs throughput ↗
Commonly asked senior design common Design read scaling for a heavily-read database. How do replication and the read-your-writes problem interact?
For read-heavy load, add read replicas: writes go to the primary, which asynchronously replicates to replicas that serve reads, spreading read load and adding redundancy. The catch is replication lag — a replica may be milliseconds-to-seconds behind, so a user who just wrote can read a replica and not see their own change.
Fix the read-your-writes experience by routing a user's reads to the primary for a short window after they write, pinning their session to the primary, tracking a write timestamp/LSN and only reading replicas caught up past it, or using synchronous replication for the critical path (at a latency cost). Layer caching and, if writes also dominate, consider sharding.
Follow-ups they push on
- What is replication lag and how do you measure it?
- When would you shard instead of (or in addition to) adding replicas?
Red flag Sending a user's immediate post-write read to an async replica and showing them stale data ('I just saved it — where did it go?'). Route recent writers to the primary or track their write position.
source: system-design-primer — Replication & federation ↗
Commonly asked senior trick occasional Trick: a service adds aggressive client retries to improve reliability and the whole system gets less reliable under load. What happened?
This is a retry storm / metastable failure. When a dependency slows or briefly fails under load, every client retries — often multiplying traffic 3x or more right when the service is least able to handle it. The added load keeps the service overloaded, so it keeps failing, so clients keep retrying: a self-sustaining feedback loop that doesn't recover even after the original trigger passes.
Fixes: bound retries with a retry budget (cap retries as a fraction of traffic, not per-request), add exponential backoff with jitter, use circuit breakers to fail fast, and only retry idempotent operations. Retries help with isolated transient blips; unbounded retries under systemic load amplify the failure.
What a strong answer covers
- Mass retries multiply load exactly when the service is already struggling.
- Creates a self-sustaining (metastable) overload that outlasts the trigger.
- Fix: retry budgets, backoff + jitter, circuit breakers, retry only idempotent ops.
- Retries help isolated blips, not systemic overload.
Quick self-check
Aggressive unconditional retries make a system LESS reliable under load because:
Follow-ups they push on
- What is a retry budget and why cap retries as a fraction of total traffic?
- How does a circuit breaker break the feedback loop?
Red flag Adding per-request retries everywhere as a blanket reliability boost. Under correlated failure they amplify load into a retry storm — bound them with budgets, backoff, and breakers.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
Commonly asked senior concept common What is sharding (horizontal partitioning), and why is choosing a good shard key the hard part?
Sharding splits one logical dataset across multiple databases/nodes by a shard key, so each shard holds a subset and the system scales writes and storage beyond one machine. Reads/writes route to the shard owning the key.
The shard key is the hard part because a bad one creates hotspots — picking a low-cardinality or monotonically-increasing key (like a timestamp) funnels traffic to one shard, defeating the point. You want a key that spreads load evenly *and* keeps commonly-joined data co-located so you avoid expensive cross-shard queries. Cross-shard transactions and re-sharding as you grow are the recurring pains, which is why teams delay sharding until replicas and caching are exhausted.
What a strong answer covers
- Sharding = horizontal partitioning by a shard key across nodes.
- Scales writes/storage past a single machine.
- Bad shard key → hotspots (monotonic or low-cardinality keys are traps).
- Cross-shard joins/transactions and re-sharding are the ongoing costs.
Follow-ups they push on
- Why does a timestamp or auto-increment id make a poor shard key?
- How does consistent hashing reduce re-sharding pain?
Red flag Sharding on a monotonically increasing key (timestamp/sequence id) so all new writes hit the newest shard — a hotspot that recreates the single-node bottleneck.
source: MongoDB — Sharding and shard keys ↗