> cs·fundamentals
interview 0% 28m read
2.7 ★ core [A][I] 15 interview Q's

Distributed systems & scaling

Monolith vs microservices, API gateway/service discovery, CAP (CP vs AP and its scalability caveat), eventual consistency, and horizontal vs vertical scaling.

Distributed systems trade simplicity for scale and availability — and every one of those trades is governed by physics you can’t cheat: networks partition, machines fail independently, and clocks disagree. When a partition happens you must give something up. The senior-level skill is naming exactly what you give up, and — just as important — not over-applying CAP to questions it doesn’t answer.

Monolith vs microservices

The default answer to “monolith or microservices?” is monolith — and that’s the senior answer, not the lazy one. A monolith is one deployable: function calls instead of network hops, one database transaction instead of distributed coordination, one thing to deploy and trace. Microservices buy team autonomy (squads ship independently) and independent scaling (scale the hot service alone), but you pay in distributed failure (every call can now time out or partially fail), operational overhead (N deploy pipelines, N dashboards, service discovery, distributed tracing), and data consistency headaches (no cross-service transaction). Split only when a real boundary demands it — a team that needs to deploy on its own cadence, or a component with a wildly different scaling profile — never by default or for résumé reasons.

Two supporting pieces show up the moment you have many services: an API gateway is the single front door that handles cross-cutting concerns (routing, auth, rate-limiting, TLS termination) so each service doesn’t reimplement them; service discovery is how services locate each other when instances come and go (a registry like Consul, or DNS-based discovery) — because in a dynamic cluster you can’t hardcode addresses.

CAP — and the caveat that earns you senior points

A triangle with Consistency at top-left, Availability at top-right, and Partition tolerance at the bottom. Labels note that a partition forces a choice between C and A.CConsistencyAAvailabilityPPartition toleranceCPMongoDB · HBaseAPCassandra · CouchDBpartition forcesa choice ↙ ↘
FIG 1 · the CAP triangle Under a partition (P is forced), you keep at most one of Consistency or Availability. CA only exists when the network never fails — i.e. not in the real world.

CAP says: when a network partition happens (and it will), you can keep Consistency (every read sees the latest write) or Availability (every request gets a non-error response), but not both. The crucial framing interviewers reward: there’s no choice to make when the network is healthy — CAP only bites during a partition. Calling a database “CP” or “AP” is shorthand for what it does when the cluster splits.

ChoiceMeans under partitionExamplesGotcha
CP (consistency)reject/stall requests rather than return stale dataMongoDB, HBase, ZooKeeperyou sacrifice availability — some requests fail
AP (availability)always answer, even if data is stale; reconcile laterCassandra, CouchDB, DynamoDBclients may read stale values — needs conflict resolution
The choice only exists during a partition; the rest of the time you get both.
CP vs AP, concretely
Scenario: a network partition splits your replicas into two groups.
A client writes X=2 to group 1; another client reads X from group 2.

CP system (e.g. ZooKeeper):
  group 2 cannot confirm it has the latest value → it ERRORS / blocks the read.
  → no stale data, but the read is unavailable until the partition heals.

AP system (e.g. Cassandra):
  group 2 happily returns the old X=1.
  → always available, but the read is stale until replicas reconcile (eventual consistency).

Neither is “better” — you choose based on whether stale reads or failed reads hurt your domain more. A bank balance or an inventory count wants CP (a wrong number is worse than a brief outage); a social feed, a “last seen” timestamp, or a like-count wants AP (a slightly stale value is fine; an error is not). The senior move is to scope the choice per data type, not per whole system — most real architectures are CP for money and AP for everything else.

Scaling and statelessness

You scale vertically (a bigger box) until you hit the ceiling — the largest instance money can buy — or the single-point-of-failure risk gets unacceptable, then you scale horizontally (more boxes behind a load balancer), which is effectively unlimited. But horizontal scaling has a hard prerequisite: your services must be stateless — no per-request or per-session state stuck on one instance. Only then can any box serve any request and the load balancer route freely, add capacity by cloning, and survive a node death without dropping sessions.

The practical rule: push state into a shared store (Redis for sessions, a database for data, object storage for files) so the application tier stays disposable — any instance is interchangeable, scaling is “launch more clones,” and a crashed box loses nothing but in-flight requests. This is the same statelessness principle from HTTP (2.1) and API design (2.2), now paying off at the infrastructure layer.

Resilience patterns

A remote call can be slow, fail, or hang — and without guardrails one sick dependency cascades into a full outage as callers pile up waiting on it. Four patterns, layered, keep a partial failure partial:

  • Timeout — never wait forever. Bound every network call so a dead dependency frees the caller’s thread instead of hanging it. The most important and most-forgotten one.
  • Retry with exponential backoff + jitter — retry a transient failure, but wait longer each attempt (1s → 2s → 4s) and add random jitter, or every client retries in lockstep and you DDoS your own recovering service (a “retry storm”). Only retry idempotent operations.
  • Circuit breaker — after N consecutive failures, open the circuit and fail fast without even calling the dead dependency; after a cooldown go half-open to test one request, and close again on success. Stops you hammering a down service and gives it room to recover.
  • Bulkhead — isolate resources (a separate thread/connection pool per dependency) so one slow downstream can’t consume every thread and sink the whole service. Named for a ship’s watertight compartments.
A state machine: Closed transitions to Open on failures over threshold; Open transitions to Half-Open after a cooldown; Half-Open returns to Closed on a successful test or back to Open on failure.CLOSEDcalls flow normallyHALF-OPENtest one requestOPENfail fast, don’t callfailures ≥ threshold →after cooldowntest oktest fails → re-open
FIG 2 · circuit breaker states Closed = calls flow. Too many failures → Open (fail fast, don't call). After a cooldown → Half-Open (test one). Success closes it; failure re-opens.

These compose: a timeout feeds the failure count, the circuit breaker trips so you stop calling a dead service, retries with backoff+jitter ride out the transient blips, and bulkheads keep the blast radius contained. Together they’re the difference between “one dependency got slow” and “the whole site went down.”

01 Learning objectives

0 / 6 done

02 Curated reading

03 Knowledge check

knowledge check5 questions · pass ≥ 70%
  1. 01easy

    Adding more machines behind a load balancer is:

  2. 02medium

    The CAP theorem tells you how well a system scales.

  3. 03medium

    An open circuit breaker…

  4. 04medium

    Why add random jitter to retry backoff?

  5. 05hard

    During a network partition, a strongly-consistent (CP) store will:

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • Commonly asked mid concept very common Explain the CAP theorem. Under a partition, what are you actually choosing between?

    CAP says that when a network partition happens, a distributed data store must choose between Consistency (every read sees the latest write) and Availability (every request gets a non-error response). You can't have both during a partition; without a partition you get both.

    So it's really a choice made *when partitioned*. CP systems (e.g. HBase, MongoDB in its default config) refuse or block to stay consistent; AP systems (e.g. Cassandra, CouchDB) keep serving and reconcile later (eventual consistency). Important caveat: CAP says nothing about latency or scalability — it's strictly about behavior under partition.

    Red flag Saying 'pick 2 of 3' as if you choose freely. Partition tolerance is mandatory in a distributed system; the real choice is C vs A only when a partition occurs.

    source: system-design-primer — CAP theorem ↗
  • Commonly asked senior concept very common Monolith vs microservices — what are the real tradeoffs, and why not default to microservices?

    A monolith is one deployable: simpler local dev, easy refactors across boundaries, in-process calls, one transaction — at the cost of coupled deploys and scaling the whole app together. Microservices give independent deploys, team autonomy, and targeted scaling — but you pay with distributed-systems tax: network failures, eventual consistency, distributed transactions/sagas, harder debugging and tracing, and heavy ops.

    The seasoned answer: don't reach for microservices by default. Most teams should start with a well-modularized monolith and split out services only when a clear scaling, team-ownership, or deploy-cadence boundary justifies the added operational cost.

    Red flag Starting greenfield with microservices for resume-driven reasons, inheriting distributed-systems complexity before you have the scale or teams to need it.

    source: Martin Fowler — Monolith First ↗
  • Commonly asked mid concept very common Horizontal vs vertical scaling — and why does statelessness matter for scaling out?

    Vertical scaling means a bigger machine (more CPU/RAM) — simple but bounded by the largest box and a single point of failure. Horizontal scaling means more machines behind a load balancer — effectively unbounded and fault-tolerant, but only if requests can hit any node.

    That's why statelessness matters: if a server keeps user session state in local memory, the load balancer must pin a user to one node (sticky sessions), which breaks failover and uneven load. Push state to a shared store (Redis/DB) so any node can serve any request, and horizontal scaling becomes trivial.

    Red flag Storing session state in process memory and then trying to scale horizontally — you're forced into sticky sessions, which undermine failover and balancing.

    source: system-design-primer — Scalability ↗
  • Commonly asked mid concept common What is eventual consistency, and why do distributed systems accept it?

    Eventual consistency means replicas may temporarily disagree, but if writes stop, they converge to the same value given enough time. AP systems accept it as the price of staying available and low-latency under partitions and across regions.

    Why accept it: strong consistency requires coordination (consensus, quorums) on every write, which adds latency and reduces availability when nodes can't reach each other. For many features — a like count, a social feed, a shopping cart — a few seconds of staleness is fine, and the availability/latency win is worth it. For money movement you choose strong consistency instead.

    Red flag Using eventual consistency for invariants that must hold immediately (e.g. account balances). Match the consistency level to the business need.

    source: AWS — Eventual consistency ↗
  • Commonly asked mid concept common What roles do an API gateway and service discovery play in a microservices system?

    An API gateway is the single entry point for clients: it routes to the right service and handles cross-cutting concerns — auth, rate limiting, TLS termination, request aggregation, and sometimes response shaping — so each service doesn't reimplement them and clients don't need to know the internal topology.

    Service discovery lets services find each other's network locations as instances scale up/down and move. A registry (Consul, Eureka, or Kubernetes DNS/Services) maps a logical service name to current healthy instances, so callers resolve a name instead of hardcoding IPs. Together they decouple clients from the shifting set of backend instances.

    Red flag Putting business logic in the gateway. It handles routing and cross-cutting concerns; domain logic stays in the services.

    source: microservices.io — API gateway & service discovery ↗
  • Commonly asked senior design common Why retry failed calls with exponential backoff AND jitter? What goes wrong without jitter?

    Retries handle transient failures, but naive retries cause two problems. Exponential backoff (wait 1s, 2s, 4s…) stops a struggling service from being hammered every few milliseconds while it tries to recover.

    Jitter (randomizing each wait) prevents a thundering herd: if many clients fail at the same instant and all back off by the exact same schedule, they retry in synchronized waves that keep knocking the service over. Adding randomness spreads the retries out. Pair this with retry budgets/circuit breakers and only retry idempotent or idempotency-keyed operations.

    Red flag Backoff without jitter — synchronized clients retry in lockstep, creating a self-reinforcing herd that prevents recovery.

    source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
  • Commonly asked senior concept common What is a circuit breaker and how does it protect a distributed system?

    A circuit breaker wraps calls to a dependency and tracks failures. In closed state calls pass through; once failures cross a threshold it opens and fails fast (returns an error or fallback immediately) instead of piling up calls on a sick service. After a cooldown it goes half-open, lets a trial request through, and closes again if it succeeds.

    It prevents cascading failures: without it, a slow dependency exhausts the caller's threads/connections waiting on timeouts, which then takes the caller down, propagating upstream. Failing fast contains the blast radius and lets the dependency recover.

    Red flag Relying on retries alone with no breaker — retries against a failing dependency amplify load and accelerate the cascade.

    source: Martin Fowler — CircuitBreaker ↗
  • Commonly asked senior concept common What is consistent hashing and why do distributed caches and databases use it?

    With plain hash(key) % N, changing the number of nodes N remaps almost every key — catastrophic for a cache (mass misses) or a sharded DB (mass data movement). Consistent hashing maps both keys and nodes onto a ring; a key belongs to the next node clockwise. Adding or removing a node only relocates the keys in that node's arc — about 1/N of keys — instead of nearly all of them.

    Virtual nodes (each physical node placed at many ring positions) smooth out uneven distribution. This is why Cassandra, DynamoDB, and Memcached-style caches use it.

    Red flag Using modulo hashing for a sharded cluster, so adding one node reshuffles nearly all keys and stampedes the backing store.

    source: system-design-primer — Consistent hashing ↗
  • Commonly asked senior concept occasional Why are leader election and quorum used in distributed coordination?

    Many tasks need exactly one node in charge (assigning work, ordering writes) to avoid conflicts — so the cluster elects a leader via a consensus protocol (Raft, Paxos, ZooKeeper/ZAB). If the leader dies, a new one is elected.

    To agree despite failures, decisions use a quorum — a majority (N/2 + 1). Requiring a majority for writes and reads guarantees any two quorums overlap, so the system never commits two conflicting decisions and can tolerate a minority of nodes failing. This is the backbone of consistent distributed stores and coordination services.

    Red flag Allowing writes without a majority quorum, enabling split-brain where two partitions both think they have a leader and diverge.

    source: The Raft Consensus Algorithm ↗
  • Commonly asked mid concept common Compare load-balancing algorithms: round robin, least connections, and consistent hashing. When does each shine?

    Round robin sends each request to the next server in rotation — simple and fine when requests are uniform and servers identical, but blind to actual load. Least connections routes to the server with the fewest active connections — better when request durations vary, since it adapts to real load instead of assuming uniformity.

    Consistent hashing (hash the client/key to a server) keeps a given key/session on the same server — essential for cache affinity or sticky routing, and it minimizes remapping when servers are added/removed. Round robin for stateless uniform work; least connections for variable work; consistent hashing when affinity/locality matters.

    What a strong answer covers
    • Round robin: simple rotation, ignores load; good for uniform requests.

    • Least connections: adapts to variable request durations.

    • Consistent hashing: routes a key to a stable server (cache/session affinity).

    • Weighted variants account for heterogeneous server capacity.

    Quick self-check

    Requests have highly variable processing times. Which LB algorithm adapts best to real server load?

    Red flag Defaulting to round robin when request costs vary wildly — a few expensive requests pile onto one server while others idle. Least connections adapts better.

    source: Cloudflare — What is load balancing? ↗
  • ★ must-know Commonly asked senior concept common Why is an idempotency key essential for a client retry after a timeout, and what subtlety makes timeouts dangerous?

    A timeout is ambiguous: when a client's request times out, it cannot tell whether the server never received it, processed it but the response was lost, or is still processing. So a retry might be a true retry or an accidental duplicate of a request that already succeeded.

    That's why a non-idempotent operation (charge a card, place an order) needs an idempotency key: the client sends a stable key, the server dedupes on it, and a retry of an already-applied request returns the original result instead of re-applying it. Without the key, the safe-looking retry can double-charge. The subtlety: the failure you can see (timeout) hides whether the side effect happened.

    What a strong answer covers
    • A timeout doesn't tell you if the operation succeeded — it's inherently ambiguous.

    • Retrying a non-idempotent op risks a duplicate side effect.

    • An idempotency key lets the server dedupe and return the first result.

    • Idempotent methods (GET/PUT/DELETE) are safe to retry without a key.

    Red flag Treating a timeout as a definite failure and blindly retrying a charge/order. The request may have completed; without an idempotency key you double-apply it.

    source: AWS Builders' Library — Making retries safe with idempotent APIs ↗
  • Commonly asked mid concept common What is the difference between latency and throughput, and why can optimizing one hurt the other?

    Latency is how long a single operation takes (time per request); throughput is how many operations complete per unit time. They're related but distinct — a system can have high throughput and high latency at once.

    They trade off because techniques that raise throughput often add per-request latency: batching many requests amortizes overhead (more throughput) but each request waits for the batch to fill (more latency); deep queues keep workers busy (throughput) but messages wait longer (latency). The discipline is to measure latency as a distribution (p50/p95/p99), not a mean, since tail latency is what users feel, and to choose the tradeoff per workload.

    What a strong answer covers
    • Latency = time per operation; throughput = operations per unit time.

    • Batching/queuing raise throughput but add per-request latency.

    • Report latency as percentiles (p95/p99), not averages — tails matter.

    • Little's Law links them: concurrency ≈ throughput × latency.

    Quick self-check

    Why is p99 latency usually more informative than mean latency?

    Red flag Reporting only average latency. A good mean can hide a terrible p99 that real users hit; and maxing throughput via batching can quietly wreck per-request latency.

    source: Hello Interview — Latency vs throughput ↗
  • Commonly asked senior design common Design read scaling for a heavily-read database. How do replication and the read-your-writes problem interact?

    For read-heavy load, add read replicas: writes go to the primary, which asynchronously replicates to replicas that serve reads, spreading read load and adding redundancy. The catch is replication lag — a replica may be milliseconds-to-seconds behind, so a user who just wrote can read a replica and not see their own change.

    Fix the read-your-writes experience by routing a user's reads to the primary for a short window after they write, pinning their session to the primary, tracking a write timestamp/LSN and only reading replicas caught up past it, or using synchronous replication for the critical path (at a latency cost). Layer caching and, if writes also dominate, consider sharding.

    Red flag Sending a user's immediate post-write read to an async replica and showing them stale data ('I just saved it — where did it go?'). Route recent writers to the primary or track their write position.

    source: system-design-primer — Replication & federation ↗
  • Commonly asked senior trick occasional Trick: a service adds aggressive client retries to improve reliability and the whole system gets less reliable under load. What happened?

    This is a retry storm / metastable failure. When a dependency slows or briefly fails under load, every client retries — often multiplying traffic 3x or more right when the service is least able to handle it. The added load keeps the service overloaded, so it keeps failing, so clients keep retrying: a self-sustaining feedback loop that doesn't recover even after the original trigger passes.

    Fixes: bound retries with a retry budget (cap retries as a fraction of traffic, not per-request), add exponential backoff with jitter, use circuit breakers to fail fast, and only retry idempotent operations. Retries help with isolated transient blips; unbounded retries under systemic load amplify the failure.

    What a strong answer covers
    • Mass retries multiply load exactly when the service is already struggling.

    • Creates a self-sustaining (metastable) overload that outlasts the trigger.

    • Fix: retry budgets, backoff + jitter, circuit breakers, retry only idempotent ops.

    • Retries help isolated blips, not systemic overload.

    Quick self-check

    Aggressive unconditional retries make a system LESS reliable under load because:

    Red flag Adding per-request retries everywhere as a blanket reliability boost. Under correlated failure they amplify load into a retry storm — bound them with budgets, backoff, and breakers.

    source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
  • Commonly asked senior concept common What is sharding (horizontal partitioning), and why is choosing a good shard key the hard part?

    Sharding splits one logical dataset across multiple databases/nodes by a shard key, so each shard holds a subset and the system scales writes and storage beyond one machine. Reads/writes route to the shard owning the key.

    The shard key is the hard part because a bad one creates hotspots — picking a low-cardinality or monotonically-increasing key (like a timestamp) funnels traffic to one shard, defeating the point. You want a key that spreads load evenly *and* keeps commonly-joined data co-located so you avoid expensive cross-shard queries. Cross-shard transactions and re-sharding as you grow are the recurring pains, which is why teams delay sharding until replicas and caching are exhausted.

    What a strong answer covers
    • Sharding = horizontal partitioning by a shard key across nodes.

    • Scales writes/storage past a single machine.

    • Bad shard key → hotspots (monotonic or low-cardinality keys are traps).

    • Cross-shard joins/transactions and re-sharding are the ongoing costs.

    Red flag Sharding on a monotonically increasing key (timestamp/sequence id) so all new writes hit the newest shard — a hotspot that recreates the single-node bottleneck.

    source: MongoDB — Sharding and shard keys ↗