> cs·fundamentals
interview 0% 35m read
6.1 ★ core [A][I] 14 interview Q's

System design fundamentals

The 4-step interview framework, back-of-the-envelope estimation, the building blocks, and designing for scalability/availability/reliability while naming bottlenecks.

A system-design interview isn’t a quiz with a right answer — it’s a structured conversation where you demonstrate that you can scope a problem, reason about scale with rough numbers, and assemble well-understood building blocks into a design whose tradeoffs you can defend. Most candidates fail not on knowledge but on process: they jump to boxes-and-arrows before they know what they’re building. The 4-step framework below keeps you from freezing on a blank whiteboard and signals seniority — you scope, you estimate, you justify, you self-critique.

The 4-step framework

Walk these four steps out loud, in order, and spend real time on step 1 — the budget percentages below are roughly how a strong candidate allocates a 45-minute round.

StepWhat you doWhat the interviewer is checkingTime
1 — ClarifyPin down functional + non-functional requirements, scale, read/write ratio, what's out of scopeDo you scope before building? Do you ask about scale?~15%
2 — High-level designDraw the major components + data flow; get explicit buy-in before drillingCan you sketch a sane architecture end to end?~25%
3 — Deep divePick the 1–2 hardest parts (data model, the hot path, a bottleneck) and go deepDepth on the parts that actually matter~45%
4 — Wrap upName bottlenecks, SPOFs, and tradeoffs; how you'd scale further or what you'd monitorDo you know where your own design is weak?~15%
Spend the most time on the deep dive — but never skip step 1.

The non-functional requirements you should always tease out in step 1: expected scale (DAU, QPS), the read/write ratio, latency targets, consistency needs (is stale data acceptable?), and availability targets. These numbers drive every later decision — you can’t choose between a cache-aside Redis layer and a synchronously-replicated store until you know whether the workload is read-heavy and whether stale reads are tolerable. A good habit: state your assumptions aloud (“I’ll assume 100M DAU and a 100:1 read:write ratio — stop me if that’s wrong”) so the interviewer can steer you before you’ve sunk time into the wrong design.

Back-of-the-envelope estimation

You’re not after precision — you’re after the right order of magnitude, fast. Memorize a handful of anchor numbers and round aggressively (treat a day as ~100k seconds, since 86,400 ≈ 10⁵).

Sizing a Twitter-like timeline

Assume 200M daily active users, each posting on average 2 tweets/day, each tweet ~300 bytes of text plus metadata. Estimate write QPS and storage.

Writes/day   = 200M users × 2 tweets   = 400M tweets/day
Seconds/day  ≈ 100,000  (86,400 rounded up)
Avg write QPS = 400M / 100k            = 4,000 writes/sec
Peak ≈ 2–3×  avg                       ≈ 10,000 writes/sec

Storage/day  = 400M × 300 B            = 120 GB/day
Storage/year = 120 GB × 365            ≈ 44 TB/year (text only)

Reads dwarf writes on a social feed — a 100:1 read:write ratio puts you near 400k read QPS, which is the number that actually forces a CDN + cache + read-replica fan-out design. The write path (10k/s) a single sharded primary can absorb; the read path is what you architect around. The discipline that scores points: do the math out loud, round, and then immediately interpret the result — “44 TB/year of text means storage isn’t my problem; 400k read QPS is, so I’ll fan out reads to a cache and replicas.”

The building blocks

These are the LEGO bricks of every design. You should be able to say what each one buys you and what it costs — naming only the win is the junior tell; naming the new problem it introduces is the senior one.

A client connects to a CDN, then a load balancer, which fans out to three stateless app servers; the app tier reads from a cache, which falls back to a primary database that replicates to read replicas and shards.Clientbrowser/appCDNedge cacheLoad Bal.health-checkapp 1app 2statelessapp 3Cache (Redis)sub-ms readsDB primarywritesreplicas · shardsmissstatic hit → done
FIG 1 · request path through the building blocks A read request flows left to right: CDN absorbs static hits, the LB fans out to a stateless app tier, the cache shields the DB, and replicas/shards scale the data layer.
BlockWhat it gives youWhen to add itCost / caveat
Load balancerSpreads traffic across instances; health-checks out dead onesThe moment you have >1 app serverItself a SPOF unless run redundant (active/passive)
Cache (Redis/Memcached)Sub-ms reads, offloads the DB on read-heavy pathsHot keys, repeated reads, expensive queriesInvalidation is hard; risk of stale data + thundering herd
CDNServes static/edge content near the user; cuts latency + origin loadImages, JS/CSS, video, cacheable API responsesCache-busting + TTL tuning; cost per GB egress
DB replicationRead replicas scale reads + give HA via failoverRead-heavy workload; need read scaling/HAReplication lag → stale reads; doesn't scale writes
DB shardingSplits data across nodes → scales writes + storageOne primary can't hold the write/storage volumeCross-shard joins + rebalancing are painful; pick keys carefully
Message queueDecouples producers/consumers; absorbs spikes; async workSlow/bursty work (email, video encode, fan-out)At-least-once delivery → consumers must be idempotent
Rate limiterProtects you from abuse + overload (returns 429)Public APIs, login, expensive endpointsToken-bucket/sliding-window state must be shared across instances
Every block solves one problem and introduces a new one — name both.

Stateless services are the glue: keep the app tier stateless so the load balancer can route any request anywhere and you scale by adding identical instances. The corollary every interviewer wants to hear: session state, uploaded files, and in-memory caches must move out of the app process into a shared store (Redis, the DB, object storage) — the instant a user’s second request can land on a different box, anything kept locally is a bug. Consistent hashing is the trick that lets a sharded cache or datastore grow and shrink without remapping every key — when a node joins or leaves the ring, only its neighbor’s slice moves (1/N of keys), instead of the naive hash(key) % N scheme where changing N reshuffles almost everything.

Scalability, availability, reliability — and finding the SPOF

These three get conflated; keep them distinct. Scalability is handling more load (prefer horizontal — add boxes — over vertical — bigger boxes). Availability is the fraction of time the system is up, usually quoted in nines and bought with redundancy + failover. Reliability is doing the right thing despite failures — idempotent retries, timeouts, circuit breakers, graceful degradation.

NinesAvailabilityDowntime / yearRoughly
Two nines99%~3.65 daysa hobby project
Three nines99.9%~8.8 hoursa typical web service
Four nines99.99%~52 minutesa serious SaaS SLA
Five nines99.999%~5.3 minutestelecom / core infra (expensive)
Each extra nine costs roughly an order of magnitude more engineering — quote the target the requirement actually needs.

The senior move in step 4 is to walk the request path and ask “what happens if this dies?” at each hop. A single load balancer, a lone database primary, one cache node holding a hot key, a single availability zone — each is a SPOF until it has a redundant twin with automatic failover. The fix is always the same shape: more than one of it, plus a health check and failover so traffic reroutes automatically.

01 Learning objectives

0 / 4 done

02 Curated reading

03 Knowledge check

knowledge check3 questions · pass ≥ 70%
  1. 01easy

    A single point of failure (SPOF) is:

  2. 02medium

    The FIRST step of the system-design interview framework is:

  3. 03hard

    Consistent hashing is primarily used to:

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

  • AmazonGoogleMicrosoft senior design very common Design a URL shortener (TinyURL / bit.ly). Walk me through it.

    Clarify scope first: read-heavy (~100:1 reads:writes), so optimize the redirect path. Estimate: ~100M new URLs/day -> ~1.2K writes/s, ~120K reads/s; 7-char base62 = 62^7 ~= 3.5T codes, plenty for years. Storage ~500 bytes/row * 100M/day -> ~36TB over 2 years.

    Core: a key-gen service maps short->long. Two strategies: (a) base62-encode a globally unique counter (e.g. a range-allocator / Snowflake-style ID) — no collisions, but reveals volume; (b) hash the long URL (MD5/SHA) and take a prefix, then collision-check. Store mapping in a KV store / sharded SQL; reads go cache-first (Redis, LRU on hot links) then DB.

    Wrap up: use 301 only if you do not need per-click analytics (browser caches it), else 302; shard by hash of short code; push click analytics to a queue (Kafka) for async aggregation.

    Red flag Jumping straight to a schema before clarifying the read/write ratio and scale. Also picking 301 while still wanting click analytics — the cached redirect never hits your server again.

    source: ByteByteGo — Design A URL Shortener ↗
  • AmazonStripeCloudflare senior design very common Design a distributed rate limiter for a public API.

    Clarify: client-side vs server-side (server-side), what dimensions to limit (per-user, per-IP, per-endpoint, global), and the action on limit (drop, queue, return 429 with Retry-After). Pick an algorithm and justify it: token bucket (allows bursts, simple, most common), leaky bucket (smooths output), fixed window (cheap but boundary spikes), sliding-window log (accurate, memory-heavy), sliding-window counter (good accuracy/cost tradeoff).

    For a distributed fleet, counters must be shared: keep them in a central store like Redis, keyed by userId:window, incremented atomically (e.g. a Lua script / INCR + EXPIRE) so the read-modify-write is race-free. Put the limiter at the edge / API gateway so rejected traffic never reaches your services.

    Discuss tradeoffs: local in-memory counters are fast but let bursts through across nodes; Redis adds a network hop and a single point to scale; allow a small over-limit margin to tolerate Redis latency.

    Red flag Using a non-atomic get-then-set on the counter, which races under concurrency and lets requests slip past the limit. Also putting the limiter behind the app instead of at the gateway.

    source: ByteByteGo — Design A Rate Limiter ↗
  • Meta senior design very common Design a social media news feed (e.g. the Facebook/Twitter timeline).

    Clarify: feed of posts from people you follow, ranked (recency or relevance), heavy read load. The core decision is fanout-on-write (push) vs fanout-on-read (pull).

    Fanout-on-write: when a user posts, push the post id into every follower's precomputed feed cache. Reads are O(1) and fast — great for most users. But it explodes for celebrities with millions of followers (the hot-key / fanout problem).

    Fanout-on-read: build the feed at request time by pulling recent posts from everyone the user follows. No write amplification, but reads are expensive and slow.

    The standard answer is a hybrid: push for normal accounts, pull for a small set of high-follower accounts, then merge at read time. Cache assembled feeds (Redis), store posts in a sharded store, and rank with a separate scoring service.

    Red flag Committing to pure fanout-on-write without acknowledging the celebrity hot-key problem, or pure fanout-on-read and ignoring read latency at scale.

    source: ByteByteGo — Design A News Feed System ↗
  • AmazonMeta senior design common Design a distributed cache (like a multi-node Redis/Memcached layer).

    Clarify: read-heavy lookups, low latency, data too large for one node's RAM, so partition across nodes. The key technique is consistent hashing: map both nodes and keys onto a hash ring so that adding/removing a node only remaps ~1/N of keys instead of remapping everything (which a naive hash(key) % N would do).

    Discuss replication for availability (replica per shard, read from replicas), an eviction policy (LRU/LFU) since memory is bounded, and the write policy: write-through (write cache + DB together, consistent but slower) vs write-back (write cache, flush DB async, fast but risks loss) vs cache-aside (app reads cache, on miss reads DB and populates).

    Wrap up: name failure modes — cache stampede on a hot key expiring (mitigate with request coalescing / a short lock), and the thundering herd on cold start.

    Red flag Proposing `hash(key) % N` for sharding — adding one node reshuffles almost every key and cold-starts the whole cache.

    source: ByteByteGo — Distributed Cache (System Design Interview) ↗
  • Commonly asked senior design very common How do you do back-of-the-envelope estimation? Estimate QPS and storage for a service with 100M daily active users.

    The point is order-of-magnitude reasoning, not precision. Start from DAU and an action rate. Say each of 100M users does ~10 reads/day -> 1B reads/day. Divide by ~86,400 s/day (~10^5) -> ~12K reads/s average; multiply by a peak factor of ~2-3x -> ~30K peak QPS.

    Storage: rows/day * bytes/row * retention. If you store 1M new items/day at ~1KB each, that is ~1GB/day, ~365GB/year, ~1TB over 3 years — round freely.

    Keep a few anchors memorized: ~10^5 seconds/day, reads usually dwarf writes (often 100:1), a memory read is ~ns, SSD ~µs, network round-trip cross-region ~tens of ms. State your assumptions out loud and round to clean powers of ten so the interviewer can follow.

    Red flag Reaching for a calculator and false precision. The interviewer wants to see assumptions stated and powers-of-ten arithmetic, not 31,536,000 seconds.

    source: System Design Primer — Back-of-the-envelope ↗
  • Commonly asked senior design very common Walk me through the 4-step framework you use to attack any system design interview.

    (1) Understand the problem & scope: ask clarifying questions, separate functional from non-functional requirements (scale, latency, consistency, availability), and do capacity estimates. Pin down what is in and out of scope before drawing anything.

    (2) Propose a high-level design and get buy-in: sketch the major boxes — clients, API/gateway, services, datastores, cache, queue — and the data flow. Confirm the interviewer agrees before going deep.

    (3) Deep dive: pick the 1-2 components the interviewer cares about (the data model, the sharding strategy, the hot path) and go deep — algorithms, schema, partitioning, the actual bottleneck.

    (4) Wrap up: name bottlenecks, single points of failure, and tradeoffs; mention what you would monitor and how you would scale the next 10x. The discipline is to drive the conversation, not silently draw.

    Red flag Skipping step 1 and diving into a database schema before clarifying scale, latency, and consistency needs — the single most common reason candidates fail the round.

    source: ByteByteGo — A framework for system design interviews ↗
  • Commonly asked senior design common How do you identify bottlenecks and single points of failure in a design, and how do you remove them?

    Trace the request path and ask, at each hop, what happens if this one component dies or saturates. A single load balancer, a single primary database, a single cache node, or a single region are classic SPOFs.

    Remove SPOFs with redundancy + failover: run the load balancer in an active-passive pair, replicate the DB (primary + replicas, automatic promotion), spread services across multiple availability zones, and use health checks so traffic routes away from dead instances.

    For bottlenecks, find the component nearest its capacity ceiling: stateless app tier scales horizontally behind the LB; a write-bound DB needs sharding or a queue to absorb bursts; a read-bound DB needs replicas + a cache. The senior move is to quantify it (this shard does X writes/s, the limit is Y) rather than hand-wave 'add more servers'.

    Red flag Treating 'add a load balancer' as the whole answer while the load balancer itself remains a single point of failure, or scaling a stateful service horizontally without externalizing session state.

    source: System Design Primer — Availability patterns ↗
  • Google senior design common Design a web crawler that can crawl the public web.

    Clarify: scale (billions of pages), politeness (respect robots.txt and per-host rate limits), freshness, and what you extract. Core loop: a URL frontier (a queue, partitioned by host so one host's pages go to one worker for politeness) feeds a fleet of fetchers; fetched HTML goes to a parser that extracts links, which are de-duplicated and fed back into the frontier.

    Key components: a DNS cache (DNS resolution is a hidden bottleneck), a seen-URL set (Bloom filter / hash store) to avoid re-crawling, and content de-duplication (hash or Simhash of page content to skip near-duplicates). Store raw pages in object storage / a distributed file store.

    Wrap up: politeness is the subtle part — partition the frontier by domain and apply a per-host crawl delay so you do not hammer one site; add priority queues so important pages get crawled sooner.

    Red flag Forgetting politeness/robots.txt and a de-dup mechanism — an interviewer reads that as someone who would get the crawler IP-banned and stuck in cycles.

    source: ByteByteGo — Design A Web Crawler ↗
  • Commonly asked senior concept very common How do you choose between SQL and NoSQL for a system-design problem?

    Drive it from access patterns and requirements, not preference. Reach for SQL (Postgres/MySQL) when you need strong consistency and multi-row transactions (ACID), rich ad-hoc queries and joins, and a stable relational schema — payments, orders, anything where correctness beats raw write throughput.

    Reach for NoSQL when you need massive horizontal write scale, a flexible/evolving schema, or a specific access shape: a wide-column store (Cassandra/DynamoDB) for huge write volume and known key lookups, a document store (MongoDB) for nested aggregates, a KV store (Redis) for caching, a graph DB for relationship-heavy traversals.

    The senior framing is the tradeoff: most NoSQL stores trade joins and strong consistency for partition tolerance and horizontal scale, and you must model the table around the query up front. State your access pattern, then justify the store.

    What a strong answer covers
    • Choose from access patterns + consistency needs, never from familiarity.

    • SQL: ACID transactions, joins, ad-hoc queries, stable relational schema (orders, payments).

    • NoSQL: horizontal write scale, flexible schema, query-shaped models (Cassandra/DynamoDB, Mongo, Redis).

    • NoSQL usually trades joins + strong consistency for scale — you model around the query first.

    • It is not all-or-nothing: polyglot persistence — SQL for the core, Redis for cache, a search index alongside.

    Quick self-check

    A payments service needs multi-row transactions and strong consistency. Best default store?

    Red flag Declaring 'NoSQL because it scales' with no access pattern stated — many NoSQL stores need the schema modeled around the exact query, and you lose joins/transactions you may actually need.

    source: System Design Primer — SQL or NoSQL ↗
  • ★ must-know Commonly asked senior concept common Explain the CAP theorem and how it actually informs a design decision.

    CAP says that when a network partition happens, a distributed system can keep only two of Consistency, Availability, Partition tolerance — and since partitions are unavoidable in real networks, the real choice is C vs A during a partition.

    CP (consistency over availability): on a partition, refuse or block requests rather than serve stale/conflicting data — pick this when correctness is non-negotiable (a bank balance, inventory you can oversell). AP (availability over consistency): keep serving on both sides of the partition and reconcile later (eventual consistency) — pick this when staleness is tolerable and uptime matters more (a social feed, a shopping cart, DNS).

    The senior point: CAP only bites *during* a partition; the rest of the time you get both. And it is a spectrum — many stores let you tune consistency per request (e.g. quorum reads/writes), so you choose CP or AP per use case, not per company.

    What a strong answer covers
    • Partitions are inevitable, so the live tradeoff is Consistency vs Availability during a partition.

    • CP: refuse/block on partition to avoid stale data — banking, inventory, anything correctness-critical.

    • AP: stay up and reconcile later (eventual consistency) — feeds, carts, DNS.

    • CAP only constrains you during a partition; normally you get C and A both.

    • Often tunable per request (quorum reads/writes), so the choice is per use case, not absolute.

    Quick self-check

    During a network partition, an 'AP' system chooses to:

    Red flag Stating you 'pick two of three' as a permanent architecture choice — partition tolerance is mandatory, so the decision is C-vs-A only when a partition occurs, and it can be tuned per request.

    source: System Design Primer — CAP theorem ↗
  • Commonly asked senior concept common Why and how would you introduce a message queue between services? What does it buy you?

    A queue (SQS, RabbitMQ) or a log (Kafka) decouples a producer from a consumer: the producer drops a message and moves on, the consumer processes it on its own schedule. That buys you three things — async (the user-facing request returns immediately while slow work happens in the background), buffering (a traffic spike fills the queue instead of overwhelming the consumer), and resilience (if the consumer is down, messages wait instead of being lost).

    Use it for work that does not need a synchronous answer: sending email, generating thumbnails, fanning out notifications, ingesting events. You also gain independent scaling (add consumers to drain a backlog) and smoothing of bursty load.

    The tradeoffs you must name: added operational complexity, eventual rather than immediate results, and the need to handle idempotency because most queues guarantee at-least-once delivery (the same message can arrive twice).

    What a strong answer covers
    • Decouples producer/consumer: async work, buffering of spikes, resilience when a consumer is down.

    • Use it for fire-and-forget work: email, thumbnails, notification fanout, event ingestion.

    • Enables independent scaling — add consumers to drain a backlog.

    • Most queues are at-least-once, so consumers must be idempotent (dedupe on a key).

    • Cost: more moving parts, eventual results, and ordering is not free (often per-partition only).

    Red flag Assuming exactly-once delivery and writing a non-idempotent consumer — at-least-once redelivery then double-charges, double-sends, or double-processes on the inevitable retry.

    source: AWS — What is message queuing? ↗
  • Google senior design common Design a typeahead / search autocomplete service.

    Clarify: top-k suggestions per prefix, ranked by popularity, with very low latency (every keystroke fires a request) and read-heavy load. Two halves — serving and data-gathering.

    Serving: precompute the top-k completions for each prefix so a query is a single lookup, not a scan. A trie with the top-k cached at each node answers a prefix in O(prefix length); cache hot tries/results in Redis at the edge. Debounce on the client and cap suggestions so you do not hammer the backend.

    Data-gathering (offline): aggregate query logs to count frequencies, then rebuild/update the trie periodically (e.g. via a batch job) rather than on every search — autocomplete tolerates being slightly stale. Wrap up: shard the trie by prefix range, discuss freshness vs cost of rebuild cadence, and personalization/spell-correction as extensions.

    Red flag Querying the database for matching terms on every keystroke and sorting at request time — that does not survive the read volume; the win is precomputing top-k per prefix offline and serving from a cached trie.

    source: ByteByteGo — Design A Search Autocomplete System ↗
  • Commonly asked senior concept common What is consistent hashing, and what specific problem does it solve that modulo hashing does not?

    With naive hash(key) % N sharding, changing the node count N changes the modulus, so almost every key remaps to a different node — adding or removing one cache/storage node reshuffles the entire keyspace and cold-starts everything.

    Consistent hashing maps both nodes and keys onto the same hash ring (0…2^m). A key is owned by the next node clockwise. Now adding or removing a node only remaps the keys between that node and its neighbor — roughly 1/N of keys, not all of them.

    The refinement is virtual nodes: place each physical node at many points on the ring so load spreads evenly and removing a node redistributes its keys across many others instead of dumping them all on one neighbor. This is the standard partitioning scheme for distributed caches, Cassandra, and DynamoDB-style stores.

    What a strong answer covers
    • hash(key) % N remaps nearly all keys when N changes — catastrophic for a cache.

    • Consistent hashing puts nodes + keys on a ring; a key goes to the next node clockwise.

    • Adding/removing a node only remaps ~1/N of keys (those between it and its neighbor).

    • Virtual nodes spread each physical node across many ring points for even load + smooth rebalancing.

    • It's the backbone of distributed caches, Cassandra, and DynamoDB-style partitioning.

    Quick self-check

    You add one node to a cluster of N. Roughly what fraction of keys remap under consistent hashing vs `hash % N`?

    Red flag Saying consistent hashing 'avoids collisions' — it is about minimizing key movement when the node set changes, not about hash collisions; without virtual nodes load can still skew badly.

    source: System Design Primer — Consistent hashing / sharding ↗
  • Commonly asked senior concept common When and how do you add a cache to a read-heavy system, and what are the gotchas?

    Add a cache when reads dominate, the same data is read far more often than it changes, and the database is the bottleneck. The most common pattern is cache-aside (lazy loading): the app reads the cache first; on a miss it reads the database, populates the cache with a TTL, and returns. Writes update the database and invalidate or update the cached entry. Alternatives are read-through/write-through (the cache layer itself loads/writes the DB) and write-back (write cache now, flush to DB async — fast but risks loss).

    The gotchas are where seniority shows. Cache invalidation is the hard problem — stale data after a write if you forget to evict. Cache stampede / thundering herd: a hot key expires and thousands of requests hit the DB at once — mitigate with request coalescing, a short lock, or staggered TTLs. Cold start after a flush hammers the DB. And caching is for tolerable-staleness data — never cache something that must be strongly consistent (a bank balance) without care.

    What a strong answer covers
    • Add a cache when reads ≫ writes, data is reused, and the DB is the bottleneck.

    • Cache-aside: read cache → miss → read DB → populate with TTL; writes invalidate the entry.

    • Alternatives: read-through/write-through (cache fronts the DB) and write-back (async flush, risks loss).

    • Invalidation is the hard part — stale reads after a write if you forget to evict.

    • Guard against stampede (hot key expiry → DB flood): coalescing, locks, staggered TTLs.

    Quick self-check

    In the cache-aside pattern, what happens on a cache miss?

    Red flag Caching without an invalidation/TTL strategy — writes update the DB but leave stale entries in the cache, so users keep reading old data until the entry happens to expire.

    source: System Design Primer — Caching ↗