6.1 ★ core [A][I] 14 interview Q's

System design fundamentals

The 4-step interview framework, back-of-the-envelope estimation, the building blocks, and designing for scalability/availability/reliability while naming bottlenecks.

A system-design interview isn’t a quiz with a right answer — it’s a structured conversation where you demonstrate that you can scope a problem, reason about scale with rough numbers, and assemble well-understood building blocks into a design whose tradeoffs you can defend. Most candidates fail not on knowledge but on process: they jump to boxes-and-arrows before they know what they’re building. The 4-step framework below keeps you from freezing on a blank whiteboard and signals seniority — you scope, you estimate, you justify, you self-critique.

Key vocabulary

QPS: Queries per second — the request rate a system must serve. Distinguish average QPS from peak QPS (often 2–10× average); you size for peak, not average.
Stateless service: A service that keeps no per-client state in its own memory; any instance can serve any request. This is what makes horizontal scaling and load balancing trivial — push session/state into a shared store (Redis/DB).
Single point of failure (SPOF): A component whose loss takes the whole system down. The availability fix is redundancy: more than one of everything on the critical path, with automatic failover.
Consistent hashing: A hashing scheme that maps keys onto a ring so that adding/removing a node only remaps 1/N of keys instead of nearly all of them — the basis for sharding caches and databases that scale in and out.
Latency vs throughput: Latency is time-per-request (a single user's wait); throughput is requests-per-second the system sustains. You can trade one for the other (e.g. batching raises throughput but adds latency).
Horizontal vs vertical scaling: Horizontal = add more boxes (commodity, near-limitless, needs statelessness + a load balancer). Vertical = a bigger box (simpler, but a hard ceiling and a single failure domain). Prefer horizontal at scale.

The 4-step framework

Walk these four steps out loud, in order, and spend real time on step 1 — the budget percentages below are roughly how a strong candidate allocates a 45-minute round.

Step	What you do	What the interviewer is checking	Time
1 — Clarify	Pin down functional + non-functional requirements, scale, read/write ratio, what's out of scope	Do you scope before building? Do you ask about scale?	~15%
2 — High-level design	Draw the major components + data flow; get explicit buy-in before drilling	Can you sketch a sane architecture end to end?	~25%
3 — Deep dive	Pick the 1–2 hardest parts (data model, the hot path, a bottleneck) and go deep	Depth on the parts that actually matter	~45%
4 — Wrap up	Name bottlenecks, SPOFs, and tradeoffs; how you'd scale further or what you'd monitor	Do you know where your own design is weak?	~15%

Spend the most time on the deep dive — but never skip step 1.

The non-functional requirements you should always tease out in step 1: expected scale (DAU, QPS), the read/write ratio, latency targets, consistency needs (is stale data acceptable?), and availability targets. These numbers drive every later decision — you can’t choose between a cache-aside Redis layer and a synchronously-replicated store until you know whether the workload is read-heavy and whether stale reads are tolerable. A good habit: state your assumptions aloud (“I’ll assume 100M DAU and a 100:1 read:write ratio — stop me if that’s wrong”) so the interviewer can steer you before you’ve sunk time into the wrong design.

Back-of-the-envelope estimation

You’re not after precision — you’re after the right order of magnitude, fast. Memorize a handful of anchor numbers and round aggressively (treat a day as ~100k seconds, since 86,400 ≈ 10⁵).

Sizing a Twitter-like timeline

Assume 200M daily active users, each posting on average 2 tweets/day, each tweet ~300 bytes of text plus metadata. Estimate write QPS and storage.

Writes/day   = 200M users × 2 tweets   = 400M tweets/day
Seconds/day  ≈ 100,000  (86,400 rounded up)
Avg write QPS = 400M / 100k            = 4,000 writes/sec
Peak ≈ 2–3×  avg                       ≈ 10,000 writes/sec

Storage/day  = 400M × 300 B            = 120 GB/day
Storage/year = 120 GB × 365            ≈ 44 TB/year (text only)

Reads dwarf writes on a social feed — a 100:1 read:write ratio puts you near 400k read QPS, which is the number that actually forces a CDN + cache + read-replica fan-out design. The write path (10k/s) a single sharded primary can absorb; the read path is what you architect around. The discipline that scores points: do the math out loud, round, and then immediately interpret the result — “44 TB/year of text means storage isn’t my problem; 400k read QPS is, so I’ll fan out reads to a cache and replicas.”

The building blocks

These are the LEGO bricks of every design. You should be able to say what each one buys you and what it costs — naming only the win is the junior tell; naming the new problem it introduces is the senior one.

FIG 1 · request path through the building blocks A read request flows left to right: CDN absorbs static hits, the LB fans out to a stateless app tier, the cache shields the DB, and replicas/shards scale the data layer.

Block	What it gives you	When to add it	Cost / caveat
Load balancer	Spreads traffic across instances; health-checks out dead ones	The moment you have >1 app server	Itself a SPOF unless run redundant (active/passive)
Cache (Redis/Memcached)	Sub-ms reads, offloads the DB on read-heavy paths	Hot keys, repeated reads, expensive queries	Invalidation is hard; risk of stale data + thundering herd
CDN	Serves static/edge content near the user; cuts latency + origin load	Images, JS/CSS, video, cacheable API responses	Cache-busting + TTL tuning; cost per GB egress
DB replication	Read replicas scale reads + give HA via failover	Read-heavy workload; need read scaling/HA	Replication lag → stale reads; doesn't scale writes
DB sharding	Splits data across nodes → scales writes + storage	One primary can't hold the write/storage volume	Cross-shard joins + rebalancing are painful; pick keys carefully
Message queue	Decouples producers/consumers; absorbs spikes; async work	Slow/bursty work (email, video encode, fan-out)	At-least-once delivery → consumers must be idempotent
Rate limiter	Protects you from abuse + overload (returns 429)	Public APIs, login, expensive endpoints	Token-bucket/sliding-window state must be shared across instances

Every block solves one problem and introduces a new one — name both.

Stateless services are the glue: keep the app tier stateless so the load balancer can route any request anywhere and you scale by adding identical instances. The corollary every interviewer wants to hear: session state, uploaded files, and in-memory caches must move out of the app process into a shared store (Redis, the DB, object storage) — the instant a user’s second request can land on a different box, anything kept locally is a bug. Consistent hashing is the trick that lets a sharded cache or datastore grow and shrink without remapping every key — when a node joins or leaves the ring, only its neighbor’s slice moves (1/N of keys), instead of the naive hash(key) % N scheme where changing N reshuffles almost everything.

Scalability, availability, reliability — and finding the SPOF

These three get conflated; keep them distinct. Scalability is handling more load (prefer horizontal — add boxes — over vertical — bigger boxes). Availability is the fraction of time the system is up, usually quoted in nines and bought with redundancy + failover. Reliability is doing the right thing despite failures — idempotent retries, timeouts, circuit breakers, graceful degradation.

Nines	Availability	Downtime / year	Roughly
Two nines	99%	~3.65 days	a hobby project
Three nines	99.9%	~8.8 hours	a typical web service
Four nines	99.99%	~52 minutes	a serious SaaS SLA
Five nines	99.999%	~5.3 minutes	telecom / core infra (expensive)

Each extra nine costs roughly an order of magnitude more engineering — quote the target the requirement actually needs.

The senior move in step 4 is to walk the request path and ask “what happens if this dies?” at each hop. A single load balancer, a lone database primary, one cache node holding a hot key, a single availability zone — each is a SPOF until it has a redundant twin with automatic failover. The fix is always the same shape: more than one of it, plus a health check and failover so traffic reroutes automatically.

Designing for the failures you can't prevent Netflix

Netflix runs across thousands of microservices on AWS, and their core architectural bet is that failure is constant, not exceptional — instances die, zones degrade, and dependencies time out continuously at their scale. Rather than try to prevent every failure, they engineer the system to degrade gracefully: fallbacks return a cached or default response when a downstream service is unhealthy, and bulkheads/circuit breakers (popularized by their Hystrix library) stop one slow dependency from exhausting threads and cascading into a total outage. They famously run Chaos Monkey, which deliberately kills production instances during business hours, precisely to guarantee that no single instance is a SPOF — if killing one box takes the service down, that’s a bug to fix before it happens at 3am. The lesson for your interview wrap-up: name the SPOFs, then describe the redundancy and graceful-degradation strategy that survives them.

read the writeup ↗ netflixtechblog.com

Key points

Run the 4 steps in order: clarify → high-level → deep dive → wrap up. Most time on the deep dive; never skip clarifying scope.
Estimate with anchor numbers: ~100k sec/day, RAM ~100 ns, network ~ms. Size for peak QPS, and figure out the read:write ratio early — it drives the whole design.
Know each building block’s win and its cost: replication scales reads not writes; sharding scales writes but kills cross-shard joins; queues need idempotent consumers.
Keep the app tier stateless so it scales horizontally; use consistent hashing to grow/shrink sharded caches without remapping everything.
Distinguish scalability (more load), availability (uptime, bought with redundancy), reliability (correct despite failures); know the nines table.
In wrap-up, walk the path and name every SPOF and tradeoff — knowing where your design is weak is the senior signal.

01 Learning objectives

0 / 4 done

02 Curated reading

The System Design Primer
essential 60m — Free, comprehensive system-design study guide.
Netflix Tech Blog
optional eng blog 30m — Real architecture write-ups: microservices, resilience, caching at scale.

03 Knowledge check

knowledge check3 questions · pass ≥ 70%

01easy
A single point of failure (SPOF) is:
02medium
The FIRST step of the system-design interview framework is:
03hard
Consistent hashing is primarily used to:

04 Interview questions

browse all ↗

What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.

AmazonGoogleMicrosoft senior design very common Design a URL shortener (TinyURL / bit.ly). Walk me through it.
Clarify scope first: read-heavy (~100:1 reads:writes), so optimize the redirect path. Estimate: ~100M new URLs/day -> ~1.2K writes/s, ~120K reads/s; 7-char base62 = 62^7 ~= 3.5T codes, plenty for years. Storage ~500 bytes/row * 100M/day -> ~36TB over 2 years.
Core: a key-gen service maps short->long. Two strategies: (a) base62-encode a globally unique counter (e.g. a range-allocator / Snowflake-style ID) — no collisions, but reveals volume; (b) hash the long URL (MD5/SHA) and take a prefix, then collision-check. Store mapping in a KV store / sharded SQL; reads go cache-first (Redis, LRU on hot links) then DB.
Wrap up: use 301 only if you do not need per-click analytics (browser caches it), else 302; shard by hash of short code; push click analytics to a queue (Kafka) for async aggregation.
Follow-ups they push on
- How do you guarantee globally unique codes across shards without a single counter bottleneck?
- 301 vs 302 — which do you pick and what do you lose with each?
- How would custom aliases and link expiry change the design?
Red flag Jumping straight to a schema before clarifying the read/write ratio and scale. Also picking 301 while still wanting click analytics — the cached redirect never hits your server again.
source: ByteByteGo — Design A URL Shortener ↗
AmazonStripeCloudflare senior design very common Design a distributed rate limiter for a public API.
Clarify: client-side vs server-side (server-side), what dimensions to limit (per-user, per-IP, per-endpoint, global), and the action on limit (drop, queue, return 429 with Retry-After). Pick an algorithm and justify it: token bucket (allows bursts, simple, most common), leaky bucket (smooths output), fixed window (cheap but boundary spikes), sliding-window log (accurate, memory-heavy), sliding-window counter (good accuracy/cost tradeoff).
For a distributed fleet, counters must be shared: keep them in a central store like Redis, keyed by userId:window, incremented atomically (e.g. a Lua script / INCR + EXPIRE) so the read-modify-write is race-free. Put the limiter at the edge / API gateway so rejected traffic never reaches your services.
Discuss tradeoffs: local in-memory counters are fast but let bursts through across nodes; Redis adds a network hop and a single point to scale; allow a small over-limit margin to tolerate Redis latency.
Follow-ups they push on
- Token bucket vs sliding-window counter — when do you prefer each?
- How do you keep the counter consistent across many API servers?
- What happens if Redis goes down — fail open or fail closed?
Red flag Using a non-atomic get-then-set on the counter, which races under concurrency and lets requests slip past the limit. Also putting the limiter behind the app instead of at the gateway.
source: ByteByteGo — Design A Rate Limiter ↗
Meta senior design very common Design a social media news feed (e.g. the Facebook/Twitter timeline).
Clarify: feed of posts from people you follow, ranked (recency or relevance), heavy read load. The core decision is fanout-on-write (push) vs fanout-on-read (pull).
Fanout-on-write: when a user posts, push the post id into every follower's precomputed feed cache. Reads are O(1) and fast — great for most users. But it explodes for celebrities with millions of followers (the hot-key / fanout problem).
Fanout-on-read: build the feed at request time by pulling recent posts from everyone the user follows. No write amplification, but reads are expensive and slow.
The standard answer is a hybrid: push for normal accounts, pull for a small set of high-follower accounts, then merge at read time. Cache assembled feeds (Redis), store posts in a sharded store, and rank with a separate scoring service.
Follow-ups they push on
- How do you handle a celebrity with 50M followers under fanout-on-write?
- Where does ranking/ML scoring fit — write time or read time?
- How do you keep the feed cache from growing unbounded?
Red flag Committing to pure fanout-on-write without acknowledging the celebrity hot-key problem, or pure fanout-on-read and ignoring read latency at scale.
source: ByteByteGo — Design A News Feed System ↗
AmazonMeta senior design common Design a distributed cache (like a multi-node Redis/Memcached layer).
Clarify: read-heavy lookups, low latency, data too large for one node's RAM, so partition across nodes. The key technique is consistent hashing: map both nodes and keys onto a hash ring so that adding/removing a node only remaps ~1/N of keys instead of remapping everything (which a naive hash(key) % N would do).
Discuss replication for availability (replica per shard, read from replicas), an eviction policy (LRU/LFU) since memory is bounded, and the write policy: write-through (write cache + DB together, consistent but slower) vs write-back (write cache, flush DB async, fast but risks loss) vs cache-aside (app reads cache, on miss reads DB and populates).
Wrap up: name failure modes — cache stampede on a hot key expiring (mitigate with request coalescing / a short lock), and the thundering herd on cold start.
Follow-ups they push on
- Why consistent hashing instead of modulo hashing?
- How do you prevent a cache stampede when a hot key expires?
- Cache-aside vs write-through — what consistency do you give up?
Red flag Proposing `hash(key) % N` for sharding — adding one node reshuffles almost every key and cold-starts the whole cache.
source: ByteByteGo — Distributed Cache (System Design Interview) ↗
Commonly asked senior design very common How do you do back-of-the-envelope estimation? Estimate QPS and storage for a service with 100M daily active users.
The point is order-of-magnitude reasoning, not precision. Start from DAU and an action rate. Say each of 100M users does ~10 reads/day -> 1B reads/day. Divide by ~86,400 s/day (~10^5) -> ~12K reads/s average; multiply by a peak factor of ~2-3x -> ~30K peak QPS.
Storage: rows/day * bytes/row * retention. If you store 1M new items/day at ~1KB each, that is ~1GB/day, ~365GB/year, ~1TB over 3 years — round freely.
Keep a few anchors memorized: ~10^5 seconds/day, reads usually dwarf writes (often 100:1), a memory read is ~ns, SSD ~µs, network round-trip cross-region ~tens of ms. State your assumptions out loud and round to clean powers of ten so the interviewer can follow.
Follow-ups they push on
- What read:write ratio did you assume and why?
- How does the storage number change the database choice?
- What peak-to-average factor is reasonable, and why?
Red flag Reaching for a calculator and false precision. The interviewer wants to see assumptions stated and powers-of-ten arithmetic, not 31,536,000 seconds.
source: System Design Primer — Back-of-the-envelope ↗
Commonly asked senior design very common Walk me through the 4-step framework you use to attack any system design interview.
(1) Understand the problem & scope: ask clarifying questions, separate functional from non-functional requirements (scale, latency, consistency, availability), and do capacity estimates. Pin down what is in and out of scope before drawing anything.
(2) Propose a high-level design and get buy-in: sketch the major boxes — clients, API/gateway, services, datastores, cache, queue — and the data flow. Confirm the interviewer agrees before going deep.
(3) Deep dive: pick the 1-2 components the interviewer cares about (the data model, the sharding strategy, the hot path) and go deep — algorithms, schema, partitioning, the actual bottleneck.
(4) Wrap up: name bottlenecks, single points of failure, and tradeoffs; mention what you would monitor and how you would scale the next 10x. The discipline is to drive the conversation, not silently draw.
Follow-ups they push on
- How do you decide which component to deep-dive on?
- What non-functional requirements do you always ask about?
Red flag Skipping step 1 and diving into a database schema before clarifying scale, latency, and consistency needs — the single most common reason candidates fail the round.
source: ByteByteGo — A framework for system design interviews ↗
Commonly asked senior design common How do you identify bottlenecks and single points of failure in a design, and how do you remove them?
Trace the request path and ask, at each hop, what happens if this one component dies or saturates. A single load balancer, a single primary database, a single cache node, or a single region are classic SPOFs.
Remove SPOFs with redundancy + failover: run the load balancer in an active-passive pair, replicate the DB (primary + replicas, automatic promotion), spread services across multiple availability zones, and use health checks so traffic routes away from dead instances.
For bottlenecks, find the component nearest its capacity ceiling: stateless app tier scales horizontally behind the LB; a write-bound DB needs sharding or a queue to absorb bursts; a read-bound DB needs replicas + a cache. The senior move is to quantify it (this shard does X writes/s, the limit is Y) rather than hand-wave 'add more servers'.
Follow-ups they push on
- How does making services stateless help you scale horizontally?
- How do you decide between adding read replicas vs sharding?
Red flag Treating 'add a load balancer' as the whole answer while the load balancer itself remains a single point of failure, or scaling a stateful service horizontally without externalizing session state.
source: System Design Primer — Availability patterns ↗
Google senior design common Design a web crawler that can crawl the public web.
Clarify: scale (billions of pages), politeness (respect robots.txt and per-host rate limits), freshness, and what you extract. Core loop: a URL frontier (a queue, partitioned by host so one host's pages go to one worker for politeness) feeds a fleet of fetchers; fetched HTML goes to a parser that extracts links, which are de-duplicated and fed back into the frontier.
Key components: a DNS cache (DNS resolution is a hidden bottleneck), a seen-URL set (Bloom filter / hash store) to avoid re-crawling, and content de-duplication (hash or Simhash of page content to skip near-duplicates). Store raw pages in object storage / a distributed file store.
Wrap up: politeness is the subtle part — partition the frontier by domain and apply a per-host crawl delay so you do not hammer one site; add priority queues so important pages get crawled sooner.
Follow-ups they push on
- How do you avoid crawling the same URL (or near-duplicate content) twice?
- How do you stay polite to a single host while still being massively parallel?
- Why is DNS a bottleneck and how do you mitigate it?
Red flag Forgetting politeness/robots.txt and a de-dup mechanism — an interviewer reads that as someone who would get the crawler IP-banned and stuck in cycles.
source: ByteByteGo — Design A Web Crawler ↗
Commonly asked senior concept very common How do you choose between SQL and NoSQL for a system-design problem?
Drive it from access patterns and requirements, not preference. Reach for SQL (Postgres/MySQL) when you need strong consistency and multi-row transactions (ACID), rich ad-hoc queries and joins, and a stable relational schema — payments, orders, anything where correctness beats raw write throughput.
Reach for NoSQL when you need massive horizontal write scale, a flexible/evolving schema, or a specific access shape: a wide-column store (Cassandra/DynamoDB) for huge write volume and known key lookups, a document store (MongoDB) for nested aggregates, a KV store (Redis) for caching, a graph DB for relationship-heavy traversals.
The senior framing is the tradeoff: most NoSQL stores trade joins and strong consistency for partition tolerance and horizontal scale, and you must model the table around the query up front. State your access pattern, then justify the store.
What a strong answer covers
- Choose from access patterns + consistency needs, never from familiarity.
- SQL: ACID transactions, joins, ad-hoc queries, stable relational schema (orders, payments).
- NoSQL: horizontal write scale, flexible schema, query-shaped models (Cassandra/DynamoDB, Mongo, Redis).
- NoSQL usually trades joins + strong consistency for scale — you model around the query first.
- It is not all-or-nothing: polyglot persistence — SQL for the core, Redis for cache, a search index alongside.
Quick self-check
A payments service needs multi-row transactions and strong consistency. Best default store?
Follow-ups they push on
- Which store fits a write-heavy event log with known key lookups?
- What do you give up when you pick a wide-column store over Postgres?
- When would you run both a SQL store and a NoSQL store in the same system?
Red flag Declaring 'NoSQL because it scales' with no access pattern stated — many NoSQL stores need the schema modeled around the exact query, and you lose joins/transactions you may actually need.
source: System Design Primer — SQL or NoSQL ↗
★ must-know Commonly asked senior concept common Explain the CAP theorem and how it actually informs a design decision.
CAP says that when a network partition happens, a distributed system can keep only two of Consistency, Availability, Partition tolerance — and since partitions are unavoidable in real networks, the real choice is C vs A during a partition.
CP (consistency over availability): on a partition, refuse or block requests rather than serve stale/conflicting data — pick this when correctness is non-negotiable (a bank balance, inventory you can oversell). AP (availability over consistency): keep serving on both sides of the partition and reconcile later (eventual consistency) — pick this when staleness is tolerable and uptime matters more (a social feed, a shopping cart, DNS).
The senior point: CAP only bites *during* a partition; the rest of the time you get both. And it is a spectrum — many stores let you tune consistency per request (e.g. quorum reads/writes), so you choose CP or AP per use case, not per company.
What a strong answer covers
- Partitions are inevitable, so the live tradeoff is Consistency vs Availability during a partition.
- CP: refuse/block on partition to avoid stale data — banking, inventory, anything correctness-critical.
- AP: stay up and reconcile later (eventual consistency) — feeds, carts, DNS.
- CAP only constrains you during a partition; normally you get C and A both.
- Often tunable per request (quorum reads/writes), so the choice is per use case, not absolute.
Quick self-check
During a network partition, an 'AP' system chooses to:
Follow-ups they push on
- Give a concrete system you'd build CP and one you'd build AP, and why.
- How do quorum reads/writes let you tune where you sit on the spectrum?
- What does 'eventual consistency' actually promise the client?
Red flag Stating you 'pick two of three' as a permanent architecture choice — partition tolerance is mandatory, so the decision is C-vs-A only when a partition occurs, and it can be tuned per request.
source: System Design Primer — CAP theorem ↗
Commonly asked senior concept common Why and how would you introduce a message queue between services? What does it buy you?
A queue (SQS, RabbitMQ) or a log (Kafka) decouples a producer from a consumer: the producer drops a message and moves on, the consumer processes it on its own schedule. That buys you three things — async (the user-facing request returns immediately while slow work happens in the background), buffering (a traffic spike fills the queue instead of overwhelming the consumer), and resilience (if the consumer is down, messages wait instead of being lost).
Use it for work that does not need a synchronous answer: sending email, generating thumbnails, fanning out notifications, ingesting events. You also gain independent scaling (add consumers to drain a backlog) and smoothing of bursty load.
The tradeoffs you must name: added operational complexity, eventual rather than immediate results, and the need to handle idempotency because most queues guarantee at-least-once delivery (the same message can arrive twice).
What a strong answer covers
- Decouples producer/consumer: async work, buffering of spikes, resilience when a consumer is down.
- Use it for fire-and-forget work: email, thumbnails, notification fanout, event ingestion.
- Enables independent scaling — add consumers to drain a backlog.
- Most queues are at-least-once, so consumers must be idempotent (dedupe on a key).
- Cost: more moving parts, eventual results, and ordering is not free (often per-partition only).
Follow-ups they push on
- Why must queue consumers usually be idempotent?
- What's the difference between a queue (SQS/RabbitMQ) and a log (Kafka)?
- How does a dead-letter queue help, and when do messages land there?
Red flag Assuming exactly-once delivery and writing a non-idempotent consumer — at-least-once redelivery then double-charges, double-sends, or double-processes on the inevitable retry.
source: AWS — What is message queuing? ↗
Google senior design common Design a typeahead / search autocomplete service.
Clarify: top-k suggestions per prefix, ranked by popularity, with very low latency (every keystroke fires a request) and read-heavy load. Two halves — serving and data-gathering.
Serving: precompute the top-k completions for each prefix so a query is a single lookup, not a scan. A trie with the top-k cached at each node answers a prefix in O(prefix length); cache hot tries/results in Redis at the edge. Debounce on the client and cap suggestions so you do not hammer the backend.
Data-gathering (offline): aggregate query logs to count frequencies, then rebuild/update the trie periodically (e.g. via a batch job) rather than on every search — autocomplete tolerates being slightly stale. Wrap up: shard the trie by prefix range, discuss freshness vs cost of rebuild cadence, and personalization/spell-correction as extensions.
Follow-ups they push on
- Why precompute top-k per prefix instead of querying at request time?
- How do you keep the suggestions fresh without rebuilding the trie on every query?
- How would you shard the trie across nodes?
Red flag Querying the database for matching terms on every keystroke and sorting at request time — that does not survive the read volume; the win is precomputing top-k per prefix offline and serving from a cached trie.
source: ByteByteGo — Design A Search Autocomplete System ↗
Commonly asked senior concept common What is consistent hashing, and what specific problem does it solve that modulo hashing does not?
With naive hash(key) % N sharding, changing the node count N changes the modulus, so almost every key remaps to a different node — adding or removing one cache/storage node reshuffles the entire keyspace and cold-starts everything.
Consistent hashing maps both nodes and keys onto the same hash ring (0…2^m). A key is owned by the next node clockwise. Now adding or removing a node only remaps the keys between that node and its neighbor — roughly 1/N of keys, not all of them.
The refinement is virtual nodes: place each physical node at many points on the ring so load spreads evenly and removing a node redistributes its keys across many others instead of dumping them all on one neighbor. This is the standard partitioning scheme for distributed caches, Cassandra, and DynamoDB-style stores.
What a strong answer covers
- hash(key) % N remaps nearly all keys when N changes — catastrophic for a cache.
- Consistent hashing puts nodes + keys on a ring; a key goes to the next node clockwise.
- Adding/removing a node only remaps ~1/N of keys (those between it and its neighbor).
- Virtual nodes spread each physical node across many ring points for even load + smooth rebalancing.
- It's the backbone of distributed caches, Cassandra, and DynamoDB-style partitioning.
Quick self-check
You add one node to a cluster of N. Roughly what fraction of keys remap under consistent hashing vs `hash % N`?
Follow-ups they push on
- Why do virtual nodes improve load balance and rebalancing?
- Roughly what fraction of keys move when you add one node to a ring of N?
- How does this connect to designing a distributed cache?
Red flag Saying consistent hashing 'avoids collisions' — it is about minimizing key movement when the node set changes, not about hash collisions; without virtual nodes load can still skew badly.
source: System Design Primer — Consistent hashing / sharding ↗
Commonly asked senior concept common When and how do you add a cache to a read-heavy system, and what are the gotchas?
Add a cache when reads dominate, the same data is read far more often than it changes, and the database is the bottleneck. The most common pattern is cache-aside (lazy loading): the app reads the cache first; on a miss it reads the database, populates the cache with a TTL, and returns. Writes update the database and invalidate or update the cached entry. Alternatives are read-through/write-through (the cache layer itself loads/writes the DB) and write-back (write cache now, flush to DB async — fast but risks loss).
The gotchas are where seniority shows. Cache invalidation is the hard problem — stale data after a write if you forget to evict. Cache stampede / thundering herd: a hot key expires and thousands of requests hit the DB at once — mitigate with request coalescing, a short lock, or staggered TTLs. Cold start after a flush hammers the DB. And caching is for tolerable-staleness data — never cache something that must be strongly consistent (a bank balance) without care.
What a strong answer covers
- Add a cache when reads ≫ writes, data is reused, and the DB is the bottleneck.
- Cache-aside: read cache → miss → read DB → populate with TTL; writes invalidate the entry.
- Alternatives: read-through/write-through (cache fronts the DB) and write-back (async flush, risks loss).
- Invalidation is the hard part — stale reads after a write if you forget to evict.
- Guard against stampede (hot key expiry → DB flood): coalescing, locks, staggered TTLs.
Quick self-check
In the cache-aside pattern, what happens on a cache miss?
Follow-ups they push on
- Why is cache invalidation famously the hard part of caching?
- How do you prevent a cache stampede when a popular key expires?
- What data should you NOT cache, and why?
Red flag Caching without an invalidation/TTL strategy — writes update the DB but leave stale entries in the cache, so users keep reading old data until the entry happens to expire.
source: System Design Primer — Caching ↗