System design fundamentals
The 4-step interview framework, back-of-the-envelope estimation, the building blocks, and designing for scalability/availability/reliability while naming bottlenecks.
A system-design interview isn’t a quiz with a right answer — it’s a structured conversation where you demonstrate that you can scope a problem, reason about scale with rough numbers, and assemble well-understood building blocks into a design whose tradeoffs you can defend. Most candidates fail not on knowledge but on process: they jump to boxes-and-arrows before they know what they’re building. The 4-step framework below keeps you from freezing on a blank whiteboard and signals seniority — you scope, you estimate, you justify, you self-critique.
The 4-step framework
Walk these four steps out loud, in order, and spend real time on step 1 — the budget percentages below are roughly how a strong candidate allocates a 45-minute round.
| Step | What you do | What the interviewer is checking | Time |
|---|---|---|---|
| 1 — Clarify | Pin down functional + non-functional requirements, scale, read/write ratio, what's out of scope | Do you scope before building? Do you ask about scale? | ~15% |
| 2 — High-level design | Draw the major components + data flow; get explicit buy-in before drilling | Can you sketch a sane architecture end to end? | ~25% |
| 3 — Deep dive | Pick the 1–2 hardest parts (data model, the hot path, a bottleneck) and go deep | Depth on the parts that actually matter | ~45% |
| 4 — Wrap up | Name bottlenecks, SPOFs, and tradeoffs; how you'd scale further or what you'd monitor | Do you know where your own design is weak? | ~15% |
The non-functional requirements you should always tease out in step 1: expected scale (DAU, QPS), the read/write ratio, latency targets, consistency needs (is stale data acceptable?), and availability targets. These numbers drive every later decision — you can’t choose between a cache-aside Redis layer and a synchronously-replicated store until you know whether the workload is read-heavy and whether stale reads are tolerable. A good habit: state your assumptions aloud (“I’ll assume 100M DAU and a 100:1 read:write ratio — stop me if that’s wrong”) so the interviewer can steer you before you’ve sunk time into the wrong design.
Back-of-the-envelope estimation
You’re not after precision — you’re after the right order of magnitude, fast. Memorize a handful of anchor numbers and round aggressively (treat a day as ~100k seconds, since 86,400 ≈ 10⁵).
Assume 200M daily active users, each posting on average 2 tweets/day, each tweet ~300 bytes of text plus metadata. Estimate write QPS and storage.
Writes/day = 200M users × 2 tweets = 400M tweets/day
Seconds/day ≈ 100,000 (86,400 rounded up)
Avg write QPS = 400M / 100k = 4,000 writes/sec
Peak ≈ 2–3× avg ≈ 10,000 writes/sec
Storage/day = 400M × 300 B = 120 GB/day
Storage/year = 120 GB × 365 ≈ 44 TB/year (text only)Reads dwarf writes on a social feed — a 100:1 read:write ratio puts you near 400k read QPS, which is the number that actually forces a CDN + cache + read-replica fan-out design. The write path (10k/s) a single sharded primary can absorb; the read path is what you architect around. The discipline that scores points: do the math out loud, round, and then immediately interpret the result — “44 TB/year of text means storage isn’t my problem; 400k read QPS is, so I’ll fan out reads to a cache and replicas.”
The building blocks
These are the LEGO bricks of every design. You should be able to say what each one buys you and what it costs — naming only the win is the junior tell; naming the new problem it introduces is the senior one.
| Block | What it gives you | When to add it | Cost / caveat |
|---|---|---|---|
| Load balancer | Spreads traffic across instances; health-checks out dead ones | The moment you have >1 app server | Itself a SPOF unless run redundant (active/passive) |
| Cache (Redis/Memcached) | Sub-ms reads, offloads the DB on read-heavy paths | Hot keys, repeated reads, expensive queries | Invalidation is hard; risk of stale data + thundering herd |
| CDN | Serves static/edge content near the user; cuts latency + origin load | Images, JS/CSS, video, cacheable API responses | Cache-busting + TTL tuning; cost per GB egress |
| DB replication | Read replicas scale reads + give HA via failover | Read-heavy workload; need read scaling/HA | Replication lag → stale reads; doesn't scale writes |
| DB sharding | Splits data across nodes → scales writes + storage | One primary can't hold the write/storage volume | Cross-shard joins + rebalancing are painful; pick keys carefully |
| Message queue | Decouples producers/consumers; absorbs spikes; async work | Slow/bursty work (email, video encode, fan-out) | At-least-once delivery → consumers must be idempotent |
| Rate limiter | Protects you from abuse + overload (returns 429) | Public APIs, login, expensive endpoints | Token-bucket/sliding-window state must be shared across instances |
Stateless services are the glue: keep the app tier stateless so the load balancer can route any request anywhere and you scale by adding identical instances. The corollary every interviewer wants to hear: session state, uploaded files, and in-memory caches must move out of the app process into a shared store (Redis, the DB, object storage) — the instant a user’s second request can land on a different box, anything kept locally is a bug. Consistent hashing is the trick that lets a sharded cache or datastore grow and shrink without remapping every key — when a node joins or leaves the ring, only its neighbor’s slice moves (1/N of keys), instead of the naive hash(key) % N scheme where changing N reshuffles almost everything.
Scalability, availability, reliability — and finding the SPOF
These three get conflated; keep them distinct. Scalability is handling more load (prefer horizontal — add boxes — over vertical — bigger boxes). Availability is the fraction of time the system is up, usually quoted in nines and bought with redundancy + failover. Reliability is doing the right thing despite failures — idempotent retries, timeouts, circuit breakers, graceful degradation.
| Nines | Availability | Downtime / year | Roughly |
|---|---|---|---|
| Two nines | 99% | ~3.65 days | a hobby project |
| Three nines | 99.9% | ~8.8 hours | a typical web service |
| Four nines | 99.99% | ~52 minutes | a serious SaaS SLA |
| Five nines | 99.999% | ~5.3 minutes | telecom / core infra (expensive) |
The senior move in step 4 is to walk the request path and ask “what happens if this dies?” at each hop. A single load balancer, a lone database primary, one cache node holding a hot key, a single availability zone — each is a SPOF until it has a redundant twin with automatic failover. The fix is always the same shape: more than one of it, plus a health check and failover so traffic reroutes automatically.
01 Learning objectives
0 / 4 done02 Curated reading
03 Knowledge check
- 01easy
A single point of failure (SPOF) is:
- 02medium
The FIRST step of the system-design interview framework is:
- 03hard
Consistent hashing is primarily used to:
04 Interview questions
browse all ↗What gets asked on this topic — tap a card for how to approach it, the follow-ups, and the trap. Company tags are best-effort & sourced.
-
Design a URL shortener (TinyURL / bit.ly). Walk me through it.
Clarify scope first: read-heavy (~100:1 reads:writes), so optimize the redirect path. Estimate: ~100M new URLs/day -> ~1.2K writes/s, ~120K reads/s; 7-char base62 = 62^7 ~= 3.5T codes, plenty for years. Storage ~500 bytes/row * 100M/day -> ~36TB over 2 years.
Core: a key-gen service maps short->long. Two strategies: (a) base62-encode a globally unique counter (e.g. a range-allocator / Snowflake-style ID) — no collisions, but reveals volume; (b) hash the long URL (MD5/SHA) and take a prefix, then collision-check. Store mapping in a KV store / sharded SQL; reads go cache-first (Redis, LRU on hot links) then DB.
Wrap up: use
301only if you do not need per-click analytics (browser caches it), else302; shard by hash of short code; push click analytics to a queue (Kafka) for async aggregation.Follow-ups they push on- How do you guarantee globally unique codes across shards without a single counter bottleneck?
- 301 vs 302 — which do you pick and what do you lose with each?
- How would custom aliases and link expiry change the design?
Red flag Jumping straight to a schema before clarifying the read/write ratio and scale. Also picking 301 while still wanting click analytics — the cached redirect never hits your server again.
source: ByteByteGo — Design A URL Shortener ↗ -
Design a distributed rate limiter for a public API.
Clarify: client-side vs server-side (server-side), what dimensions to limit (per-user, per-IP, per-endpoint, global), and the action on limit (drop, queue, return
429withRetry-After). Pick an algorithm and justify it: token bucket (allows bursts, simple, most common), leaky bucket (smooths output), fixed window (cheap but boundary spikes), sliding-window log (accurate, memory-heavy), sliding-window counter (good accuracy/cost tradeoff).For a distributed fleet, counters must be shared: keep them in a central store like Redis, keyed by
userId:window, incremented atomically (e.g. a Lua script /INCR+EXPIRE) so the read-modify-write is race-free. Put the limiter at the edge / API gateway so rejected traffic never reaches your services.Discuss tradeoffs: local in-memory counters are fast but let bursts through across nodes; Redis adds a network hop and a single point to scale; allow a small over-limit margin to tolerate Redis latency.
Follow-ups they push on- Token bucket vs sliding-window counter — when do you prefer each?
- How do you keep the counter consistent across many API servers?
- What happens if Redis goes down — fail open or fail closed?
Red flag Using a non-atomic get-then-set on the counter, which races under concurrency and lets requests slip past the limit. Also putting the limiter behind the app instead of at the gateway.
source: ByteByteGo — Design A Rate Limiter ↗ -
Design a social media news feed (e.g. the Facebook/Twitter timeline).
Clarify: feed of posts from people you follow, ranked (recency or relevance), heavy read load. The core decision is fanout-on-write (push) vs fanout-on-read (pull).
Fanout-on-write: when a user posts, push the post id into every follower's precomputed feed cache. Reads are O(1) and fast — great for most users. But it explodes for celebrities with millions of followers (the hot-key / fanout problem).
Fanout-on-read: build the feed at request time by pulling recent posts from everyone the user follows. No write amplification, but reads are expensive and slow.
The standard answer is a hybrid: push for normal accounts, pull for a small set of high-follower accounts, then merge at read time. Cache assembled feeds (Redis), store posts in a sharded store, and rank with a separate scoring service.
Follow-ups they push on- How do you handle a celebrity with 50M followers under fanout-on-write?
- Where does ranking/ML scoring fit — write time or read time?
- How do you keep the feed cache from growing unbounded?
Red flag Committing to pure fanout-on-write without acknowledging the celebrity hot-key problem, or pure fanout-on-read and ignoring read latency at scale.
source: ByteByteGo — Design A News Feed System ↗ -
Design a distributed cache (like a multi-node Redis/Memcached layer).
Clarify: read-heavy lookups, low latency, data too large for one node's RAM, so partition across nodes. The key technique is consistent hashing: map both nodes and keys onto a hash ring so that adding/removing a node only remaps ~1/N of keys instead of remapping everything (which a naive
hash(key) % Nwould do).Discuss replication for availability (replica per shard, read from replicas), an eviction policy (LRU/LFU) since memory is bounded, and the write policy: write-through (write cache + DB together, consistent but slower) vs write-back (write cache, flush DB async, fast but risks loss) vs cache-aside (app reads cache, on miss reads DB and populates).
Wrap up: name failure modes — cache stampede on a hot key expiring (mitigate with request coalescing / a short lock), and the thundering herd on cold start.
Follow-ups they push on- Why consistent hashing instead of modulo hashing?
- How do you prevent a cache stampede when a hot key expires?
- Cache-aside vs write-through — what consistency do you give up?
Red flag Proposing `hash(key) % N` for sharding — adding one node reshuffles almost every key and cold-starts the whole cache.
source: ByteByteGo — Distributed Cache (System Design Interview) ↗ -
How do you do back-of-the-envelope estimation? Estimate QPS and storage for a service with 100M daily active users.
The point is order-of-magnitude reasoning, not precision. Start from DAU and an action rate. Say each of 100M users does ~10 reads/day -> 1B reads/day. Divide by ~86,400 s/day (~10^5) -> ~12K reads/s average; multiply by a peak factor of ~2-3x -> ~30K peak QPS.
Storage: rows/day * bytes/row * retention. If you store 1M new items/day at ~1KB each, that is ~1GB/day, ~365GB/year, ~1TB over 3 years — round freely.
Keep a few anchors memorized: ~10^5 seconds/day, reads usually dwarf writes (often 100:1), a memory read is ~ns, SSD ~µs, network round-trip cross-region ~tens of ms. State your assumptions out loud and round to clean powers of ten so the interviewer can follow.
Follow-ups they push on- What read:write ratio did you assume and why?
- How does the storage number change the database choice?
- What peak-to-average factor is reasonable, and why?
Red flag Reaching for a calculator and false precision. The interviewer wants to see assumptions stated and powers-of-ten arithmetic, not 31,536,000 seconds.
source: System Design Primer — Back-of-the-envelope ↗ -
Walk me through the 4-step framework you use to attack any system design interview.
(1) Understand the problem & scope: ask clarifying questions, separate functional from non-functional requirements (scale, latency, consistency, availability), and do capacity estimates. Pin down what is in and out of scope before drawing anything.
(2) Propose a high-level design and get buy-in: sketch the major boxes — clients, API/gateway, services, datastores, cache, queue — and the data flow. Confirm the interviewer agrees before going deep.
(3) Deep dive: pick the 1-2 components the interviewer cares about (the data model, the sharding strategy, the hot path) and go deep — algorithms, schema, partitioning, the actual bottleneck.
(4) Wrap up: name bottlenecks, single points of failure, and tradeoffs; mention what you would monitor and how you would scale the next 10x. The discipline is to drive the conversation, not silently draw.
Follow-ups they push on- How do you decide which component to deep-dive on?
- What non-functional requirements do you always ask about?
Red flag Skipping step 1 and diving into a database schema before clarifying scale, latency, and consistency needs — the single most common reason candidates fail the round.
source: ByteByteGo — A framework for system design interviews ↗ -
How do you identify bottlenecks and single points of failure in a design, and how do you remove them?
Trace the request path and ask, at each hop, what happens if this one component dies or saturates. A single load balancer, a single primary database, a single cache node, or a single region are classic SPOFs.
Remove SPOFs with redundancy + failover: run the load balancer in an active-passive pair, replicate the DB (primary + replicas, automatic promotion), spread services across multiple availability zones, and use health checks so traffic routes away from dead instances.
For bottlenecks, find the component nearest its capacity ceiling: stateless app tier scales horizontally behind the LB; a write-bound DB needs sharding or a queue to absorb bursts; a read-bound DB needs replicas + a cache. The senior move is to quantify it (this shard does X writes/s, the limit is Y) rather than hand-wave 'add more servers'.
Follow-ups they push on- How does making services stateless help you scale horizontally?
- How do you decide between adding read replicas vs sharding?
Red flag Treating 'add a load balancer' as the whole answer while the load balancer itself remains a single point of failure, or scaling a stateful service horizontally without externalizing session state.
source: System Design Primer — Availability patterns ↗ -
Design a web crawler that can crawl the public web.
Clarify: scale (billions of pages), politeness (respect robots.txt and per-host rate limits), freshness, and what you extract. Core loop: a URL frontier (a queue, partitioned by host so one host's pages go to one worker for politeness) feeds a fleet of fetchers; fetched HTML goes to a parser that extracts links, which are de-duplicated and fed back into the frontier.
Key components: a DNS cache (DNS resolution is a hidden bottleneck), a seen-URL set (Bloom filter / hash store) to avoid re-crawling, and content de-duplication (hash or Simhash of page content to skip near-duplicates). Store raw pages in object storage / a distributed file store.
Wrap up: politeness is the subtle part — partition the frontier by domain and apply a per-host crawl delay so you do not hammer one site; add priority queues so important pages get crawled sooner.
Follow-ups they push on- How do you avoid crawling the same URL (or near-duplicate content) twice?
- How do you stay polite to a single host while still being massively parallel?
- Why is DNS a bottleneck and how do you mitigate it?
Red flag Forgetting politeness/robots.txt and a de-dup mechanism — an interviewer reads that as someone who would get the crawler IP-banned and stuck in cycles.
source: ByteByteGo — Design A Web Crawler ↗ -
How do you choose between SQL and NoSQL for a system-design problem?
Drive it from access patterns and requirements, not preference. Reach for SQL (Postgres/MySQL) when you need strong consistency and multi-row transactions (ACID), rich ad-hoc queries and joins, and a stable relational schema — payments, orders, anything where correctness beats raw write throughput.
Reach for NoSQL when you need massive horizontal write scale, a flexible/evolving schema, or a specific access shape: a wide-column store (Cassandra/DynamoDB) for huge write volume and known key lookups, a document store (MongoDB) for nested aggregates, a KV store (Redis) for caching, a graph DB for relationship-heavy traversals.
The senior framing is the tradeoff: most NoSQL stores trade joins and strong consistency for partition tolerance and horizontal scale, and you must model the table around the query up front. State your access pattern, then justify the store.
What a strong answer coversChoose from access patterns + consistency needs, never from familiarity.
SQL: ACID transactions, joins, ad-hoc queries, stable relational schema (orders, payments).
NoSQL: horizontal write scale, flexible schema, query-shaped models (Cassandra/DynamoDB, Mongo, Redis).
NoSQL usually trades joins + strong consistency for scale — you model around the query first.
It is not all-or-nothing: polyglot persistence — SQL for the core, Redis for cache, a search index alongside.
Quick self-checkA payments service needs multi-row transactions and strong consistency. Best default store?
-
Correct — ACID transactions, joins, and strong consistency are exactly what relational stores guarantee.
-
Optimized for huge write throughput with eventual consistency — weak fit for cross-row transactional correctness.
-
Great as a cache, but not a durable system-of-record for transactional money movement.
-
Flexible schema, but you would be fighting it for multi-document ACID; not the default for payments.
Follow-ups they push on- Which store fits a write-heavy event log with known key lookups?
- What do you give up when you pick a wide-column store over Postgres?
- When would you run both a SQL store and a NoSQL store in the same system?
Red flag Declaring 'NoSQL because it scales' with no access pattern stated — many NoSQL stores need the schema modeled around the exact query, and you lose joins/transactions you may actually need.
source: System Design Primer — SQL or NoSQL ↗ -
Explain the CAP theorem and how it actually informs a design decision.
CAP says that when a network partition happens, a distributed system can keep only two of Consistency, Availability, Partition tolerance — and since partitions are unavoidable in real networks, the real choice is C vs A during a partition.
CP (consistency over availability): on a partition, refuse or block requests rather than serve stale/conflicting data — pick this when correctness is non-negotiable (a bank balance, inventory you can oversell). AP (availability over consistency): keep serving on both sides of the partition and reconcile later (eventual consistency) — pick this when staleness is tolerable and uptime matters more (a social feed, a shopping cart, DNS).
The senior point: CAP only bites *during* a partition; the rest of the time you get both. And it is a spectrum — many stores let you tune consistency per request (e.g. quorum reads/writes), so you choose CP or AP per use case, not per company.
What a strong answer coversPartitions are inevitable, so the live tradeoff is Consistency vs Availability during a partition.
CP: refuse/block on partition to avoid stale data — banking, inventory, anything correctness-critical.
AP: stay up and reconcile later (eventual consistency) — feeds, carts, DNS.
CAP only constrains you during a partition; normally you get C and A both.
Often tunable per request (quorum reads/writes), so the choice is per use case, not absolute.
Quick self-checkDuring a network partition, an 'AP' system chooses to:
-
Correct — AP favors availability, accepting temporary inconsistency that is resolved after the partition heals.
-
That is CP behavior — sacrificing availability to avoid divergence.
-
Not a CAP option; the theorem is about C vs A, not total shutdown.
-
Impossible during a partition — that is precisely what CAP rules out.
Follow-ups they push on- Give a concrete system you'd build CP and one you'd build AP, and why.
- How do quorum reads/writes let you tune where you sit on the spectrum?
- What does 'eventual consistency' actually promise the client?
Red flag Stating you 'pick two of three' as a permanent architecture choice — partition tolerance is mandatory, so the decision is C-vs-A only when a partition occurs, and it can be tuned per request.
source: System Design Primer — CAP theorem ↗ -
Why and how would you introduce a message queue between services? What does it buy you?
A queue (SQS, RabbitMQ) or a log (Kafka) decouples a producer from a consumer: the producer drops a message and moves on, the consumer processes it on its own schedule. That buys you three things — async (the user-facing request returns immediately while slow work happens in the background), buffering (a traffic spike fills the queue instead of overwhelming the consumer), and resilience (if the consumer is down, messages wait instead of being lost).
Use it for work that does not need a synchronous answer: sending email, generating thumbnails, fanning out notifications, ingesting events. You also gain independent scaling (add consumers to drain a backlog) and smoothing of bursty load.
The tradeoffs you must name: added operational complexity, eventual rather than immediate results, and the need to handle idempotency because most queues guarantee at-least-once delivery (the same message can arrive twice).
What a strong answer coversDecouples producer/consumer: async work, buffering of spikes, resilience when a consumer is down.
Use it for fire-and-forget work: email, thumbnails, notification fanout, event ingestion.
Enables independent scaling — add consumers to drain a backlog.
Most queues are at-least-once, so consumers must be idempotent (dedupe on a key).
Cost: more moving parts, eventual results, and ordering is not free (often per-partition only).
Follow-ups they push on- Why must queue consumers usually be idempotent?
- What's the difference between a queue (SQS/RabbitMQ) and a log (Kafka)?
- How does a dead-letter queue help, and when do messages land there?
Red flag Assuming exactly-once delivery and writing a non-idempotent consumer — at-least-once redelivery then double-charges, double-sends, or double-processes on the inevitable retry.
source: AWS — What is message queuing? ↗ -
Design a typeahead / search autocomplete service.
Clarify: top-k suggestions per prefix, ranked by popularity, with very low latency (every keystroke fires a request) and read-heavy load. Two halves — serving and data-gathering.
Serving: precompute the top-k completions for each prefix so a query is a single lookup, not a scan. A trie with the top-k cached at each node answers a prefix in O(prefix length); cache hot tries/results in Redis at the edge. Debounce on the client and cap suggestions so you do not hammer the backend.
Data-gathering (offline): aggregate query logs to count frequencies, then rebuild/update the trie periodically (e.g. via a batch job) rather than on every search — autocomplete tolerates being slightly stale. Wrap up: shard the trie by prefix range, discuss freshness vs cost of rebuild cadence, and personalization/spell-correction as extensions.
Follow-ups they push on- Why precompute top-k per prefix instead of querying at request time?
- How do you keep the suggestions fresh without rebuilding the trie on every query?
- How would you shard the trie across nodes?
Red flag Querying the database for matching terms on every keystroke and sorting at request time — that does not survive the read volume; the win is precomputing top-k per prefix offline and serving from a cached trie.
source: ByteByteGo — Design A Search Autocomplete System ↗ -
What is consistent hashing, and what specific problem does it solve that modulo hashing does not?
With naive
hash(key) % Nsharding, changing the node countNchanges the modulus, so almost every key remaps to a different node — adding or removing one cache/storage node reshuffles the entire keyspace and cold-starts everything.Consistent hashing maps both nodes and keys onto the same hash ring (0…2^m). A key is owned by the next node clockwise. Now adding or removing a node only remaps the keys between that node and its neighbor — roughly 1/N of keys, not all of them.
The refinement is virtual nodes: place each physical node at many points on the ring so load spreads evenly and removing a node redistributes its keys across many others instead of dumping them all on one neighbor. This is the standard partitioning scheme for distributed caches, Cassandra, and DynamoDB-style stores.
What a strong answer covershash(key) % Nremaps nearly all keys when N changes — catastrophic for a cache.Consistent hashing puts nodes + keys on a ring; a key goes to the next node clockwise.
Adding/removing a node only remaps ~1/N of keys (those between it and its neighbor).
Virtual nodes spread each physical node across many ring points for even load + smooth rebalancing.
It's the backbone of distributed caches, Cassandra, and DynamoDB-style partitioning.
Quick self-checkYou add one node to a cluster of N. Roughly what fraction of keys remap under consistent hashing vs `hash % N`?
-
Correct — consistent hashing moves only the keys near the new node; modulo changes the divisor and reshuffles almost everything.
-
Wrong for consistent hashing — its entire purpose is to bound key movement to ~1/N.
-
Wrong for `hash % N` — changing N changes the modulus, remapping nearly every key.
-
Adding capacity must move some keys onto the new node; zero movement is impossible.
Follow-ups they push on- Why do virtual nodes improve load balance and rebalancing?
- Roughly what fraction of keys move when you add one node to a ring of N?
- How does this connect to designing a distributed cache?
Red flag Saying consistent hashing 'avoids collisions' — it is about minimizing key movement when the node set changes, not about hash collisions; without virtual nodes load can still skew badly.
source: System Design Primer — Consistent hashing / sharding ↗ -
When and how do you add a cache to a read-heavy system, and what are the gotchas?
Add a cache when reads dominate, the same data is read far more often than it changes, and the database is the bottleneck. The most common pattern is cache-aside (lazy loading): the app reads the cache first; on a miss it reads the database, populates the cache with a TTL, and returns. Writes update the database and invalidate or update the cached entry. Alternatives are read-through/write-through (the cache layer itself loads/writes the DB) and write-back (write cache now, flush to DB async — fast but risks loss).
The gotchas are where seniority shows. Cache invalidation is the hard problem — stale data after a write if you forget to evict. Cache stampede / thundering herd: a hot key expires and thousands of requests hit the DB at once — mitigate with request coalescing, a short lock, or staggered TTLs. Cold start after a flush hammers the DB. And caching is for tolerable-staleness data — never cache something that must be strongly consistent (a bank balance) without care.
What a strong answer coversAdd a cache when reads ≫ writes, data is reused, and the DB is the bottleneck.
Cache-aside: read cache → miss → read DB → populate with TTL; writes invalidate the entry.
Alternatives: read-through/write-through (cache fronts the DB) and write-back (async flush, risks loss).
Invalidation is the hard part — stale reads after a write if you forget to evict.
Guard against stampede (hot key expiry → DB flood): coalescing, locks, staggered TTLs.
Quick self-checkIn the cache-aside pattern, what happens on a cache miss?
-
Correct — the application lazily loads on a miss and stores the result so subsequent reads hit the cache.
-
A miss is normal, not an error; the app should fall through to the source of truth.
-
That's read-through caching; in cache-aside the application does the DB read.
-
A miss is a read path, not a write; you populate the cache, you don't write the DB.
Follow-ups they push on- Why is cache invalidation famously the hard part of caching?
- How do you prevent a cache stampede when a popular key expires?
- What data should you NOT cache, and why?
Red flag Caching without an invalidation/TTL strategy — writes update the DB but leave stale entries in the cache, so users keep reading old data until the entry happens to expire.
source: System Design Primer — Caching ↗