interview prep · 696 questions

What actually gets asked

Real interview questions mapped to every topic in the course — grouped by module and chapter. Tap any card for how to approach it, what a strong answer covers, a quick self-check, the follow-ups, and the trap. Company tags are best-effort and sourced: a company is shown only when a public source names it; everything else reads “Commonly asked.” The ★ must-know set is the high-yield core — widely asked and easy to get wrong.

Module

Company

Level

Type

Focus

Showing 696 of 696 questions

01 Core CS / DSA 90 Q's

1.1 Big-O & complexity reasoning 15

★ must-know Commonly asked junior concept common Why do we drop constants and lower-order terms in Big-O?
Big-O describes asymptotic growth as n grows large, so the dominant term decides how the cost scales; constants and lower-order terms become negligible in that limit. 3n + 5 and n both grow linearly, so both are O(n). The point is to compare how algorithms scale, not to predict exact wall-clock time.
What a strong answer covers
- Big-O measures asymptotic growth as n → ∞, not real wall-clock time.
- The fastest-growing term dominates; constants and lower-order terms vanish in the limit.
- 3n + 5, n, and 500n are all O(n) — the same growth class.
- The goal is comparing how algorithms scale, not benchmarking a specific machine.
Quick self-check
Which expression is NOT O(n)?
Follow-ups they push on
- When do constants actually matter in practice?
- Is an O(n) algorithm always faster than an O(n log n) one?
Red flag Treating O(2n) or O(n + 100) as meaningfully different from O(n), or implying Big-O predicts real runtime rather than scaling.
source: Tech Interview Handbook — Algorithms / Complexity ↗
★ must-know Commonly asked mid concept common What's the time complexity of the naive recursive Fibonacci, and why is it so bad?
Naive fib(n) = fib(n-1) + fib(n-2) is O(2^n) (more precisely O(φ^n)) because each call spawns two more and the recursion tree's size roughly doubles per level, recomputing the same subproblems exponentially many times. Memoizing the results collapses it to O(n) time (each fib(k) computed once), and an iterative version is O(n) time with O(1) space. This is the canonical 'overlapping subproblems ⇒ use DP' example.
What a strong answer covers
- Branching factor 2 with depth n ⇒ ~2^n calls.
- Same subproblems recomputed repeatedly (overlapping subproblems).
- Memoization ⇒ O(n) time; iterative ⇒ O(n) time, O(1) space.
- Recursion tree size, not depth, drives the exponential cost.
Quick self-check
Why is naive recursive Fibonacci exponential while memoized is linear?
Follow-ups they push on
- How much space does the memoized version use?
- Can you compute Fibonacci faster than O(n)?
Red flag Estimating it as O(n) by counting recursion depth instead of the exponential number of nodes in the call tree.
source: Tech Interview Handbook — Dynamic Programming cheatsheet ↗
Commonly asked junior concept occasional An algorithm has two separate phases: one O(n) and one O(n^2). What's the overall complexity, and what if a third phase is O(m)?
Sequential phases add, then you keep the dominant term: O(n) + O(n^2) = O(n^2), because the quadratic term swamps the linear one as n grows. When a phase depends on a different input size m, you cannot fold it into n — the honest answer is O(n^2 + m), keeping both variables because either could dominate depending on the inputs.
What a strong answer covers
- Sequential (non-nested) phases add; nested phases multiply.
- After adding, drop dominated terms: O(n) + O(n²) = O(n²).
- Independent input sizes stay separate: O(n² + m), not O(n²).
- Only collapse m into n if you can prove m ≤ n (or similar).
Follow-ups they push on
- When is it wrong to assume m ≈ n in a graph problem (V vs E)?
- If the phases were nested instead of sequential, what changes?
Red flag Silently assuming a second input variable equals n, or multiplying sequential phases that should be added.
source: Big-O Cheat Sheet ↗
Commonly asked junior trick occasional You sort an array (O(n log n)) and then do a single linear scan. What's the combined complexity, and is the sort 'free' because the scan is O(n)?
Combined it's O(n log n) — the sort dominates the linear scan, so the scan is effectively absorbed, but the sort is certainly not free; it sets the overall complexity. A common mistake is to advertise 'an O(n) two-pointer solution' while quietly sorting first, which makes the real cost O(n log n). Always fold the preprocessing cost into the bound you quote.
What a strong answer covers
- O(n log n) + O(n) = O(n log n) — the larger term wins.
- The sort is the bottleneck, not 'free'.
- A two-pointer pass after a sort is an O(n log n) solution overall.
- Quote the cost including preprocessing, not just the hot loop.
Quick self-check
Sort (O(n log n)) followed by a separate O(n) scan — overall?
Follow-ups they push on
- When would an O(n)-space hash approach beat the sort-then-scan approach?
- If the array were already sorted, what would change?
Red flag Claiming an 'O(n) solution' that secretly sorts the input first — the honest bound is O(n log n).
source: Tech Interview Handbook — Algorithms / Sorting ↗
Commonly asked junior concept common Order these from fastest- to slowest-growing: O(n log n), O(1), O(n!), O(log n), O(n^2), O(n), O(2^n).
O(1) < O(log n) < O(n) < O(n log n) < O(n^2) < O(2^n) < O(n!). The split that matters most in interviews is polynomial (everything up to O(n^2)) versus exponential/factorial (O(2^n), O(n!)), which become intractable for even modest n. Knowing where a candidate algorithm sits on this ladder is usually the first thing an interviewer wants.
Follow-ups they push on
- Give a concrete algorithm that lands in each class.
Red flag Putting O(n log n) above O(n^2), or thinking O(2^n) and O(n^2) are close because both 'have an n and a power'.
source: Big-O Cheat Sheet ↗
Commonly asked junior concept occasional If 100x more data slows an operation ~100x, what's its complexity? What if it slows ~10,000x?

~100x slowdown for 100x data is linear, O(n). A ~10,000x slowdown is 100^2, i.e. O(n^2) — the quadratic term means scaling the input scales the cost by the square. This back-of-envelope reasoning is exactly how you sanity-check whether a measured slowdown matches your assumed complexity.

Red flag Confusing the input multiplier with the time multiplier, or assuming any slowdown larger than linear must be exponential.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗
Commonly asked junior concept common A loop runs `for (i = 1; i < n; i *= 2)`. What's its time complexity, and why?
It's O(log n). Multiplying i by 2 each iteration means i takes the values 1, 2, 4, 8, …, so the loop body runs about log2(n) times before i reaches n. Any loop where the counter is multiplied or divided by a constant factor (rather than added to) is logarithmic — this is the same shape as binary search.
What a strong answer covers
- Counter multiplied/divided by a constant ⇒ logarithmic, not linear.
- 1, 2, 4, … n means ~log2(n) iterations.
- Contrast with i += 1 (or i += c), which is O(n).
- Nesting this inside an O(n) loop gives O(n log n).
Quick self-check
Time complexity of `for (i = n; i > 1; i /= 2)`?
Follow-ups they push on
- What's the complexity if the inner loop instead did `j *= 3`?
- What if you nest this logarithmic loop inside a `for i in 0..n`?
Red flag Calling it O(n) because it's 'a loop up to n', ignoring that the counter grows geometrically, not by one.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗
Commonly asked mid concept occasional What does Big-O actually bound — Big-O vs Big-Theta vs Big-Omega — and why do people say O(n) when they mean Θ(n)?
Big-O is an upper bound (grows no faster than), Big-Omega (Ω) is a lower bound (grows no slower than), and Big-Theta (Θ) is a tight bound (both at once). Strictly, an O(n) algorithm is also O(n^2) because O is only an upper bound, so the precise claim is usually Θ(n). In interviews people say 'O(n)' loosely to mean the tight bound; it's fine, but knowing the distinction signals rigor.
What a strong answer covers
- O = upper bound, Ω = lower bound, Θ = tight (both).
- An O(n) algorithm is technically also O(n^2) — O doesn't have to be tight.
- Θ(n) is the precise statement people usually intend by 'O(n)'.
- Worst-case Big-O is the common interview default unless stated otherwise.
Quick self-check
Which statement is technically correct for an algorithm that always does exactly n steps?
Follow-ups they push on
- Give an algorithm whose best and worst cases differ in Big-Theta.
- Is saying 'quicksort is O(n^2)' wrong?
Red flag Insisting O must be a tight bound, or conflating worst-case with Big-O (they're independent axes).
source: MIT OCW 6.006 — Asymptotic notation ↗
Commonly asked mid trick common A function loops `i` from 0 to n and, inside, loops `j` from 0 to `i`. Is that O(n^2)?
Yes, it's O(n^2) — even though the inner loop doesn't always run n times. The total iterations are 0 + 1 + 2 + … + (n-1) = n(n-1)/2, which is ~n^2/2; dropping the constant gives O(n^2). The lesson: a triangular nested loop is still quadratic, because half of a square is still proportional to n^2.
What a strong answer covers
- Total work is the arithmetic series 0+1+…+(n-1) = n(n-1)/2.
- That's ~n²/2 ⇒ O(n²) after dropping the constant.
- 'Inner loop shorter each time' does not save an order of magnitude.
- Same total as comparing all unique pairs of n items.
Quick self-check
Total iterations of the inner body across the whole run?
Follow-ups they push on
- How many distinct pairs `(i, j)` with `i < j` exist among n items?
- What if the inner loop ran to `i*i` instead of `i`?
Red flag Claiming it's O(n) or 'O(n²/2)' — the triangular shape is still Θ(n²) and constants are dropped.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗
Commonly asked mid concept common Explain best, average, and worst case. Which one does Big-O usually refer to, and why?
Best/average/worst describe how cost varies across different inputs of the same size. Interviewers usually mean worst-case Big-O because it's the guarantee that holds regardless of input, but average case matters for things like quicksort (avg O(n log n), worst O(n^2)) and hash maps (avg O(1), worst O(n)). Good practice: state worst-case first, then add the expected/average case with the assumption it relies on.
Follow-ups they push on
- What assumption makes a hash-map lookup average O(1)?
- Why might average case be the more honest number for quicksort?
Red flag Quoting average case as if it were a worst-case guarantee, especially for hashing or randomized algorithms.
source: Tech Interview Handbook — Algorithms / Complexity ↗
Commonly asked mid concept common What is amortized complexity? Why is appending to a dynamic array amortized O(1) if a resize is O(n)?
Amortized cost is the average cost per operation across a long sequence, even if individual operations vary. A dynamic array doubles capacity on resize, so a resize costs O(n) but only happens after ~n cheap appends; spreading that O(n) over the n appends gives O(1) per append on average. The doubling (geometric growth) is what makes the total work across n appends O(n), not O(n^2).
Follow-ups they push on
- What breaks if you grow the array by a fixed +1 instead of doubling?
- Is amortized O(1) the same as worst-case O(1)?
Red flag Calling a single resizing append O(1), or claiming linear (+1) growth still gives amortized O(1) appends.
source: InterviewPlus — Understanding Amortized Time Complexity ↗
Commonly asked mid concept occasional What's the time and space complexity of a recursive function that recurses on n/2 and does O(1) work per call?
Halving n each call with O(1) work per level gives O(log n) time (about log2(n) levels). Space is O(log n) too because the call stack holds one frame per level until the base case unwinds — a point candidates often miss when they say O(1) space. This is the binary-search recursion shape.
Follow-ups they push on
- How does an iterative version change the space complexity?
Red flag Reporting O(1) space for a recursive solution by forgetting the call stack costs O(depth).
source: InterviewPrep — Algorithm Complexity Interview Questions ↗
Commonly asked mid trick occasional Two nested loops over the same array of size n look O(n^2). When can nested loops still be O(n)?
Nesting doesn't automatically mean O(n^2) — what matters is total iterations. In a sliding-window or two-pointer pass, the inner pointer advances monotonically and never resets, so across the whole run it moves at most n times total: the two loops combined do O(n) work. Always count how many times the inner body actually runs, not how deeply the loops nest.
Follow-ups they push on
- What's the complexity of the sliding-window longest-substring solution?
Red flag Mechanically multiplying loop depths instead of bounding the total number of inner iterations.
source: GeeksforGeeks — Big-O Notation Interview Questions ↗
Commonly asked senior concept occasional When do constant factors and Big-O 'lie' in practice — i.e. when is a higher-Big-O algorithm actually faster?
Big-O hides constants and cache effects, so for small or medium n a higher-Big-O algorithm with a tiny constant often wins. Classic cases: insertion sort (O(n^2)) beats quicksort on tiny arrays — which is why Timsort/introsort fall back to it; linear scan of a contiguous array can beat a hash map for small n because of cache locality and no hashing overhead; and an O(n log n) algorithm with huge constants can lose to a well-tuned O(n^2) until n is large. The honest senior answer is 'Big-O tells you scaling behavior; profile to know the crossover point for your actual n.'
What a strong answer covers
- Big-O drops constants and ignores cache locality / memory hierarchy.
- Insertion sort beats quicksort for tiny n ⇒ hybrid sorts switch over.
- Contiguous linear scan can beat a hash map for small n (locality, no hashing).
- There's a crossover n; profile rather than assume the lower class always wins.
Follow-ups they push on
- Why does Timsort use insertion sort on small runs?
- How does cache locality favor arrays over linked lists despite equal Big-O?
Red flag Treating Big-O as a real-runtime ranking for all n, ignoring constants, locality, and the crossover point.
source: Tech Interview Handbook — Algorithms / Complexity ↗
Commonly asked senior concept occasional You can solve a problem in O(n) time with O(n) extra space, or O(n log n) time with O(1) space. How do you decide?
It's a time/space trade-off driven by constraints: if memory is tight (embedded, huge inputs, streaming) favour the O(1)-space version; if latency dominates and memory is cheap, take the O(n)-time version. The strong-signal answer names the constraints out loud, states the assumption (e.g. input fits in memory), and asks the interviewer about input size and environment rather than guessing.
Follow-ups they push on
- When would O(n) extra space be a non-starter even if it's faster?
Red flag Optimizing only time and never mentioning the space cost, or picking one without surfacing the trade-off.
source: InterviewPrep — Algorithm Analysis Interview Questions ↗

1.2 Linear structures — when to reach for each 15

★ must-know AmazonMicrosoftBloomberg mid concept common Why is a doubly linked list often paired with a hash map (e.g. in an LRU cache), and what does each part provide?
The hash map gives O(1) lookup from key to its node; the doubly linked list gives O(1) reordering — unlink a node from anywhere and move it to the front/back using its prev/next pointers. Neither alone suffices: a hash map has no order, and a list alone needs O(n) to find a node. Together they back an LRU cache where both get and put (including evicting the least-recently-used entry) are O(1). The doubly-linked part is essential because unlinking an interior node in O(1) requires knowing its predecessor, which only a backward pointer provides.
What a strong answer covers
- Hash map: O(1) find key → node. Doubly linked list: O(1) reorder/evict.
- Map stores pointers to list nodes, not values, so manipulation is direct.
- prev pointer is what makes interior unlink O(1) (a singly list can't).
- Move-to-front on access; evict from the tail (the LRU end).
Quick self-check
In an LRU cache, why a doubly linked list rather than a singly linked one?
Follow-ups they push on
- Why a doubly (not singly) linked list specifically?
- What stale-reference bug appears if you evict from the list but not the map?
Red flag Removing the evicted node from the list but leaving its key in the hash map, leaving a dangling stale reference.
source: LeetCode 146 — LRU Cache (company tags) ↗
Commonly asked junior concept very common Array vs linked list: compare index access, insert/delete, and memory. When would you choose each?
Arrays are contiguous: O(1) random index access but O(n) to insert/delete in the middle (shifting). Linked lists give O(1) insert/delete at a known node or the ends, but O(n) to find or index because you must walk the pointers. Reach for an array when you index a lot and the size is roughly known; reach for a linked list when you constantly add/remove at the front (or splice known nodes) and rarely index.
Follow-ups they push on
- Why is array access cache-friendly but linked-list traversal often isn't?
- What does a dynamic array (ArrayList/vector) change about this comparison?
Red flag Saying linked lists are 'faster for inserts' without the 'at a known position' caveat — finding the position is still O(n).
source: Tech Interview Handbook — Linked List cheatsheet ↗
AmazonMicrosoftAppleMeta junior coding very common Merge two sorted linked lists into one sorted list.
Walk both lists with a dummy head and a tail pointer: at each step append the smaller of the two current nodes and advance that list; when one list runs out, append the remainder of the other. O(n + m) time, O(1) extra space because you splice existing nodes rather than allocate new ones. The dummy node removes the special case for choosing the very first node.
What a strong answer covers
- Dummy head + tail pointer avoids first-node special-casing.
- Splice existing nodes ⇒ O(1) extra space.
- O(n + m) time, one comparison per node.
- Attach the leftover tail wholesale once one list empties.
Follow-ups they push on
- How does this become the merge step of merge sort on a list?
- Extend to merging k sorted lists efficiently.
Red flag Forgetting to attach the remaining nodes of the non-empty list after the loop ends.
source: LeetCode 21 — Merge Two Sorted Lists (company tags) ↗
Commonly asked junior concept occasional What's the difference between a singly and a doubly linked list, and what does the second pointer cost and buy you?
A singly linked list has only a next pointer per node (forward traversal only); a doubly linked list adds a prev pointer, enabling backward traversal and O(1) deletion of a node given only a reference to it. The cost is one extra pointer of memory per node plus more bookkeeping on every insert/delete (you must fix two links, not one). Choose doubly when you need to walk backward or splice out arbitrary nodes cheaply (LRU caches, browser history); choose singly to save memory when forward-only suffices.
What a strong answer covers
- Singly: next only, forward traversal; Doubly: prev + next.
- Doubly enables O(1) delete given just the node and backward walks.
- Cost: extra pointer per node + dual-link maintenance on every edit.
- Use doubly for LRU/history; singly when forward-only and memory-tight.
Quick self-check
What does the `prev` pointer in a doubly linked list primarily buy you?
Follow-ups they push on
- Can you delete a known node in O(1) in a singly linked list (with a trick)?
- Why do LRU caches specifically need the doubly-linked variant?
Red flag Updating only `next` (or only `prev`) on insert/delete and corrupting one direction of the list.
source: GeeksforGeeks — Doubly Linked List ↗
Commonly asked junior concept common Explain a stack vs a queue vs a deque in one sentence each, and give a real use for each.
A stack is LIFO — last in, first out — used for undo, call stacks, DFS, and expression parsing. A queue is FIFO — first in, first out — used for task/work queues and BFS. A deque is double-ended, O(1) push/pop at both ends, used when you need front-and-back access (and as a faster substitute for inserting at index 0 of an array).
Follow-ups they push on
- Which would you use for BFS, and why not the other?
Red flag Mixing up which end stays open, or claiming a stack is good for FIFO ordering.
source: Tech Interview Handbook — Stack cheatsheet ↗
AmazonMicrosoftAppleMeta junior coding very common Reverse a singly linked list.
Iterate with three pointers — prev, curr, next — and on each step reverse the link (curr.next = prev) then advance all three; return prev at the end. This is O(n) time, O(1) space. The recursive version is O(n) space due to the call stack, so mention the iterative one first.
Follow-ups they push on
- Now reverse only nodes between positions m and n.
- Reverse the list in groups of k.
Red flag Losing the rest of the list by overwriting `curr.next` before saving `next`.
source: LeetCode 206 — Reverse Linked List (company tags) ↗
AmazonMetaGoogleMicrosoftBloomberg junior coding very common Determine if a string of brackets ()[]{} is validly matched.
Push each opening bracket onto a stack; on a closing bracket, pop and check it matches the expected opener, failing fast on mismatch or empty stack. At the end the string is valid only if the stack is empty. O(n) time, O(n) space — the classic motivating example for why stacks exist.
Follow-ups they push on
- Handle the longest valid-parentheses substring.
- What if other characters are interleaved with the brackets?
Red flag Forgetting to check the stack is empty at the end, so unmatched openers like `(((` are wrongly accepted.
source: LeetCode 20 — Valid Parentheses (company tags) ↗
AmazonMicrosoftGoogle junior coding common Find the middle node of a singly linked list in one pass.
Use the fast/slow pointer trick: advance slow by one and fast by two each step; when fast reaches the end, slow sits at the middle. It's O(n) time, O(1) space, and finishes in a single pass — no need to first count the length and then walk halfway. For an even-length list, decide up front whether you return the first or second middle (the fast/fast.next loop condition controls this).
What a strong answer covers
- Two pointers, speeds 1 and 2 ⇒ slow lands at the middle in one pass.
- O(n) time, O(1) space; no length precomputation.
- Even length: loop condition picks first vs second middle.
- Same tortoise/hare machinery as cycle detection.
Follow-ups they push on
- For even length, how do you choose between the two middles?
- How does this generalize to finding the node n/k of the way through?
Red flag Looping while `fast != null` instead of checking `fast && fast.next`, dereferencing null on even-length lists.
source: LeetCode 876 — Middle of the Linked List (company tags) ↗
AmazonMetaMicrosoftGoogle mid coding common Remove the nth node from the end of a singly linked list in one pass.
Advance a fast pointer n nodes ahead, then move fast and slow together until fast hits the end — now slow is just before the node to remove, so you splice it out. Use a dummy head in front of the real head so removing the first node needs no special case. One pass, O(n) time, O(1) space.
What a strong answer covers
- Gap of n between fast and slow locates the target in one pass.
- Dummy node before head removes the edge case of deleting the head.
- O(n) time, O(1) space.
- Stop fast at the last node so slow lands on the predecessor.
Follow-ups they push on
- Why does the dummy node matter when n equals the list length?
- How would you do it in two passes, and why prefer one?
Red flag Skipping the dummy node and crashing (or returning the wrong head) when the node to remove is the head itself.
source: LeetCode 19 — Remove Nth Node From End of List (company tags) ↗
Commonly asked mid concept occasional How does a circular buffer (ring buffer) work, and where is it the right choice?
A ring buffer is a fixed-size array with head and tail indices that wrap around using modulo; you enqueue at tail and dequeue at head, both O(1), reusing slots instead of shifting. It's ideal for bounded producer/consumer streams — audio/IO buffering, recent-event logs, fixed-window rate limiting — where memory must be capped and old data can be overwritten. The classic subtlety is distinguishing full from empty when head == tail (track a size/count or leave one slot unused).
What a strong answer covers
- Fixed array + wrapping head/tail via modulo ⇒ O(1) enqueue/dequeue.
- No element shifting and no dynamic allocation after setup.
- Best for bounded streaming buffers (audio, logs, IO).
- full-vs-empty ambiguity at head == tail needs a count or a sacrificed slot.
Follow-ups they push on
- How do you tell a full buffer from an empty one?
- What happens to the oldest data when the buffer is full and you write?
Red flag Failing to disambiguate full from empty (both have head == tail), corrupting reads/writes.
source: GeeksforGeeks — Circular Queue ↗
AmazonGoogleMeta mid coding common Given daily temperatures, for each day return how many days until a warmer one (a monotonic-stack problem).
Use a monotonic decreasing stack of indices: scan left to right, and while the current temperature exceeds the temperature at the stack's top index, pop it and record the gap (current index − popped index) as its answer. Push the current index. Each index is pushed and popped at most once, so it's O(n) time, O(n) space — far better than the O(n^2) double loop. The stack pattern answers 'next greater element' style questions generally.
What a strong answer covers
- Stack holds indices awaiting a warmer day, kept decreasing by temperature.
- Pop and resolve each index when a warmer day arrives.
- Each index pushed/popped once ⇒ O(n) time.
- Generalizes to 'next greater/smaller element' problems.
Follow-ups they push on
- How does this generalize to 'next greater element' on a circular array?
- Why is the amortized cost O(n) despite the inner while-loop?
Red flag Storing temperatures instead of indices on the stack, losing the distance needed for the answer.
source: LeetCode 739 — Daily Temperatures (company tags) ↗
AmazonMicrosoftBloomberg mid coding common Detect whether a singly linked list has a cycle, using O(1) extra space.
Use Floyd's tortoise-and-hare: a slow pointer moves one step and a fast pointer two steps; if they ever meet there's a cycle, and if fast reaches null there isn't. O(n) time, O(1) space. A hash set of visited nodes also works but costs O(n) space, so lead with Floyd's.
Follow-ups they push on
- Return the node where the cycle begins.
- How do you find the cycle's length?
Red flag Advancing the fast pointer without null-checking both `fast` and `fast.next`, causing a crash on even-length lists.
source: LeetCode 141 — Linked List Cycle (company tags) ↗
AmazonBloombergGoogle mid coding common Design a stack that returns its minimum element in O(1) alongside push/pop/top.
Keep a second 'min stack' that records the running minimum in parallel with the main stack; on push you store min(value, currentMin), and on pop you pop both. Every operation stays O(1) time and the structure uses O(n) extra space. The key idea is that each level remembers the min as of when it was pushed, so popping restores the previous min for free.
Follow-ups they push on
- Reduce the extra space when many pushed values repeat.
Red flag Storing only a single min variable, which can't recover the previous minimum after the current min is popped.
source: LeetCode 155 — Min Stack (company tags) ↗
Commonly asked mid concept occasional Why is inserting at the front of a dynamic array O(n), and what should you use instead?
Inserting at index 0 forces every existing element to shift one slot right, which is O(n) per insert. If you frequently add/remove at the front, use a deque (or a linked list), which gives O(1) push/pop at both ends. This is a common hidden-quadratic bug: building a result by repeatedly inserting at the front of an array turns an O(n) loop into O(n^2).
Follow-ups they push on
- When is appending to the end of a dynamic array still cheap?
Red flag Reaching for `arr.unshift(...)`/insert-at-0 in a loop and not noticing it makes the whole loop quadratic.
source: MDN — JavaScript Array ↗
AmazonMicrosoftGoogle mid coding common Implement a FIFO queue using two LIFO stacks.
Keep an in stack for pushes and an out stack for pops; when out is empty, pour everything from in into out, which reverses the order and exposes the oldest element. Each element is moved at most once between stacks, so dequeue is amortized O(1) even though a single transfer is O(n). This is a clean test of whether a candidate understands LIFO-vs-FIFO and amortized cost.
Follow-ups they push on
- What's the worst-case (not amortized) cost of a single pop?
Red flag Transferring on every dequeue instead of only when `out` is empty, which makes it O(n) per op.
source: LeetCode 232 — Implement Queue using Stacks (company tags) ↗

1.3 Hashing structures 15

★ must-know AmazonGoogleMetaBloomberg mid coding common Find the length of the longest consecutive sequence of integers in an unsorted array, in O(n).
Put every number in a hash set for O(1) membership, then for each number x start counting a run only if x - 1 is absent (so x is a sequence start); from such a start, extend x, x+1, x+2, … while present and track the longest. Starting only at run-beginnings means each number is visited O(1) times overall, giving O(n) time, O(n) space — beating the O(n log n) sort-then-scan.
What a strong answer covers
- Set membership gives O(1) 'is this number present?' checks.
- Only begin counting where x-1 is missing (a run start).
- That guard bounds total work to O(n), not O(n²).
- Beats sorting (O(n log n)) by trading time for O(n) space.
Quick self-check
Why is the algorithm O(n) and not O(n²) despite the inner while-loop?
Follow-ups they push on
- Why does the 'only start where x-1 is absent' check keep it O(n)?
- What if duplicates are present in the input?
Red flag Extending a run from every element (O(n^2)) instead of only from numbers that begin a run.
source: LeetCode 128 — Longest Consecutive Sequence (company tags) ↗
★ must-know Commonly asked mid concept common Why must objects used as hash-map keys be effectively immutable, and what is the equals/hashCode contract?
A hash map places a key in a bucket derived from its hash; if you mutate a key after insertion so its hash changes, the entry is now in the 'wrong' bucket and lookups silently fail to find it. The equals/hashCode contract is the rule that ties them together: if two objects are equal they must have the same hash code, and equal objects must stay equal — so keys should be immutable (or at least their hash-relevant fields must be). Override hashCode whenever you override equals, or hash-based collections break.
What a strong answer covers
- Bucket is chosen from the key's hash at insert time.
- Mutating a key's hash-relevant fields strands the entry in the wrong bucket.
- Contract: equal objects ⇒ equal hash codes (not vice-versa).
- Override equals ⇒ you must override hashCode too.
Quick self-check
You override `equals` to compare two fields but leave the default `hashCode`. What breaks?
Follow-ups they push on
- What goes wrong if hashCode is constant for all keys?
- Why is using a mutable list as a key dangerous?
Red flag Overriding `equals` but not `hashCode` (or mutating a key in place), so lookups for present keys return nothing.
source: Oracle Java SE — Object.hashCode() contract ↗
Commonly asked junior concept very common How does a hash map achieve average O(1) lookup, and why is the worst case O(n)?
A hash function maps a key to a bucket index, so with a good hash and a reasonable load factor most buckets hold ~1 entry and lookup is average O(1). The worst case is O(n) when many keys collide into the same bucket (bad hash, adversarial keys, or everything hashing the same), degrading a bucket into a linear scan. The O(1) is therefore an expected/average bound, not a guarantee.
Follow-ups they push on
- What makes a hash function 'good'?
- How can an attacker force the worst case (hash flooding)?
Red flag Stating O(1) as a hard worst-case guarantee instead of an average/expected one.
source: Hirist — Top HashMap Interview Questions ↗
Commonly asked junior concept occasional When should you use a hash set vs a hash map?
A hash set stores keys only and answers 'have I seen this?' — use it for membership, dedup, and presence checks. A hash map stores key→value associations — use it when you also need data attached to each key (counts, indices, last-seen position). Both give average O(1) ops; a set is essentially a map whose values you don't care about. Reach for the map the moment you need to remember *something about* each key, not just *that* you saw it.
What a strong answer covers
- Set: membership / dedup / 'seen?' — keys only.
- Map: key → value — counts, indices, metadata per key.
- Both average O(1); a set is a valueless map.
- Two Sum needs a map (value→index); 'contains duplicate' needs only a set.
Quick self-check
You must return the index of a matching earlier element. Set or map?
Follow-ups they push on
- Which would you use for Two Sum, and why not the other?
- Which for 'does this array contain any duplicate'?
Red flag Using a set when you later need the associated value (e.g. an index), forcing an awkward rework.
source: AlgoArk — Hash Map Patterns for Interviews ↗
AmazonAppleGoogle junior coding common Determine whether an array contains any duplicate values.
Walk the array once, inserting each value into a hash set; if a value is already present, return true immediately, otherwise return false at the end. O(n) time, O(n) space. Alternatively sort first and check adjacent equal pairs for O(n log n) time and O(1) extra space — a clean time/space trade-off to mention.
What a strong answer covers
- Hash set: insert each, return true on the first repeat.
- O(n) time, O(n) space.
- Sort-and-scan alternative: O(n log n) time, O(1) extra space.
- Early exit on first duplicate; no need to finish the scan.
Follow-ups they push on
- What if duplicates only count when within k indices of each other?
- How would you do it with O(1) extra space?
Red flag Comparing all pairs with a double loop (O(n^2)) when a single hash-set pass is O(n).
source: LeetCode 217 — Contains Duplicate (company tags) ↗
AmazonGoogleMetaAppleMicrosoftBloomberg junior coding very common Given an array and a target, return indices of two numbers that sum to the target.
Walk the array once, and for each value x check a hash map for target - x; if present you've found the pair, otherwise store x -> index and continue. This is O(n) time, O(n) space — the canonical 'use a hash map to remember what you've seen' problem. The brute-force double loop is O(n^2); the hash map trades space for that speedup.
Follow-ups they push on
- What changes if the array is already sorted?
- How would you return all unique pairs (3Sum-style)?
Red flag Matching an element with itself by checking the map before inserting the current element incorrectly.
source: LeetCode 1 — Two Sum (company tags) ↗
AmazonBloombergAppleMicrosoftGoogle junior coding common Find the first non-repeating character in a string and return its index.
Make one pass to build a hash map (or 26-length array) of character counts, then a second pass over the string returning the index of the first character whose count is 1; return -1 if none. Two linear passes, O(n) time, O(1) space if the alphabet is fixed (at most 26/128 entries). The second pass must walk the original string order, not the map, because a map has no positional order.
What a strong answer covers
- Pass 1: count frequencies in a map/array.
- Pass 2: scan the string in order, return first index with count 1.
- O(n) time; O(1) space for a fixed alphabet.
- Iterate the string (ordered), not the map (unordered), in pass 2.
Follow-ups they push on
- Why can't you find the answer by iterating the hash map directly?
- How would you support a streaming version where characters arrive over time?
Red flag Iterating the map instead of the string in pass 2 and returning a non-first unique character because maps lack order.
source: LeetCode 387 — First Unique Character in a String (company tags) ↗
Commonly asked mid concept occasional Why does iterating a hash map give no guaranteed order, and what should you use if you need ordering?
Entries are placed by hash value into buckets, so iteration order reflects the internal bucket layout — which changes with the hash function, capacity, and resizes — not insertion or sort order. If you need a stable order, use an insertion-ordered map (Java's LinkedHashMap, Python's dict since 3.7) for insertion order, or a tree/sorted map (TreeMap, C++ std::map) for key-sorted order at O(log n) per op. Never rely on a plain hash map's iteration order; it's an implementation detail that can differ across runs or versions.
What a strong answer covers
- Iteration order follows bucket layout, not insertion or sort order.
- Order can change after a resize/rehash or across language versions.
- Need insertion order ⇒ LinkedHashMap / Python dict.
- Need sorted order ⇒ TreeMap / std::map (O(log n) ops).
Quick self-check
You need keys returned in sorted order on every iteration. Which structure?
Follow-ups they push on
- Python dicts preserve insertion order since 3.7 — is that the same as 'sorted'?
- What ordering does a TreeMap give, and at what cost?
Red flag Depending on a plain hash map's iteration order in tests or logic, then breaking when it changes.
source: AlgoArk — Hash Map Patterns for Interviews ↗
Commonly asked mid concept common Compare separate chaining and open addressing for collision handling.
Separate chaining stores colliding keys in a per-bucket list (or tree, as Java 8+ does past a threshold), so it tolerates high load factors but pays pointer/indirection overhead. Open addressing keeps everything in the array and probes for the next free slot (linear/quadratic probing, double hashing); it's cache-friendlier but degrades sharply as load factor approaches 1 and complicates deletion. The choice trades memory locality against sensitivity to load factor.
Follow-ups they push on
- Why does deletion need tombstones in open addressing?
- Why does Java convert long chains into trees?
Red flag Describing chaining and open addressing as interchangeable without noting their load-factor and deletion behaviour differs.
source: GetSDEReady — HashMap & HashSet Interview Questions ↗
Commonly asked mid concept common What is a load factor, and what happens when it's exceeded?
Load factor is entries divided by buckets — a measure of how full the table is (Java's HashMap defaults to 0.75). When it's exceeded the table resizes: capacity roughly doubles and every key is rehashed into the larger array, an O(n) operation that happens rarely, keeping amortized insert O(1). A higher load factor saves memory but raises collision rates and slows lookups; a lower one wastes space.
Follow-ups they push on
- Why double the capacity rather than grow by a constant?
Red flag Thinking each insert that crosses the threshold is cheap, or that resize never happens.
source: Hirist — Top HashMap Interview Questions ↗
AmazonMetaGoogleUber mid coding common Group a list of strings into anagrams.
Use a hash map keyed by a canonical form of each word and collect words sharing a key. The canonical key is either the sorted characters (O(k log k) per word) or a 26-length character-count signature (O(k) per word); the latter is faster for long strings. Total time is about O(n*k), space O(n*k). The trick the interviewer is probing is choosing a good collision-free key.
Follow-ups they push on
- Which key is better when words are long, and why?
Red flag Comparing every pair of words for the anagram relation (O(n^2 * k)) instead of bucketing by a canonical key.
source: LeetCode 49 — Group Anagrams (company tags) ↗
MetaAmazonGoogle mid coding common Count the number of contiguous subarrays whose sum equals k.
Track a running prefix sum and a hash map of how many times each prefix sum has occurred; at each index, the count of subarrays ending here equals the number of earlier prefix sums equal to prefixSum - k. Seed the map with {0: 1} to count subarrays starting at index 0. This is O(n) time, O(n) space, versus the O(n^2) brute force.
Follow-ups they push on
- Why must the map be seeded with prefix sum 0?
Red flag Forgetting the `{0:1}` seed, which drops every subarray that starts at index 0.
source: LeetCode 560 — Subarray Sum Equals K (company tags) ↗
Commonly asked mid concept common When is a hash map the wrong data structure? What do you reach for instead?
A hash map gives no ordering, so it's wrong when you need sorted iteration, the min/max, or range queries ('all keys between A and B'). For those, use an ordered/tree-based map (red-black tree, like Java's TreeMap or C++ std::map) giving O(log n) ordered operations, or a heap when you only need the extreme. Hash maps shine for pure key lookup, dedup, and frequency counting.
Follow-ups they push on
- What does a TreeMap give you that a HashMap can't?
Red flag Defaulting to a hash map for problems that need ordering or range scans and then bolting on a sort every query.
source: AlgoArk — Hash Map Patterns for Interviews ↗
Commonly asked senior concept occasional What makes a good hash function, and what is 'hash flooding' (algorithmic complexity attack)?
A good hash function distributes keys uniformly across buckets, is fast to compute, and is deterministic — minimizing collisions so buckets stay ~O(1). Hash flooding is a denial-of-service attack where an adversary crafts many keys that all hash to the same bucket, collapsing every lookup/insert to O(n) and the whole table to O(n^2) work — historically used to DoS web servers via crafted POST/query parameters. Defenses include per-process randomized/seeded hashing (SipHash) so an attacker can't predict the bucket, and converting long collision chains into balanced trees (Java 8+ does this).
What a strong answer covers
- Good hash: uniform, fast, deterministic ⇒ low collision rate.
- Hash flooding forces worst-case collisions ⇒ O(n) ops, O(n²) total (DoS).
- Defense 1: seeded/randomized hashing (e.g. SipHash) hides the mapping.
- Defense 2: treeify long chains (O(n) → O(log n) within a bucket).
Follow-ups they push on
- Why does a per-process random seed defeat the attack?
- How does treeifying long buckets bound the worst case?
Red flag Assuming worst-case collisions only happen by chance and ignoring that they can be deliberately induced.
source: GeeksforGeeks — Hash Functions and Hashing ↗
MetaAmazonGoogle senior coding occasional Design a structure with insert, delete, and getRandom all in average O(1).
Combine a dynamic array (for O(1) random access by index) with a hash map from value to its index in the array. Insert appends and records the index; delete swaps the target with the last element, pops the tail, and fixes the moved element's index; getRandom picks a random array index. The swap-with-last trick is what keeps delete O(1) instead of O(n).
Follow-ups they push on
- How do you support duplicate values?
Red flag Deleting by shifting the array (O(n)) instead of swapping the victim with the last element.
source: LeetCode 380 — Insert Delete GetRandom O(1) (company tags) ↗

1.4 Trees 15

★ must-know AmazonMetaMicrosoftGoogleLinkedIn mid coding very common Find the lowest common ancestor (LCA) of two nodes in a binary tree.
Recurse: if the current node is null or equals either target, return it; otherwise recurse left and right. If both sides return non-null, the current node is the LCA (the targets split here); if only one side does, propagate that side up. O(n) time, O(h) stack space. If it's specifically a BST, you can do better: walk down, going left when both targets are smaller and right when both are larger — the first node that splits them is the LCA, O(h) time.
What a strong answer covers
- General tree: both subtrees return non-null ⇒ this node is the LCA.
- Return the non-null side upward when only one target is found below.
- O(n) time, O(h) stack for the general case.
- BST shortcut: descend by comparing values, first split node is the LCA.
Follow-ups they push on
- How does the BST version beat the general O(n) approach?
- What changes if each node also stores a parent pointer?
Red flag Assuming both targets actually exist in the tree, or applying the BST descent on a non-BST.
source: LeetCode 236 — Lowest Common Ancestor of a Binary Tree (company tags) ↗
Commonly asked junior concept common What property defines a binary search tree, and what are its operation costs when balanced vs degenerate?
In a BST every node's left subtree holds only smaller keys and its right subtree only larger keys, so an in-order traversal yields sorted order. Search/insert/delete are O(log n) when the tree is balanced (height ~log n) but degrade to O(n) when it degenerates into a linked-list shape (e.g. inserting already-sorted data). That fragility is exactly why self-balancing variants exist.
Follow-ups they push on
- What insertion order produces a degenerate BST?
- How do you validate that a tree is a proper BST?
Red flag Claiming a BST is always O(log n) without the 'when balanced' qualifier.
source: GeeksforGeeks — Self-Balancing Binary Search Trees ↗
AmazonMicrosoftMetaBloomberg junior coding common Return the level-order traversal of a binary tree (values grouped by level).
Run BFS with a queue: at each step record the current queue size (that's one full level), then dequeue exactly that many nodes, collect their values into a level list, and enqueue their children. Repeat until the queue empties. O(n) time, O(width) space. Snapshotting the queue size per round is the trick that cleanly separates one level from the next.
What a strong answer covers
- BFS with a queue; snapshot the level size each round.
- Process exactly that many nodes to isolate one level.
- Enqueue children as you go for the next level.
- O(n) time, O(max width) space.
Follow-ups they push on
- Produce a zigzag (alternating left-right) level order.
- Return only the rightmost node of each level (right side view).
Red flag Not capturing the level size before the loop, so children enqueued mid-level bleed into the current level.
source: LeetCode 102 — Binary Tree Level Order Traversal (company tags) ↗
Commonly asked junior concept common What is a heap / priority queue, and what are the costs of peek, insert, and extract?
A binary heap is a complete tree (stored in an array) maintaining the heap property — each parent is <= (min-heap) or >= (max-heap) its children — so the extreme element sits at the root. Peek-min/max is O(1); insert and extract are O(log n) because you sift up/down one level at a time. It's the go-to for top-K, scheduling, Dijkstra, and merging K sorted streams.
Follow-ups they push on
- Why is building a heap from n items O(n) and not O(n log n)?
Red flag Confusing a heap with a BST, or thinking it keeps all elements fully sorted (it only orders the root).
source: CodeJeet — Heap / Priority Queue Interview Questions ↗
Commonly asked junior concept common Compare the four binary-tree traversals (preorder, inorder, postorder, level-order) and say when you'd use each.
Preorder (node, left, right) visits the root first — good for copying/serializing a tree. Inorder (left, node, right) yields sorted order in a BST — good for validation and producing ordered output. Postorder (left, right, node) visits children before the parent — good for deletion and bottom-up aggregates like subtree sums/heights. Level-order is BFS with a queue, processing tier by tier — good for shortest-depth and 'by level' problems. The first three are DFS (recursion or stack); level-order is BFS (queue).
What a strong answer covers
- Preorder: serialize/clone (root before children).
- Inorder: BST ⇒ sorted output; used for validation.
- Postorder: delete / bottom-up subtree aggregates.
- Level-order: BFS via queue; depth and per-level problems.
Quick self-check
Which traversal of a valid BST produces the keys in ascending sorted order?
Follow-ups they push on
- Which traversal reconstructs a BST's sorted sequence?
- Why is postorder natural for freeing/deleting a tree?
Red flag Mixing up the visit positions, or using a stack for level-order instead of a queue (that's DFS, not BFS).
source: GeeksforGeeks — Tree Traversals (Inorder, Preorder, Postorder) ↗
AmazonGoogleMetaMicrosoft mid coding common Compute the diameter of a binary tree (longest path between any two nodes).
Do a single postorder DFS that returns each node's height while updating a global max: at each node, the longest path *through* it is leftHeight + rightHeight (in edges), so track the maximum of that across all nodes and return 1 + max(leftHeight, rightHeight) to the parent. O(n) time, O(h) stack. The key insight is that the answer is a path that bends at some node, computed from its two subtree depths.
What a strong answer covers
- Postorder DFS returns height; a side variable tracks the best diameter.
- Path through a node = leftHeight + rightHeight (edge count).
- Return 1 + max(left, right) upward as the node's height.
- O(n) time, O(h) stack — one traversal, not one per node.
Follow-ups they push on
- Why compute height and diameter in the same pass instead of two?
- Should the diameter be measured in nodes or edges (be consistent)?
Red flag Recomputing height separately at every node (O(n^2)) instead of folding it into one postorder pass.
source: LeetCode 543 — Diameter of Binary Tree (company tags) ↗
Commonly asked mid concept occasional What advantage does a trie have over a hash map for storing strings, and what's the catch?
A trie answers prefix queries — 'all words starting with "pre"', autocomplete, longest-prefix matching — which a hash map cannot do without scanning every key, and it shares storage for common prefixes. Lookups are O(m) in the word length, independent of how many words are stored. The catch is memory: each node carries a child map/array (up to alphabet size), so a sparse trie can use far more memory than a hash set of the same words, and it's only worthwhile when prefix operations matter.
What a strong answer covers
- Trie supports prefix / autocomplete queries; a hash map can't, cheaply.
- Lookup is O(m) in word length, not in the number of stored words.
- Common prefixes are shared, but each node holds child links.
- Catch: high memory overhead; use only when prefixes matter.
Quick self-check
What can a trie do that a hash map of the same words fundamentally cannot do efficiently?
Follow-ups they push on
- How would you compress a sparse trie (radix/Patricia trie)?
- When is a plain hash set strictly better than a trie?
Red flag Reaching for a trie when only exact-match lookup is needed — a hash set is simpler and lighter there.
source: GeeksforGeeks — Trie Data Structure ↗
AmazonGoogleMicrosoftUber mid coding common Find the kth smallest element in a binary search tree.
Do an inorder traversal (which visits BST keys in ascending order) and stop at the kth visited node — you don't need to traverse the whole tree. An iterative inorder with an explicit stack lets you halt early at O(h + k) time. If the tree is queried for many different k values, augment each node with its left-subtree size so each query becomes O(h) by navigating directly.
What a strong answer covers
- Inorder visits BST keys ascending ⇒ the kth visited is the answer.
- Stop early at the kth node; no full traversal needed.
- Iterative stack-based inorder ⇒ O(h + k) time.
- For repeated queries, store subtree sizes ⇒ O(h) per query.
Follow-ups they push on
- How do subtree-size augmentations speed up many repeated queries?
- How would you find the kth largest instead?
Red flag Collecting the entire inorder list and indexing (O(n)) instead of stopping at the kth element.
source: LeetCode 230 — Kth Smallest Element in a BST (company tags) ↗
AmazonMetaMicrosoftBloomberg mid coding common Validate that a binary tree is a valid binary search tree.
Recurse with a valid (min, max) range for each node: the root is unbounded, the left child tightens the max to the parent's value and the right child tightens the min. A node fails if its value violates its range. O(n) time, O(h) stack space. Equivalently, an in-order traversal of a valid BST is strictly increasing, so you can check that the previous visited value is always smaller.
Follow-ups they push on
- Why isn't it enough to just compare each node to its two children?
Red flag Only comparing a node against its immediate children, which misses violations deeper in a subtree.
source: LeetCode 98 — Validate Binary Search Tree (company tags) ↗
Commonly asked mid concept occasional What is a self-balancing tree (AVL / red-black), and where are they used in real systems?
Self-balancing BSTs perform rotations on insert/delete to keep height O(log n), guaranteeing O(log n) operations regardless of input order. AVL trees keep height balance tighter (faster lookups, more rotations); red-black trees balance more loosely (fewer rotations, faster writes). They back ordered maps/sets such as Java's TreeMap and C++ std::map, and red-black trees appear in the Linux process scheduler.
Follow-ups they push on
- When would you prefer AVL's stricter balance over red-black?
Red flag Treating AVL and red-black as identical, or not knowing they guarantee O(log n) by construction.
source: AlgoCademy — Introduction to Self-Balancing BSTs ↗
AmazonGoogleMicrosoftMeta mid coding common Implement a trie (prefix tree) supporting insert, search, and startsWith.
Each node holds a map/array of child links and an isEnd flag; insert walks/creates a path of nodes one character at a time, search walks the path and checks isEnd, and startsWith walks the path without requiring isEnd. All three are O(m) for a word of length m, independent of how many words are stored. Tries shine for autocomplete, spellcheck, and prefix-heavy lookups where a hash map can't answer prefix queries.
Follow-ups they push on
- Add wildcard '.' matching.
- How would you support delete?
Red flag Conflating 'a word ends here' (`isEnd`) with 'a prefix exists here', which breaks exact-word search.
source: LeetCode 208 — Implement Trie (company tags) ↗
AmazonMetaGoogleMicrosoft mid coding common Find the kth largest element in an unsorted array.
Maintain a min-heap of size k: push each element, and whenever the heap exceeds k pop the smallest, so the heap ends holding the k largest with the kth largest at its root. That's O(n log k) time, O(k) space. Quickselect gives average O(n) by partitioning around a pivot and recursing into only the relevant side, with O(n^2) worst case — mention both and the trade-off.
Follow-ups they push on
- When is quickselect's O(n) average worth its O(n^2) worst case?
Red flag Sorting the whole array (O(n log n)) and indexing, or using a max-heap of size n when a size-k min-heap suffices.
source: LeetCode 215 — Kth Largest Element in an Array (company tags) ↗
Commonly asked senior concept occasional Why is building a heap from n elements O(n) and not O(n log n)? And how do you do an in-place heapsort?
Bottom-up heapify (sift-down from the last internal node up to the root) is O(n), not O(n log n), because most nodes sit near the leaves and sift down only a tiny distance — summing the work weighted by height gives a convergent series bounded by O(n). (Inserting one-by-one with sift-up is the O(n log n) way.) Heapsort then builds a max-heap in place, repeatedly swaps the root (the max) with the last unsorted element and sifts down the reduced heap — O(n log n) time, O(1) extra space, but not stable.
What a strong answer covers
- Bottom-up heapify is O(n): most nodes are shallow, work sums to O(n).
- Repeated sift-up inserts would be O(n log n) — the slower build.
- Heapsort: build max-heap, swap root to the end, shrink, sift down.
- Heapsort is O(n log n), O(1) space, not stable.
Quick self-check
Why is bottom-up heap construction O(n) rather than O(n log n)?
Follow-ups they push on
- Why is sift-down-from-the-bottom cheaper than n separate insertions?
- Why isn't heapsort stable, and when does that matter?
Red flag Claiming heap construction is always O(n log n), conflating the build phase with n individual insertions.
source: GeeksforGeeks — Time Complexity of Building a Heap ↗
Commonly asked senior concept occasional Why do relational databases use B-trees / B+ trees for indexes instead of a binary search tree?
B-trees are shallow and high-fanout — each node holds many keys, so the tree stays only a few levels deep even for millions of rows, which minimizes expensive disk seeks (disk I/O, not comparisons, is the bottleneck). A binary tree would be far taller and cost many more page reads. In a B+ tree all values live in the leaves, which are linked together, so range scans and ORDER BY can sweep the leaves sequentially without re-walking the tree.
Follow-ups they push on
- Why does fanout matter more than tree height in comparisons?
- How does the linked leaf layer of a B+ tree help range queries?
Red flag Justifying B-trees by comparison count rather than by minimizing disk page reads.
source: Use The Index, Luke — Anatomy of an Index (B-tree) ↗
AmazonMetaGoogleMicrosoft senior coding common Merge k sorted linked lists into one sorted list.
Push the head of each list into a min-heap keyed by node value; repeatedly pop the smallest, append it to the result, and push that node's successor. Each of the n total nodes is pushed/popped once at O(log k) cost, giving O(n log k) time and O(k) heap space. Divide-and-conquer pairwise merging hits the same O(n log k) without a heap.
Follow-ups they push on
- Compare the heap approach with pairwise divide-and-conquer merging.
Red flag Concatenating all lists and sorting (O(n log n)) instead of exploiting that each list is already sorted.
source: LeetCode 23 — Merge k Sorted Lists (company tags) ↗

1.5 Graphs 15

★ must-know Commonly asked mid concept common What is union-find (disjoint set union), and what do union by rank and path compression buy you?
Union-find tracks elements partitioned into disjoint sets via a parent-pointer forest, supporting find (which set/root an element belongs to) and union (merge two sets). Path compression flattens the tree by pointing visited nodes straight at the root during find, and union by rank/size always attaches the smaller tree under the larger; together they make each operation nearly O(1) — amortized O(α(n)), the inverse-Ackermann function, effectively constant. It's the tool for dynamic connectivity, counting connected components, cycle detection in undirected graphs, and Kruskal's MST.
What a strong answer covers
- Forest of parent pointers; find returns the set root, union merges.
- Path compression: repoint nodes to the root during find.
- Union by rank/size: attach smaller tree under larger.
- Together ⇒ amortized O(α(n)) ≈ constant per operation.
Quick self-check
With both path compression and union by rank, the amortized cost per operation is:
Follow-ups they push on
- Why is union-find better than BFS/DFS for *dynamic* connectivity queries?
- How does Kruskal's algorithm use union-find?
Red flag Implementing find/union without either optimization, degrading to O(n) per op on adversarial unions.
source: GeeksforGeeks — Disjoint Set (Union-Find) with Rank & Path Compression ↗
Commonly asked junior concept common Adjacency list vs adjacency matrix: compare space and edge-lookup cost, and say when to use each.
An adjacency list stores each node's neighbours, using O(V + E) space — efficient for sparse graphs, which is most real-world graphs. An adjacency matrix is a V x V grid giving O(1) edge-existence checks but O(V^2) space regardless of edge count, so it only pays off for dense graphs or when you constantly test specific edges. Default to the list unless the graph is dense.
Follow-ups they push on
- Which representation makes 'is there an edge u-v?' fastest?
Red flag Using a matrix for a large sparse graph and wasting O(V^2) memory on mostly-empty cells.
source: Tech Interview Handbook — Graph cheatsheet ↗
Commonly asked junior concept very common BFS vs DFS: how do they differ, and when do you pick each?
BFS explores level by level using a queue and finds the shortest path in an unweighted graph (fewest edges); DFS dives deep along one branch using recursion or an explicit stack and suits connectivity, cycle detection, and topological sort. BFS uses O(width) memory, DFS uses O(depth). Choose BFS when you need shortest hops or level order; choose DFS when you need to fully explore structure or order dependencies.
Follow-ups they push on
- Why does BFS, not DFS, give the shortest path in an unweighted graph?
- When does DFS risk a stack overflow?
Red flag Using DFS to find a shortest unweighted path, or forgetting a visited set and looping forever on cycles.
source: Tech Interview Handbook — Graph cheatsheet ↗
Commonly asked junior trick occasional Why must graph traversals track visited nodes, and what's the cost of forgetting?
Graphs can contain cycles and multiple paths to the same node, so without a visited set a traversal revisits nodes and, on a cycle, loops forever or explodes in work. A visited set makes both BFS and DFS O(V + E) by guaranteeing each node and edge is processed once. (Trees are the special case where you can skip it — they have no cycles.)
Follow-ups they push on
- Why is a visited set unnecessary when traversing a tree?
Red flag Copy-pasting tree-traversal code onto a graph and infinite-looping on the first cycle.
source: Tech Interview Handbook — Graph cheatsheet ↗
Commonly asked junior concept occasional Define directed vs undirected and weighted vs unweighted graphs, with an example of each.
In a directed graph edges have a direction (Twitter 'follows'); in an undirected graph they go both ways (Facebook 'friends'). Weighted edges carry a cost or distance (road network with mileage); unweighted edges just record a connection (a maze of equal steps). These two axes determine your algorithm choice — e.g. unweighted shortest path uses BFS, weighted uses Dijkstra.
Follow-ups they push on
- How does each property change which traversal/shortest-path algorithm you pick?
Red flag Modelling a one-way relationship (like 'follows') as an undirected edge and corrupting the graph's meaning.
source: Tech Interview Handbook — Graph cheatsheet ↗
AmazonGoogleMeta mid coding common Count the number of connected components in an undirected graph. Two ways?
Way 1 — traversal: loop over all nodes; each time you hit an unvisited node, increment the count and BFS/DFS to mark its whole component visited. O(V + E). Way 2 — union-find: start with V components and union the endpoints of every edge; each successful merge of two distinct sets drops the count by one. O(E·α(V)). Union-find shines when edges arrive incrementally or you also need connectivity queries; traversal is simplest for a static graph.
What a strong answer covers
- Traversal: count = number of BFS/DFS launches from unvisited nodes.
- Union-find: start at V, decrement on each cross-set union.
- Both are near-linear: O(V + E) vs O(E·α(V)).
- Prefer union-find for streaming edges / repeated connectivity queries.
Follow-ups they push on
- Which approach fits a stream of edges arriving over time, and why?
- How would you also report the size of the largest component?
Red flag Forgetting isolated (degree-0) vertices, which are components of their own and easy to miss.
source: LeetCode 323 — Number of Connected Components in an Undirected Graph (company tags) ↗
Commonly asked mid concept common How does Dijkstra's algorithm work, and why does it break with negative edge weights?
Dijkstra greedily grows a set of finalized shortest distances: a min-heap repeatedly pops the closest unfinalized node, finalizes its distance, and relaxes its outgoing edges. With a binary heap it's O((V + E) log V). It relies on the assumption that once you finalize a node, no later path can be shorter — true only with non-negative weights. A negative edge can make a 'longer-looking' path actually cheaper after the node is already finalized, breaking correctness; for negative edges use Bellman-Ford (O(V·E)), which also detects negative cycles.
What a strong answer covers
- Min-heap pops the nearest unfinalized node, then relaxes its edges.
- O((V + E) log V) with a binary heap.
- Correct only because finalized nodes can't be improved — needs non-negative weights.
- Negative edges ⇒ use Bellman-Ford (O(V·E)), which finds negative cycles.
Quick self-check
Why does Dijkstra fail on graphs with negative edge weights?
Follow-ups they push on
- What does Bellman-Ford do that Dijkstra can't?
- How does A* differ from Dijkstra?
Red flag Running Dijkstra on a graph with negative edges and trusting the (silently wrong) result.
source: Tech Interview Handbook — Graph cheatsheet ↗
Commonly asked mid concept common How do you detect a cycle in a graph, and why does the method differ between directed and undirected graphs?
In an undirected graph, DFS finds a cycle if it reaches an already-visited node that isn't the immediate parent (or union-find: an edge joining two nodes already in the same set). In a directed graph a plain visited set is insufficient — you must track nodes currently on the recursion stack (often three colors: white/unvisited, gray/in-progress, black/done); a back edge to a gray node means a cycle. The difference is that in directed graphs revisiting a finished node is fine (it's just a shared descendant), whereas an edge back to an *in-progress* ancestor is the cycle.
What a strong answer covers
- Undirected: visited neighbor that isn't the parent ⇒ cycle (or union-find).
- Directed: need a recursion-stack / gray marker, not just visited.
- Back edge to a gray (in-progress) node ⇒ directed cycle.
- Revisiting a finished (black) node in a digraph is not a cycle.
Quick self-check
Detecting a cycle in a DIRECTED graph requires tracking which of these beyond a visited set?
Follow-ups they push on
- Why isn't a simple visited set enough for directed cycle detection?
- How does topological sort also reveal a directed cycle?
Red flag Reusing the undirected approach (plain visited set) on a directed graph and reporting false cycles.
source: GeeksforGeeks — Detect Cycle in a Directed Graph ↗
AmazonMetaGoogleApple mid coding common Given a directed acyclic dependency graph, produce a valid build/task order (topological ordering via Kahn's algorithm).
Compute every node's in-degree, seed a queue with all in-degree-0 nodes (no dependencies), then repeatedly dequeue a node, append it to the order, and decrement its neighbors' in-degrees — enqueuing any that hit zero. O(V + E). If the emitted order contains fewer than V nodes, a cycle exists and no valid ordering is possible, so the same algorithm doubles as cycle detection. This is exactly Course Schedule II / dependency resolution.
What a strong answer covers
- Kahn's: start from in-degree-0 nodes, peel them off layer by layer.
- Decrement neighbors' in-degrees; enqueue when they reach 0.
- O(V + E) time and space.
- Output size < V ⇒ a cycle ⇒ no valid ordering.
Follow-ups they push on
- How does the same run tell you the graph has a cycle?
- How would you produce *all* valid topological orders?
Red flag Assuming an ordering always exists and not checking for the cycle case (output shorter than V).
source: LeetCode 210 — Course Schedule II (company tags) ↗
AmazonMetaGoogleMicrosoftBloomberg mid coding very common Count the number of islands in a grid of land ('1') and water ('0').
Scan every cell; when you hit unvisited land, increment the island count and flood-fill (BFS or DFS) all connected land, marking it visited so you don't recount it. The grid is an implicit graph where each cell connects to its 4 neighbours. O(rows * cols) time and space. The core insight is recognizing a 2D matrix as a graph traversal.
Follow-ups they push on
- How would you handle a grid too large to fit in memory?
- Count islands with diagonal connectivity.
Red flag Not marking visited cells (recounting the same island) or only checking diagonal instead of 4-directional neighbours.
source: LeetCode 200 — Number of Islands (company tags) ↗
AmazonMetaGoogleApple mid coding common Given course prerequisites, determine whether you can finish all courses.
Model courses as a directed graph and ask whether it has a cycle: if it does, the prerequisites are circular and you can't finish. Use Kahn's algorithm (BFS topological sort — repeatedly remove in-degree-0 nodes; if you can't remove them all, a cycle remains) or DFS cycle detection with a recursion-stack marker. O(V + E) time and space.
Follow-ups they push on
- Return a valid course ordering (Course Schedule II).
- BFS vs DFS for detecting the cycle?
Red flag Detecting a cycle with a simple visited set but no 'currently on the recursion stack' distinction, giving false positives.
source: LeetCode 207 — Course Schedule (company tags) ↗
Commonly asked mid concept common What is a topological sort, what graphs admit one, and how do you compute it?
A topological sort is a linear ordering of a directed graph's vertices where every edge u->v has u before v — it exists if and only if the graph is a DAG (no cycles). Compute it with Kahn's algorithm (repeatedly emit in-degree-0 nodes) or via DFS finish times reversed. It models dependency resolution: build systems, task scheduling, course prerequisites.
Follow-ups they push on
- How does the same algorithm also tell you the graph has a cycle?
Red flag Claiming any directed graph can be topologically sorted — cycles make it impossible.
source: AlgoMonster — Course Schedule (topological sort) ↗
Commonly asked mid concept common You need the shortest path in an unweighted graph. Which algorithm, and what changes if edges have weights?
Unweighted shortest path is plain BFS — the first time you reach a node is via the fewest edges, so it's optimal at O(V + E). With non-negative weights, BFS no longer works because fewer edges can cost more; switch to Dijkstra's algorithm, which uses a min-heap/priority queue to always expand the cheapest frontier node. The shift from a queue to a priority queue is the key recognition.
Follow-ups they push on
- Why does Dijkstra break with negative edge weights?
Red flag Reaching for Dijkstra on an unweighted graph (overkill) or using BFS when edges carry weights (wrong answer).
source: Tech Interview Handbook — Graph cheatsheet ↗
AmazonMetaGoogleBloomberg mid coding common Make a deep copy (clone) of a connected undirected graph.
Traverse with BFS or DFS while keeping a hash map from original node to its clone. When you first see a node, create its clone and record it; then for each neighbor, create-or-look-up its clone and wire up the edge. The map serves double duty as both the visited set and the original→copy lookup, which is what prevents infinite loops on cycles. O(V + E) time and space.
What a strong answer covers
- Map original → clone doubles as the visited set.
- Create a clone on first sight; reuse the mapped clone afterward.
- Wire each neighbor edge using looked-up clones.
- O(V + E) time and space; works via BFS or DFS.
Follow-ups they push on
- Why does the original→clone map prevent infinite recursion on cycles?
- How does this change for a directed graph?
Red flag Cloning a neighbor again instead of reusing the mapped clone, producing duplicate nodes and looping on cycles.
source: LeetCode 133 — Clone Graph (company tags) ↗
AmazonGoogleMetaLinkedIn senior coding occasional Find the length of the shortest word transformation from beginWord to endWord changing one letter at a time (Word Ladder).
Model each word as a graph node with edges to words differing by one letter, then run BFS from beginWord — the first time you reach endWord, the BFS depth is the shortest transformation length (unweighted shortest path). To find neighbors efficiently, use wildcard patterns like h*t as buckets so you don't compare every pair of words. BFS guarantees the shortest sequence; bidirectional BFS from both ends prunes the frontier and is a strong optimization to mention.
What a strong answer covers
- Words are nodes; one-letter-apart words are edges ⇒ unweighted graph.
- BFS gives the shortest transformation (fewest steps).
- Wildcard buckets (h*t) generate neighbors without all-pairs comparison.
- Bidirectional BFS searches from both ends to cut the explored frontier.
Follow-ups they push on
- Why BFS rather than DFS for the *shortest* sequence?
- How does bidirectional BFS reduce the work?
Red flag Using DFS (finds *a* path, not the shortest) or comparing all word pairs (O(N^2·L)) to build edges.
source: LeetCode 127 — Word Ladder (company tags) ↗

1.6 Algorithm categories — recognize the pattern 15

★ must-know Commonly asked mid concept common What's the general template for backtracking problems, and how do you prune to avoid exploring dead ends?
Backtracking is DFS over a decision tree: choose an option, explore by recursing, then un-choose (undo the change) before trying the next option. You hit a base case when a full candidate is built (record it) and prune by checking constraints *before* recursing — abandoning a branch the moment it can't lead to a valid solution. Pruning (e.g. skipping a queen placement under attack in N-Queens, or stopping when a partial sum exceeds the target) is what turns brute-force enumeration into something tractable.
What a strong answer covers
- Pattern: choose → explore → un-choose (restore state on the way out).
- Base case records a complete candidate.
- Prune early: reject a branch before recursing when it can't succeed.
- Used for permutations, combinations, N-Queens, Sudoku, word search.
Quick self-check
What is the defining structure of a backtracking algorithm?
Follow-ups they push on
- How does N-Queens prune attacked positions?
- Why must you undo the choice after recursing, not before?
Red flag Forgetting to undo the choice on the way back (state leaks across branches), or pruning only after fully building candidates.
source: Tech Interview Handbook — Recursion / Backtracking ↗
★ must-know AmazonMicrosoftGoogleBloombergApple mid coding very common Find the contiguous subarray with the largest sum (Maximum Subarray / Kadane's algorithm).
Kadane's algorithm: scan once, maintaining curr = max(x, curr + x) (either start fresh at x or extend the running subarray) and tracking the best curr seen. O(n) time, O(1) space. The key decision at each element — extend the previous subarray or restart — is a one-line DP. Watch the all-negative case: initialize the answer to the first element (or -∞), not 0, so you don't wrongly return 0 for an empty subarray.
What a strong answer covers
- Per element: curr = max(x, curr + x) — extend or restart.
- Track the maximum curr; O(n) time, O(1) space.
- It's a one-variable DP (running best ending here).
- All-negative inputs: init answer to first element, never 0.
Quick self-check
Why initialize Kadane's answer to the first element (or -∞) rather than 0?
Follow-ups they push on
- How would you also return the start/end indices of the subarray?
- What changes for the maximum *product* subarray?
Red flag Initializing the max to 0, which returns 0 for an all-negative array instead of the largest (least negative) element.
source: LeetCode 53 — Maximum Subarray (company tags) ↗
Commonly asked junior concept occasional Recursion vs iteration: what are a base case and the call stack, and when does recursion risk a stack overflow?
Recursion solves a problem by calling itself on smaller inputs until a base case stops the descent; each call pushes a frame onto the call stack and pops it on return. Without a correct base case (or with too-deep recursion) the stack grows until it overflows. Deep recursion on large inputs should be rewritten iteratively (or made tail-recursive where the language optimizes it) to use O(1) instead of O(depth) stack space.
Follow-ups they push on
- How would you convert a deep DFS recursion into an iterative one?
Red flag Omitting or mis-ordering the base case (infinite recursion), or ignoring the O(depth) stack cost on large inputs.
source: Tech Interview Handbook — Algorithms cheatsheet ↗
AmazonGoogleAdobe junior coding common Climbing stairs: you can take 1 or 2 steps at a time — how many ways to reach step n? Why is this Fibonacci?
The ways to reach step n equal the ways to reach n-1 (then a 1-step) plus the ways to reach n-2 (then a 2-step): ways(n) = ways(n-1) + ways(n-2) — the Fibonacci recurrence. Bottom-up DP keeping just the last two values gives O(n) time and O(1) space. Recognizing that the final move splits the problem into independent subproblems is the DP insight; the naive recursion without memoization is exponential.
What a strong answer covers
- ways(n) = ways(n-1) + ways(n-2) ⇒ Fibonacci shape.
- Subproblems overlap ⇒ DP, not naive exponential recursion.
- Rolling two variables ⇒ O(n) time, O(1) space.
- Base cases: ways(0)=1, ways(1)=1.
Follow-ups they push on
- Generalize to taking 1, 2, or 3 steps.
- What if each step has a cost and you minimize total cost (min cost climbing)?
Red flag Solving with naive O(2^n) recursion, or botching the base cases so the count is off by one.
source: LeetCode 70 — Climbing Stairs (company tags) ↗
Commonly asked mid concept common Compare quicksort and mergesort. Why is comparison sorting bounded at O(n log n)?
Quicksort partitions around a pivot in place — average O(n log n), O(log n) stack space, but O(n^2) worst case on bad pivots and not stable. Mergesort splits and merges — guaranteed O(n log n) and stable, but needs O(n) extra space. Any comparison-based sort is bounded below by O(n log n) because there are n! possible orderings and each comparison yields one bit, so you need at least log2(n!) ~ n log n comparisons to distinguish them.
Follow-ups they push on
- How do non-comparison sorts like counting/radix beat O(n log n)?
- Why does Timsort (Python/Java) blend mergesort and insertion sort?
Red flag Calling quicksort O(n log n) worst case, or claiming any sort whatsoever beats O(n log n) (only non-comparison ones can).
source: Tech Interview Handbook — Algorithms / Sorting ↗
AmazonGoogleMicrosoft mid coding common House Robber: maximize the sum of non-adjacent house values along a street.
At each house you either skip it (carry forward the best so far) or rob it (its value plus the best up to two houses back): dp[i] = max(dp[i-1], dp[i-2] + nums[i]). Keep just the two previous results for O(n) time, O(1) space. The greedy 'rob every other house' fails — the optimal choice depends on values, which is the cue for DP over greedy.
What a strong answer covers
- Transition: dp[i] = max(dp[i-1], dp[i-2] + nums[i]) (skip vs rob).
- Two rolling variables ⇒ O(n) time, O(1) space.
- Greedy 'every other house' is wrong; the answer is value-dependent.
- Classic optimal-substructure + overlapping-subproblems DP.
Follow-ups they push on
- What changes if the houses are arranged in a circle (House Robber II)?
- Why does a greedy alternating strategy fail here?
Red flag Assuming the answer is just the larger of the even-index vs odd-index sums, which a counterexample breaks.
source: LeetCode 198 — House Robber (company tags) ↗
AmazonMetaGoogleBloomberg mid coding common Generate all subsets (the power set) of a set of distinct integers.
Use backtracking: at each index decide to include or exclude that element, recursing on the rest and recording the running subset at every node of the decision tree. There are 2^n subsets, so it's O(n·2^n) time (n to copy each subset) — inherent to the output size. An iterative alternative builds subsets by, for each new element, appending it to every subset seen so far. Passing a start index prevents revisiting earlier elements and generating duplicates.
What a strong answer covers
- Include/exclude decision per element ⇒ binary choice tree of 2^n leaves.
- Record the partial subset at every recursion node.
- O(n·2^n) — bounded by the output size itself.
- A start index avoids re-choosing earlier elements (no dup subsets).
Follow-ups they push on
- How do you handle duplicate input values (Subsets II)?
- How does this template extend to permutations and combinations?
Red flag Adding the same combination twice by recursing from index 0 instead of advancing a `start` pointer.
source: LeetCode 78 — Subsets (company tags) ↗
AmazonMetaGoogleMicrosoftApple mid coding very common Compute the product of all elements except self, without using division and in O(n).
Use prefix and suffix products: first pass fills each position with the product of everything to its left; second pass multiplies in the product of everything to its right (tracked in a running variable). O(n) time, O(1) extra space if the output array doesn't count. Division would be the obvious trick but is explicitly banned — and it breaks on zeros anyway, which is exactly why the prefix/suffix approach is the expected answer.
What a strong answer covers
- Left-products pass, then a right-products running multiply.
- O(n) time, O(1) extra space (output aside).
- Avoids division — which the problem bans and which fails on zeros.
- Each output = (product of all left) × (product of all right).
Follow-ups they push on
- Why is the division approach fragile when the array contains a zero?
- How do you keep it O(1) extra space (reusing the output array)?
Red flag Using division (banned, and breaks with one or more zeros) instead of prefix/suffix products.
source: LeetCode 238 — Product of Array Except Self (company tags) ↗
AmazonMetaGoogleMicrosoftBloomberg mid coding very common What cues in a problem tell you to reach for binary search? Search a rotated sorted array as an example.
The cue is 'sorted (or monotonic) + find', or a search space you can halve by a yes/no test — binary search gives O(log n). In a rotated sorted array, at each midpoint one half is still sorted; check whether the target lies within that sorted half to decide which side to discard, keeping it O(log n). Binary search also hides in 'find minimum capacity/threshold' problems via binary-search-on-the-answer.
Follow-ups they push on
- Find the minimum in a rotated sorted array.
- How do you binary-search on the answer?
Red flag Off-by-one and infinite loops from sloppy mid/low/high updates, or assuming the array must be fully sorted to apply it.
source: LeetCode 33 — Search in Rotated Sorted Array (company tags) ↗
Commonly asked mid concept common When do you use two pointers vs a sliding window? Give the canonical cue for each.
Two pointers fits sorted arrays and pair/triplet problems: move a left and right pointer inward based on a comparison (e.g. pair-sum, removing duplicates). Sliding window fits 'longest/shortest contiguous subarray or substring satisfying a constraint': grow the right edge and shrink the left when the constraint breaks. Both turn an O(n^2) brute force into O(n) by never resetting the pointers backwards.
Follow-ups they push on
- What signals a fixed-size window vs a variable-size one?
Red flag Resetting the inner pointer to the window start on each step, which silently reintroduces O(n^2) behaviour.
source: DEV — Two Pointers & Sliding Window ↗
AmazonMetaGoogleMicrosoftBloombergApple mid coding very common Find the length of the longest substring without repeating characters.
Slide a window with two pointers, tracking the characters currently inside in a hash set/map; when the right pointer hits a duplicate, advance the left pointer (removing characters) until the window is valid again, recording the max length along the way. Each character enters and leaves the window at most once, so it's O(n) time, O(min(n, alphabet)) space. The classic sliding-window-plus-hashing problem.
Follow-ups they push on
- Generalize to at most k distinct characters.
Red flag Restarting the scan from the duplicate instead of moving the left pointer, degrading to O(n^2).
source: LeetCode 3 — Longest Substring Without Repeating Characters (company tags) ↗
Commonly asked mid concept common How do you recognize a dynamic programming problem, and what's the difference between memoization and tabulation?
DP applies when a problem has overlapping subproblems (the same smaller problem recurs) and optimal substructure (the best answer is built from best sub-answers) — counting paths, min cost, longest subsequence are typical. Memoization is top-down: write the natural recursion and cache results. Tabulation is bottom-up: fill a table in dependency order, avoiding recursion overhead. Both cut exponential brute force to polynomial; choose based on which is clearer.
Follow-ups they push on
- When does tabulation let you shrink space to O(1) rows?
Red flag Reaching for greedy on a problem that needs DP (greedy gives a locally optimal but globally wrong answer).
source: NeetCode — Roadmap ↗
AmazonGoogleMeta mid coding common Given coin denominations and an amount, return the fewest coins to make that amount.
This is bottom-up DP: dp[a] = fewest coins to make amount a, computed as 1 + min over coins c of dp[a - c], with dp[0] = 0 and unreachable amounts marked infinity. Answer is dp[amount] or -1 if still infinity. O(amount * numCoins) time, O(amount) space. The interviewer is watching you state the subproblem and transition clearly — and notice that greedy (largest coin first) fails for denominations like {1, 3, 4}.
Follow-ups they push on
- Why does the greedy 'largest coin first' approach fail here?
- Count the number of ways instead of the minimum.
Red flag Using greedy largest-coin-first, which is wrong for arbitrary denominations.
source: LeetCode 322 — Coin Change (company tags) ↗
AmazonGoogleMeta senior coding occasional What is 'binary search on the answer', and when do you apply it? (e.g. minimum capacity / Koko eating bananas)
When the input isn't sorted but the answer space is monotonic — a candidate value either works or doesn't, and 'works' is monotone (if capacity X works, every larger capacity also works) — you binary-search over the range of possible answers, using a feasibility check as the comparison. Example: find the minimum eating speed so Koko finishes in H hours — binary-search the speed and test 'can she finish at speed k?' in O(n) each, giving O(n log(max)) overall. The trick is spotting the monotone yes/no boundary you can bisect.
What a strong answer covers
- Search the answer range, not the array, when answers are monotone.
- Need a feasibility test: 'does candidate value X satisfy the constraint?'
- Bisect toward the boundary between feasible and infeasible.
- Cost = O(check · log(range)), e.g. O(n log(max)).
Follow-ups they push on
- How do you prove the feasibility predicate is monotonic?
- Apply it to 'minimum days to ship all packages within D days'.
Red flag Applying it when the feasibility predicate isn't monotonic, so bisection converges to a wrong boundary.
source: LeetCode 875 — Koko Eating Bananas (company tags) ↗
Commonly asked senior concept occasional Explain greedy vs divide-and-conquer vs dynamic programming. How do you know greedy is safe?
Divide-and-conquer splits into independent subproblems and combines results (mergesort, binary search). DP is for overlapping subproblems with optimal substructure, caching to avoid recomputation. Greedy makes the locally optimal choice at each step and never revisits it — fast and simple, but only correct when the problem has the greedy-choice property (e.g. interval scheduling, Dijkstra, Huffman). You justify greedy with an exchange argument or by proving the greedy choice is always part of some optimal solution; otherwise fall back to DP.
Follow-ups they push on
- Name a problem where greedy looks right but fails, and DP is needed.
Red flag Asserting a greedy strategy is correct without an exchange argument, then being blindsided by a counterexample.
source: NeetCode — Roadmap ↗

02 Backend Engineering 111 Q's

2.1 HTTP/HTTPS deeply 14

★ must-know Commonly asked mid concept very common 401 vs 403 vs 404 — when do you return each, and why might a security-conscious API return 404 instead of 403?
401 Unauthorized means 'I don't know who you are' — no/invalid credentials; the right fix is to authenticate. 403 Forbidden means 'I know who you are, but you're not allowed' — re-authenticating won't help. 404 Not Found means the resource doesn't exist.
The security twist: a 403 on a resource you can't see still confirms it exists, leaking information (resource enumeration). Some APIs deliberately return 404 instead of 403 for unauthorized access to private resources, so an attacker can't distinguish 'exists but forbidden' from 'doesn't exist'.
What a strong answer covers
- 401 = not authenticated (who are you?); 403 = authenticated but not permitted.
- Despite the name, 401 means unauthenticated, not unauthorized — a historical misnomer.
- 403 confirms a resource exists, which can leak information.
- Returning 404 for forbidden private resources prevents enumeration.
Quick self-check
A logged-in user requests another user's private profile they have no rights to see. What's the most information-leak-resistant response?
Follow-ups they push on
- Why is 401's name a misnomer?
- When would leaking 'this resource exists' actually matter?
Red flag Returning 401 when the user IS authenticated but lacks permission — that's 403. And 403 on private resources silently leaks their existence.
source: MDN — 403 Forbidden ↗
Commonly asked junior concept very common Walk me through the HTTP status code families and name a key code in each.
Five families by first digit: 1xx informational, 2xx success (200 OK, 201 Created, 204 No Content), 3xx redirection (301 permanent, 302 found, 304 Not Modified), 4xx client error (400 bad request, 401 unauthenticated, 403 forbidden, 404 not found, 409 conflict, 422 unprocessable, 429 too many requests), 5xx server error (500 internal, 502 bad gateway, 503 unavailable).
The useful instinct: 4xx means the client must change the request; 5xx means the client can retry the same request later.
Follow-ups they push on
- 401 vs 403 — what is the difference?
- When would you return 422 instead of 400?
- What does 304 require the client to have sent?
Red flag Returning 200 with an error body, or using 401 when you mean 403. 401 = not authenticated (who are you?), 403 = authenticated but not allowed.
source: MDN — HTTP response status codes ↗
Commonly asked mid concept very common Which HTTP methods are idempotent, and why does it matter?
GET, PUT, DELETE, HEAD, and OPTIONS are idempotent: making the same call N times leaves the server in the same state as making it once. POST is not idempotent — two POSTs typically create two resources.
It matters for safe retries. When a client times out it cannot tell whether the request was processed, so it must retry. Idempotent methods can be retried freely; for non-idempotent POSTs you need an idempotency key so the server can dedupe.
Follow-ups they push on
- How is idempotent different from safe?
- How would you make a payment POST safely retryable?
Red flag Conflating idempotent with safe. GET/HEAD/OPTIONS are also safe (no side effects); PUT/DELETE are idempotent but NOT safe. Also: PUT is idempotent by spec even though it changes data.
source: MDN — HTTP request methods ↗
Commonly asked mid concept common What do the SameSite, HttpOnly, and Secure cookie attributes each do?
HttpOnly hides the cookie from JavaScript (document.cookie), so an XSS payload can't read it — it mitigates token theft. Secure sends the cookie only over HTTPS, so it can't leak over plaintext. SameSite controls whether the cookie rides along on cross-site requests: Strict never sends it cross-site, Lax sends it only on top-level navigations (the modern browser default), and None sends it always but then requires Secure.
Together they harden a session cookie: HttpOnly+Secure stop theft and eavesdropping; SameSite is the first line of CSRF defense.
What a strong answer covers
- HttpOnly → unreadable by JS, blunts XSS-based token theft.
- Secure → HTTPS-only transmission.
- SameSite → controls cross-site sending; Lax is the default in modern browsers.
- SameSite=None must be paired with Secure or the browser rejects it.
Quick self-check
Which cookie attribute most directly mitigates CSRF?
Follow-ups they push on
- Why does SameSite=None require Secure?
- Does HttpOnly do anything against CSRF? (No — the cookie is still auto-sent.)
Red flag Thinking HttpOnly prevents CSRF. It stops JS from reading the cookie, but the browser still attaches it automatically on requests — SameSite/CSRF tokens handle CSRF.
source: MDN — Set-Cookie (SameSite) ↗
Commonly asked mid trick occasional Trick: is GET guaranteed to have no server-side effects? Is it safe to cache and retry a GET?
By the HTTP spec GET is safe (read-only) and idempotent, so intermediaries (browsers, proxies, CDNs) freely cache and retry it. But 'safe' is a *contract you must honor*, not something the protocol enforces — a poorly designed GET /delete?id=5 will happily delete data.
The danger: because GETs are prefetched, cached, and retried, a side-effecting GET can be triggered by a link prefetcher, a crawler, or a retry, causing unintended mutations. Mutations belong on POST/PUT/PATCH/DELETE; keep GET strictly read-only.
What a strong answer covers
- GET is defined as safe + idempotent, but the server must actually honor that.
- Caches, prefetchers, and crawlers will issue GETs without user intent.
- A side-effecting GET can fire from a prefetch or retry — a real source of bugs/exploits.
- Put all mutations behind non-safe methods.
Quick self-check
Why is implementing a delete behind GET /delete?id=5 dangerous?
Follow-ups they push on
- How could a crawler or link-prefetcher trigger a side-effecting GET?
- What's the difference between 'safe' and 'idempotent' here?
Red flag Believing the protocol enforces GET's safety. It's a contract — a GET that mutates state is valid HTTP but a design bug that prefetchers and caches will exploit.
source: MDN — Safe (HTTP methods) ↗
Commonly asked mid concept common What is an ETag and how does conditional caching with If-None-Match work?
An ETag is an opaque validator (often a hash) the server attaches to a response to identify a specific version of a resource. On the next request the client sends If-None-Match: <etag>. If the resource is unchanged the server replies 304 Not Modified with no body, saving bandwidth; if it changed it returns 200 with the new body and a new ETag.
ETags also enable optimistic concurrency on writes via If-Match: the write is rejected with 412 Precondition Failed if someone else changed the resource first.
Follow-ups they push on
- Strong vs weak ETags?
- How does this compare to Last-Modified / If-Modified-Since?
Red flag Thinking 304 carries the body — it does not; the client reuses its cached copy. Also forgetting ETags can prevent lost-update races on PUT.
source: MDN — ETag ↗
Commonly asked mid concept very common Explain CORS. Why does a browser block a cross-origin request, and what is a preflight?
CORS (Cross-Origin Resource Sharing) is a browser security mechanism on top of the same-origin policy. By default a page at origin A cannot read responses from origin B unless B opts in via Access-Control-Allow-Origin.
For non-simple requests (custom headers, methods like PUT/DELETE, certain content types) the browser first sends a preflight OPTIONS request. The server answers with Access-Control-Allow-Methods, Allow-Headers, and Allow-Origin; only then does the browser send the real request.
Follow-ups they push on
- Does CORS protect the server? (No — it protects the user's browser.)
- What does Access-Control-Allow-Credentials change, and why can't you combine it with '*'?
Red flag Believing CORS is server-side security. It is enforced by the browser; curl, Postman, and a malicious server-to-server call ignore it entirely.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗
Commonly asked mid concept common What does HTTPS/TLS actually add over HTTP, and what is the rough handshake?
TLS adds three things: confidentiality (traffic is encrypted), integrity (tampering is detected), and server identity (the certificate, signed by a CA, proves you are talking to the real host).
Handshake sketch: client sends ClientHello (supported ciphers); server returns its certificate and key-exchange parameters; they use asymmetric crypto (e.g. ECDHE) to agree on a shared symmetric session key; the rest of the connection uses fast symmetric encryption. TLS 1.3 cuts this to roughly one round trip.
Follow-ups they push on
- Why switch to a symmetric key after the handshake?
- What does the CA actually vouch for?
Red flag Saying TLS 'encrypts with the certificate'. The cert carries the public key and identity; the bulk data is encrypted with a negotiated symmetric session key.
source: Cloudflare — What happens in a TLS handshake? ↗
Commonly asked mid concept common PUT vs PATCH vs POST for updating a resource — when do you use each?
PUT replaces the resource wholesale and is idempotent — send the full representation. PATCH applies a partial modification (only the changed fields) and is not guaranteed idempotent by spec. POST creates a new subordinate resource or triggers a non-idempotent action.
Rule of thumb: full replace at a known URL → PUT; partial field update → PATCH; create-and-let-the-server-assign-the-id → POST.
Follow-ups they push on
- Can PUT create a resource? (Yes, if the client picks the URL/id.)
- How would you make PATCH idempotent?
Red flag Using PUT for partial updates — sending only some fields with PUT semantically blanks the rest. Use PATCH for partial.
source: MDN — PUT ↗
Commonly asked senior concept occasional What's the difference between Connection: keep-alive and HTTP/2 multiplexing? Why isn't keep-alive enough?
Keep-alive (persistent connections, default in HTTP/1.1) reuses one TCP connection for multiple sequential requests, avoiding a new handshake each time. But requests on that connection are still serialized — request 2 waits for response 1 (head-of-line blocking), which is why browsers open ~6 parallel connections per host.
HTTP/2 multiplexing interleaves many concurrent request/response streams over a single connection, so a slow response doesn't block the others at the application layer. Keep-alive reuses the pipe; multiplexing lets many requests share it simultaneously.
What a strong answer covers
- Keep-alive reuses one connection but processes requests sequentially.
- HTTP/1.1 pipelining tried concurrency but still suffered HOL blocking and was largely abandoned.
- HTTP/2 multiplexing runs concurrent streams over one connection.
- Browsers opened ~6 connections per host precisely to work around HTTP/1.1 serialization.
Follow-ups they push on
- Why did HTTP/1.1 pipelining never catch on?
- Does HTTP/2 multiplexing eliminate ALL head-of-line blocking? (No — TCP-level remains.)
Red flag Conflating keep-alive with multiplexing. Keep-alive just avoids re-handshaking; it does not allow concurrent in-flight requests on the same connection.
source: MDN — Connection management in HTTP/1.x ↗
Commonly asked senior concept occasional How does HSTS work, and what attack does it prevent that a redirect from HTTP to HTTPS does not?
HSTS (HTTP Strict Transport Security) is a response header (Strict-Transport-Security: max-age=...) that tells the browser to only ever contact this host over HTTPS for the given duration — the browser upgrades any http:// request to https:// *before* sending it.
A plain 301 redirect from HTTP→HTTPS still sends that first request in cleartext, which a man-in-the-middle can intercept and strip (SSL stripping). HSTS closes that window because, after the first secure visit (or via the preload list), the browser never makes the insecure request at all.
What a strong answer covers
- HSTS forces the browser to upgrade requests to HTTPS before any cleartext goes out.
- A redirect leaves the initial request exposed to SSL-stripping MITM.
- The HSTS preload list protects even the very first visit.
- Set a long max-age; includeSubDomains extends it to subdomains.
Quick self-check
What does HSTS protect against that a 301 HTTP→HTTPS redirect alone does not?
Follow-ups they push on
- What is SSL stripping?
- Why does the HSTS preload list matter for the first-ever visit?
Red flag Assuming an HTTP→HTTPS redirect is fully secure. The first cleartext request before the redirect is interceptable; HSTS (ideally preloaded) is what removes that gap.
source: MDN — Strict-Transport-Security ↗
Commonly asked senior concept occasional A client uploads a large file and the server responds 100 Continue before the body. What is the Expect: 100-continue mechanism for?
When a client is about to send a large request body, it can send the headers first with Expect: 100-continue and pause before sending the body. The server inspects the headers (auth, content-length limits, content-type) and replies 100 Continue to greenlight the body, or an error status (e.g. 401, 413) to reject it up front.
The point is to avoid wasting bandwidth uploading a huge body that the server would only reject anyway. It's part of the 1xx informational family — a provisional response before the final one.
What a strong answer covers
- Expect: 100-continue lets the client send headers, then wait for a go-ahead.
- Server replies 100 Continue to accept the body, or an error to reject before upload.
- Saves bandwidth on large bodies the server would reject (auth fail, too large).
- 1xx are provisional/informational responses preceding the final status.
Follow-ups they push on
- What status would the server send instead of 100 to reject an oversized upload? (413)
- What other 1xx codes exist? (101 Switching Protocols, 103 Early Hints)
Red flag Treating 100 Continue as a final response. It's provisional — the real status comes after the body is sent and processed.
source: MDN — 100 Continue ↗
Commonly asked senior debug occasional A client gets a 200 but you suspect the response was served stale. Which headers control caching, and how would you debug it?
Caching is governed by Cache-Control (max-age, no-store, no-cache, private/public, must-revalidate), plus validators ETag/Last-Modified and the legacy Expires.
Debug path: inspect the response Cache-Control and Age headers; check whether an intermediary (CDN/proxy) added Age or an X-Cache: HIT. no-cache means 'revalidate before use', not 'do not cache' — that surprises people. To force freshness set no-store or a short max-age plus an ETag so clients revalidate cheaply with 304s.
Follow-ups they push on
- no-cache vs no-store — exact difference?
- What does the Vary header do for a shared cache?
Red flag Reading `no-cache` as 'never cache'. It means store but revalidate every time; `no-store` is the one that forbids storing.
source: MDN — Cache-Control ↗
Commonly asked senior concept occasional What problems did HTTP/2 solve over HTTP/1.1, and what does HTTP/3 change?
HTTP/1.1 suffers head-of-line blocking: one response per connection at a time, so browsers open many TCP connections. HTTP/2 adds multiplexing (many concurrent streams over one connection), header compression (HPACK), and server push, removing application-layer HOL blocking.
But HTTP/2 still rides TCP, so a single lost packet stalls all streams (transport-layer HOL blocking). HTTP/3 runs over QUIC (UDP), giving independent streams, faster connection setup (0-RTT), and seamless connection migration across network changes.
Follow-ups they push on
- Why doesn't HTTP/2 multiplexing fully fix HOL blocking?
- What does QUIC do that TCP can't?
Red flag Claiming HTTP/2 eliminated all head-of-line blocking. It removed it at the HTTP layer but TCP still serializes loss recovery — that's why HTTP/3 moved to QUIC.
source: Cloudflare — HTTP/3 vs HTTP/2 ↗

2.2 API design & alternatives 14

★ must-know Commonly asked senior concept common Why does the N+1 query problem hit GraphQL especially hard, and how do you fix it?
GraphQL resolvers run per-field, per-object. Fetch a list of 10 authors and then ask for each author's posts, and the naive resolver fires 1 query for the authors + N queries for the posts — the classic N+1 blowup, which gets worse as clients nest deeper.
The standard fix is a DataLoader: it batches the individual post requests made within one tick of the event loop into a single WHERE author_id IN (...) query and caches results per request. This collapses N+1 into 2 queries while keeping the per-field resolver model.
What a strong answer covers
- Per-field resolvers mean nested fields each trigger their own query.
- A list of N parents requesting a child field → 1 + N queries.
- DataLoader batches per-tick requests into one IN (...) query and caches per request.
- It's worse in GraphQL than REST because clients control nesting depth dynamically.
Quick self-check
Querying 50 users and each user's `team` name with a naive resolver issues how many DB queries, and what fixes it?
Follow-ups they push on
- Why is per-request caching (not global) the right scope for DataLoader?
- How does query-depth/complexity limiting relate to this?
Red flag Solving N+1 by eager-loading everything regardless of the query — you over-fetch and lose GraphQL's selectivity. Batch with DataLoader instead.
source: Apollo — Optimizing resolvers with DataLoader ↗
Commonly asked junior trick common Trick: what's wrong with the REST route GET /getUserById?id=5, and how should it look?
It mixes RPC-style verb-in-the-path (getUserById) with REST, which is redundant and inconsistent. In REST the HTTP method is the verb and the URL names a resource (noun). So fetching user 5 is simply GET /users/5; the GET already says 'retrieve', and /users/{id} identifies the resource.
Proper resource modeling: GET /users (list), POST /users (create), GET /users/5, PUT/PATCH /users/5 (update), DELETE /users/5. Keep verbs out of paths and use plural nouns consistently.
What a strong answer covers
- The HTTP method is the verb; the path is a noun/resource identifier.
- GET /users/5, not GET /getUserById?id=5.
- Use plural collection nouns consistently (/users, /orders).
- Verb-in-path is an RPC style, not REST.
Quick self-check
Which is the correct RESTful way to fetch the user with id 5?
Follow-ups they push on
- How would you model 'cancel an order' RESTfully? (POST /orders/5/cancel or PATCH status)
- When is an RPC-style action endpoint actually acceptable?
Red flag Putting actions/verbs in the URL (`/createUser`, `/deleteOrder`). The method conveys the action; the path names the thing.
source: MDN — REST ↗
Commonly asked junior concept common What does a good API error response look like, and why is a consistent error shape worth enforcing?
Use the right status code to signal the category, then a structured body with a stable machine-readable code, a human message, and optional details/field-level errors. Keep the shape identical across every endpoint so clients can handle errors generically.
Example shape: { "error": { "code": "card_declined", "message": "Your card was declined.", "details": [] } }. Stable string codes (not just HTTP numbers) let clients branch on the specific failure without parsing prose. Never leak stack traces or internal identifiers.
Follow-ups they push on
- 400 vs 422 for validation errors?
- Why include a stable `code` string alongside the HTTP status?
Red flag Returning 200 with `{ success: false }`, or varying the error body per endpoint. Clients then can't handle failures uniformly.
source: Stripe — Error handling ↗
Commonly asked mid design very common Design a URL-shortening API (like bit.ly). Walk me through the endpoints and the redirect.
Two core endpoints: POST /urls with the long URL returns a short code (201 + Location); GET /{code} issues a 301/302 redirect to the long URL.
Key decisions: generate the code via a base62 encoding of an auto-increment id or a hash (handle collisions); store code -> longURL in a fast KV store; cache hot codes (read-heavy workload). Discuss 301 (permanent, cacheable, loses analytics) vs 302 (temporary, every hit reaches you for click counts). Add rate limiting and custom-alias support as extensions.
Follow-ups they push on
- 301 vs 302 for the redirect — which and why?
- How do you guarantee short-code uniqueness at scale?
- How would you add click analytics without slowing the redirect?
Red flag Picking 301 then wondering why click analytics vanish — browsers cache 301 and stop hitting your server.
source: system-design-primer — Design a URL shortener ↗
Commonly asked mid concept common WebSockets vs Server-Sent Events vs long polling — how do you pick for a real-time feature?
Long polling holds an HTTP request open until there's data, then the client reconnects — works everywhere but is request-heavy and laggy. Server-Sent Events (SSE) is a one-way server→client stream over a single long-lived HTTP connection, with built-in auto-reconnect and event IDs — ideal for notifications, live scores, dashboards. WebSockets give a full-duplex bidirectional channel after an HTTP upgrade — needed when the client also pushes frequently (chat, collaborative editing, multiplayer).
Rule of thumb: server-push-only → SSE (simpler, rides plain HTTP); two-way/high-frequency → WebSockets; fallback when neither is available → long polling.
What a strong answer covers
- SSE is unidirectional (server→client), text-only, with automatic reconnection.
- WebSockets are bidirectional full-duplex after an upgrade handshake.
- Long polling is the universal but least efficient fallback.
- SSE works over plain HTTP/2; WebSockets need their own protocol handling.
Quick self-check
A dashboard only needs the server to push live metric updates to the browser. Best fit?
Follow-ups they push on
- Why might SSE be a better fit than WebSockets for a notifications feed?
- What HTTP mechanism upgrades a connection to a WebSocket? (101 Switching Protocols)
Red flag Reaching for WebSockets for a one-way notification stream. SSE is simpler, auto-reconnects, and rides ordinary HTTP infrastructure.
source: MDN — Server-sent events ↗
Commonly asked mid concept occasional What makes gRPC fast, and what are the practical downsides versus REST/JSON?
gRPC rides HTTP/2 (multiplexed, persistent connections) and serializes with Protocol Buffers — a compact binary format with a strict schema, so payloads are smaller and parsing is faster than text JSON. It also generates typed client/server stubs and supports streaming in both directions.
Downsides: it's not natively callable from browsers (you need gRPC-Web + a proxy); the binary payloads aren't human-readable, so debugging needs tooling; and it adds schema/codegen overhead. That's why gRPC dominates internal service-to-service traffic while REST/JSON stays the default for public, browser-facing APIs.
What a strong answer covers
- HTTP/2 transport + binary Protocol Buffers → small payloads, fast parsing.
- Generated typed stubs and first-class bidirectional streaming.
- Not browser-native — needs gRPC-Web and a proxy.
- Binary payloads are hard to eyeball/debug versus JSON.
Follow-ups they push on
- Why can't a browser call a gRPC service directly?
- When is the protobuf schema requirement a benefit vs a burden?
Red flag Choosing gRPC for a public browser-facing API. Its lack of native browser support and opaque payloads make REST/JSON the friendlier public choice.
source: gRPC — Core concepts, architecture and lifecycle ↗
GitHub mid concept occasional How do rate-limit response headers (X-RateLimit-* / RateLimit-*) and 429 + Retry-After help a well-behaved client?
When throttling, return 429 Too Many Requests and tell the client *how* to behave. Limit headers expose the budget: a limit, the remaining count, and a reset time (GitHub uses X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset; the IETF RateLimit draft standardizes this). On a 429 (or 503) a Retry-After header tells the client exactly how long to wait.
This lets a good client self-throttle proactively — slow down as remaining approaches zero and back off precisely after a 429 — instead of blindly hammering and guessing.
What a strong answer covers
- 429 = rate limited; pair it with Retry-After (seconds or a date).
- Limit/Remaining/Reset headers let clients pace themselves before being blocked.
- GitHub's API documents X-RateLimit-*; an IETF RateLimit header draft standardizes the pattern.
- Proactive self-throttling beats reactive retry-storms.
Follow-ups they push on
- What format can Retry-After take? (delay-seconds or an HTTP date)
- Why surface Remaining/Reset instead of only a 429?
Red flag Returning 429 with no Retry-After or budget headers, leaving clients to guess and retry-storm. Tell them how long to wait and how much budget remains.
source: GitHub REST API — Rate limits ↗
Commonly asked mid concept very common REST vs GraphQL vs gRPC vs WebSockets — when do you reach for each?
REST: default for public CRUD over HTTP; cacheable, simple, ubiquitous. GraphQL: client picks exactly the fields it needs — kills over/under-fetching when many clients aggregate data from many resources; cost is caching and query-complexity control. gRPC: high-performance internal service-to-service calls over HTTP/2 + protobuf, with streaming; not browser-native. WebSockets: persistent bidirectional real-time channel (chat, live feeds, multiplayer).
Choose by traffic shape: public+cacheable → REST; flexible client queries → GraphQL; fast internal RPC → gRPC; push/real-time → WebSockets.
Follow-ups they push on
- Why is GraphQL harder to cache than REST?
- Why isn't gRPC used directly from browsers?
Red flag Reaching for GraphQL or gRPC by default. For a simple public CRUD API, REST is usually the lower-friction, more cacheable choice.
source: ByteByteGo — REST vs GraphQL vs gRPC ↗
Commonly asked mid concept common Offset pagination vs cursor (keyset) pagination — what breaks with offset at scale?
Offset/limit (LIMIT 20 OFFSET 10000) is simple but the database must scan and discard every skipped row, so deep pages get slow, and rows shifting between requests cause duplicates or skips.
Cursor/keyset pagination passes the last-seen sorted key (WHERE id > :lastId ORDER BY id LIMIT 20). It uses the index directly, so performance is constant regardless of depth, and it is stable under inserts. Tradeoff: you can't jump to an arbitrary page number. Use cursors for infinite scroll and large/active datasets.
Follow-ups they push on
- Why does offset pagination skip or duplicate rows under writes?
- How do you build a cursor over a non-unique sort column?
Red flag Using OFFSET for an infinite feed — as users scroll, new inserts shift the window and they see duplicates. Cursors avoid that.
source: Hello Interview — Pagination patterns ↗
Commonly asked mid concept common How do you version a public API, and how do you evolve it without breaking clients?
Three common strategies: URI versioning (/v1/users) — explicit and cache-friendly, the most common; header versioning (Accept: application/vnd.api.v2+json) — cleaner URLs, harder to test in a browser; and query param (?version=2).
The deeper answer is to avoid breaking changes at all: add fields rather than remove, treat unknown fields as ignorable, never repurpose a field's meaning, and only bump the major version for genuinely incompatible changes. Announce deprecations with timelines and Deprecation/Sunset headers.
Follow-ups they push on
- What counts as a breaking vs non-breaking change?
- How does Stripe version without URL bumps? (dated versions pinned per account)
Red flag Bumping the version for additive changes. Adding an optional field is backward-compatible and shouldn't force clients to migrate.
source: Hello Interview — API design (versioning) ↗
Commonly asked senior design occasional Design a bulk-create endpoint that imports 10,000 records. Sync or async, and how do you report results?
Don't process 10k records in a synchronous request — you'll hit timeouts and tie up a worker. Accept the payload, validate it cheaply, enqueue a background job, and return 202 Accepted with a job/status URL (Location: /imports/{id}). The client polls that URL (or subscribes) for progress and the final per-record outcome.
Key decisions: define partial-failure semantics (all-or-nothing transaction vs per-record results so 9,998 succeed and 2 errors are reported), make the import idempotent via a client-supplied batch key so retries don't double-import, and cap batch size with backpressure.
Follow-ups they push on
- All-or-nothing vs per-record partial success — which and why?
- How do you make the bulk import idempotent under client retries?
- What status code signals 'accepted but not yet done'? (202)
Red flag Processing the whole batch inline and returning one 200/500. Long requests time out, and a single bad row failing the entire batch is a poor contract — go async with per-record results.
source: MDN — 202 Accepted ↗
Stripe senior design very common How do idempotency keys make a payment POST safely retryable? Walk through the server logic.
The client generates a unique key (e.g. a V4 UUID) and sends it in an Idempotency-Key header. The server stores the key with the request's outcome.
Logic: on first request for a key, process it and persist the resulting status + response body keyed by that idempotency key (inside the same transaction as the side effect). On any retry with the same key, return the stored response instead of re-charging. Handle the in-flight case (a retry arriving while the first is still processing) with a lock or a 409. Stripe expires keys after 24 hours. This turns a non-idempotent POST into a safely retryable one after a timeout.
Follow-ups they push on
- Where do you store the key — same DB transaction as the charge? Why?
- What if two identical requests arrive concurrently?
Red flag Storing the idempotency record separately from the side effect, so a crash between the charge and the record leaves you able to double-charge. Persist them atomically.
source: Stripe — Designing robust APIs with idempotency ↗
AmazonStripe senior design very common Design a rate limiter for an API. Which algorithm would you use and why?
The token bucket is the common default — a bucket refills tokens at a fixed rate up to a capacity; each request consumes a token, and an empty bucket means the request is rejected with 429 Too Many Requests (plus a Retry-After header). It allows short bursts while bounding the average rate. ByteByteGo notes both Amazon and Stripe use this algorithm to throttle their APIs.
Alternatives: leaky bucket (smooths to a constant outflow), fixed window (simple but allows 2x bursts at window edges), and sliding window (smooths the edge problem). For a distributed limiter, keep counters in a shared store like Redis (atomic INCR with TTL) so all nodes agree.
Follow-ups they push on
- What status code and header do you return when throttled?
- How do you keep the limit consistent across many API servers?
- Why does fixed-window allow a 2x burst?
Red flag Keeping the counter in each server's local memory in a multi-node deployment — clients then get N times the limit. Use a shared/atomic store.
source: ByteByteGo — Design a rate limiter ↗
Commonly asked senior trick occasional What is HATEOAS, and is it actually used in practice?
HATEOAS (Hypermedia As The Engine Of Application State) is the REST constraint where responses include links to the next available actions, so the client discovers transitions dynamically ({ "_links": { "cancel": "/orders/42/cancel" } }) instead of hardcoding URLs.
In practice it's the least-adopted REST constraint — most 'REST' APIs are really HTTP+JSON without hypermedia. Be honest in interviews: know what it is and the decoupling argument, but acknowledge most teams skip it because clients are coupled to the API anyway and tooling support is thin.
Follow-ups they push on
- What would full HATEOAS buy you that plain JSON doesn't?
- What is the Richardson Maturity Model?
Red flag Claiming your API is 'fully RESTful' while having no hypermedia — by Fielding's definition that's level 2, not true REST.
source: MDN — REST ↗

2.3 Auth & security concepts 14

★ must-know Commonly asked mid concept very common What is CSRF, and why does a CSRF attack work even though the attacker never sees the victim's cookie?
CSRF (Cross-Site Request Forgery) tricks a logged-in victim's browser into making a state-changing request to your site. The attacker hosts a page that auto-submits a form (or fires a request) to yourbank.com/transfer; because the browser automatically attaches the victim's cookies to any request to that origin, the request arrives authenticated — even though the attacker never read the cookie.
The core enabler is ambient authority: cookies ride along by default. Defenses: SameSite cookies (block cross-site sends), anti-CSRF tokens (a secret the attacker's page can't know), and checking Origin/Referer.
What a strong answer covers
- The browser auto-sends cookies to the target origin — the attacker exploits that, not the cookie value.
- Only state-changing requests matter; CSRF can't read the response (same-origin policy).
- SameSite=Lax/Strict cookies are the first-line modern defense.
- Anti-CSRF tokens add a secret the attacker's page cannot supply.
Quick self-check
Why does a CSRF attack succeed without the attacker ever reading the session cookie?
Follow-ups they push on
- Why are JWTs in the Authorization header less exposed to CSRF than cookie sessions?
- Does CSRF let the attacker read the response? (No — SOP blocks that.)
Red flag Thinking HTTPS or HttpOnly stops CSRF. They don't — the browser still auto-attaches the cookie. SameSite and CSRF tokens are the defenses.
source: OWASP — Cross-Site Request Forgery Prevention Cheat Sheet ↗
Commonly asked junior concept very common Authentication vs authorization — state the difference crisply with an example.
Authentication answers 'who are you?' — verifying identity (password, token, passkey). Authorization answers 'what are you allowed to do?' — checking permissions after identity is established.
Example: logging in with your password is authentication; the check that decides you can read but not delete the document is authorization. Authn always precedes authz. The corresponding status codes: 401 Unauthorized = not authenticated; 403 Forbidden = authenticated but not permitted.
Follow-ups they push on
- Which HTTP status maps to each failure?
- Where does each typically live in a request pipeline?
Red flag Swapping 401 and 403, or saying 'authorization checks your password'. Authorization assumes identity is already known.
source: Auth0 — Authentication vs Authorization ↗
Commonly asked mid concept very common Walk through the three parts of a JWT. What does the signature guarantee — and what does it NOT?
A JWT is header.payload.signature, each base64url-encoded and joined by dots. The header names the algorithm; the payload holds the claims (sub, exp, roles); the signature is computed over header+payload with a secret (HMAC) or private key (RSA/ECDSA).
The signature guarantees integrity and authenticity — the server detects any tampering and confirms the token was issued by a holder of the key. It does not provide confidentiality: the payload is merely encoded, not encrypted, so anyone can base64-decode and read it. Never put secrets in a JWT payload, and always verify the signature server-side.
What a strong answer covers
- Three parts: header, payload (claims), signature — base64url, dot-separated.
- Signature → integrity + authenticity (tamper-evident, proves the issuer).
- Payload is encoded, not encrypted — readable by anyone; no secrets in it.
- Standard claims: sub, exp, iat, iss, aud.
Quick self-check
What does a valid JWT signature prove?
Follow-ups they push on
- Why must the server verify the signature on every request?
- What's the difference between a signed (JWS) and an encrypted (JWE) token?
Red flag Storing sensitive data in the JWT payload assuming it's hidden. It's base64-decodable plaintext — signing protects integrity, not confidentiality.
source: jwt.io — Introduction to JSON Web Tokens ↗
Commonly asked mid concept very common Session cookies vs JWTs for API auth — compare the tradeoffs. How do you revoke each?
Sessions: server stores session state, the client holds an opaque session id in an HttpOnly cookie. Stateful, but revocation is trivial — delete the server-side session. Needs shared session storage to scale horizontally.
JWTs: a signed, self-contained token the server verifies without a lookup — stateless and scales easily. The catch is revocation: a valid JWT is honored until it expires, so logout/ban requires a denylist or short expiry + refresh tokens, which reintroduces state. Use short-lived access tokens (minutes) plus a refresh token to limit the blast radius.
Follow-ups they push on
- How do you revoke a JWT before it expires?
- Where should the browser store a JWT — localStorage or a cookie?
Red flag Calling stateless JWTs strictly better. Their headline weakness is revocation; any real logout/ban story drags state back in.
source: Auth0 — Token-based vs session-based authentication ↗
Commonly asked mid concept very common OAuth2 vs OIDC — what is each actually for? Don't conflate them.
OAuth 2.0 is delegated authorization: 'let app A access my data on service B' without sharing my password — it issues access tokens scoped to resources. It says nothing about who the user is.
OIDC (OpenID Connect) is an authentication layer built on top of OAuth2. It adds an ID token (a JWT) and a standard /userinfo endpoint, so the app learns *who* logged in — this is what powers 'Log in with Google'. So: OAuth2 = access to resources; OIDC = proof of identity.
Follow-ups they push on
- What does the ID token contain that the access token doesn't?
- Why is using a raw OAuth2 access token as proof of login a mistake?
Red flag Using a bare OAuth2 access token to authenticate a user. Access tokens are for resource access; identity comes from the OIDC ID token.
source: OpenID Connect — How it works ↗
Commonly asked mid concept common How should passwords be stored, and why is a fast hash like SHA-256 the wrong choice?
Never store plaintext or reversible encryption. Use a slow, salted, adaptive password hash — bcrypt, scrypt, or Argon2 (the current OWASP-preferred). The salt (unique per user) defeats rainbow tables; the deliberate slowness/work factor caps how many guesses an attacker can make per second after a breach.
Fast general-purpose hashes (SHA-256, MD5) are wrong precisely because they're fast — a GPU computes billions per second, making offline brute force cheap. Choose a memory-hard function and raise the cost factor as hardware improves.
Follow-ups they push on
- What does the salt protect against specifically?
- Why is Argon2 preferred over bcrypt today?
Red flag Using SHA-256/MD5 (even salted) for passwords. They're built to be fast, which is the opposite of what password hashing needs.
source: OWASP — Password Storage Cheat Sheet ↗
Commonly asked mid concept common Why use short-lived access tokens with refresh tokens instead of one long-lived token?
A stateless access token can't be revoked before it expires, so you want it to live only minutes — that bounds the damage if it leaks. To avoid forcing the user to log in every few minutes, a longer-lived refresh token (stored more securely, server-trackable) is exchanged for fresh access tokens.
This splits concerns: access tokens are stateless and fast to verify; refresh tokens are the revocable, stateful part. Add refresh token rotation (issue a new refresh token each use and invalidate the old one) so a stolen refresh token is detected on reuse.
Follow-ups they push on
- What is refresh token rotation and what attack does it catch?
- Where do you store the refresh token vs the access token?
Red flag Issuing a long-lived access token 'for convenience'. If it leaks you have no way to revoke it until expiry.
source: Auth0 — Refresh tokens ↗
Commonly asked senior debug occasional Debugging: a JWT library accepts a token with alg: none and lets a forged admin token through. What happened?
This is the classic alg: none / algorithm-confusion vulnerability. The JWT header declares its own algorithm; if the verifier trusts that field, an attacker sets alg: none (or strips the signature) and the library skips verification, accepting a payload they forged (role: admin). A related attack swaps RS256 for HS256, signing with the public key as if it were an HMAC secret.
Fix: never let the token dictate the algorithm. Configure the verifier with an allowlist of expected algorithms, reject none, and validate exp/aud/iss. Treat the header's alg as untrusted input.
What a strong answer covers
- The bug: the verifier trusts the attacker-controlled alg header.
- alg: none tells naive libraries to skip signature verification entirely.
- RS256→HS256 confusion lets the public key be abused as an HMAC secret.
- Fix: pin the expected algorithm(s) server-side; reject none; verify standard claims.
Quick self-check
What's the root cause of the alg:none JWT bypass?
Follow-ups they push on
- Why is the RS256-to-HS256 swap dangerous when the public key is, well, public?
- Which standard claims should you always validate?
Red flag Calling a generic `verify()` that honors the token's own `alg`. Always pass an explicit algorithm allowlist; never accept `none`.
source: Auth0 — Critical vulnerabilities in JSON Web Token libraries ↗
Commonly asked senior concept common RBAC vs ABAC — what's the difference, and when do you outgrow roles?
RBAC (Role-Based Access Control) grants permissions through roles: a user is an editor, the editor role can update:article. Simple, auditable, and enough for most apps. It strains when access depends on context beyond a role — ownership, department, time of day, resource attributes — leading to a 'role explosion' (editor_team_a_readonly_weekends).
ABAC (Attribute-Based Access Control) decides via policies over attributes of the user, resource, action, and environment (e.g. 'allow if user.dept == resource.dept and time is business hours'). It's far more expressive but harder to reason about and audit. Start with RBAC; reach for ABAC when contextual, fine-grained rules cause role explosion.
What a strong answer covers
- RBAC: permissions via roles — simple, auditable, sufficient for most apps.
- ABAC: policies over user/resource/action/environment attributes — expressive, context-aware.
- Role explosion signals you've outgrown pure RBAC.
- ABAC trades simplicity/auditability for fine-grained flexibility.
Quick self-check
Requirement: 'a user may edit a document only if they are in the same department as the document.' Which model fits naturally?
Follow-ups they push on
- What's 'role explosion' and what causes it?
- How does ownership-based access (only edit your own posts) fit RBAC vs ABAC?
Red flag Encoding contextual rules as ever-more-specific roles. When permissions depend on resource attributes or context, that's an ABAC need, not more roles.
source: Auth0 — RBAC vs ABAC ↗
Commonly asked senior concept occasional Why compare password hashes (and tokens) with a constant-time comparison instead of ==?
A normal string == short-circuits at the first mismatching byte, so it returns faster the earlier the difference. An attacker measuring response timing can exploit this timing side channel to recover a secret (an API token or HMAC) byte by byte — try values until the comparison takes slightly longer, meaning one more byte matched.
A constant-time comparison always examines the full length regardless of where bytes differ, leaking no timing information. Use the platform's crypto.timingSafeEqual / hmac.compare_digest for tokens, HMAC tags, and similar secrets. (Note: bcrypt/Argon2 verification already handles this for passwords.)
What a strong answer covers
- == short-circuits, so its runtime depends on how many leading bytes match.
- An attacker can recover a secret byte-by-byte from timing differences.
- Constant-time compare scans the full input regardless of mismatches.
- Use crypto.timingSafeEqual / hmac.compare_digest for token/HMAC checks.
Follow-ups they push on
- Why doesn't this timing concern apply to comparing two bcrypt hashes the same way?
- Where else do timing side channels show up?
Red flag Comparing secret tokens or HMAC signatures with ordinary string equality. The early-exit timing leak can let an attacker brute-force the secret one byte at a time.
source: OWASP — Cryptographic Storage Cheat Sheet ↗
Commonly asked senior concept occasional What is a pepper, and how does it differ from a salt in password hashing?
A salt is a unique random value stored *alongside* each password hash; it ensures identical passwords produce different hashes and defeats precomputed rainbow tables. It's not secret — it lives in the database with the hash.
A pepper is a single secret value mixed into every password before hashing, but kept outside the database (in app config, a secret manager, or an HSM). The point of defense-in-depth: if an attacker steals only the database, the salts don't help them, and without the pepper they still can't crack the hashes offline. Salt = per-user, public, in DB; pepper = global, secret, outside DB.
What a strong answer covers
- Salt: per-user, random, stored with the hash — kills rainbow tables.
- Pepper: global secret, kept out of the DB — defends against DB-only theft.
- They're complementary, not alternatives.
- Pepper rotation is harder, so it's stored in config/secret manager/HSM.
Quick self-check
What distinguishes a pepper from a salt?
Follow-ups they push on
- If both leak, does the pepper still help?
- Where should the pepper be stored, and why not in the DB?
Red flag Storing the pepper in the same database as the hashes — that defeats its entire purpose. The pepper's value comes from living somewhere a DB dump won't expose.
source: OWASP — Password Storage Cheat Sheet (Peppering) ↗
Commonly asked senior design common Walk through the OAuth2 authorization code flow. Why was PKCE added?
Authorization code flow: the app redirects the user to the auth server; the user authenticates and consents; the auth server redirects back with a short-lived authorization code; the app's backend exchanges that code (plus its client secret) for an access token over a back channel. Keeping tokens off the front channel is the point.
PKCE (Proof Key for Code Exchange) hardens this for public clients (SPAs, mobile) that can't keep a secret. The client sends a hashed code_challenge up front and the original code_verifier at exchange time, so a stolen authorization code is useless without the verifier. PKCE is now recommended for all clients.
Follow-ups they push on
- Why is the implicit flow discouraged now?
- What attack does PKCE specifically stop?
Red flag Using the deprecated implicit flow (tokens in the URL fragment) for SPAs. The modern guidance is auth-code + PKCE.
source: oauth.com — Authorization Code with PKCE ↗
Commonly asked senior concept common Where should a browser store an access token, and how do the choices map to XSS vs CSRF?
localStorage is readable by any JavaScript on the page, so a single XSS flaw leaks the token. An HttpOnly cookie is invisible to JS (XSS can't read it) but is sent automatically, which opens CSRF.
The pragmatic answer: store tokens in HttpOnly, Secure, SameSite=Lax/Strict cookies and add anti-CSRF defenses (SameSite already blocks most cross-site sends; add a CSRF token for the rest). Keep access tokens short-lived. There's no storage location immune to a compromised front end — defense in depth plus a tight CSP matters more than the slot.
Follow-ups they push on
- How does SameSite=Strict mitigate CSRF?
- Why doesn't HttpOnly help against CSRF?
Red flag Claiming HttpOnly cookies are 'XSS-proof and safe'. They stop token theft via JS but are auto-sent, so you still need CSRF protection.
source: OWASP — JWT / token storage cheat sheet ↗
Commonly asked senior concept occasional Common pattern: use OAuth/OIDC to log in, then issue your own session or JWT. Why do that instead of using the provider's token directly?
After OIDC verifies identity, you typically mint your own session/JWT rather than passing Google's token around. Reasons: you control expiry and revocation; you attach your app's roles/permissions and user id; you don't couple every internal service to the external provider's token format or availability; and you avoid leaking a powerful provider token across your backend.
The provider token is used once at login to establish identity; from then on your own credential governs the session.
Follow-ups they push on
- What goes in your token that the provider's doesn't?
- How does this help if you later add a second identity provider?
Red flag Forwarding the raw Google/Apple token to every internal service. It couples you to the provider and complicates revocation and authorization.
source: OAuth.com — OAuth 2.0 Simplified ↗

2.4 Application architecture & patterns 14

★ must-know Commonly asked mid concept common What's the difference between MVC and a layered (controller/service/repository) architecture? Are they the same thing?
They overlap but aren't identical. MVC is a UI-organizing pattern: the Model holds data/state, the View renders it, and the Controller handles input and coordinates the two — its purpose is separating presentation from data.
A layered architecture stacks responsibilities by technical concern (presentation → business/service → data-access/repository), each layer depending only on the one below. In practice a server MVC framework's 'Controller' maps to the presentation layer, and the 'Model' often expands into service + repository layers. So MVC describes the request-handling triangle; layering describes the full vertical stack that the model side usually grows into.
What a strong answer covers
- MVC separates presentation (view) from data/state (model) via a controller.
- Layered architecture separates by technical concern top-to-bottom.
- MVC's 'Model' typically expands into service + repository layers.
- They're complementary lenses, not competing choices.
Quick self-check
In a layered backend, where does business logic (e.g. 'a refund can't exceed the original charge') belong?
Follow-ups they push on
- Where does business logic live in a 'fat model' vs a service layer?
- Why is putting business logic in the controller a smell in both?
Red flag Cramming business logic and data access into the MVC controller. The controller is presentation/coordination; domain logic belongs in services, persistence in repositories.
source: MDN — MVC ↗
Commonly asked junior concept common What is middleware in a web framework, and what does it look like in practice?
Middleware is a function in the request/response pipeline that runs before (and often after) the route handler. Each piece can inspect or mutate the request/response and either pass control to the next link or short-circuit (e.g. reject an unauthenticated request).
Classic uses: logging, authentication, body parsing, CORS, rate limiting, error handling. In Express the signature is (req, res, next) => { ... next(); }. The ordered chain is what makes cross-cutting concerns composable instead of duplicated in every handler.
Follow-ups they push on
- How does calling (or not calling) next() control the chain?
- Why is error-handling middleware registered last?
Red flag Forgetting to call next() (or to send a response), which hangs the request silently.
source: Express — Using middleware ↗
Commonly asked mid concept common Give a one-line 'smell it fixes' for each SOLID principle.
S — Single Responsibility: a class has one reason to change; fixes the god-class that mixes parsing, business rules, and DB code. O — Open/Closed: extend behavior without editing existing code; fixes the ever-growing switch you reopen for every new case. L — Liskov Substitution: subtypes must be usable through the base type without surprises; fixes the subclass that throws on a method the parent promises. I — Interface Segregation: many small interfaces over one fat one; fixes clients forced to implement methods they don't use. D — Dependency Inversion: depend on abstractions, not concretions; fixes high-level logic nailed to a specific DB/SDK, which kills testability.
Follow-ups they push on
- Which SOLID principle most directly enables unit testing? (DIP)
- Give a concrete Liskov violation.
Red flag Reciting the names without a concrete smell. Interviewers want the problem each one removes, not the dictionary definition.
source: GeeksforGeeks — SOLID principles ↗
Commonly asked mid concept common What does the Repository pattern give you, and what's the risk of a 'leaky' repository?
A Repository is a collection-like abstraction over persistence: the service asks for userRepo.findActiveByEmail(email) and doesn't know whether that's SQL, a document store, or an in-memory list. It centralizes query logic, decouples the domain from the ORM, and makes services testable with a fake repository.
The risk is a leaky abstraction: if the repository exposes IQueryable, raw SQL fragments, or ORM-specific lazy-loading proxies, persistence concerns bleed into the service and the decoupling is gone. Keep the interface in domain terms — return domain objects, accept domain criteria — so the storage technology stays a private detail.
What a strong answer covers
- Collection-like interface over persistence; hides the storage mechanism.
- Decouples domain/service from the ORM and enables fake-based unit tests.
- Centralizes query logic instead of scattering SQL across services.
- Leak risk: exposing IQueryable/raw SQL/lazy proxies re-couples callers to the DB.
Follow-ups they push on
- Repository vs DAO — what's the conceptual difference?
- Why return domain objects rather than ORM entities directly?
Red flag Returning the ORM's query builder or lazy-loaded entities from the repository. Callers then depend on persistence details, defeating the abstraction.
source: Martin Fowler — Repository ↗
Commonly asked mid trick occasional Trick: a class has 14 constructor parameters. Which design principle is being violated, and how do you fix it?
A bloated constructor (a 'too many dependencies' smell) usually signals a Single Responsibility Principle violation — the class is doing too many jobs, each pulling in its own collaborators. It's the constructor-injection symptom of a god class.
Fix by decomposing: extract cohesive groups of those dependencies into smaller focused classes (e.g. a NotificationService wrapping the email/SMS/push senders) so the original class depends on a few higher-level abstractions instead of fourteen low-level ones. The number of constructor args is a proxy metric; the real fix is restoring single responsibility, not hiding the args behind a service locator or a giant config object.
What a strong answer covers
- Many constructor params → the class has too many responsibilities (SRP violation).
- Constructor injection makes the bloat visible, which is a feature, not the bug.
- Fix by extracting cohesive collaborators into focused sub-services.
- Don't hide it with a service locator/God-config object — that masks the smell.
Quick self-check
A class needs 12 injected dependencies. The healthiest interpretation is:
Follow-ups they push on
- Why is hiding the dependencies behind a service locator the wrong fix?
- How does SRP relate to high cohesion?
Red flag 'Fixing' it by switching to a service locator so the dependencies become invisible. That hides the SRP violation instead of resolving it and hurts testability.
source: Refactoring Guru — Large Class smell ↗
Commonly asked mid concept common Composition over inheritance — what does it mean and why is it usually the better default?
Inheritance models 'is-a' and binds a subclass to its parent's implementation at compile time — a rigid, white-box coupling that gets brittle with deep hierarchies (the fragile base class problem) and tempts Liskov violations. Composition builds behavior by holding other objects and delegating to them ('has-a'), which you can vary at runtime and swap for tests.
The guidance 'favor composition over inheritance' (from the Gang of Four) is about flexibility: small composed parts recombine freely, while inheritance hierarchies resist change. Use inheritance for genuine, stable is-a relationships with a real behavioral contract; prefer composition for sharing/reusing behavior.
What a strong answer covers
- Inheritance = compile-time 'is-a', tight white-box coupling to the parent.
- Composition = runtime 'has-a', delegate to swappable collaborators.
- Deep hierarchies cause fragile-base-class and Liskov problems.
- GoF guidance: favor composition; reserve inheritance for true, stable is-a.
Follow-ups they push on
- How does the Strategy pattern embody composition over inheritance?
- When is inheritance still the right tool?
Red flag Reaching for inheritance to reuse a method, creating a deep hierarchy that's hard to change. If the relationship isn't a true is-a, compose and delegate instead.
source: Refactoring Guru — Favor composition over inheritance ↗
Commonly asked mid trick common Why is the Singleton pattern considered a testability and design smell?
A Singleton enforces one global instance with global access. The problems: it's global mutable state in disguise, which hides dependencies (a class secretly reaches for Logger.getInstance() instead of receiving it). That makes unit tests hard — you can't easily substitute a mock, tests share state and leak into each other, and parallel tests interfere.
The usual fix is dependency injection: create one instance at the composition root and pass it in. You keep 'one instance' as a lifecycle policy without the hard-coded global lookup.
Follow-ups they push on
- How does DI give you 'one instance' without the Singleton anti-pattern?
- When is a Singleton actually fine?
Red flag Defending Singleton as 'just one object'. The cost is the static global access point that hides dependencies and breaks test isolation.
source: GeeksforGeeks — Singleton design pattern ↗
Commonly asked mid concept common Explain dependency injection and how it improves testability.
Dependency injection means a component receives its collaborators from outside (constructor/parameters) instead of constructing them itself. It's the practical expression of the Dependency Inversion Principle: code depends on an interface, and the concrete implementation is wired in at the edge.
Testability win: in a test you inject a fake/mock repository or HTTP client, so you can unit-test the service in isolation with no real database or network. It also decouples modules — swapping Postgres for an in-memory store is a wiring change, not a rewrite.
Follow-ups they push on
- How does this relate to the Repository pattern?
- Constructor injection vs a service locator — which is cleaner and why?
Red flag Confusing DI with 'using a DI framework'. DI is just passing dependencies in; the container is optional sugar.
source: Martin Fowler — Inversion of Control & DI ↗
Commonly asked mid concept common Walk through layered architecture (controller → service → repository). What belongs in each layer?
Controller: HTTP concerns only — parse/validate the request, call a service, map the result to a status code and response. Service: the business logic and orchestration — transactions, rules, coordinating multiple repositories; it knows nothing about HTTP. Repository: data access — encapsulates queries behind a collection-like interface so the service depends on an abstraction, not raw SQL.
The payoff is that each layer is testable and replaceable in isolation, and business logic doesn't leak into the web framework or the database.
Follow-ups they push on
- Why keep HTTP concerns out of the service layer?
- Where does request validation live, and where do domain rules live?
Red flag Fat controllers with business logic and SQL inline — you lose testability and the logic gets tied to the web framework.
source: Martin Fowler — Patterns of Enterprise Application Architecture ↗
Commonly asked mid concept occasional Strategy vs Factory vs Adapter — give a one-sentence use case for each.
Strategy: swap interchangeable algorithms behind one interface at runtime — e.g. pluggable payment processors or sort comparators, picked by configuration. Factory: centralize object creation so callers ask for *what* they want, not *how* it's built — e.g. createParser(fileType). Adapter: wrap an incompatible third-party interface to match the one your code expects — e.g. adapting a legacy SDK to your PaymentGateway interface.
Mnemonic: Strategy varies behavior, Factory varies construction, Adapter reconciles interfaces.
Follow-ups they push on
- Strategy vs simple if/else — when is the pattern worth it?
- How does Adapter differ from Decorator?
Red flag Applying a pattern for its own sake. A two-branch conditional doesn't need Strategy; patterns earn their cost when the variation is open-ended.
source: Refactoring Guru — Design patterns catalog ↗
Commonly asked mid concept occasional Explain the Observer (pub/sub) pattern and the Decorator pattern. Give a real backend use of each.
Observer / pub-sub: subjects publish events and any number of subscribers react, with no direct coupling between them — e.g. on OrderPlaced, the email service, inventory service, and analytics each subscribe independently. It decouples producers from consumers and underlies event-driven systems.
Decorator: wrap an object to layer behavior without changing it, preserving the same interface — e.g. wrapping a repository with caching, then logging, then retry. Each layer adds one concern and delegates inward, so you compose features instead of editing the core class.
Follow-ups they push on
- How does Observer relate to a message broker like Kafka?
- Decorator vs subclassing for adding logging — why prefer the decorator?
Red flag Confusing Decorator with Adapter. Decorator keeps the same interface and adds behavior; Adapter changes one interface into another.
source: Refactoring Guru — Observer ↗
Commonly asked senior concept occasional Explain hexagonal (ports & adapters) architecture. What problem does it solve over a plain layered design?
Hexagonal architecture puts the domain/application core at the center and defines ports (interfaces) for everything it talks to. Adapters implement those ports for specific technologies — a Postgres adapter, a REST adapter, a Kafka adapter — and plug in at the edges. The dependency rule points inward: the core never imports a framework or driver.
Versus a strict top-down layered design (where the business layer still depends on a concrete data layer beneath it), hexagonal inverts those edge dependencies so the database, web framework, and message bus are all swappable, interchangeable details. The payoff is testability (drive the core through fake adapters) and decoupling the domain from infrastructure churn.
What a strong answer covers
- Domain core + ports (interfaces) + adapters (tech-specific implementations).
- Dependencies point inward; the core knows nothing about frameworks/drivers.
- DB, web, and messaging become swappable adapters, not foundational layers.
- Enables testing the core in isolation through fake adapters.
Follow-ups they push on
- What's a 'driving' (primary) adapter vs a 'driven' (secondary) adapter?
- How does this relate to the Dependency Inversion Principle?
Red flag Letting domain code import the ORM/web framework directly 'for convenience'. That re-couples the core to infrastructure and defeats the whole ports-and-adapters point.
source: Alistair Cockburn — Hexagonal Architecture ↗
Commonly asked senior concept occasional What is inversion of control (IoC), and how is dependency injection a specific form of it?
Inversion of control is the general principle that a framework or container — not your code — drives the flow: instead of your code calling into a library, the framework calls your code at the right moments ('don't call us, we'll call you', the Hollywood Principle). Event loops, middleware pipelines, and template method patterns are all IoC.
Dependency injection is one specific kind of IoC: inverting *who supplies a component's dependencies*. Rather than a class constructing its own collaborators, something external (a container or the composition root) provides them. So DI inverts dependency acquisition; IoC is the broader family of 'the framework controls the flow, you fill in the parts'.
What a strong answer covers
- IoC: the framework controls flow and calls your code ('Hollywood Principle').
- DI is a specific form of IoC — inverting how dependencies are supplied.
- Other IoC examples: event loops, callbacks, middleware, template method.
- DI ≠ a DI container; the container is just one way to do DI.
Quick self-check
Which statement is correct about IoC and DI?
Follow-ups they push on
- Give a non-DI example of inversion of control.
- Why is 'IoC container' a slightly misleading name for a DI framework?
Red flag Using IoC and DI as synonyms. DI is one instance of IoC (inverting dependency supply); IoC is the broader idea of the framework owning control flow.
source: Martin Fowler — Inversion of Control ↗
Commonly asked senior debug occasional Give a concrete Liskov Substitution Principle violation and how you'd fix it.
Classic example: Square extends Rectangle. Setting width and height independently is part of Rectangle's contract, but a Square forces them equal, so code that does rect.setWidth(5); rect.setHeight(4); assert area == 20 breaks when handed a Square. The subtype violates the base type's expectations.
Fix: drop the inheritance — model Shape with an area() method and make Square and Rectangle siblings, or use immutable value objects so the mutating contract that conflicts never exists. The lesson: 'is-a' in English isn't enough; the subtype must honor the supertype's behavioral contract.
Follow-ups they push on
- Why is 'a square is a rectangle' true in math but wrong here?
- How does LSP relate to using exceptions in overridden methods?
Red flag Treating LSP as just 'subclasses should work'. The real test is behavioral substitutability — preconditions can't strengthen, postconditions can't weaken.
source: GeeksforGeeks — Liskov Substitution Principle ↗

2.5 Concurrency & parallelism 13

★ must-know Commonly asked senior concept common What is a deadlock vs a livelock vs starvation? Distinguish all three.
Deadlock: threads are blocked forever, each waiting on a resource another holds — nobody moves (e.g. the AB/BA lock-ordering cycle). Livelock: threads aren't blocked and keep *changing state* in response to each other, but make no progress — like two people stepping aside in the same direction repeatedly in a hallway. Starvation: a thread *can* run but is perpetually denied the resource because others keep winning it (e.g. a low-priority thread under a greedy scheduler).
Key distinction: deadlock = stuck and idle; livelock = busy but unproductive; starvation = some progress overall, but one thread is unfairly shut out.
What a strong answer covers
- Deadlock: mutual blocking, zero activity, circular wait.
- Livelock: active state changes but no forward progress.
- Starvation: a thread is runnable but perpetually denied the resource.
- Fairness/aging fixes starvation; lock ordering fixes deadlock.
Quick self-check
Two threads each detect a conflict, both back off and immediately retry in lockstep, repeating forever without blocking. This is:
Follow-ups they push on
- How can a naive retry-on-conflict loop cause livelock?
- How does priority aging address starvation?
Red flag Calling any 'no progress' situation a deadlock. Livelock threads are actively running, and starvation still has overall progress — different causes, different fixes.
source: GeeksforGeeks — Deadlock, Starvation, and Livelock ↗
Commonly asked junior concept very common Concurrency vs parallelism — what's the difference?
Concurrency is about *dealing with* many tasks by interleaving them — making progress on several by switching between them, even on a single core. Parallelism is *doing* many tasks at the same instant on multiple cores.
Rob Pike's line: concurrency is about structure, parallelism is about execution. A single-threaded async server is concurrent but not parallel; a CPU-bound job split across 8 cores is parallel. You can have concurrency without parallelism and vice versa.
Follow-ups they push on
- Can you have parallelism without concurrency?
- Where does Node's event loop sit on this axis?
Red flag Using the terms interchangeably. Interleaving on one core is concurrency, not parallelism.
source: GeeksforGeeks — Concurrency vs parallelism ↗
Commonly asked mid debug common Debugging: a Node.js endpoint that does heavy synchronous JSON crypto makes ALL other requests slow. Why, and how do you fix it?
Node runs your JavaScript on a single event-loop thread. A heavy *synchronous* CPU task (a big loop, sync crypto, JSON over megabytes) doesn't yield, so it blocks the event loop — every other pending request, timer, and callback stalls until it finishes. Async I/O isn't the issue; CPU-bound sync work is.
Fixes: move the CPU work off the loop — use a worker thread (or worker_threads/a child process), the async variant of the crypto API, or offload to a separate service/queue. The rule: never run long synchronous CPU work on the event loop thread.
What a strong answer covers
- Node's JS executes on one event-loop thread; sync CPU work blocks everything.
- Async I/O is fine — the culprit is synchronous CPU-bound code.
- Fix: worker threads / child process / async crypto / offload to a service.
- Symptom: latency spikes across unrelated endpoints during the heavy call.
Quick self-check
A synchronous CPU-heavy handler slows all Node.js requests. The correct fix is to:
Follow-ups they push on
- Why doesn't adding more async/await help a CPU-bound loop?
- When would you reach for a separate service vs a worker thread?
Red flag Trying to fix it by sprinkling `async/await`. Awaiting doesn't yield during a synchronous CPU loop — you must move the computation off the event-loop thread.
source: Node.js — Don't block the event loop ↗
Commonly asked mid concept common Thread-per-request vs event-loop (reactive) servers — what's the tradeoff at high concurrency?
Thread-per-request (classic Java/Tomcat, Apache prefork) assigns each connection a thread. The model is simple — blocking code reads top-to-bottom — but each thread costs ~1MB+ of stack and context-switch overhead, so tens of thousands of concurrent connections exhaust memory and the scheduler (the C10k problem).
Event-loop / reactive servers (Node, Netty, nginx) handle many connections on a few threads via non-blocking I/O and callbacks, scaling to huge connection counts with low memory. The cost is programming complexity (callbacks/async) and the danger that any blocking call freezes the loop. Threads suit moderate concurrency with blocking dependencies; event loops suit massive I/O-bound concurrency.
What a strong answer covers
- Thread-per-request: simple blocking code, but per-thread memory + context switches cap concurrency.
- Event loop: few threads, non-blocking I/O, scales to huge connection counts.
- This is the classic C10k scaling story.
- Event loops demand non-blocking code; one blocking call stalls everyone.
Follow-ups they push on
- What is the C10k problem?
- How do virtual/green threads (e.g. Java loom, goroutines) blur this divide?
Red flag Assuming 'more threads = more scale'. Past a point, thread memory and context-switching dominate; that's exactly what event-loop models were built to avoid.
source: GeeksforGeeks — Thread per request vs event-driven model ↗
Commonly asked mid concept common Threads vs processes — what's shared, what's isolated, and when do you pick each?
A process has its own isolated memory space; threads within a process share the same heap/address space. Threads are cheaper to create and communicate through shared memory; processes are heavier but isolated — a crash or memory corruption in one process can't directly corrupt another.
Pick threads for fine-grained shared-memory work where communication cost matters; pick processes for isolation and fault containment (and, in languages with a GIL like CPython, to get true CPU parallelism). The tradeoff is shared-memory speed vs. isolation and safety.
Follow-ups they push on
- How do processes communicate without shared memory? (IPC, pipes, sockets)
- Why does the GIL push CPython to multiprocessing for CPU-bound work?
Red flag Assuming threads always parallelize CPU work — a global interpreter lock (CPython) serializes bytecode, so threads help I/O but not CPU-bound loops.
source: GeeksforGeeks — Difference between process and thread ↗
Commonly asked mid debug very common What is a race condition? Show a classic example and how to fix it.
A race condition is when the result depends on the unpredictable timing/interleaving of concurrent operations on shared state. Classic case: two threads run balance = balance + 100. That's read-modify-write: both read the same old value, both add 100, both write back — one update is lost.
Fix by making the critical section atomic: guard it with a mutex/lock, use an atomic increment, or use a compare-and-swap. The general principle is to serialize access to shared mutable state so only one thread is in the critical section at a time.
Follow-ups they push on
- Why isn't `x++` atomic?
- What's a check-then-act race (e.g. 'if not exists, create')?
Red flag Assuming a single statement like `count++` is atomic — it compiles to load/add/store, which can interleave.
source: GeeksforGeeks — Race condition ↗
Commonly asked mid concept common Mutex vs semaphore — define each and when you'd use which.
A mutex provides mutual exclusion: one holder at a time, and ownership matters — the thread that locks it should unlock it. Use it to protect a single shared resource / critical section.
A semaphore is a counter that permits up to N concurrent holders (acquire decrements, release increments, block at zero). A binary semaphore (N=1) resembles a lock but has no ownership and is often used for signaling between threads. Use a counting semaphore to cap concurrency — e.g. limit to 10 simultaneous DB connections.
Follow-ups they push on
- What does 'ownership' give a mutex that a semaphore lacks?
- How would you bound a connection pool with a semaphore?
Red flag Treating a binary semaphore as a drop-in mutex. Without ownership, any thread can release it, which permits subtle bugs a mutex prevents.
source: GeeksforGeeks — Mutex vs semaphore ↗
Commonly asked mid concept common Why does async I/O let a single thread handle thousands of connections?
Most server work is I/O-bound — waiting on the network, disk, or a database. With blocking I/O each connection ties up a thread that just sits idle during the wait, so 10k connections need ~10k threads (expensive memory + context switching).
Non-blocking async I/O flips this: the thread issues the I/O and immediately moves on; the OS notifies it (epoll/kqueue) when data is ready, and a callback/continuation resumes. One thread multiplexes thousands of in-flight waits because no thread blocks on the wait. The catch: a CPU-bound task blocks the loop and starves everyone, so async shines for I/O-bound, not CPU-bound, work.
Follow-ups they push on
- When does async hurt? (CPU-bound work blocking the event loop)
- How is this different from a thread-per-request server?
Red flag Believing async is faster for everything. It wins on I/O concurrency; a heavy CPU computation still blocks the single event-loop thread.
source: MDN — Asynchronous JavaScript / event loop ↗
Commonly asked senior concept occasional What is thread starvation in a connection/thread pool, and how does it cause a 'deadlock' without any locks?
Pool starvation: every thread (or DB connection) in a bounded pool is busy waiting on a resource that can only be supplied by *another* task that's now stuck in the pool's queue with no thread to run it. No mutex is involved, yet the system wedges — a 'pool-induced deadlock'.
Classic case: a request handler holds a pooled thread and synchronously calls back into the same service/pool, which is exhausted; the inner call waits for a thread that will only free up when the outer call returns. Fixes: never block a pooled thread waiting on the same pool, size pools to account for nested calls, separate pools for distinct workloads (bulkheading), and add timeouts so waiters fail fast instead of hanging forever.
What a strong answer covers
- All pool threads/connections busy → queued work can't get a worker.
- A task blocking on work that needs the same exhausted pool wedges the system.
- It looks like a deadlock but has no locks — it's resource exhaustion.
- Fixes: bulkhead separate pools, avoid nested same-pool blocking, add timeouts.
Follow-ups they push on
- How does the bulkhead pattern prevent one workload from starving others?
- Why do checkout/borrow timeouts help even if they don't fix the root cause?
Red flag Blocking a pooled worker on a call that itself needs a worker from the same exhausted pool. Use separate pools and timeouts, and avoid nested same-pool blocking.
source: Microsoft — Bulkhead pattern ↗
Commonly asked senior trick occasional Why is double-checked locking for lazy singleton initialization subtly broken without proper memory visibility?
Double-checked locking checks instance == null, locks only if null, checks again inside the lock, then constructs. The subtle bug is memory visibility / instruction reordering: object construction isn't atomic — the reference can become visible to other threads *before* the constructor's writes are flushed, so a second thread sees a non-null but partially-initialized object.
The fix depends on the memory model: in Java, mark the field volatile (which since JMM 5 establishes the needed happens-before ordering); other languages need their equivalent memory barrier / acquire-release semantics. The deeper lesson: correctness under concurrency needs the language's memory model guarantees, not just mutual exclusion.
What a strong answer covers
- The flaw is reordering/visibility, not the locking logic itself.
- A thread can publish the reference before the constructor's writes are visible.
- Fix in Java: volatile field (post-Java-5 memory model).
- Lesson: concurrency correctness requires memory-model guarantees, not just locks.
Quick self-check
What makes naive double-checked locking unsafe?
Follow-ups they push on
- What does `volatile` guarantee that a plain field doesn't?
- Why is a static holder/initialization-on-demand idiom often cleaner than DCL?
Red flag Believing the second null-check alone makes DCL safe. Without volatile/memory barriers, reordering can expose a half-constructed instance.
source: Wikipedia — Double-checked locking ↗
Commonly asked senior concept common What is a deadlock, what four conditions cause it, and how do you prevent it?
A deadlock is when threads wait forever on each other's locks. It needs all four Coffman conditions simultaneously: mutual exclusion, hold-and-wait, no preemption, and circular wait.
Break any one to prevent it. The most practical: impose a global lock-ordering so all threads acquire locks in the same order (kills circular wait); or acquire all locks at once (kills hold-and-wait); or use lock timeouts / tryLock and back off. Example deadlock: thread A holds lock 1 and wants lock 2 while thread B holds lock 2 and wants lock 1.
Follow-ups they push on
- Which Coffman condition is easiest to remove in practice? (circular wait via ordering)
- Deadlock vs livelock vs starvation?
Red flag Adding 'just one more lock' to fix a race and creating a deadlock instead. Inconsistent lock acquisition order is the usual culprit.
source: GeeksforGeeks — Deadlock and conditions ↗
Commonly asked senior concept occasional What is optimistic vs pessimistic locking, and when do you pick each?
Pessimistic locking assumes conflicts are likely, so it locks the row/resource up front (SELECT ... FOR UPDATE) and others wait. Safe but reduces concurrency and risks deadlocks.
Optimistic locking assumes conflicts are rare: read freely, and at write time check a version number (or timestamp/ETag) — if it changed, someone else won, so reject and retry. Great for low-contention, read-heavy workloads; wasteful retries under high contention. Pick pessimistic for hot, highly-contended rows; optimistic for mostly-independent updates.
Follow-ups they push on
- How does a version column implement optimistic locking?
- How does this map to HTTP's If-Match / 412?
Red flag Using optimistic locking on a hotly-contended counter — you'll thrash on retries. High contention favors pessimistic locking.
source: Martin Fowler — Optimistic Offline Lock ↗
Commonly asked senior concept occasional Why are atomic operations and compare-and-swap (CAS) faster than locks for simple shared counters?
A mutex involves OS-level machinery: contended threads may block and be parked/woken by the scheduler, which costs context switches. CAS is a single hardware instruction — 'if memory still holds the value I read, swap in the new value, else fail' — so a lock-free counter just loops read → compute → CAS, retry on failure entirely in user space with no blocking.
Under low-to-moderate contention this is much cheaper. The tradeoff: CAS works for small single-word updates; complex multi-variable invariants still need locks, and very high contention makes CAS retry loops spin wastefully. This is the basis of lock-free/atomic data structures.
What a strong answer covers
- Locks can block threads → context-switch and scheduler overhead.
- CAS is one atomic CPU instruction; lock-free loops stay in user space.
- Great for single-word updates (counters, flags); not for multi-variable invariants.
- Under heavy contention, CAS retry loops can spin and waste CPU.
Follow-ups they push on
- What is the ABA problem in CAS-based algorithms?
- When does a spinning CAS loop perform worse than a mutex?
Red flag Assuming lock-free is always faster. CAS shines for tiny updates under modest contention; complex invariants and extreme contention can favor locks.
source: GeeksforGeeks — Compare and Swap (CAS) ↗

2.6 Messaging & event-driven architecture 13

★ must-know Commonly asked mid concept very common Point-to-point queue vs publish/subscribe — what's the difference and when do you use each?
In a point-to-point queue, each message is delivered to exactly one consumer among possibly many competing workers — it's a work queue for distributing tasks (e.g. resize one image once, no matter how many workers are running). In publish/subscribe, each message is fanned out to *every* subscriber, so N independent services all react to the same event.
Use point-to-point to load-balance work across a pool (competing consumers); use pub/sub to broadcast an event to multiple independent consumers. Kafka models pub/sub via consumer groups: across groups it's fan-out, within a group it's point-to-point load balancing.
What a strong answer covers
- Queue (point-to-point): one message → exactly one of the competing consumers.
- Pub/sub: one message → every subscriber (fan-out).
- Queues load-balance work; pub/sub broadcasts events.
- Kafka consumer groups: fan-out across groups, load-balance within a group.
Quick self-check
You need an OrderPlaced event to trigger email, inventory, AND analytics services independently. Which model?
Follow-ups they push on
- How do Kafka consumer groups give you both models?
- What's the 'competing consumers' pattern?
Red flag Using a single shared queue when you actually need every service to see the event — only one consumer will get each message, and the others silently miss it.
source: AWS — Pub/sub messaging vs message queues ↗
Commonly asked mid concept very common Kafka vs RabbitMQ vs SQS — what's the conceptual difference and when does each fit?
Kafka is a durable, append-only distributed log: consumers track an offset and can replay; built for high-throughput streaming and event sourcing, with retention so multiple consumer groups read the same stream independently. RabbitMQ is a traditional broker with smart routing (exchanges, queues, bindings) and per-message acks — great for complex routing and classic task queues, but messages typically vanish once consumed. SQS is a fully managed AWS queue — minimal ops, at-least-once delivery, near-infinite scale, but no replay and limited ordering (FIFO queues excepted).
Pick Kafka for streaming/replay/high-throughput, RabbitMQ for rich routing and work queues, SQS when you want managed simplicity on AWS.
Follow-ups they push on
- Why can Kafka replay events but RabbitMQ usually can't?
- What does a consumer offset give you that an ack doesn't?
Red flag Calling Kafka 'just a queue'. It's a retained log — consumers read by offset and can replay, which a delete-on-consume queue can't.
source: Hello Interview — Kafka deep dive ↗
Commonly asked mid trick common Trick: does Kafka delete a message once a consumer reads it? What actually controls retention?
No — this is the key mental-model shift from traditional queues. Kafka is a durable log; reading a message does not remove it. The consumer just advances its offset (a bookmark), and the data stays on disk for everyone else to read. Multiple consumer groups can read the same messages independently, and a group can rewind its offset to replay.
Retention is governed by configured policy, not consumption: time-based (retention.ms, e.g. 7 days) or size-based (retention.bytes), or log compaction (keep the latest value per key). Messages age out by policy regardless of whether anyone consumed them.
What a strong answer covers
- Reading does NOT delete — consumers advance an offset (a bookmark).
- Data persists for all consumer groups; rewinding the offset replays.
- Retention is by time/size policy or log compaction, independent of consumption.
- Contrast: a traditional queue typically deletes on consume.
Quick self-check
What happens to a Kafka message after a consumer reads it?
Follow-ups they push on
- What is log compaction and when do you use it?
- How does offset-as-bookmark enable replay and reprocessing?
Red flag Treating Kafka like a delete-on-read queue. Messages persist until the retention policy expires them — consumption only moves an offset.
source: Confluent — Kafka topics and retention ↗
Commonly asked mid concept very common Why does at-least-once delivery force you to build idempotent consumers?
Most brokers guarantee at-least-once delivery: if a consumer processes a message but crashes before acking, the broker redelivers it, so duplicates are inevitable. If processing isn't idempotent, a duplicate means double-charging, double-emailing, or double-incrementing.
Make the consumer idempotent: dedupe on a stable message id (record processed ids and skip repeats), or design the operation so reapplying it is a no-op (upsert, set-to-value instead of increment). Then redelivery is harmless.
Follow-ups they push on
- How would you dedupe by message id, and where do you store seen ids?
- Why is 'set status = SHIPPED' safer than 'increment count'?
Red flag Assuming each message arrives exactly once. At-least-once is the norm; build for duplicates.
source: AWS — SQS at-least-once delivery ↗
Commonly asked mid concept common What is a dead-letter queue and when does a message land there?
A dead-letter queue (DLQ) is a side queue where messages go after they repeatedly fail to be processed (exceeding a max-receive/retry count) or can't be delivered. It stops a single 'poison' message from being redelivered forever and blocking the main queue.
Operationally you alert on DLQ depth, inspect the failed messages, fix the bug or bad data, and replay them back to the main queue. Without a DLQ, a permanently-failing message either loops endlessly or gets silently dropped.
Follow-ups they push on
- How do you decide the max-receive count before dead-lettering?
- What's a poison message and how does a DLQ contain it?
Red flag Having no DLQ, so a poison message either blocks the queue with infinite retries or is lost silently. Always have a parking lot.
source: AWS — Amazon SQS dead-letter queues ↗
Commonly asked mid concept common What is consumer lag in Kafka, and what does growing lag tell you?
Consumer lag is the gap between the latest offset produced to a partition (the log-end offset) and the offset the consumer group has committed — i.e. how many messages are produced-but-not-yet-processed. Steady or near-zero lag means consumers keep up; growing lag means the consumers can't process as fast as producers write.
It's a primary health/alerting signal. Remedies for chronic lag: add consumers (up to the partition count — that's the parallelism ceiling), add partitions, speed up per-message processing, or batch. Spiky lag that drains is fine; monotonically rising lag predicts an eventual backlog blowup.
What a strong answer covers
- Lag = latest produced offset − consumer's committed offset (unprocessed backlog).
- Rising lag = consumers slower than producers.
- Max parallelism is bounded by partition count — more consumers than partitions sit idle.
- A core metric to alert on for streaming health.
Follow-ups they push on
- Why can't you scale consumers beyond the partition count?
- How would you reduce lag without adding partitions?
Red flag Adding consumers beyond the number of partitions to cut lag. Extra consumers in a group just idle — you must increase partitions to raise parallelism.
source: Confluent — Monitoring consumer lag ↗
Commonly asked senior design occasional Choreography vs orchestration for the Saga pattern — how do you keep a multi-service transaction consistent?
With no distributed ACID transaction across microservices, a Saga breaks a business transaction into a sequence of local transactions, each publishing an event; if a step fails, compensating transactions undo the prior steps (e.g. refund a charge after inventory reservation fails).
Choreography: services react to each other's events with no central coordinator — loosely coupled but the end-to-end flow is implicit and hard to trace. Orchestration: a central orchestrator explicitly drives each step and triggers compensations — easier to reason about and monitor, but the orchestrator is a coupling point. Choose choreography for simple, few-step flows; orchestration as step count and error handling grow.
Follow-ups they push on
- What's a compensating transaction, and why isn't it the same as a rollback?
- When does choreography's implicit flow become a liability?
Red flag Trying to span microservices with one ACID transaction (e.g. distributed 2PC everywhere). Sagas with compensations are the practical model; 2PC scales and fails poorly across services.
source: microservices.io — Saga pattern ↗
Commonly asked senior concept occasional How do RabbitMQ acks and the prefetch (QoS) setting affect throughput and reliability?
With manual acks, RabbitMQ keeps a message 'unacked' until the consumer confirms it; if the consumer dies first, the message is requeued — that's how at-least-once delivery and crash safety work. Auto-ack trades that safety for speed (a crash mid-processing loses the message).
Prefetch (basic.qos) caps how many unacked messages a consumer may hold at once. Prefetch=1 gives the fairest load distribution (a slow consumer won't hoard a backlog) but adds round-trip overhead; a higher prefetch boosts throughput by pipelining but can let one consumer grab a big batch while others idle. Tune prefetch to balance fairness against throughput for your processing time.
What a strong answer covers
- Manual ack = message redelivered if the consumer dies before acking (at-least-once).
- Auto-ack is faster but loses in-flight messages on crash.
- Prefetch limits unacked messages per consumer.
- Low prefetch → fair distribution; high prefetch → throughput but possible hoarding.
Follow-ups they push on
- Why does prefetch=1 give the fairest distribution but lower throughput?
- What happens to unacked messages when a consumer connection drops?
Red flag Using auto-ack for work you can't afford to lose, or leaving prefetch unbounded so one consumer grabs the whole queue while others starve.
source: RabbitMQ — Consumer Acknowledgements and Publisher Confirms ↗
Commonly asked senior concept occasional How does a message queue provide back-pressure and load leveling, and what's the risk if you ignore queue depth?
A queue decouples producer rate from consumer rate: during a spike, messages buffer in the queue instead of overwhelming the downstream service, which keeps processing at its sustainable rate — that's load leveling (the queue-based load-leveling pattern). It smooths bursts into a steady drain.
But a queue is finite. If producers persistently outpace consumers, queue depth grows unbounded: latency climbs (messages wait longer), memory/disk fills, and you risk hitting limits or processing hours-stale data. Back-pressure is signaling producers to slow down (reject, throttle, or block) when depth crosses a threshold. Always monitor and alert on queue depth/age, cap the queue, and decide a shed/back-pressure policy — a queue defers overload, it doesn't eliminate it.
What a strong answer covers
- Queue buffers bursts so consumers drain at a sustainable rate (load leveling).
- Back-pressure = signaling producers to slow when the queue fills.
- Unbounded growth → rising latency, stale data, resource exhaustion.
- Monitor depth/age; cap the queue and define a shedding/back-pressure policy.
Follow-ups they push on
- How do you implement back-pressure when producers and consumers are decoupled?
- Why is a growing queue a latency problem even before it's a capacity problem?
Red flag Treating the queue as infinite elastic buffer. If consumers are chronically slower than producers, the queue just defers the overload while latency and staleness balloon.
source: Microsoft — Queue-Based Load Leveling pattern ↗
Commonly asked senior trick common Is exactly-once delivery real? Explain the nuance.
Exactly-once *network delivery* is generally impossible — you can't simultaneously guarantee no loss and no duplicates across an unreliable network (two-generals problem). What systems offer is exactly-once processing / effective-once: at-least-once delivery plus idempotent or transactional handling so the observable effect happens once.
Kafka's 'exactly-once semantics' works this way: idempotent producers and transactions tie the consume-process-produce cycle together so duplicates don't produce duplicate effects. The honest framing: dedup + transactions give exactly-once *effects*, not magically-once delivery.
Follow-ups they push on
- How does Kafka achieve its exactly-once semantics? (idempotent producer + transactions)
- Why is the consumer side still your responsibility for external side effects?
Red flag Claiming a broker delivers exactly once over the wire. Real systems get exactly-once *effects* via idempotency/transactions, not exactly-once delivery.
source: Confluent — Exactly-once semantics in Kafka ↗
Commonly asked senior concept common Event-driven vs request/response — what do you gain and what do you give up?
Request/response is synchronous and simple: the caller waits and gets an answer or an error, with a clear linear flow that's easy to reason about and debug. But it couples services temporally — if the callee is down, the caller fails — and it doesn't absorb spikes.
Event-driven publishes events and lets consumers react asynchronously: it decouples producers from consumers, buffers load (the queue absorbs spikes), and lets you add new consumers without touching the producer. The costs are eventual consistency, harder end-to-end debugging/tracing, and the need for idempotency and ordering handling. Use events for fan-out, decoupling, and load-leveling; use request/response when you need an immediate answer.
Follow-ups they push on
- How does a queue provide back-pressure / load-leveling?
- What new failure modes does async introduce?
Red flag Going event-driven everywhere and losing the simple synchronous read paths. Async adds eventual consistency and tracing complexity — use it where decoupling actually pays.
source: Netflix Tech Blog — event-driven architecture ↗
Commonly asked senior concept occasional How does Kafka preserve message ordering, and what's the catch?
Kafka guarantees ordering only within a partition, not across a topic. Messages with the same partition key (e.g. userId) always land in the same partition and are consumed in order, so per-key ordering holds.
The catch: you only get parallelism by having multiple partitions, and across partitions there's no global order. So you trade total ordering for throughput. If you need strict global ordering you're limited to one partition (no parallelism) — the usual move is to choose a partition key that makes per-key ordering sufficient.
Follow-ups they push on
- How do you pick a partition key for per-entity ordering?
- Why can't you both have many partitions and total ordering?
Red flag Assuming a Kafka topic is globally ordered. Ordering is per-partition; cross-partition order is undefined.
source: Hello Interview — Kafka deep dive ↗
Commonly asked senior design occasional How would you reliably publish an event after committing a DB write (the dual-write problem)?
The trap (a dual write) is committing to the DB and then publishing to the broker as two separate steps — a crash in between leaves them inconsistent (event lost, or published but DB rolled back).
The standard fix is the transactional outbox: in the same DB transaction as the business write, insert the event into an outbox table. A separate relay (polling or change-data-capture like Debezium) reads the outbox and publishes to the broker, marking rows sent. Because the write and the outbox insert commit atomically, the event is never lost; the relay gives at-least-once publishing, so consumers stay idempotent.
Follow-ups they push on
- Why not just publish then write, or write then publish?
- How does CDC / Debezium read the outbox?
Red flag Doing a naive dual write (commit DB, then send to Kafka). A failure between the two desynchronizes your DB and your event stream.
source: microservices.io — Transactional Outbox ↗

2.7 Distributed systems & scaling 15

★ must-know Commonly asked senior concept common Why is an idempotency key essential for a client retry after a timeout, and what subtlety makes timeouts dangerous?
A timeout is ambiguous: when a client's request times out, it cannot tell whether the server never received it, processed it but the response was lost, or is still processing. So a retry might be a true retry or an accidental duplicate of a request that already succeeded.
That's why a non-idempotent operation (charge a card, place an order) needs an idempotency key: the client sends a stable key, the server dedupes on it, and a retry of an already-applied request returns the original result instead of re-applying it. Without the key, the safe-looking retry can double-charge. The subtlety: the failure you can see (timeout) hides whether the side effect happened.
What a strong answer covers
- A timeout doesn't tell you if the operation succeeded — it's inherently ambiguous.
- Retrying a non-idempotent op risks a duplicate side effect.
- An idempotency key lets the server dedupe and return the first result.
- Idempotent methods (GET/PUT/DELETE) are safe to retry without a key.
Follow-ups they push on
- Why can a request that 'timed out' have actually succeeded server-side?
- How does this connect to at-least-once delivery in messaging?
Red flag Treating a timeout as a definite failure and blindly retrying a charge/order. The request may have completed; without an idempotency key you double-apply it.
source: AWS Builders' Library — Making retries safe with idempotent APIs ↗
Commonly asked mid concept very common Explain the CAP theorem. Under a partition, what are you actually choosing between?
CAP says that when a network partition happens, a distributed data store must choose between Consistency (every read sees the latest write) and Availability (every request gets a non-error response). You can't have both during a partition; without a partition you get both.
So it's really a choice made *when partitioned*. CP systems (e.g. HBase, MongoDB in its default config) refuse or block to stay consistent; AP systems (e.g. Cassandra, CouchDB) keep serving and reconcile later (eventual consistency). Important caveat: CAP says nothing about latency or scalability — it's strictly about behavior under partition.
Follow-ups they push on
- Why is the 'pick 2 of 3' framing misleading?
- What does PACELC add to CAP?
Red flag Saying 'pick 2 of 3' as if you choose freely. Partition tolerance is mandatory in a distributed system; the real choice is C vs A only when a partition occurs.
source: system-design-primer — CAP theorem ↗
Commonly asked mid concept common Compare load-balancing algorithms: round robin, least connections, and consistent hashing. When does each shine?
Round robin sends each request to the next server in rotation — simple and fine when requests are uniform and servers identical, but blind to actual load. Least connections routes to the server with the fewest active connections — better when request durations vary, since it adapts to real load instead of assuming uniformity.
Consistent hashing (hash the client/key to a server) keeps a given key/session on the same server — essential for cache affinity or sticky routing, and it minimizes remapping when servers are added/removed. Round robin for stateless uniform work; least connections for variable work; consistent hashing when affinity/locality matters.
What a strong answer covers
- Round robin: simple rotation, ignores load; good for uniform requests.
- Least connections: adapts to variable request durations.
- Consistent hashing: routes a key to a stable server (cache/session affinity).
- Weighted variants account for heterogeneous server capacity.
Quick self-check
Requests have highly variable processing times. Which LB algorithm adapts best to real server load?
Follow-ups they push on
- When does round robin distribute poorly?
- Why does consistent hashing help cache hit rates behind a load balancer?
Red flag Defaulting to round robin when request costs vary wildly — a few expensive requests pile onto one server while others idle. Least connections adapts better.
source: Cloudflare — What is load balancing? ↗
Commonly asked mid concept common What is the difference between latency and throughput, and why can optimizing one hurt the other?
Latency is how long a single operation takes (time per request); throughput is how many operations complete per unit time. They're related but distinct — a system can have high throughput and high latency at once.
They trade off because techniques that raise throughput often add per-request latency: batching many requests amortizes overhead (more throughput) but each request waits for the batch to fill (more latency); deep queues keep workers busy (throughput) but messages wait longer (latency). The discipline is to measure latency as a distribution (p50/p95/p99), not a mean, since tail latency is what users feel, and to choose the tradeoff per workload.
What a strong answer covers
- Latency = time per operation; throughput = operations per unit time.
- Batching/queuing raise throughput but add per-request latency.
- Report latency as percentiles (p95/p99), not averages — tails matter.
- Little's Law links them: concurrency ≈ throughput × latency.
Quick self-check
Why is p99 latency usually more informative than mean latency?
Follow-ups they push on
- Why report p99 instead of the mean?
- How does batching trade latency for throughput?
Red flag Reporting only average latency. A good mean can hide a terrible p99 that real users hit; and maxing throughput via batching can quietly wreck per-request latency.
source: Hello Interview — Latency vs throughput ↗
Commonly asked mid concept very common Horizontal vs vertical scaling — and why does statelessness matter for scaling out?
Vertical scaling means a bigger machine (more CPU/RAM) — simple but bounded by the largest box and a single point of failure. Horizontal scaling means more machines behind a load balancer — effectively unbounded and fault-tolerant, but only if requests can hit any node.
That's why statelessness matters: if a server keeps user session state in local memory, the load balancer must pin a user to one node (sticky sessions), which breaks failover and uneven load. Push state to a shared store (Redis/DB) so any node can serve any request, and horizontal scaling becomes trivial.
Follow-ups they push on
- What breaks if you keep sessions in local server memory?
- How do load balancers route — round robin, least connections, hashing?
Red flag Storing session state in process memory and then trying to scale horizontally — you're forced into sticky sessions, which undermine failover and balancing.
source: system-design-primer — Scalability ↗
Commonly asked mid concept common What is eventual consistency, and why do distributed systems accept it?
Eventual consistency means replicas may temporarily disagree, but if writes stop, they converge to the same value given enough time. AP systems accept it as the price of staying available and low-latency under partitions and across regions.
Why accept it: strong consistency requires coordination (consensus, quorums) on every write, which adds latency and reduces availability when nodes can't reach each other. For many features — a like count, a social feed, a shopping cart — a few seconds of staleness is fine, and the availability/latency win is worth it. For money movement you choose strong consistency instead.
Follow-ups they push on
- Give a feature where eventual consistency is fine and one where it's not.
- What is read-your-own-writes consistency?
Red flag Using eventual consistency for invariants that must hold immediately (e.g. account balances). Match the consistency level to the business need.
source: AWS — Eventual consistency ↗
Commonly asked mid concept common What roles do an API gateway and service discovery play in a microservices system?
An API gateway is the single entry point for clients: it routes to the right service and handles cross-cutting concerns — auth, rate limiting, TLS termination, request aggregation, and sometimes response shaping — so each service doesn't reimplement them and clients don't need to know the internal topology.
Service discovery lets services find each other's network locations as instances scale up/down and move. A registry (Consul, Eureka, or Kubernetes DNS/Services) maps a logical service name to current healthy instances, so callers resolve a name instead of hardcoding IPs. Together they decouple clients from the shifting set of backend instances.
Follow-ups they push on
- Client-side vs server-side discovery — what's the difference?
- What concerns belong in the gateway vs each service?
Red flag Putting business logic in the gateway. It handles routing and cross-cutting concerns; domain logic stays in the services.
source: microservices.io — API gateway & service discovery ↗
Commonly asked senior design common Design read scaling for a heavily-read database. How do replication and the read-your-writes problem interact?
For read-heavy load, add read replicas: writes go to the primary, which asynchronously replicates to replicas that serve reads, spreading read load and adding redundancy. The catch is replication lag — a replica may be milliseconds-to-seconds behind, so a user who just wrote can read a replica and not see their own change.
Fix the read-your-writes experience by routing a user's reads to the primary for a short window after they write, pinning their session to the primary, tracking a write timestamp/LSN and only reading replicas caught up past it, or using synchronous replication for the critical path (at a latency cost). Layer caching and, if writes also dominate, consider sharding.
Follow-ups they push on
- What is replication lag and how do you measure it?
- When would you shard instead of (or in addition to) adding replicas?
Red flag Sending a user's immediate post-write read to an async replica and showing them stale data ('I just saved it — where did it go?'). Route recent writers to the primary or track their write position.
source: system-design-primer — Replication & federation ↗
Commonly asked senior trick occasional Trick: a service adds aggressive client retries to improve reliability and the whole system gets less reliable under load. What happened?
This is a retry storm / metastable failure. When a dependency slows or briefly fails under load, every client retries — often multiplying traffic 3x or more right when the service is least able to handle it. The added load keeps the service overloaded, so it keeps failing, so clients keep retrying: a self-sustaining feedback loop that doesn't recover even after the original trigger passes.
Fixes: bound retries with a retry budget (cap retries as a fraction of traffic, not per-request), add exponential backoff with jitter, use circuit breakers to fail fast, and only retry idempotent operations. Retries help with isolated transient blips; unbounded retries under systemic load amplify the failure.
What a strong answer covers
- Mass retries multiply load exactly when the service is already struggling.
- Creates a self-sustaining (metastable) overload that outlasts the trigger.
- Fix: retry budgets, backoff + jitter, circuit breakers, retry only idempotent ops.
- Retries help isolated blips, not systemic overload.
Quick self-check
Aggressive unconditional retries make a system LESS reliable under load because:
Follow-ups they push on
- What is a retry budget and why cap retries as a fraction of total traffic?
- How does a circuit breaker break the feedback loop?
Red flag Adding per-request retries everywhere as a blanket reliability boost. Under correlated failure they amplify load into a retry storm — bound them with budgets, backoff, and breakers.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
Commonly asked senior concept common What is sharding (horizontal partitioning), and why is choosing a good shard key the hard part?
Sharding splits one logical dataset across multiple databases/nodes by a shard key, so each shard holds a subset and the system scales writes and storage beyond one machine. Reads/writes route to the shard owning the key.
The shard key is the hard part because a bad one creates hotspots — picking a low-cardinality or monotonically-increasing key (like a timestamp) funnels traffic to one shard, defeating the point. You want a key that spreads load evenly *and* keeps commonly-joined data co-located so you avoid expensive cross-shard queries. Cross-shard transactions and re-sharding as you grow are the recurring pains, which is why teams delay sharding until replicas and caching are exhausted.
What a strong answer covers
- Sharding = horizontal partitioning by a shard key across nodes.
- Scales writes/storage past a single machine.
- Bad shard key → hotspots (monotonic or low-cardinality keys are traps).
- Cross-shard joins/transactions and re-sharding are the ongoing costs.
Follow-ups they push on
- Why does a timestamp or auto-increment id make a poor shard key?
- How does consistent hashing reduce re-sharding pain?
Red flag Sharding on a monotonically increasing key (timestamp/sequence id) so all new writes hit the newest shard — a hotspot that recreates the single-node bottleneck.
source: MongoDB — Sharding and shard keys ↗
Commonly asked senior concept very common Monolith vs microservices — what are the real tradeoffs, and why not default to microservices?
A monolith is one deployable: simpler local dev, easy refactors across boundaries, in-process calls, one transaction — at the cost of coupled deploys and scaling the whole app together. Microservices give independent deploys, team autonomy, and targeted scaling — but you pay with distributed-systems tax: network failures, eventual consistency, distributed transactions/sagas, harder debugging and tracing, and heavy ops.
The seasoned answer: don't reach for microservices by default. Most teams should start with a well-modularized monolith and split out services only when a clear scaling, team-ownership, or deploy-cadence boundary justifies the added operational cost.
Follow-ups they push on
- What forces a split — scaling, team size, or deploy cadence?
- What is a 'distributed monolith' and why is it the worst of both?
Red flag Starting greenfield with microservices for resume-driven reasons, inheriting distributed-systems complexity before you have the scale or teams to need it.
source: Martin Fowler — Monolith First ↗
Commonly asked senior design common Why retry failed calls with exponential backoff AND jitter? What goes wrong without jitter?
Retries handle transient failures, but naive retries cause two problems. Exponential backoff (wait 1s, 2s, 4s…) stops a struggling service from being hammered every few milliseconds while it tries to recover.
Jitter (randomizing each wait) prevents a thundering herd: if many clients fail at the same instant and all back off by the exact same schedule, they retry in synchronized waves that keep knocking the service over. Adding randomness spreads the retries out. Pair this with retry budgets/circuit breakers and only retry idempotent or idempotency-keyed operations.
Follow-ups they push on
- Why only retry idempotent operations?
- What does a circuit breaker add on top of backoff?
Red flag Backoff without jitter — synchronized clients retry in lockstep, creating a self-reinforcing herd that prevents recovery.
source: AWS Builders' Library — Timeouts, retries, and backoff with jitter ↗
Commonly asked senior concept common What is a circuit breaker and how does it protect a distributed system?
A circuit breaker wraps calls to a dependency and tracks failures. In closed state calls pass through; once failures cross a threshold it opens and fails fast (returns an error or fallback immediately) instead of piling up calls on a sick service. After a cooldown it goes half-open, lets a trial request through, and closes again if it succeeds.
It prevents cascading failures: without it, a slow dependency exhausts the caller's threads/connections waiting on timeouts, which then takes the caller down, propagating upstream. Failing fast contains the blast radius and lets the dependency recover.
Follow-ups they push on
- Walk through closed → open → half-open transitions.
- How does this complement timeouts and bulkheads?
Red flag Relying on retries alone with no breaker — retries against a failing dependency amplify load and accelerate the cascade.
source: Martin Fowler — CircuitBreaker ↗
Commonly asked senior concept common What is consistent hashing and why do distributed caches and databases use it?
With plain hash(key) % N, changing the number of nodes N remaps almost every key — catastrophic for a cache (mass misses) or a sharded DB (mass data movement). Consistent hashing maps both keys and nodes onto a ring; a key belongs to the next node clockwise. Adding or removing a node only relocates the keys in that node's arc — about 1/N of keys — instead of nearly all of them.
Virtual nodes (each physical node placed at many ring positions) smooth out uneven distribution. This is why Cassandra, DynamoDB, and Memcached-style caches use it.
Follow-ups they push on
- What do virtual nodes solve?
- How much data moves when you add the (N+1)th node?
Red flag Using modulo hashing for a sharded cluster, so adding one node reshuffles nearly all keys and stampedes the backing store.
source: system-design-primer — Consistent hashing ↗
Commonly asked senior concept occasional Why are leader election and quorum used in distributed coordination?
Many tasks need exactly one node in charge (assigning work, ordering writes) to avoid conflicts — so the cluster elects a leader via a consensus protocol (Raft, Paxos, ZooKeeper/ZAB). If the leader dies, a new one is elected.
To agree despite failures, decisions use a quorum — a majority (N/2 + 1). Requiring a majority for writes and reads guarantees any two quorums overlap, so the system never commits two conflicting decisions and can tolerate a minority of nodes failing. This is the backbone of consistent distributed stores and coordination services.
Follow-ups they push on
- Why a majority specifically? (overlapping quorums prevent split decisions)
- What is split-brain and how does quorum prevent it?
Red flag Allowing writes without a majority quorum, enabling split-brain where two partitions both think they have a leader and diverge.
source: The Raft Consensus Algorithm ↗

2.8 Caching 14

★ must-know Commonly asked senior concept common On a write, should you update the cache in place or delete (invalidate) the key? Why is delete usually safer?
Prefer delete (invalidate) over update-in-place. Updating the cache directly on write opens a race: two concurrent writers can set the cache in the opposite order from how they hit the DB, leaving the cache holding the older value permanently. Deleting the key sidesteps that — the next read just repopulates from the source of truth.
Delete is also cheaper (you don't recompute a value that may never be read) and avoids caching an intermediate state. The cost is one guaranteed cache miss after each write. For very hot keys you can refresh asynchronously, but the default rule is invalidate, don't update.
What a strong answer covers
- Update-in-place risks concurrent writers leaving a stale value forever.
- Delete forces the next read to repopulate from the source of truth.
- Delete avoids recomputing values that may never be read.
- Cost: one cache miss after each write.
Quick self-check
On a database write, the more robust cache strategy is usually to:
Follow-ups they push on
- Walk through the concurrent-writer race that update-in-place causes.
- When might you refresh the cache asynchronously instead of deleting?
Red flag Writing the new value straight into the cache on every update. Concurrent writes can reorder and pin a stale value; deleting the key is the robust default.
source: AWS Builders' Library — Caching challenges and strategies ↗
Commonly asked junior concept common Where can caches live across a request's path, and what does each layer cache?
Caching exists at many layers, each closer to the user is cheaper: browser cache (per-user, static assets via Cache-Control/ETag); CDN / edge (shared, geo-distributed static and cacheable responses); application / in-process (a local in-memory map — fastest but per-instance and not shared); distributed cache (Redis/Memcached — shared across app servers); and the database query/buffer cache.
The instinct: cache as close to the user as the data's freshness allows. Each layer trades reach (shared vs per-instance) against latency and invalidation difficulty.
Follow-ups they push on
- Trade-off of in-process vs distributed cache?
- Why is CDN caching great for static but tricky for personalized content?
Red flag Caching personalized/private data in a shared CDN or proxy layer, leaking one user's data to another. Mark it private/no-store.
source: system-design-primer — Caching ↗
Commonly asked mid concept very common Compare cache-aside, write-through, and write-back. When do you use each?
Cache-aside (lazy loading): app checks cache; on a miss it reads the DB, populates the cache, and returns. Most common for read-heavy workloads; only requested data is cached, but the first hit is a miss and stale data is possible without invalidation.
Write-through: writes go to cache and DB together, so the cache is always consistent — at the cost of higher write latency and caching data that may never be read. Write-back (write-behind): write to cache immediately and flush to the DB asynchronously — fast writes and great for write-heavy bursts, but a cache crash before flush loses data. Pick cache-aside by default; write-through when reads must never be stale; write-back when write latency dominates and you can tolerate the durability risk.
Follow-ups they push on
- Which strategy risks data loss and why?
- How do you keep cache-aside from serving stale data?
Red flag Using write-back for data you can't afford to lose. An async-flush cache that dies before flushing loses the unwritten writes.
source: AWS Builders' Library — Caching challenges and strategies ↗
Commonly asked mid concept common What makes a good cache key, and why is cache hit ratio the metric that matters most?
A good cache key is deterministic (same logical request → same key), specific enough to avoid collisions (include the parameters that change the result — user, locale, version), and normalized (sort query params, lowercase where appropriate) so equivalent requests share a key. Over-specific keys (including irrelevant params like a request id) fragment the cache and tank the hit rate; under-specific keys serve the wrong data.
Hit ratio (hits / total lookups) is the headline metric because a cache only pays off when most reads avoid the backend. A low hit ratio means you're spending memory and adding a layer for little benefit — investigate whether keys are too granular, TTLs too short, or the working set exceeds cache capacity.
What a strong answer covers
- Keys must be deterministic, normalized, and include exactly the result-affecting params.
- Over-specific keys fragment the cache; under-specific keys serve wrong data.
- Hit ratio = hits / lookups — the core measure of cache value.
- Low hit ratio → keys too granular, TTL too short, or working set > capacity.
Follow-ups they push on
- How can including a request id or timestamp in the key destroy the hit ratio?
- What does a sudden hit-ratio drop usually indicate?
Red flag Baking a unique/volatile value (request id, current timestamp) into the cache key, so every request misses — you've added overhead with a near-zero hit ratio.
source: AWS — Caching best practices ↗
Commonly asked mid debug occasional Debugging: after deploying a new code version, users report seeing old data that won't refresh. Where would you look in the caching layers?
Stale data after a deploy almost always means a cache somewhere is serving the old version. Walk the layers from the client inward: the browser cache (Cache-Control/Expires on the asset — a missing hash/versioned filename means the browser reuses the old bundle), the CDN/edge (needs a purge/invalidation for changed assets), the application/distributed cache (Redis/Memcached entries not invalidated by the deploy), and finally the DB query cache.
Debug tooling: inspect response headers for Age, X-Cache: HIT, and Cache-Control; force-reload to bypass the browser; check whether the CDN was purged. The durable fix for static assets is cache-busting — content-hashed filenames so a new version is a new URL and old caches simply don't apply.
What a strong answer covers
- Check layers client→server: browser → CDN/edge → app/distributed cache → DB.
- Inspect Age, X-Cache, and Cache-Control to locate the serving cache.
- Static assets need content-hashed filenames (cache-busting) per deploy.
- A CDN may need an explicit purge/invalidation after deploy.
Quick self-check
Users keep getting the old JS bundle after a deploy. The most reliable fix is to:
Follow-ups they push on
- How does a content hash in the filename make cache invalidation automatic?
- What does the Age header tell you about which cache served the response?
Red flag Serving versioned JS/CSS under a stable filename with a long max-age, so browsers and CDNs keep the old bundle after deploy. Hash the filename so a new build is a new URL.
source: MDN — HTTP caching (cache busting) ↗
Commonly asked mid concept occasional What's the difference between read-through and cache-aside caching? Who is responsible for the database read in each?
Both are lazy-loading read strategies, but they differ in *who* loads on a miss. In cache-aside (lazy loading) the application owns the logic: it checks the cache, and on a miss it reads the DB and writes the result back to the cache itself. The cache is just a dumb store; the app is the orchestrator.
In read-through, the cache sits inline and loads from the DB on a miss transparently — the application only ever talks to the cache, and a provider/loader function populates it. Read-through centralizes the load logic (less duplicated code, consistent behavior) but needs cache support for it; cache-aside is more flexible and the most common pattern. Both still need a write strategy and TTLs to manage staleness.
What a strong answer covers
- Cache-aside: the application reads the DB on a miss and populates the cache.
- Read-through: the cache loads from the DB on a miss transparently.
- Read-through centralizes load logic; cache-aside is more flexible/common.
- Both are lazy (load on miss) and still need write/TTL strategies.
Quick self-check
In cache-aside, who reads the database on a cache miss?
Follow-ups they push on
- Why is cache-aside more common despite read-through's cleaner app code?
- How does write-through pair with read-through?
Red flag Conflating the two and assuming the cache auto-loads in cache-aside. In cache-aside the application must explicitly read the DB and repopulate on every miss.
source: AWS — Database caching strategies (lazy loading vs read-through) ↗
Commonly asked mid concept common LRU vs LFU vs FIFO eviction, plus TTL — how do you choose?
When a cache is full it evicts by policy. LRU drops the least-recently-used entry — the default; great when recent access predicts future access (temporal locality). LFU drops the least-frequently-used — better when some items are persistently hot regardless of recency, but it can keep stale 'once-popular' items. FIFO evicts the oldest inserted regardless of use — simple but ignores access patterns.
TTL is orthogonal: it bounds staleness by expiring entries after a time, independent of capacity pressure. Typical setup: LRU for capacity eviction plus a TTL for freshness.
Follow-ups they push on
- When does LFU beat LRU?
- How does TTL interact with an eviction policy?
Red flag Treating TTL as an eviction policy. TTL bounds staleness over time; LRU/LFU/FIFO decide what to drop under memory pressure — they solve different problems.
source: GeeksforGeeks — Cache eviction policies ↗
Commonly asked mid coding very common Implement an LRU cache with O(1) get and put. What data structures do you use?
Combine a hash map (key → node) with a doubly linked list ordered by recency. The map gives O(1) lookup; the linked list gives O(1) move-to-front and O(1) eviction at the tail.
On get: look up the node in the map, unlink it, move it to the head (most recent), return its value. On put: if present, update and move to head; if new, insert at head and add to the map; if over capacity, remove the tail node and delete its key from the map. Both operations are O(1) because every step is a constant-time pointer/map update.
Map (key -> node) + DLL: head=newest ... tail=evict
Follow-ups they push on
- Why a doubly (not singly) linked list?
- How would you make it thread-safe?
- In an interview, can you use a language built-in like LinkedHashMap?
Red flag Using an array or scanning the list to find the LRU item — that's O(n). The hash map + DLL pairing is what keeps both operations O(1).
source: LeetCode — LRU Cache (146) ↗
Commonly asked mid concept common Redis vs Memcached — when would you pick each?
Memcached is a simple, multithreaded, in-memory key→blob cache — extremely fast and easy to scale for pure caching of opaque values. Redis is a richer in-memory data store: it has data structures (lists, sets, sorted sets, hashes, streams), optional persistence, replication, pub/sub, Lua scripting, and clustering.
Pick Memcached when you just need a fast, large, simple cache and want multithreaded throughput per node. Pick Redis when you need those data structures, durability, atomic operations, pub/sub, rate-limiter counters, leaderboards, or built-in replication/clustering — which is most modern use cases.
Follow-ups they push on
- When does Memcached's multithreading actually win?
- What Redis features make it more than a cache?
Red flag Saying 'Redis is just a faster Memcached'. The real difference is Redis's data structures, persistence, and clustering, not raw speed.
source: AWS — Redis vs Memcached ↗
Commonly asked senior concept occasional Why does Redis need a persistence and eviction policy, and what's the difference between RDB and AOF?
Redis holds data in memory, so two policies matter. Eviction (maxmemory-policy) decides what happens when memory fills — noeviction (reject writes), allkeys-lru, volatile-ttl, etc. Pick LRU/LFU variants when using Redis as a cache; noeviction when it's a primary store you can't silently drop from.
Persistence decides what survives a restart. RDB takes periodic point-in-time snapshots — compact, fast to load, but you lose writes since the last snapshot. AOF (append-only file) logs every write operation — far better durability (down to per-write fsync) at the cost of larger files and slower restart. Many run both: AOF for durability, RDB for fast restores. Treating Redis purely as a cache means you may not need persistence at all.
What a strong answer covers
- Eviction policy governs behavior at maxmemory; choose LRU/LFU for cache use.
- RDB = periodic snapshots: compact and fast to load, but loses recent writes.
- AOF = append-only write log: stronger durability, bigger/slower.
- Often run both; a pure cache may skip persistence entirely.
Follow-ups they push on
- When would you choose noeviction over allkeys-lru?
- What's the durability/performance tradeoff of AOF fsync-everysec vs always?
Red flag Running Redis as a primary datastore with `noeviction` unset and no persistence, then losing data on a restart or silently dropping writes at maxmemory.
source: Redis — Persistence ↗
Commonly asked senior trick occasional When is adding a cache the WRONG move? Name cases where caching hurts more than it helps.
Caching is not free — it adds a consistency problem and an extra failure mode. It's the wrong move when the data changes more often than it's read (you invalidate constantly, getting a near-zero hit ratio while paying the cost), when staleness is unacceptable (account balances, inventory at checkout, anything where a wrong value causes real harm), when the working set is far larger than memory so you thrash with evictions, or when the backend is already fast enough that the cache only adds complexity and a coherence bug surface.
The instinct: reach for caching when reads dominate writes and a little staleness is tolerable; otherwise the extra layer buys complexity, not speed.
What a strong answer covers
- Write-heavy / frequently-changing data → constant invalidation, low hit ratio.
- Strict-correctness data (balances, inventory) → staleness causes real harm.
- Working set >> cache memory → thrashing evictions, little benefit.
- Already-fast backend → cache adds complexity and a new failure/coherence surface.
Quick self-check
For which workload is adding a cache LEAST likely to help?
Follow-ups they push on
- Why does a write-heavy workload defeat most caching strategies?
- How do you decide the read:write ratio threshold where caching pays off?
Red flag Adding a cache reflexively 'for performance' on write-heavy or correctness-critical data. You inherit invalidation bugs and a new failure mode for little or negative gain.
source: AWS Builders' Library — Caching challenges and strategies ↗
Commonly asked senior design common What is a cache stampede (thundering herd) and how do you prevent it?
A cache stampede happens when a hot key expires and many concurrent requests all miss simultaneously, then all hit the database at once to recompute the same value — a spike that can overload the backend.
Mitigations: request coalescing / locking so only one request recomputes while others wait for or briefly serve the old value; early/probabilistic expiration so one request refreshes the key slightly before it expires; stale-while-revalidate, serving the old value while refreshing in the background; and jittering TTLs so many keys don't expire at the same instant.
Follow-ups they push on
- How does a per-key lock prevent the dogpile?
- What is probabilistic early expiration?
Red flag Giving many hot keys the same fixed TTL, so they expire together and trigger a synchronized backend stampede. Add jitter and coalesce recomputation.
source: AWS Builders' Library — Caching challenges and strategies ↗
Commonly asked senior trick common Why is cache invalidation hard, and what are the failure modes (stale reads)?
There are really two hard problems: deciding *when* a cached value is no longer valid, and making the cache and source of truth agree across concurrent updates. With cache-aside, a classic race: reader gets a miss and starts loading the old value; a writer updates the DB and deletes the cache key; the reader then writes its stale value back — now the cache is wrong indefinitely.
Approaches: delete (don't update) the key on write so the next read repopulates fresh; use short TTLs to bound staleness; version keys; or for tighter consistency use write-through. There's no free lunch — you trade consistency, latency, and complexity.
Follow-ups they push on
- Why delete the key on write instead of updating it?
- How does a short TTL bound the damage?
Red flag Updating the cache in place on writes (instead of deleting) and ignoring the read-load/write-delete interleaving — you cache a stale value that never self-heals.
source: AWS Builders' Library — Caching challenges and strategies ↗
Commonly asked senior concept occasional What are cache penetration and cache avalanche, and how do they differ from a stampede?
Cache penetration: requests for keys that don't exist anywhere always miss the cache and hit the DB — common in scraping/attacks. Fix by caching the negative result (a short-TTL 'null' marker) or screening with a Bloom filter of valid keys.
Cache avalanche: a large set of keys expire at once (or the cache itself goes down), so traffic floods the DB en masse. Fix by jittering TTLs, layering caches, and adding rate limiting / circuit breakers in front of the DB. Versus a stampede, which is many concurrent requests for *one* expiring hot key — avalanche is many keys at once, penetration is keys that never exist.
Follow-ups they push on
- How does a Bloom filter stop penetration cheaply?
- Why does jittering TTLs help avalanche?
Red flag Not caching negative lookups, so a flood of requests for nonexistent keys bypasses the cache entirely and hammers the database.
source: GeeksforGeeks — Cache penetration, avalanche, stampede ↗

03 Databases 98 Q's

3.1 Relational model & SQL basics 14

★ must-know Commonly asked junior concept very common What's the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
COUNT(*) counts rows, including rows where every column is NULL. COUNT(col) counts rows where col is not NULL — NULLs are skipped. COUNT(DISTINCT col) counts the number of distinct non-NULL values.
So on a column with NULLs, COUNT(*) >= COUNT(col) >= COUNT(DISTINCT col). This trips people up in 'how many customers placed an order' style questions, where a LEFT JOIN leaves NULLs and COUNT(*) over-counts.
What a strong answer covers
- COUNT(*) counts every row regardless of NULLs.
- COUNT(col) ignores rows where col IS NULL.
- COUNT(DISTINCT col) ignores NULLs and collapses duplicates.
- After a LEFT JOIN, count a non-null right-side column (not *) to avoid counting unmatched rows.
Quick self-check
A `votes(id, choice)` column has 5 rows; `choice` is NULL in 2 of them, and the 3 non-null values are 'A','A','B'. What are COUNT(*), COUNT(choice), COUNT(DISTINCT choice)?
Follow-ups they push on
- After a LEFT JOIN, why does COUNT(*) over-report and COUNT(right_col) fix it?
- Is COUNT(1) any different from COUNT(*)? (no — same thing)
Red flag Assuming COUNT(col) counts all rows like COUNT(*), or using COUNT(*) after a LEFT JOIN and counting the NULL-filled unmatched rows.
source: PostgreSQL docs — Aggregate Functions ↗
Amazon junior concept very common What is the difference between a PRIMARY KEY, a UNIQUE constraint, and a FOREIGN KEY?
A primary key uniquely identifies a row: it is UNIQUE and NOT NULL, and there is exactly one per table.
A UNIQUE constraint also forbids duplicates but *does* allow a NULL (one, in most engines), and a table can have many of them.
A foreign key is a column whose values must exist as a key in another table — it enforces referential integrity (you cannot insert an order for a customer that does not exist, and the DB can block/cascade deletes).
Follow-ups they push on
- What is a composite key?
- Can a foreign key reference a UNIQUE column instead of a primary key?
Red flag Saying a primary key is 'just a unique column' and forgetting the implicit NOT NULL, or claiming a table can have several primary keys (it has one, possibly composite).
source: DataLemur — Amazon SQL Interview Questions ↗
Commonly asked junior concept occasional What is the difference between CHAR, VARCHAR, and TEXT, and when does the choice matter?
CHAR(n) is fixed-length — it pads with spaces to n, so it suits truly fixed codes (a 2-char country code, a fixed hash). VARCHAR(n) is variable-length with a declared max, erroring if you exceed it. TEXT is variable-length with no practical limit.
In PostgreSQL there is no performance difference between them — the manual recommends text or varchar and notes char(n) is usually the *slowest* due to padding. The length limit is mainly a data-integrity constraint, not an optimization. (In some other engines, like older MySQL row formats, fixed vs variable length had storage implications.)
What a strong answer covers
- CHAR(n): fixed length, space-padded — only for genuinely fixed-width values.
- VARCHAR(n): variable length with an enforced maximum.
- TEXT: variable length, effectively unlimited.
- In Postgres these perform the same; a length cap is a constraint, not a speed win.
Follow-ups they push on
- Does a VARCHAR(255) store faster than VARCHAR(1000) in Postgres? (no)
- When is CHAR(n) actually the right choice?
Red flag Believing a smaller VARCHAR(n) is faster or saves space in Postgres, or using CHAR for general text and getting surprised by trailing-space padding.
source: PostgreSQL docs — Character Types ↗
Commonly asked junior coding common How do you classify employees into salary bands ('low'/'mid'/'high') in a single SELECT?
Use a CASE expression, which is SQL's inline if/else:
SELECT name, salary, CASE WHEN salary < 50000 THEN 'low' WHEN salary < 100000 THEN 'mid' ELSE 'high' END AS band FROM employee;
The searched CASE evaluates WHEN branches top-to-bottom and returns the first match, so order the boundaries carefully. With no ELSE, unmatched rows get NULL. You can also wrap CASE inside an aggregate (SUM(CASE WHEN … THEN 1 ELSE 0 END)) for conditional counts — the classic pivot trick.
What a strong answer covers
- CASE WHEN … THEN … [WHEN …] ELSE … END returns the first matching branch.
- Branches are evaluated in order — overlapping conditions resolve to the first true one.
- Omitting ELSE yields NULL for unmatched rows.
- SUM(CASE WHEN cond THEN 1 ELSE 0 END) does conditional counting / pivoting.
Follow-ups they push on
- Rewrite a conditional COUNT using SUM(CASE WHEN …).
- What's the difference between a simple CASE and a searched CASE?
Red flag Ordering CASE branches so a broad condition shadows a narrower one, or forgetting that without ELSE the result is NULL, not 0.
source: PostgreSQL docs — Conditional Expressions (CASE) ↗
Commonly asked junior coding very common Write a query to return all employees in the Engineering department earning more than 100000, sorted by salary descending.
Straight SELECT … WHERE … ORDER BY:
SELECT name, salary FROM employees WHERE department = 'Engineering' AND salary > 100000 ORDER BY salary DESC;
Watch the clause order — WHERE filters rows, ORDER BY runs last. String literals are single-quoted; double quotes mean an identifier in standard SQL.
Follow-ups they push on
- Add a tie-breaker so equal salaries sort by name.
- Return only the top 5 — LIMIT vs TOP vs FETCH FIRST?
Red flag Using double quotes around the string literal (an identifier in standard SQL/Postgres), or putting ORDER BY before WHERE.
source: PG Exercises — Basic ↗
Commonly asked junior concept very common What is the difference between WHERE and HAVING, and why can't you put an aggregate in WHERE?
WHERE filters individual rows before grouping; HAVING filters groups after the GROUP BY runs.
An aggregate like COUNT(*) is not known until rows are grouped, so it cannot appear in WHERE — it belongs in HAVING. Example: SELECT dept, COUNT(*) FROM emp WHERE active = true GROUP BY dept HAVING COUNT(*) > 5; — active is filtered per-row, the head-count per-group.
Follow-ups they push on
- Logical order of evaluation of FROM/WHERE/GROUP BY/HAVING/SELECT/ORDER BY?
- Can you reference a SELECT alias in HAVING?
Red flag Putting `WHERE COUNT(*) > 5`, or believing HAVING is just 'WHERE for the GROUP BY query' with no semantic difference.
source: PostgreSQL docs — GROUP BY and HAVING ↗
Commonly asked junior trick common What does NULL mean in SQL, and why does `WHERE col = NULL` return nothing?
NULL is 'unknown', not a value. Any comparison with NULL using =/<> yields UNKNOWN (not true), so the row is dropped — WHERE col = NULL always returns zero rows.
Use the dedicated operators: WHERE col IS NULL / IS NOT NULL. Note aggregates skip NULLs (AVG, COUNT(col)) but COUNT(*) counts the row regardless.
Follow-ups they push on
- What does `NULL = NULL` evaluate to?
- How does NULL behave inside NOT IN (subquery) — and why is that a trap?
Red flag Treating NULL as a value you can equality-test, or assuming `NOT IN` works when the subquery can yield a NULL (it then returns no rows).
source: PostgreSQL docs — Comparison Functions and Operators ↗
Commonly asked junior coding common Find all duplicate email addresses in a Person table (emails appearing more than once).
Group by the column and keep groups of size > 1:
SELECT email FROM person GROUP BY email HAVING COUNT(*) > 1;
This is the canonical 'GROUP BY + HAVING COUNT' pattern. To actually delete dupes you would keep MIN(id) per group and remove the rest.
Follow-ups they push on
- Now delete the duplicates, keeping the row with the smallest id.
- Could a self-join solve this too? Compare it to GROUP BY.
Red flag Using `WHERE COUNT(*) > 1`, or `SELECT DISTINCT` (which hides duplicates rather than finding them).
source: LeetCode 196 — Duplicate Emails ↗
Commonly asked junior concept common What is the difference between DELETE, TRUNCATE, and DROP?
DELETE removes rows one at a time, can have a WHERE, fires triggers, is fully transactional and rollback-able.
TRUNCATE empties the whole table in one fast metadata operation — no per-row WHERE, usually resets identity counters, minimal logging.
DROP removes the table definition itself (and its data) from the schema.
Mnemonic: DELETE = some/all rows, TRUNCATE = all rows fast, DROP = the table is gone.
Follow-ups they push on
- Is TRUNCATE transactional in Postgres? (Yes.) In other engines?
- Which of these can you roll back?
Red flag Claiming TRUNCATE can take a WHERE clause, or that DELETE and TRUNCATE are interchangeable (triggers, identity reset, and speed differ).
source: PostgreSQL docs — TRUNCATE ↗
Commonly asked junior concept common What is the difference between UNION and UNION ALL, and which is faster?
UNION concatenates two result sets and removes duplicates (an implicit DISTINCT, which costs a sort/hash). UNION ALL keeps every row, including duplicates.
UNION ALL is faster because it skips the dedup step — prefer it whenever you know the inputs are already disjoint or duplicates are acceptable. Both require the same column count and compatible types in each branch.
Follow-ups they push on
- When is UNION (with dedup) actually required?
- Difference between UNION and a FULL OUTER JOIN?
Red flag Defaulting to UNION everywhere and paying for a needless dedup, or assuming the column lists must have identical names (only count/type must match).
source: StrataScratch — Meta SQL Interview Questions ↗
Amazon mid coding very common Find the employee(s) with the highest salary in each department.
The robust way is a window rank so ties are kept:
WITH r AS (SELECT name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employee) SELECT name, department, salary FROM r WHERE rk = 1;
A correlated-subquery form also works: WHERE salary = (SELECT MAX(salary) FROM employee e2 WHERE e2.department = e1.department). Use RANK (not ROW_NUMBER) so two employees tied at the top of a department both appear.
What a strong answer covers
- Partition by department, order by salary DESC, keep the top rank.
- RANK() = 1 keeps all ties; ROW_NUMBER() = 1 arbitrarily keeps just one.
- Equivalent correlated subquery: compare each salary to that department's MAX.
- A global ORDER BY salary DESC LIMIT 1 is wrong — it returns one row overall, not one per department.
Follow-ups they push on
- Why RANK rather than ROW_NUMBER if ties should all be returned?
- Rewrite it without a window function using a correlated subquery.
Red flag Using `ORDER BY salary DESC LIMIT 1` (top earner overall, not per department) or ROW_NUMBER, which silently drops tied top earners.
source: LeetCode 184 — Department Highest Salary ↗
Commonly asked mid trick common What does `WHERE status NOT IN ('shipped', 'delivered')` do to rows where status is NULL, and why?
It excludes them — a row with status = NULL is *not* returned, even though NULL is obviously 'not shipped and not delivered' to a human.
NOT IN (...) expands to status <> 'shipped' AND status <> 'delivered'. Comparing NULL <> anything yields UNKNOWN, so the whole AND is UNKNOWN, and WHERE keeps only rows that are TRUE. To include NULLs you must say so explicitly: WHERE status NOT IN (...) OR status IS NULL.
What a strong answer covers
- NOT IN is sugar for a chain of <> comparisons joined by AND.
- Any comparison with NULL is UNKNOWN, and WHERE keeps only TRUE rows — so NULL rows drop.
- The far more dangerous case: a NULL inside the list makes NOT IN return *no rows at all*.
- Add OR col IS NULL to include NULLs, or prefer NOT EXISTS which is NULL-safe.
Quick self-check
`orders` has statuses 'shipped', 'pending', and NULL. `SELECT * FROM orders WHERE status NOT IN ('shipped');` returns…
Follow-ups they push on
- What happens if the value list itself contains a NULL (e.g. from a subquery)?
- How does NOT EXISTS avoid this NULL trap?
Red flag Expecting NULL rows to satisfy a `NOT IN`/`<>` filter, or using `NOT IN (subquery)` where the subquery can yield NULL and silently returning zero rows.
source: PostgreSQL docs — Row and Array Comparisons (IN / NOT IN and NULL) ↗
Amazon mid coding common Write a SQL query to get the average star rating for each product for each month.
Extract the month from the timestamp, group by it and the product, average the stars:
SELECT EXTRACT(MONTH FROM submit_date) AS mth, product_id, ROUND(AVG(stars), 2) AS avg_stars FROM reviews GROUP BY mth, product_id ORDER BY mth, product_id;
Every non-aggregated SELECT column must appear in GROUP BY. ROUND tidies the output.
Follow-ups they push on
- Group by year-and-month so January 2023 and January 2024 don't collapse together.
- How would NULL stars affect AVG?
Red flag Selecting a column that is neither grouped nor aggregated (errors in Postgres, silently picks an arbitrary row in old MySQL), or grouping by month only so different years merge.
source: DataLemur — Amazon 'Average Review Ratings' ↗
Commonly asked mid concept common What is the logical order of evaluation of a SELECT's clauses, and why does it explain why you can't use a SELECT alias in WHERE?
Although you *write* SELECT … FROM … WHERE … GROUP BY … HAVING … ORDER BY, the engine *logically* evaluates them as: FROM/joins -> WHERE -> GROUP BY -> HAVING -> SELECT (where aliases are assigned) -> DISTINCT -> ORDER BY -> LIMIT.
Because SELECT runs after WHERE, an alias defined in SELECT doesn't yet exist when WHERE is evaluated — so WHERE total > 100 referencing a SELECT … AS total errors. ORDER BY runs last, which is why it *can* see SELECT aliases. (MySQL leniently allows aliases in some clauses as an extension, but the standard doesn't.)
What a strong answer covers
- Logical order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT.
- Aliases are created in the SELECT step, so WHERE/GROUP BY/HAVING generally can't see them.
- ORDER BY is the one clause that *can* reference SELECT aliases — it runs after SELECT.
- Workaround: repeat the expression, or wrap the query in a subquery/CTE and filter on the alias outside.
Quick self-check
Given `SELECT price * qty AS total FROM line_items WHERE total > 100;`, what happens?
Follow-ups they push on
- Why can ORDER BY use a SELECT alias but WHERE cannot?
- Does MySQL deviate from this? (it allows aliases in GROUP BY/HAVING as an extension)
Red flag Assuming the written clause order is the execution order, then being confused why a SELECT alias is 'not recognized' in WHERE or GROUP BY.
source: PostgreSQL docs — SELECT (clause processing order) ↗

3.2 JOINs 16

★ must-know Commonly asked junior concept very common Walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN — what rows does each keep?
All join two tables on a predicate; they differ in which unmatched rows survive:
- INNER — only rows that match on both sides.
- LEFT (OUTER) — all left rows; right columns are NULL where no match.
- RIGHT (OUTER) — all right rows; left columns NULL where no match (a LEFT JOIN with the tables swapped).
- FULL OUTER — all rows from both sides; NULL on whichever side is missing.
- CROSS — every left row paired with every right row (Cartesian product, no ON).
Mental model: INNER is the intersection, LEFT/RIGHT keep one side whole, FULL keeps the union, CROSS multiplies.
What a strong answer covers
- INNER = matches only; the unmatched rows on both sides are dropped.
- LEFT keeps all left rows; RIGHT keeps all right rows (mirror images).
- FULL OUTER keeps unmatched rows from both sides, padding the missing side with NULL.
- CROSS produces n*m rows with no join condition.
- RIGHT JOIN is rarely written by hand — people flip the tables and use LEFT for readability.
Quick self-check
`customers` has 10 rows; `orders` has 4 rows, all belonging to just 2 of those customers (one order each... wait, 4 orders across 2 customers). How many rows does `customers LEFT JOIN orders ON customers.id = orders.cust_id` return?
Follow-ups they push on
- Which join would you use to find rows present in A but missing in B?
- How do you reproduce a FULL OUTER JOIN in MySQL, which lacks it?
Red flag Describing LEFT/RIGHT as 'returns more rows' rather than 'preserves the unmatched rows of one side', or thinking CROSS JOIN needs an ON clause.
source: PostgreSQL docs — Joined Tables (join types) ↗
Amazon junior concept very common What's the difference between an INNER JOIN and a LEFT JOIN, and what's the classic LEFT JOIN bug?
INNER JOIN keeps only rows that match in both tables; LEFT JOIN keeps all left rows, filling NULL where the right side has no match.
The bug: a WHERE predicate on a *right-table* column silently turns a LEFT JOIN into an INNER JOIN, because NULL fails the filter and those unmatched rows vanish. Fix by moving the condition into the ON clause: LEFT JOIN orders o ON o.cust_id = c.id AND o.status = 'paid' keeps customers with no paid order.
Follow-ups they push on
- Where do you put a filter on the *left* table — does it matter?
- Emulate a FULL OUTER JOIN in MySQL, which lacks it.
Red flag Saying LEFT JOIN 'returns more rows' instead of 'preserves unmatched left rows', and not catching the WHERE-vs-ON filter trap.
source: DataLemur — SQL Interview Questions ↗
Commonly asked junior concept occasional What does a CROSS JOIN do, and name a legitimate use for it.
A CROSS JOIN produces the Cartesian product — every row of A paired with every row of B (n*m rows, no ON clause). 10k x 10k = 100M rows, so it's usually a bug from a missing join condition.
Legit uses: generating a complete grid (every store x every day to fill gaps for a report), pairing each row against a small constants/calendar table, or building combinations. Often written deliberately as CROSS JOIN generate_series(...).
Follow-ups they push on
- How does an unintended CROSS JOIN usually sneak in?
- Difference between CROSS JOIN and an INNER JOIN with `ON 1=1`?
Red flag Not recognizing that a comma-join with no WHERE join condition is effectively a CROSS JOIN that explodes row counts.
source: PostgreSQL docs — Joined Tables (CROSS JOIN) ↗
Commonly asked mid debug very common A LEFT JOIN with `WHERE right_table.col = 'x'` returns fewer rows than expected. What happened, and what's the fix?
The WHERE on a right-table column silently demotes the LEFT JOIN to an INNER JOIN. Unmatched left rows have NULL in right_table.col, and NULL = 'x' is UNKNOWN, so WHERE discards exactly the rows the LEFT JOIN was meant to preserve.
Fix: move the predicate into the ON clause — LEFT JOIN r ON r.fk = l.id AND r.col = 'x' — so it filters which right rows *match* without dropping unmatched left rows. The rule: conditions that should *preserve* the outer side go in ON; conditions that should *filter the final result* go in WHERE. (A WHERE right.col IS NULL is the deliberate exception — that's the anti-join idiom.)
What a strong answer covers
- A WHERE predicate on the null-able (right) side turns LEFT JOIN into INNER JOIN.
- Cause: NULL = 'x' evaluates to UNKNOWN, so the padded unmatched rows are filtered out.
- Fix: put the right-side condition in ON, not WHERE.
- ON controls matching (preserves the outer side); WHERE filters the joined result.
- Exception: WHERE right.col IS NULL is intentional — it's the anti-join pattern.
Quick self-check
You want every customer plus their 2024 orders (customers with no 2024 order should still appear). Which is correct?
Follow-ups they push on
- Why is a predicate on the LEFT (preserved) table the same in ON or WHERE here?
- How does this differ for an INNER JOIN, where ON vs WHERE are interchangeable?
Red flag Putting a right-table filter in WHERE and not realizing you've turned an outer join into an inner join, losing the unmatched rows you wanted.
source: PostgreSQL docs — Joined Tables (ON vs WHERE for outer joins) ↗
Commonly asked mid coding very common Find customers who have never placed an order — and explain three ways to write it.
This is the canonical anti-join. Three idioms:
1. LEFT JOIN / IS NULL: SELECT c.name FROM customers c LEFT JOIN orders o ON o.cust_id = c.id WHERE o.id IS NULL;
2. NOT EXISTS (usually the planner's favourite, and NULL-safe): SELECT name FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.cust_id = c.id);
3. NOT IN — works *only* if the subquery column can't be NULL: WHERE c.id NOT IN (SELECT cust_id FROM orders WHERE cust_id IS NOT NULL);
Prefer NOT EXISTS for safety and performance; reach for LEFT JOIN … IS NULL when you also need columns from the joined table.
What a strong answer covers
- Anti-join = 'rows in A with no match in B'.
- LEFT JOIN + WHERE matched_col IS NULL keeps only the unmatched left rows.
- NOT EXISTS is NULL-safe and typically optimizes to an efficient anti-join.
- NOT IN breaks (returns nothing) if the subquery yields a single NULL — guard with WHERE col IS NOT NULL.
Follow-ups they push on
- Why is NOT EXISTS safer than NOT IN here?
- Which form lets you also return data from the orders table?
Red flag Using `NOT IN (SELECT cust_id FROM orders)` when `cust_id` can be NULL — one NULL makes the predicate UNKNOWN for every row and the query returns nothing.
source: LeetCode 183 — Customers Who Never Order ↗
Commonly asked mid concept common What's the difference between joining in the ON clause versus filtering in WHERE for an INNER JOIN — does it matter?
For an INNER JOIN, a predicate produces the same result whether you put it in ON or WHERE — both filter the matched set, and the optimizer treats them equivalently.
For OUTER joins it matters enormously: an ON condition decides which rows *match* (unmatched outer rows are still kept and padded with NULL), while a WHERE condition filters the *final* result *after* the NULLs are added — which can erase the preserved rows. So the safe habit is: join keys and match conditions in ON; result-set filters in WHERE; and remember the distinction only collapses for INNER joins.
What a strong answer covers
- INNER JOIN: ON vs WHERE give identical results — equivalent to the optimizer.
- OUTER JOIN: ON affects *matching* (preserves unmatched rows); WHERE filters *after* padding.
- Best practice: put the relationship/keys in ON, post-join filters in WHERE.
- The 'it doesn't matter' rule applies *only* to inner joins.
Quick self-check
For `a INNER JOIN b ON a.id = b.aid AND b.active = true` vs `a INNER JOIN b ON a.id = b.aid WHERE b.active = true`, the results are…
Follow-ups they push on
- Show a case where moving a predicate from WHERE to ON changes a LEFT JOIN's output.
- Does the optimizer reorder ON vs WHERE predicates for an inner join?
Red flag Over-generalizing 'ON and WHERE are the same' from inner joins to outer joins, where they produce different result sets.
source: Use The Index, Luke! — Join Operations ↗
Commonly asked mid trick occasional Using a USING clause or NATURAL JOIN instead of ON — what are they and why are they risky?
JOIN … USING (col) joins on equally-named columns and merges them into one output column (so you write col, not a.col). NATURAL JOIN goes further and joins on all identically-named columns automatically, with no ON/USING at all.
USING is fine and concise. NATURAL JOIN is dangerous: adding an unrelated same-named column later (a created_at or id on both tables) silently changes the join key and corrupts results with no error. Most style guides ban NATURAL JOIN and prefer an explicit ON (or USING) so the join condition is visible and stable against schema changes.
What a strong answer covers
- USING (col) joins on a shared column name and collapses it to a single output column.
- NATURAL JOIN auto-joins on *every* commonly-named column — implicit and fragile.
- A later schema change (new same-named column) silently alters a NATURAL JOIN's keys.
- Prefer explicit ON; USING is acceptable, NATURAL JOIN is widely discouraged.
Quick self-check
Why do most style guides discourage NATURAL JOIN?
Follow-ups they push on
- How does USING change which columns appear in `SELECT *`?
- Why can adding a column break an existing NATURAL JOIN with no error?
Red flag Relying on NATURAL JOIN, then having a future migration add a same-named column that silently joins on it and quietly changes the result set.
source: PostgreSQL docs — Joined Tables (USING and NATURAL) ↗
Amazon mid coding common Identify the top two highest-grossing products within each category in 2022, returning category, product, and total spend.
Aggregate spend per (category, product), rank within each category, keep rank <= 2:
WITH g AS (SELECT category, product, SUM(spend) AS total FROM product_spend WHERE EXTRACT(YEAR FROM tx_date) = 2022 GROUP BY category, product), r AS (SELECT *, RANK() OVER (PARTITION BY category ORDER BY total DESC) AS rk FROM g) SELECT category, product, total FROM r WHERE rk <= 2;
This is the 'top-N-per-group' pattern: GROUP BY for the metric, a window RANK to rank within partitions.
Follow-ups they push on
- RANK vs DENSE_RANK vs ROW_NUMBER for breaking ties on equal spend?
- Why can't you filter on the window function in the same SELECT's WHERE?
Red flag Using a global ORDER BY + LIMIT 2 (gives the top 2 overall, not per category), or referencing the window alias in WHERE instead of wrapping it in a CTE/subquery.
source: DataLemur — Amazon 'Highest-Grossing Items' ↗
Commonly asked mid coding common Write a self-join to list each employee alongside their manager's name from an employees(id, name, manager_id) table.
Join the table to itself with two aliases:
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;
Use LEFT JOIN (not INNER) so the CEO, whose manager_id is NULL, still appears with a NULL manager. Aliases (e, m) are mandatory to disambiguate the two copies.
Follow-ups they push on
- How would you go more than one level up (whole chain to the CEO)?
- Recursive CTE for an arbitrary-depth org chart?
Red flag Using INNER JOIN and silently dropping the top-level employee, or forgetting aliases so the columns are ambiguous.
source: PG Exercises — JOINs ↗
Commonly asked mid coding occasional MySQL has no FULL OUTER JOIN. How do you emulate one?
Take the union of a LEFT JOIN and a RIGHT JOIN:
SELECT * FROM a LEFT JOIN b ON a.id = b.id UNION SELECT * FROM a RIGHT JOIN b ON a.id = b.id;
The LEFT half gives all of a plus matches; the RIGHT half gives all of b plus matches; UNION (not UNION ALL) dedups the rows that matched on both sides.
Follow-ups they push on
- Why UNION and not UNION ALL here?
- How to find rows present in exactly one side (anti-join / symmetric difference)?
Red flag Using UNION ALL and double-counting matched rows, or assuming MySQL silently supports FULL OUTER JOIN.
source: PostgreSQL docs — Joins (table expressions) ↗
Amazon mid coding occasional Find products that exist in Amazon's catalog but NOT in the partner's catalog (an anti-join).
Three idiomatic ways; the LEFT JOIN … IS NULL anti-join is the workhorse:
SELECT a.product FROM amazon a LEFT JOIN partner p ON a.product = p.product WHERE p.product IS NULL;
Alternatives: NOT EXISTS (SELECT 1 FROM partner p WHERE p.product = a.product) (NULL-safe, often the planner's favourite) or EXCEPT. Prefer NOT EXISTS over NOT IN when the right column can be NULL.
Follow-ups they push on
- Why is NOT IN dangerous when the subquery may return a NULL?
- Performance: NOT EXISTS vs LEFT JOIN/IS NULL vs EXCEPT?
Red flag Using `NOT IN` with a nullable column (a single NULL makes the whole predicate return no rows), or forgetting the `IS NULL` filter in the LEFT-JOIN form.
source: StrataScratch — Amazon 'Exclusive Amazon Products' ↗
Commonly asked mid concept common Why can a JOIN return more rows than either input table, and how do you avoid accidental row explosion?
A join multiplies rows wherever the join key is not unique on the other side: if one customer has 3 orders, joining customers->orders yields 3 rows for that customer. A many-to-many join multiplies both sides — fan-out.
This silently corrupts aggregates: SUM(amount) double-counts if you joined in a second one-to-many table first. Guard against it by joining on unique/PK columns, pre-aggregating one side in a CTE before joining, or checking the grain of every join.
Follow-ups they push on
- What is the 'grain' of a result set and why track it?
- A CROSS JOIN of 10k x 10k rows — how many rows, and when is that intentional?
Red flag Blaming 'duplicate data' when the real cause is joining on a non-unique key, or summing a measure after a fan-out join and reporting inflated totals.
source: StrataScratch — Amazon SQL Interview Questions ↗
Commonly asked senior debug common An index exists on the join column of one table but the JOIN is still slow. What index considerations apply to joins?
For a nested-loop join the engine iterates the outer table and probes the inner table once per row, so the index that matters is on the inner table's join column — the side being looked up. If only the outer table's column is indexed, each probe still scans the inner table.
Checklist: (1) index the inner/probed side's join key; (2) make the join columns the same type — an implicit cast (e.g. int vs varchar) makes the predicate non-sargable and skips the index; (3) for big unindexed equi-joins a hash join may be the right plan, not a fix; (4) read EXPLAIN to see whether it chose nested-loop vs hash and whether the index is actually used.
What a strong answer covers
- Nested-loop joins need the index on the inner (probed) table's join column.
- Mismatched column types force an implicit cast -> non-sargable -> index ignored.
- Both join columns should share a type and, ideally, collation.
- A hash join on large unindexed inputs can be the correct plan, not a bug.
- Use EXPLAIN to confirm which join algorithm and index the planner actually picked.
Follow-ups they push on
- Which table's column should carry the index in a nested-loop join?
- How does an int-vs-varchar join key defeat an index?
Red flag Indexing only the driving (outer) table and expecting fast probes, or joining columns of different types and silently losing the index to an implicit cast.
source: Use The Index, Luke! — Nested Loops / indexing joins ↗
Commonly asked senior debug common Why does summing a measure go wrong after joining two one-to-many tables, and how do you fix the double-counting?
Joining a parent to two child tables (orders has many line_items *and* many payments) creates a Cartesian fan-out: each order's rows = items x payments. Now SUM(payment.amount) is multiplied by the number of line items, and SUM(item.qty) is multiplied by the number of payments — every total is inflated.
Fix by pre-aggregating each child to the parent's grain in its own subquery/CTE before joining: WITH it AS (SELECT order_id, SUM(qty) q FROM items GROUP BY order_id), pm AS (SELECT order_id, SUM(amount) a FROM payments GROUP BY order_id) SELECT … FROM orders o LEFT JOIN it … LEFT JOIN pm …. Each child is now one row per order, so no fan-out. Always know the grain of each table you join.
What a strong answer covers
- Joining one parent to two one-to-many children multiplies rows (items x payments).
- Aggregates over the fanned-out rows double/triple-count.
- Fix: pre-aggregate each child to the parent grain in separate CTEs/subqueries, *then* join.
- COUNT(DISTINCT …) can patch a single measure but doesn't fix multiple measures cleanly.
- Track the 'grain' (one row per what?) at every join step.
Follow-ups they push on
- Why doesn't COUNT(DISTINCT) fully solve it when you need two sums?
- What is the 'grain' of a result set and how do you reason about it?
Red flag Joining several one-to-many tables in one flat query and trusting SUM — the totals are inflated by the cross-product of the child rows.
source: StrataScratch — SQL JOIN Interview Questions ↗
Commonly asked senior concept occasional Explain the three physical join algorithms (nested loop, hash join, merge join) and when a planner picks each.
Nested loop: for each outer row, probe the inner table — great when one side is tiny or there's an index on the inner join key; O(n*m) without an index.
Hash join: build a hash table on the smaller input's key, probe with the larger — best for large, unindexed equality joins; needs memory and only does equi-joins.
Merge join: sort both inputs on the key, then walk them in lockstep — wins when inputs are already sorted (e.g. from an index) or for range conditions.
The planner chooses by estimated row counts and available indexes; you see them in EXPLAIN.
Follow-ups they push on
- Why can't a hash join serve `a.x < b.y`?
- How does a missing index push a join from nested-loop to a costly hash join?
Red flag Thinking the JOIN keyword maps to one fixed algorithm — the optimizer picks the physical operator based on stats and indexes.
source: PostgreSQL docs — Planner / Optimizer (join methods) ↗
Meta senior coding occasional For each Friday, count the total likes a post received from the poster's friends, where the like happened after the post was created.
Friendship is usually stored one-directional, so first symmetrize it with UNION ALL of (a,b) and (b,a). Then join posts to that friend list and to likes, requiring the liker to be a friend and like_ts > post_ts, and filter to Fridays:
... WHERE EXTRACT(DOW FROM like_date) = 5 AND like_ts > post_ts then GROUP BY like_date. Use COUNT(DISTINCT …) if a friend could like the same post twice.
This is a Meta-style 'SQL as a tool for product reasoning' question: the schema modelling (bidirectional friendship, temporal ordering) is the real test.
Follow-ups they push on
- Why UNION ALL rather than UNION when symmetrizing friendships?
- How does the day-of-week number differ across MySQL/Postgres?
Red flag Treating friendship as already bidirectional and undercounting, or forgetting the `like_ts > post_ts` temporal guard.
source: StrataScratch — Meta "Friday's Likes Count" ↗

3.3 Advanced querying 14

★ must-know Amazon mid trick common Why can't you put a window function in a WHERE clause, and how do you filter on its result?
Window functions are computed in the SELECT step, which logically runs *after* WHERE, GROUP BY, and HAVING. So WHERE rn = 1 referencing ROW_NUMBER() … AS rn errors — the window result doesn't exist yet when WHERE is evaluated.
The fix is to compute the window function in an inner query (a subquery or CTE) and filter on its alias in the outer query: WITH r AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn FROM emp) SELECT * FROM r WHERE rn = 1;. This 'rank-then-filter' wrapper is the single most common window-function pattern in interviews.
What a strong answer covers
- Window functions evaluate in SELECT, after WHERE/GROUP BY/HAVING.
- Referencing a window alias in the same query's WHERE/HAVING is an error.
- Wrap it in a CTE/subquery and filter on the alias in the outer query.
- This 'rank in inner, filter in outer' is the top-N-per-group backbone.
Quick self-check
You want the single highest-paid employee per department. Which is valid?
Follow-ups they push on
- Could you ever use a window function in HAVING? (no — same reason)
- How does this relate to the top-N-per-group pattern?
Red flag Writing `WHERE ROW_NUMBER() OVER (...) = 1` directly and being surprised by a syntax/semantic error instead of wrapping it in a subquery.
source: PostgreSQL docs — Window Function Processing ↗
Commonly asked junior coding very common Find the second-highest distinct salary in an Employee table; return NULL if there isn't one.
Order distinct salaries and skip the top one:
SELECT (SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1) AS second_highest;
Wrapping it in an outer SELECT makes the result NULL (not an empty set) when there's no second salary. Alternative: DENSE_RANK() OVER (ORDER BY salary DESC) and keep rank = 2. DISTINCT/DENSE_RANK matters so duplicate top salaries don't count as two ranks.
Follow-ups they push on
- Generalize to the Nth-highest salary.
- Why DENSE_RANK rather than RANK or ROW_NUMBER here?
Red flag Using `MAX(salary) WHERE salary < MAX(salary)` incorrectly, or forgetting DISTINCT so two employees tied at the top hide the real second salary; also returning an empty set instead of NULL.
source: LeetCode 176 — Second Highest Salary ↗
Commonly asked mid concept common What's the difference between EXISTS and IN with a subquery, and when does each win?
IN (subquery) materializes the subquery's values and checks membership; EXISTS (subquery) is a correlated semi-join that returns true as soon as one matching row is found (short-circuits).
Semantically the big difference is NULL handling: NOT IN returns no rows if the subquery yields a NULL, whereas NOT EXISTS is NULL-safe — so prefer NOT EXISTS for anti-joins. Performance-wise, modern optimizers often rewrite both into the same semi-/anti-join, but EXISTS tends to win when the subquery is large (it can stop early) and IN reads fine for small, NULL-free value lists. Use EXISTS when you only test *existence*; use IN for a short, known set.
What a strong answer covers
- IN tests membership in a value set; EXISTS tests whether any correlated row exists (short-circuits).
- NOT IN + a NULL in the subquery returns zero rows; NOT EXISTS is NULL-safe.
- Optimizers frequently rewrite both to semi-joins, so results — not raw form — usually drive the plan.
- Rule of thumb: EXISTS for existence tests / large subqueries; IN for small NULL-free lists.
Quick self-check
`SELECT * FROM a WHERE a.x NOT IN (SELECT b.y FROM b)` where `b.y` contains one NULL. Result?
Follow-ups they push on
- Show the NULL case where NOT IN and NOT EXISTS diverge.
- Why can EXISTS stop scanning after the first match?
Red flag Treating IN and EXISTS as always identical and getting burned by `NOT IN` with a nullable subquery column returning no rows.
source: PostgreSQL docs — Subquery Expressions (EXISTS / IN) ↗
Commonly asked mid coding occasional Pivot a tall table (one row per month) into a wide one (a column per month) in SQL.
The portable, engine-agnostic way is conditional aggregation — SUM(CASE WHEN …) per target column:
SELECT product, SUM(CASE WHEN month = 'Jan' THEN revenue END) AS jan, SUM(CASE WHEN month = 'Feb' THEN revenue END) AS feb FROM sales GROUP BY product;
Each CASE isolates one month's value; the GROUP BY collapses to one row per product. You must enumerate the target columns explicitly — SQL's result shape is fixed at plan time, so a truly dynamic pivot needs generated SQL or an engine extension (Postgres crosstab, SQL Server PIVOT).
What a strong answer covers
- Conditional aggregation: one SUM(CASE WHEN key = 'X' THEN val END) per output column.
- GROUP BY the row dimension; each CASE picks out one pivot value.
- Output columns must be hard-coded — SQL can't return a runtime-variable number of columns.
- Dynamic pivots need generated SQL or extensions (Postgres crosstab, T-SQL PIVOT).
Follow-ups they push on
- How would you handle a column set that isn't known until query time?
- How do you un-pivot (wide back to tall)?
Red flag Expecting a single SQL statement to produce a dynamic, data-dependent number of columns — the column list is fixed at plan time.
source: PostgreSQL docs — tablefunc (crosstab / pivot) ↗
Commonly asked mid concept occasional Compare INTERSECT, EXCEPT (MINUS), and UNION — and how do they handle duplicates?
All three are set operators combining two result sets with matching column counts/types, and all remove duplicates by default (each has an ALL variant to keep them):
- UNION — rows in either set.
- INTERSECT — rows in both sets.
- EXCEPT (Oracle calls it MINUS) — rows in the first set not in the second.
They compare whole rows and treat NULLs as equal to each other for this purpose (unlike =). EXCEPT is a clean way to express an anti-join, and INTERSECT a semi-join, when you're comparing identically-shaped queries.
What a strong answer covers
- UNION = either, INTERSECT = both, EXCEPT/MINUS = first-minus-second.
- All dedup by default; UNION ALL / INTERSECT ALL / EXCEPT ALL keep duplicates.
- They match on the entire row and treat NULL = NULL (unlike =).
- EXCEPT is a tidy anti-join; INTERSECT a tidy semi-join for same-shaped queries.
- Oracle uses MINUS; most others use EXCEPT.
Quick self-check
`SELECT id FROM a EXCEPT SELECT id FROM b` returns…
Follow-ups they push on
- How do set operators treat NULLs differently from a `=` comparison?
- Rewrite an EXCEPT query as a NOT EXISTS anti-join.
Red flag Forgetting these dedup by default (surprising row counts), or assuming `EXCEPT` exists in Oracle, where it's `MINUS`.
source: PostgreSQL docs — Combining Queries (UNION/INTERSECT/EXCEPT) ↗
Commonly asked mid coding occasional Use NTILE / percentile window functions to bucket users into quartiles by spend.
NTILE(n) splits ordered rows into n roughly-equal buckets and labels each row 1..n:
SELECT user_id, spend, NTILE(4) OVER (ORDER BY spend DESC) AS quartile FROM users;
Quartile 1 is the top quarter of spenders. NTILE distributes any remainder to the earliest buckets, so groups can differ by one row. If you instead want a *value* threshold (the spend at the 90th percentile), use PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY spend) (an ordered-set aggregate), not NTILE — NTILE buckets *rows*, percentiles compute a *value*.
What a strong answer covers
- NTILE(n) OVER (ORDER BY …) assigns each row a bucket number 1..n of near-equal size.
- Uneven counts: the first buckets get the extra rows.
- NTILE labels *rows by rank position*; it does not compute a threshold value.
- For a percentile *value*, use PERCENTILE_CONT/PERCENTILE_DISC … WITHIN GROUP.
Follow-ups they push on
- Difference between NTILE(4) and PERCENTILE_CONT(0.25)?
- How does NTILE distribute rows when the count isn't divisible by n?
Red flag Using NTILE to get a percentile *threshold value* (it returns bucket labels, not the value at a percentile) or assuming all NTILE buckets have exactly equal size.
source: PostgreSQL docs — Window Functions (NTILE) & Aggregate (percentile) ↗
Commonly asked mid concept common What's the difference between a correlated and a non-correlated subquery, and why does it matter for performance?
A non-correlated subquery is self-contained — it runs once and its result is reused (e.g. WHERE salary > (SELECT AVG(salary) FROM emp)).
A correlated subquery references a column from the outer query, so conceptually it re-runs once per outer row (e.g. WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept = e1.dept)). That can be O(n) executions and slow, though modern planners often rewrite simple cases into joins.
Follow-ups they push on
- Rewrite a correlated subquery as a JOIN or window function.
- When is EXISTS preferable to IN with a subquery?
Red flag Calling every subquery 'correlated', or assuming a correlated subquery always re-executes literally (optimizers may decorrelate it).
source: LeetCode 185 — Department Top Three Salaries (correlated subquery) ↗
Commonly asked mid concept common When would you use a CTE (WITH clause) over a subquery or a temp table?
A CTE names an intermediate result so you can reference it (sometimes multiple times) and read the query top-to-bottom — mainly a readability win, and the only way to write a recursive query (WITH RECURSIVE).
Vs a subquery: same logic, clearer structure. Vs a temp table: a CTE is scoped to the single statement and (usually) not materialized to disk. Note: in some engines a CTE is an optimization fence (older Postgres materialized them); Postgres 12+ inlines non-recursive CTEs unless you say MATERIALIZED.
Follow-ups they push on
- Write a recursive CTE to walk an org hierarchy.
- When does a CTE act as an optimization barrier?
Red flag Claiming CTEs are always faster — pre-12 Postgres materialized them, which could be slower than an inlined subquery.
source: PostgreSQL docs — WITH Queries (Common Table Expressions) ↗
Meta mid concept very common How does a window function differ from GROUP BY?
GROUP BY collapses each group into one row — you lose the individual rows. A window function (… OVER (PARTITION BY …)) computes an aggregate/rank across a window of rows but keeps every row, attaching the result alongside.
So to show each employee *and* their department's average salary in the same row, you need AVG(salary) OVER (PARTITION BY dept), not GROUP BY. Window functions also give you ROW_NUMBER/RANK/LAG/LEAD and running totals, which GROUP BY can't express.
Follow-ups they push on
- Give a running total with `SUM(x) OVER (ORDER BY d)`.
- Difference between PARTITION BY and a plain GROUP BY?
Red flag Saying they're interchangeable — GROUP BY reduces row count, a window function preserves it.
source: PostgreSQL docs — Window Functions ↗
Amazon mid concept very common What's the difference between ROW_NUMBER, RANK, and DENSE_RANK on tied values?
On a tie of two rows ranked 1st:
- ROW_NUMBER — always unique, arbitrary among ties: 1, 2, 3, 4 …
- RANK — ties share a rank, then it skips: 1, 1, 3, 4 …
- DENSE_RANK — ties share a rank, no gap: 1, 1, 2, 3 …
Pick ROW_NUMBER for 'one row per group / dedup', RANK/DENSE_RANK for leaderboards. 'Top 3 salaries including ties' usually wants DENSE_RANK <= 3.
Follow-ups they push on
- Which one for 'top N salaries, ties count as one place'?
- How to make ROW_NUMBER deterministic when the ORDER BY has ties?
Red flag Using ROW_NUMBER for a 'top N including ties' question and arbitrarily dropping tied rows, or confusing RANK's gaps with DENSE_RANK's continuity.
source: StrataScratch — Amazon 'Top-Rated Support Employees' (DENSE_RANK) ↗
Commonly asked mid coding common Write a running (cumulative) total of daily sales ordered by date.
A windowed SUM with an ORDER BY gives a running total:
SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM sales;
Adding ORDER BY inside OVER switches the default frame from 'whole partition' to 'start … current row', which is exactly a cumulative sum. Add PARTITION BY region to get one running total per region.
Follow-ups they push on
- Default window frame with vs without ORDER BY?
- RANGE vs ROWS for the frame — when do they differ?
Red flag Omitting ORDER BY in OVER (you get the grand total on every row, not a running one), or being surprised by RANGE's behavior on duplicate dates.
source: PostgreSQL docs — Window Function Calls (frames) ↗
Amazon senior coding common Compute the month-over-month percentage change in revenue using a window function.
Aggregate to monthly revenue, then use LAG to reach the previous month:
WITH m AS (SELECT DATE_TRUNC('month', tx) AS mth, SUM(amount) AS rev FROM orders GROUP BY 1) SELECT mth, ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY mth)) / LAG(rev) OVER (ORDER BY mth), 2) AS pct_change FROM m ORDER BY mth;
LAG(rev) OVER (ORDER BY mth) pulls the prior row's value; the first month is NULL (no prior). Multiply by 100.0 to force float division.
Follow-ups they push on
- Use LEAD instead — what changes?
- Why might integer division give you 0% everywhere?
Red flag Integer division truncating the ratio to 0, or self-joining the table to itself on month-1 instead of the cleaner LAG.
source: StrataScratch — Amazon 'Monthly Percentage Difference' ↗
Meta senior coding occasional Find users with three or more consecutive days of activity (a gap-and-islands problem).
Classic 'gaps and islands': subtract a ROW_NUMBER from the date to give every consecutive run the same anchor:
WITH d AS (SELECT user_id, day, day - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY day))::int AS grp FROM activity) SELECT user_id, COUNT(*) AS streak FROM d GROUP BY user_id, grp HAVING COUNT(*) >= 3;
Within one user, on consecutive days both day and the row number increase by 1, so day - row_number is constant across a streak and changes at each gap — grouping on it isolates each island.
Follow-ups they push on
- Adapt it to find the *longest* streak per user.
- How would LAG-based gap detection compare?
Red flag Trying to detect consecutiveness with a single self-join on day+1 (breaks for runs longer than 2), or forgetting to PARTITION BY user.
source: StrataScratch — Meta 'User Streaks' (gap-and-island, LAG/DENSE_RANK) ↗
Commonly asked senior coding common What is a recursive CTE, and how does it walk an org hierarchy down to arbitrary depth?
A recursive CTE has two parts joined by UNION [ALL]: an anchor (the starting rows) and a recursive member that references the CTE itself, iterating until it adds no new rows.
WITH RECURSIVE chain AS (SELECT id, name, manager_id, 1 AS lvl FROM emp WHERE id = :root UNION ALL SELECT e.id, e.name, e.manager_id, c.lvl + 1 FROM emp e JOIN chain c ON e.manager_id = c.id) SELECT * FROM chain;
The anchor seeds the root; each pass joins employees onto the rows found so far, descending one level. It's the standard way to traverse trees/graphs (org charts, category trees, bill-of-materials) that a fixed number of self-joins can't handle.
What a strong answer covers
- Structure: anchor member UNION ALL recursive member that references the CTE.
- The recursive step runs repeatedly, feeding its output back in, until it produces no new rows.
- Used for arbitrary-depth trees/graphs: org charts, category trees, BOM, threaded comments.
- Guard against cycles (a depth cap or a visited-set) or the recursion never terminates.
Follow-ups they push on
- How do you prevent infinite recursion if the hierarchy has a cycle?
- UNION vs UNION ALL in a recursive CTE — what changes?
Red flag Forgetting the termination behavior on cyclic data (infinite loop) or trying to express arbitrary depth with a fixed chain of self-joins.
source: PostgreSQL docs — WITH Queries (Recursive Queries) ↗

3.4 Indexes & query performance 15

★ must-know Commonly asked mid concept common What is the difference between a clustered and a non-clustered (secondary) index, and how does it affect lookups?
A clustered index *is* the table: the rows are physically stored in the index's key order (the leaf nodes hold the full row). There's at most one per table — in MySQL/InnoDB it's the primary key. A range scan on the clustered key reads contiguous data, which is very fast.
A non-clustered (secondary) index is a separate structure whose leaves hold the key plus a pointer back to the row. So a secondary-index lookup that needs columns not in the index does a second hop — a bookmark lookup (InnoDB: look up the PK, then the clustered index). That's why a *covering* secondary index (one that includes all needed columns) is so much faster: it skips the second hop.
What a strong answer covers
- Clustered index = the table's rows stored in key order; at most one per table (InnoDB: the PK).
- Secondary index = separate B-tree of key -> row locator; many allowed.
- A secondary lookup needing extra columns does a second fetch (bookmark / clustered-index lookup).
- InnoDB secondary indexes store the PK as the row pointer — so a fat PK bloats every secondary index.
- Postgres heap tables differ: there's no clustered index, just heap + index TIDs.
Quick self-check
In MySQL/InnoDB, what does a secondary index's leaf node store as the pointer to the full row?
Follow-ups they push on
- In InnoDB, why does a large primary key make every secondary index bigger?
- How does Postgres's storage model differ from InnoDB's clustered table?
Red flag Assuming every index is a separate copy with a fixed pointer (engine-specific), or ignoring that a secondary lookup may need a costly second fetch to the base row.
source: MySQL docs — Clustered and Secondary Indexes ↗
Commonly asked junior concept common What is connection pooling and what problem does it solve?
Opening a DB connection is expensive — TCP handshake, TLS, auth, a backend process/thread. Doing it per request adds latency and can exhaust the DB's connection limit under load.
A connection pool keeps a set of pre-opened connections and hands one to each request, returning it to the pool when done. This caps total connections (protecting the DB), amortizes setup cost, and smooths spikes. Tools: PgBouncer (external), HikariCP (Java), plus most ORMs/drivers' built-in pools.
Follow-ups they push on
- How do you size a pool, and why isn't bigger always better?
- Transaction vs session pooling mode in PgBouncer?
Red flag Thinking a bigger pool always means more throughput — past the DB's CPU/IO capacity it causes contention; serverless functions multiplying pools is a classic source of connection exhaustion.
source: PostgreSQL docs — Connections and Authentication ↗
Commonly asked mid concept very common What is a B-tree index and why does it support range queries when a hash index doesn't?
A B-tree keeps keys in sorted order across a balanced, shallow tree, giving O(log n) lookups. Because the keys are ordered, the engine can do equality and range scans (>, <, BETWEEN), prefix LIKE 'abc%', and serve ORDER BY for free.
A hash index maps a key to a bucket via a hash — O(1) equality lookups, but the hash destroys ordering, so it can only do =, never ranges or sorts. B-tree is the default for exactly this versatility.
Follow-ups they push on
- Why can a B-tree help `ORDER BY` avoid a sort step?
- When is a hash index actually the better choice?
Red flag Saying hash indexes are 'always faster' (only for point equality) or that B-trees are O(1).
source: Use The Index, Luke! — Anatomy of an Index ↗
Commonly asked mid concept occasional What's the difference between a B-tree, a hash, and a GIN/inverted index, and what is each best for?
B-tree — the default; ordered keys, serves equality, ranges, prefix LIKE, and ORDER BY. Use for almost everything scalar.
Hash — O(1) equality only, no ranges or ordering; rarely worth it over a B-tree except niche equality-heavy cases.
GIN (Generalized Inverted Index) — maps each *element* of a composite value to the rows containing it, so it's built for 'contains' queries over multi-valued columns: full-text search (tsvector), JSONB containment (@>), array membership. (GiST is its cousin for ranges/geometry/nearest-neighbor.)
Choose by the query shape: scalar ranges/sorts -> B-tree; element-membership in documents/arrays/text -> GIN.
What a strong answer covers
- B-tree: ordered, the all-purpose default (equality, range, sort, prefix LIKE).
- Hash: equality-only, no ordering — niche.
- GIN: inverted index for multi-valued columns — full-text, JSONB @>, array containment.
- GiST: ranges, geometry, nearest-neighbor / fuzzy search.
- Match the index type to the query operator, not by habit.
Follow-ups they push on
- Why can't a B-tree efficiently answer JSONB `@>` containment?
- When would you reach for GiST over GIN?
Red flag Putting a plain B-tree on a JSONB or array column and wondering why containment queries don't use it — those need a GIN index.
source: PostgreSQL docs — Index Types ↗
Commonly asked mid concept very common Explain the leftmost-prefix rule for a composite index on (a, b, c). Which queries can it serve?
A concatenated index is sorted by a, then b, then c — like a phone book by surname then first name. So it serves queries that filter on a leading prefix: a; a AND b; a AND b AND c.
It does not efficiently serve b alone or (b, c) — there's no leading a, so the order is useless and the engine falls back to a scan. Practical rule: put equality/most-selective columns first, the column you range/sort on last.
Follow-ups they push on
- Where do you place a column you do a range scan on within the index?
- Can the index still help an `a = ? AND c = ?` query (skipping b)?
Red flag Believing a composite index helps any subset of its columns, especially trailing ones like `(b, c)` without `a`.
source: Use The Index, Luke! — Concatenated Keys ↗
Commonly asked mid concept common What is a covering index / index-only scan, and why is it fast?
A covering index contains every column a query needs — both the filter/sort columns and the selected columns — so the engine can answer entirely from the index and never touch the heap/table. That's an 'index-only scan'.
It's fast because it skips the random table fetch per matched row (the most expensive part of an index lookup). Postgres lets you tack non-key payload columns on with INCLUDE (...); MySQL/InnoDB exposes it via the EXPLAIN 'Using index' note.
Follow-ups they push on
- Difference between key columns and INCLUDE columns in a covering index?
- Why might Postgres still hit the heap despite a covering index (visibility map)?
Red flag Thinking any index that matches the WHERE is 'covering' — it must also contain the SELECTed columns to avoid the table fetch.
source: Use The Index, Luke! — Index-Only Scan ↗
Commonly asked mid concept common When do indexes hurt rather than help?
Indexes are not free:
- Writes slow down — every INSERT/UPDATE/DELETE must update every affected index.
- Storage — each index is a copy of its columns plus row pointers.
- Low cardinality — an index on a boolean/status with few distinct values rarely beats a scan (the planner may ignore it).
- Tiny tables — a seq scan of a few pages is faster than an index round-trip.
- Unused/redundant indexes still cost on every write.
So index for read patterns you actually have, and drop ones EXPLAIN never picks.
Follow-ups they push on
- How would you find unused indexes in production?
- Why might the planner ignore an index on a low-selectivity column?
Red flag 'Just index every column' — it bloats writes and storage and the planner won't use most of them.
source: Use The Index, Luke! — The Where Clause / index downsides ↗
Commonly asked mid concept very common What is the N+1 query problem and how do you fix it?
An ORM lazy-loads a relationship inside a loop: 1 query for the list of N parents, then 1 more query per parent to fetch its children — 1 + N round-trips. With 100 posts you fire 101 queries, each paying network + planning latency.
Fix with eager loading — pull the children in a single JOIN or batched IN query: Rails .includes, Django select_related/prefetch_related, Hibernate JOIN FETCH, SQLAlchemy joinedload. It's a query-count problem, not a slow-query problem; detect it by counting queries per request, not by EXPLAINing one.
Follow-ups they push on
- How would you detect N+1 in a running app?
- Trade-off: a single huge JOIN vs a batched IN(...) of two queries?
Red flag Trying to optimize the individual child query when the real issue is firing it N times; or not knowing the eager-load API for the ORM in use.
source: Use The Index, Luke! — N+1 problem (Join Operations) ↗
Commonly asked senior debug common Your query filters on `status = 'active'` (95% of rows) and the planner does a Seq Scan instead of using the index. Is that a bug?
No — that's the planner being correct. The predicate is low-selectivity: it matches almost every row, so an index scan would do millions of random single-row fetches plus the index read, which is *slower* than one sequential pass. Indexes win only when they eliminate most of the table.
If instead you query the rare value (status = 'pending', 0.1% of rows), the index becomes worthwhile — that asymmetry is why a partial index (CREATE INDEX … WHERE status = 'pending') is the right tool for skewed columns. Verify with EXPLAIN (ANALYZE, BUFFERS); if the planner *wrongly* avoids an index, suspect stale stats and run ANALYZE.
What a strong answer covers
- Low selectivity (matching most rows) makes a full scan cheaper than scattered index fetches.
- Indexes pay off when they exclude the large majority of rows.
- Skewed columns: a partial index on the rare value(s) beats a full-column index.
- If the planner avoids an index it *should* use, suspect stale statistics — run ANALYZE.
Quick self-check
A boolean `is_deleted` column is true for 0.5% of rows. The best index strategy for `WHERE is_deleted = true` is…
Follow-ups they push on
- What is selectivity, and roughly what threshold flips the planner to a seq scan?
- When does a partial index beat a full index on the same column?
Red flag Force-hinting an index onto a non-selective predicate and making the query slower, or assuming 'index not used' is always a problem.
source: PostgreSQL docs — Partial Indexes ↗
Commonly asked senior concept common For `WHERE status = 'active' AND created_at > '2024-01-01' ORDER BY created_at`, what composite index would you build and in what column order?
Index (status, created_at) — equality column first, range/sort column last. The leading status = narrows the index to the matching slice, and within that slice the entries are already ordered by created_at, so the engine satisfies both the range filter and the ORDER BY from the index with no separate sort.
Flip it to (created_at, status) and the leading range column scatters the status values, so it can't use the index for the equality efficiently and may need a sort. The rule (from Use The Index, Luke!): equalities first, then the one range/order-by column — and you only get a sort-free ORDER BY if its column trails the equality columns in the index.
What a strong answer covers
- Order: equality predicate columns first, then the range/ORDER BY column.
- (status, created_at) lets one index serve filter + range + ordering with no sort step.
- A leading range column ruins the ability to use trailing equality columns and to skip the sort.
- Only one range column can be 'used' to bound the scan; further columns only refine within it.
Quick self-check
Best single composite index for `WHERE status = 'active' AND created_at > :d ORDER BY created_at`:
Follow-ups they push on
- Why can an index serve ORDER BY only when the sort column trails the equality columns?
- What happens to this index if you add a second range predicate?
Red flag Putting the range/sort column before the equality column, which forces a scan-and-sort and wastes the composite index.
source: Use The Index, Luke! — The Equality-First Rule (concatenated keys, ORDER BY) ↗
Commonly asked senior concept common Why is keyset (seek) pagination better than OFFSET for deep pages?
LIMIT 20 OFFSET 100000 still reads and discards the first 100,000 rows before returning 20 — cost grows linearly with the page number, so deep pages crawl. It can also skip or repeat rows if data changes between page loads.
Keyset (seek) pagination remembers the last row's sort key and asks for the next slice directly: WHERE (created_at, id) < (:last_ts, :last_id) ORDER BY created_at DESC, id DESC LIMIT 20. With an index on the sort key the database *seeks* straight to the spot — constant time regardless of depth — and it's stable under concurrent inserts. The cost is you can't jump to an arbitrary page number, only next/previous.
What a strong answer covers
- OFFSET scans and throws away all skipped rows — O(offset), so deep pages get slower.
- Keyset filters on the last seen sort key and seeks via the index — roughly constant time.
- Keyset is stable when rows are inserted/deleted between page requests; OFFSET can skip/duplicate.
- Trade-off: keyset supports next/prev, not random 'jump to page N'.
- Needs a unique, indexed tiebreaker (e.g. id) appended to the sort key.
Follow-ups they push on
- Why include `id` as a tiebreaker in the keyset comparison?
- When is OFFSET pagination still acceptable?
Red flag Using OFFSET for infinite scroll / deep pages (slow and prone to skipping rows under concurrent writes) and not realizing seek pagination needs a unique sort tiebreaker.
source: Use The Index, Luke! — Paging Through Results (seek method) ↗
Commonly asked senior debug common The estimated rows in EXPLAIN say 12 but actual says 4,000,000. What's wrong and how do you fix it?
A large gap between estimated and actual rows means the planner is working from stale or missing statistics, so it's likely choosing a bad plan (e.g. a nested loop sized for 12 rows that actually runs 4M times).
First fix: ANALYZE the_table; (or VACUUM ANALYZE) to refresh the stats the planner samples. If it's still off, the column may have correlated predicates the default per-column stats can't model — create extended statistics (CREATE STATISTICS … (dependencies/ndistinct)), or raise the sampling resolution with ALTER TABLE … ALTER COLUMN … SET STATISTICS. Always read these numbers with EXPLAIN (ANALYZE, BUFFERS) so you compare estimate vs actual on the same run.
What a strong answer covers
- Estimate-vs-actual divergence = the planner's row-count model is wrong, usually stale stats.
- First action: ANALYZE to refresh statistics.
- Correlated columns defeat per-column stats — use extended statistics (CREATE STATISTICS).
- Bad estimates cause bad join-method/order choices (nested loop where a hash join was right).
- Use EXPLAIN (ANALYZE, BUFFERS) to see estimate, actual rows, and real I/O together.
Follow-ups they push on
- Why does autovacuum sometimes not keep stats fresh enough on a hot table?
- What are extended statistics and when do you need them?
Red flag Rewriting the query when the real problem is stale stats, or trusting the planner's row estimate without checking it against EXPLAIN ANALYZE's actual rows.
source: PostgreSQL docs — Row Estimation / Statistics Used by the Planner ↗
Commonly asked senior design common How would you find and fix slow queries in a production database?
Find: turn on query collection — pg_stat_statements (Postgres) or the slow query log (MySQL) — and sort by total time (frequency x latency), not just single slowest, since a moderately slow query run millions of times dominates. APM traces help spot N+1 patterns.
Diagnose: run EXPLAIN (ANALYZE, BUFFERS) on the worst offenders; look for seq scans on big tables, bad row estimates, nested loops over many rows, and high buffer reads.
Fix: add/adjust an index (composite, covering, partial), rewrite to be sargable, fix N+1 with eager loading, refresh stats with ANALYZE, or cache/materialize expensive aggregates. Then re-measure — optimize the query that costs the most aggregate time first.
What a strong answer covers
- Capture queries with pg_stat_statements / the slow query log; rank by total time, not single-run time.
- Diagnose the top offenders with EXPLAIN (ANALYZE, BUFFERS).
- Common fixes: indexing, sargable rewrites, fixing N+1, refreshing stats, caching/materializing.
- Re-measure after each change — never optimize blind.
- A medium-slow query run constantly often beats the single slowest in total cost.
Follow-ups they push on
- Why rank by total time rather than the single slowest query?
- How do you catch an N+1 that no single EXPLAIN reveals?
Red flag Optimizing the single slowest query while ignoring a moderately slow one executed orders of magnitude more often, or tuning without measuring before/after.
source: PostgreSQL docs — pg_stat_statements ↗
Commonly asked senior debug common You run EXPLAIN and see a Seq Scan with 'Rows Removed by Filter: 9,900,000'. What does that tell you and what do you do?
A sequential scan read the whole table and the filter threw away almost all of it — the query is selective but there's no index for it, so it's reading 10M rows to keep 100. Add an index on the filtered column(s) so the planner can do an index scan instead.
Read the plan bottom-up (inner nodes run first). Watch for a big gap between estimated and actual rows — that means stale statistics, so run ANALYZE. Remember the cost= numbers are arbitrary planner units, not milliseconds; use EXPLAIN (ANALYZE, BUFFERS) for real timings.
Follow-ups they push on
- Estimated rows say 5, actual say 5,000,000 — what's wrong and what's the fix?
- When is a Seq Scan actually the right plan?
Red flag Reading cost as milliseconds, ignoring the estimate-vs-actual divergence, or 'optimizing' a query the planner already handles well.
source: PostgreSQL docs — Using EXPLAIN ↗
Commonly asked senior debug common A query filters `WHERE LOWER(email) = 'a@b.com'` (or `WHERE created_at::date = '2024-01-01'`) and ignores the index on the column. Why, and how do you fix it?
Wrapping the indexed column in a function makes the predicate non-sargable — the index is sorted on email, not on LOWER(email), so the engine can't use it and seq-scans.
Fixes: (1) create a functional/expression index matching the expression: CREATE INDEX ON users (LOWER(email));. (2) Rewrite to keep the column bare: for the date case, WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02' is sargable and uses the plain index. Same trap with leading-wildcard LIKE '%x'.
Follow-ups they push on
- Why is `LIKE 'abc%'` sargable but `LIKE '%abc'` not?
- Implicit type casts (string column compared to a number) — same problem?
Red flag Not recognizing that a function/cast on the indexed column defeats the index, and reaching for query hints instead of an expression index or a sargable rewrite.
source: Use The Index, Luke! — Functions / sargable predicates ↗

3.5 Schema design & transactions 14

★ must-know Commonly asked senior concept common Optimistic vs pessimistic concurrency control — how do they work and when do you pick each?
Pessimistic: assume conflicts are likely, so lock the row up front (SELECT … FOR UPDATE) and hold it until commit; others wait. Correct and simple, but locks reduce concurrency and risk deadlocks and lock-wait timeouts.
Optimistic: assume conflicts are rare, so don't lock — read a version/timestamp, and at write time do UPDATE … WHERE id = ? AND version = :read_version. If zero rows update, someone else changed it: abort and retry. No locks held during the user's think-time.
Pick pessimistic for high contention / short critical sections where retries would thrash; optimistic for low contention and long read-think-write cycles (web forms, APIs) where holding a lock across a round-trip is unacceptable.
What a strong answer covers
- Pessimistic = lock first (FOR UPDATE); others block until commit.
- Optimistic = no lock; detect conflict at write via a version/timestamp check, then retry.
- High contention favors pessimistic (avoid retry storms); low contention favors optimistic.
- Optimistic avoids holding a lock across user think-time / network round-trips.
- Both need a transaction; optimistic additionally needs retry logic in the app.
Quick self-check
A web 'edit profile' form is open for minutes before submit; conflicts are rare. Best concurrency strategy?
Follow-ups they push on
- How does a `version` column implement optimistic locking?
- Why can optimistic locking thrash under high contention?
Red flag Using optimistic locking under heavy contention (constant retry/abort churn), or holding a pessimistic lock across a user's think-time and serializing everyone.
source: PostgreSQL docs — Concurrency Control / Explicit Locking ↗
Commonly asked junior concept common What does referential integrity mean, and what are ON DELETE CASCADE / RESTRICT / SET NULL?
Referential integrity is the guarantee that a foreign key always points at a row that exists (or is NULL) — you can't have an order for a customer who was deleted. The DB enforces it for you.
The ON DELETE (and ON UPDATE) clause decides what happens to children when the parent is deleted:
- RESTRICT / NO ACTION — block the delete if children exist (the safe default).
- CASCADE — delete the children too.
- SET NULL — keep children but null out their FK (requires a nullable column).
- SET DEFAULT — set the FK to its default.
Choose CASCADE only when children are truly owned by the parent (an order's line items); use RESTRICT for shared/important references to avoid accidental mass deletes.
What a strong answer covers
- Referential integrity: every FK value must match an existing PK (or be NULL).
- RESTRICT/NO ACTION blocks deleting a parent that still has children.
- CASCADE deletes the children with the parent — powerful but easy to mass-delete by accident.
- SET NULL/SET DEFAULT keep the child but clear/replace its FK.
- FK enforcement requires an index (often the child FK column) for the check to be efficient.
Follow-ups they push on
- Why is CASCADE risky in production, and how do you make deletes auditable?
- Does the child's FK column need its own index? (yes, for the check and for joins)
Red flag Adding `ON DELETE CASCADE` everywhere and triggering a surprise mass-delete, or forgetting that SET NULL needs the FK column to be nullable.
source: PostgreSQL docs — Foreign Keys (referential actions) ↗
Commonly asked mid concept common Explain 1NF, 2NF, and 3NF each in a sentence, with an example violation.
1NF — atomic columns, no repeating groups/arrays in a cell (violated by a comma-separated phones column).
2NF — 1NF plus no non-key column depends on only part of a composite key (in (order_id, product_id) -> product_name, product_name depends on product_id alone — split it out).
3NF — 2NF plus no transitive dependency: non-key columns depend only on the key (storing zip and city, where zip -> city, is a transitive dependency; move it to a zip table).
Mnemonic: 'the key, the whole key, and nothing but the key.'
Follow-ups they push on
- What does BCNF add over 3NF?
- Give an anomaly (insert/update/delete) that normalization removes.
Red flag Reciting the names without being able to name a concrete violation, or conflating 2NF (partial dependency) with 3NF (transitive dependency).
source: Wikipedia — Database normalization ↗
Commonly asked mid concept common What is the difference between surrogate and natural primary keys, and what are the trade-offs?
A natural key is a real-world attribute already unique (SSN, ISBN, email, country code). A surrogate key is a system-generated, meaningless identifier (auto-increment id, UUID) added solely to identify the row.
Surrogates win in practice: they're stable (a natural key like email can change, breaking every FK referencing it), compact, and uniform. Naturals avoid an extra column and can prevent duplicate business rows. Common pattern: use a surrogate PK for joins/FKs and a UNIQUE constraint on the natural key to enforce business uniqueness. Note the UUID choice matters: random UUIDv4 PKs fragment a clustered index (random insert order); UUIDv7/ULID are time-ordered to avoid that.
What a strong answer covers
- Natural key = meaningful real-world attribute; surrogate = synthetic id (serial/UUID).
- Surrogates are stable under business changes; natural keys can mutate and cascade.
- Best practice: surrogate PK + a UNIQUE constraint on the natural key.
- Random UUIDv4 as a clustered PK hurts insert locality; prefer UUIDv7/ULID or bigserial.
Follow-ups they push on
- Why does a random UUIDv4 primary key hurt write performance on a clustered table?
- When is a composite natural key genuinely the better PK?
Red flag Using a mutable natural key (email/phone) as the PK so a single change cascades through every foreign key, or choosing random UUIDv4 PKs and fragmenting the clustered index.
source: Wikipedia — Surrogate key ↗
Commonly asked mid concept common When would you deliberately denormalize a schema?
Denormalize to trade write/consistency cost for read speed when reads dominate and joins are the bottleneck. Common cases: duplicating a category_name onto an orders table to avoid a join on every report; precomputed counts/totals (a comment_count column) to skip aggregation; materialized views; read-optimized analytics tables.
The cost: every duplicated fact must be kept in sync on write (triggers, app logic, or background jobs), risking drift. Rule of thumb: normalize first for correctness, denormalize surgically where a measured read path demands it.
Follow-ups they push on
- How do you keep denormalized copies consistent?
- Materialized view vs a denormalized column — trade-offs?
Red flag Denormalizing prematurely 'for performance' without a measured hot path, then fighting update anomalies and data drift.
source: Wikipedia — Denormalization ↗
Amazon mid concept very common What does ACID stand for, and what does each property actually guarantee?
Atomicity — a transaction is all-or-nothing; partial failure rolls the whole thing back.
Consistency — a committed transaction moves the DB from one valid state to another, preserving constraints/invariants.
Isolation — concurrent transactions don't see each other's uncommitted, in-flight state (degree set by the isolation level).
Durability — once committed, the change survives a crash (write-ahead log / fsync).
Classic example: a bank transfer must debit and credit atomically, never leaving money half-moved.
Follow-ups they push on
- Which property does the isolation level tune?
- How is durability implemented (WAL / fsync)?
Red flag Conflating ACID's 'Consistency' (constraint preservation) with the distributed-systems 'consistency' of CAP — different concepts.
source: PostgreSQL docs — Transactions ↗
Commonly asked mid concept common ORM vs raw SQL — what are the trade-offs and when do you drop to raw SQL?
ORM wins on productivity and safety: less boilerplate, parameterized queries (SQL-injection resistant by default), migrations, mapping rows to objects, DB portability.
Raw SQL wins on control and performance: complex joins, window functions, CTEs, query-plan tuning, and bulk operations the ORM expresses poorly or N+1's.
Practical stance: ORM for the 90% of CRUD, drop to raw/handwritten SQL (most ORMs allow it) for hot, complex, or analytical queries. The ORM's biggest footgun is hidden N+1 queries.
Follow-ups they push on
- How does an ORM protect against SQL injection?
- Name an ORM performance pitfall besides N+1.
Red flag Treating it as religious all-or-nothing, or not knowing the ORM's N+1 / lazy-loading traps and over-fetching.
source: StrataScratch — SQL Interview Questions: The Ultimate Guide ↗
Commonly asked senior concept occasional What is BCNF and how does it differ from 3NF? Give a case where a table is in 3NF but not BCNF.
BCNF (Boyce-Codd Normal Form) is a stricter 3NF: for *every* non-trivial functional dependency X -> Y, X must be a superkey. 3NF allows a narrow exception — a dependency is OK if its right side is a *prime* attribute (part of some candidate key) — and BCNF removes that exception.
The textbook case needs overlapping candidate keys. Table (student, course, instructor) where each course is taught by one instructor (instructor -> course) and a student takes a course with one instructor ({student, course} -> instructor). Candidate keys are {student, course} and {student, instructor}. The dependency instructor -> course has a non-superkey left side, so it violates BCNF — yet the table is in 3NF because course is a prime attribute. Fix: split into (instructor, course) and (student, instructor).
What a strong answer covers
- BCNF: every non-trivial FD's determinant (left side) must be a superkey — no exceptions.
- 3NF permits a dependency whose right side is a prime (key) attribute; BCNF forbids it.
- Violations require overlapping/composite candidate keys.
- BCNF decomposition can occasionally sacrifice dependency-preservation — a real trade-off.
Follow-ups they push on
- Why is dependency preservation sometimes lost when decomposing to BCNF?
- When is staying at 3NF the pragmatic choice over BCNF?
Red flag Claiming 3NF and BCNF are identical — they diverge precisely when a non-key attribute determines a prime attribute under overlapping candidate keys.
source: Wikipedia — Boyce-Codd normal form ↗
Commonly asked senior concept occasional Why should long-running transactions be avoided, especially under MVCC?
Under MVCC, an UPDATE/DELETE doesn't overwrite — it creates a new row version and leaves the old one as a 'dead tuple' until no transaction could still need it. A long-running (or idle-in-transaction) transaction holds an old snapshot open, so the vacuum/garbage-collector can't reclaim those dead tuples — leading to table/index bloat, slower scans, and transaction-ID wraparound pressure in Postgres.
Long transactions also hold locks longer (more contention and deadlock risk) and amplify lost-update windows. The fix: keep transactions short, never leave one open across user think-time or external API calls, batch large mutations, and watch for idle in transaction connections.
What a strong answer covers
- MVCC keeps old row versions until no open snapshot needs them.
- A long/idle transaction pins an old snapshot, blocking VACUUM from reclaiming dead tuples -> bloat.
- It also holds locks longer (contention, deadlocks) and, in Postgres, raises wraparound risk.
- Keep transactions short; never span user think-time or slow external calls; batch big writes.
Follow-ups they push on
- What is 'idle in transaction' and why is it dangerous?
- How does table bloat hurt query performance, and how do you measure it?
Red flag Opening a transaction, then making a slow external API call or waiting on user input inside it — pinning the MVCC snapshot, blocking vacuum, and bloating the table.
source: PostgreSQL docs — Routine Vacuuming (dead tuples / bloat) ↗
Commonly asked senior concept very common Define dirty read, non-repeatable read, and phantom read, and map each to the isolation level that prevents it.
Dirty read — you read another transaction's uncommitted change (which may be rolled back). Prevented at READ COMMITTED and above.
Non-repeatable read — you read a row twice and get different values because another committed transaction updated it between reads. Prevented at REPEATABLE READ and above.
Phantom read — you re-run a range query and new rows appear (or vanish) because another transaction inserted/deleted matching rows. Prevented at SERIALIZABLE.
So the ladder is READ UNCOMMITTED -> READ COMMITTED -> REPEATABLE READ -> SERIALIZABLE, each forbidding one more anomaly.
Follow-ups they push on
- Postgres prevents phantoms at REPEATABLE READ — why is that stronger than the SQL standard?
- What is a write-skew anomaly and which level stops it?
Red flag Swapping non-repeatable (an UPDATE to existing rows) with phantom (INSERT/DELETE changing which rows match), or assuming every engine maps the levels identically.
source: PostgreSQL docs — Transaction Isolation ↗
Commonly asked senior trick occasional PostgreSQL's REPEATABLE READ prevents phantom reads, which the SQL standard doesn't require at that level. Why?
Because Postgres implements isolation with MVCC + snapshots, not range locks. At REPEATABLE READ it takes one consistent snapshot at the first statement and every read in the transaction sees the database exactly as of that snapshot — so new rows inserted by others are invisible, eliminating phantoms too.
The SQL standard only *requires* REPEATABLE READ to block dirty + non-repeatable reads; Postgres is strictly stronger. (Its SERIALIZABLE adds Serializable Snapshot Isolation to also catch write-skew.) Takeaway: the named levels are minimum guarantees — engines often exceed them, so verify per-engine.
Follow-ups they push on
- What anomaly does Postgres SERIALIZABLE catch that REPEATABLE READ still allows (write skew)?
- How does MySQL/InnoDB REPEATABLE READ differ (gap locks)?
Red flag Assuming the SQL-standard anomaly table is literally true for every database — engine implementations (MVCC vs locking) change the real guarantees.
source: PostgreSQL docs — Repeatable Read Isolation Level ↗
Commonly asked senior concept common Explain shared vs exclusive locks and how a deadlock arises.
A shared (read) lock lets many transactions hold it at once but blocks writers. An exclusive (write) lock is held by exactly one transaction and blocks everyone else on that resource. Shared/shared is compatible; anything with exclusive is not.
A deadlock is a cycle of waits: T1 holds A and wants B; T2 holds B and wants A — neither can proceed. The DB detects the cycle and aborts one transaction (the 'deadlock victim'); your code should catch the error and retry. Avoid them by acquiring locks in a consistent order and keeping transactions short.
Follow-ups they push on
- How does lock ordering prevent deadlocks?
- Optimistic vs pessimistic locking — when to pick each?
Red flag Thinking the DB hangs forever on a deadlock — it detects the cycle and kills a victim; the app must handle the retry. Also confusing a deadlock with a long lock-wait.
source: PostgreSQL docs — Explicit Locking / Deadlocks ↗
Commonly asked senior design common Two users buy the last item in stock at the same time and you oversell. How do you prevent the race with the database?
It's a lost-update / check-then-act race: both read stock = 1, both decrement. Fixes:
- Pessimistic lock: SELECT stock FROM items WHERE id = ? FOR UPDATE inside a transaction — the second buyer blocks until the first commits, then sees 0.
- Atomic conditional write: UPDATE items SET stock = stock - 1 WHERE id = ? AND stock > 0 and check the affected-row count — zero rows means it was already sold out. No separate read needed.
- Optimistic concurrency: a version column, UPDATE … WHERE version = ?; retry on conflict. Best under low contention.
The atomic conditional UPDATE is usually the simplest correct answer.
Follow-ups they push on
- Optimistic vs pessimistic — which under high contention?
- Where do isolation levels alone fail to save you here?
Red flag Doing read-then-write in application code without a lock or atomic update and assuming the transaction wrapper alone prevents the lost update (it doesn't at READ COMMITTED).
source: PostgreSQL docs — Explicit Locking (Row-Level Locks / FOR UPDATE) ↗
Commonly asked senior trick occasional What is a write-skew anomaly, and why can it slip past REPEATABLE READ / snapshot isolation?
Write skew: two transactions each read an overlapping set of rows, each checks an invariant that currently holds, then each writes to a different row — and the combined result violates the invariant that neither saw broken.
Classic case: a hospital requires >=1 doctor on call. Two on-call doctors each run 'if more than one is on call, I can go off-call', both read 2-on-call (true), both update their own row, and now zero are on call. Snapshot isolation / REPEATABLE READ doesn't catch it because the two transactions write *disjoint* rows — there's no write-write conflict, only a read-write dependency cycle. Only SERIALIZABLE (in Postgres, Serializable Snapshot Isolation) detects the dependency cycle and aborts one.
What a strong answer covers
- Write skew: concurrent transactions read overlapping data, then write *disjoint* rows, breaking an invariant.
- Snapshot isolation misses it because there's no write-write conflict to detect.
- It's a read-write dependency cycle, not a lost update.
- Only SERIALIZABLE (SSI in Postgres) prevents it; or use explicit SELECT … FOR UPDATE to materialize the conflict.
Quick self-check
Which isolation level is required to reliably prevent write skew?
Follow-ups they push on
- How does SELECT … FOR UPDATE turn a write-skew into a detectable conflict?
- What is Serializable Snapshot Isolation and how does it differ from two-phase locking?
Red flag Believing REPEATABLE READ/snapshot isolation prevents all anomalies — it still allows write skew, which needs SERIALIZABLE or explicit locking.
source: PostgreSQL docs — Serializable Isolation Level (write skew) ↗

3.6 NoSQL 14

★ must-know Commonly asked senior concept common State the CAP theorem and explain why 'CA' isn't a real choice for a distributed database.
CAP says that when a network partition (P) happens, a distributed system can preserve at most one of Consistency (every read sees the latest write) and Availability (every request gets a non-error response) — you must drop one.
'CA' isn't a meaningful pick because partitions *will* happen in any real network — you don't get to opt out of P. So the real choice during a partition is CP (refuse/error to stay consistent — e.g. a leader-based store rejecting writes it can't replicate) or AP (answer with possibly-stale data and reconcile later — Dynamo-style stores). When there's *no* partition, a good system gives both C and A; CAP only forces the trade *during* a partition. PACELC extends it: else (no partition) you still trade latency vs consistency.
What a strong answer covers
- Under a partition you choose Consistency or Availability, not both.
- Partitions are unavoidable in real networks, so P isn't optional — 'CA' is a non-choice.
- CP = stay consistent, reject/err during partition; AP = stay available, serve stale, reconcile.
- CAP only bites *during* a partition; PACELC adds the latency-vs-consistency trade for normal operation.
Quick self-check
During a network partition, a payment system that must never double-charge should behave as…
Follow-ups they push on
- What does PACELC add to CAP?
- Give a real CP store and a real AP store and the workload each suits.
Red flag Treating CAP as 'pick any two' (you can't drop P) or thinking it forces a permanent global trade rather than one that only applies during a partition.
source: Wikipedia — CAP theorem ↗
Commonly asked mid concept very common Name the four main NoSQL families and a use case where each beats a relational DB.
Document (MongoDB) — flexible JSON-like docs; content, catalogs, user profiles where the shape varies.
Key-value (Redis, DynamoDB) — fastest by-key access; caching, sessions, leaderboards, rate counters.
Wide-column (Cassandra, HBase) — massive distributed write scale; time-series, IoT, event logs.
Graph (Neo4j) — relationship-heavy traversals; social graphs, fraud rings, recommendations.
The through-line: each optimizes a specific access pattern that relational tables + joins serve poorly at scale.
Follow-ups they push on
- Why is a graph DB better than SQL recursive joins for 'friends of friends of friends'?
- Document vs wide-column — how do their data models differ?
Red flag Treating 'NoSQL' as one thing, or claiming it's 'schemaless so always better' — each family has a narrow sweet spot.
source: MongoDB — Types of NoSQL Databases ↗
Commonly asked mid concept occasional When is a graph database the right tool, and why does it beat relational recursive joins for deep traversals?
Use a graph DB (Neo4j) when relationships are first-class and traversals are deep/variable-length: social graphs ('friends of friends of friends'), fraud rings, recommendation paths, dependency/permission graphs.
In a relational store, each 'hop' is another self-join, and a 4-hop query means 4 joins whose cost compounds with table size — the optimizer re-finds matching rows by index lookup each level. A graph DB uses index-free adjacency: each node directly stores pointers to its neighbors, so traversing one more hop is O(neighbors of the current node), independent of total graph size. That makes variable-depth path queries (shortest path, reachability) both fast and natural to express (Cypher's MATCH (a)-[:FRIEND*1..4]->(b)).
What a strong answer covers
- Graph DBs shine when relationships and multi-hop traversal are the core workload.
- Index-free adjacency: nodes point straight at neighbors, so a hop is local, not a global index lookup.
- Relational deep traversal = N self-joins whose cost compounds with table size.
- Variable-length paths (shortest path, reachability) are awkward in SQL, native in graph query languages.
Follow-ups they push on
- What is index-free adjacency, concretely?
- Could a recursive CTE handle this in SQL, and where does it fall down at scale?
Red flag Forcing a deeply-connected, variable-depth traversal into repeated SQL self-joins/recursive CTEs and watching it degrade as hop count and table size grow.
source: Neo4j — Graph Database Concepts (index-free adjacency) ↗
Commonly asked mid concept common Cache-aside vs write-through vs write-behind — compare the caching strategies.
Cache-aside (lazy): the app checks the cache; on a miss it reads the DB and populates the cache, and on writes it updates the DB and *invalidates* the key. Simple and resilient (cache down ≠ data loss), but the first read after a miss/eviction is slow and there's a brief staleness window.
Write-through: writes go to cache and DB synchronously, so the cache is always fresh — at the cost of higher write latency and caching data that may never be read.
Write-behind (write-back): writes hit the cache and are flushed to the DB asynchronously — lowest write latency, highest throughput, but risks data loss if the cache fails before flushing and adds complexity. Cache-aside is the common default for read-heavy web workloads.
What a strong answer covers
- Cache-aside: app-managed, populate on miss, invalidate on write — simple, resilient, can serve stale briefly.
- Write-through: write cache+DB together — always fresh, slower writes, may cache unread data.
- Write-behind: async flush to DB — fastest writes, but risks loss on cache failure.
- Default to cache-aside for read-heavy systems; reserve write-behind for write-heavy, loss-tolerant cases.
Quick self-check
Which strategy has the **lowest write latency** but the **highest risk of data loss**?
Follow-ups they push on
- Why does write-behind risk data loss, and how do you mitigate it?
- How do you avoid a cache stampede when a hot cache-aside key expires?
Red flag Choosing write-behind for data you can't afford to lose, or running cache-aside without an invalidation step so the cache serves stale data after every update.
source: AWS — Caching strategies (lazy loading / write-through) ↗
Commonly asked mid trick occasional A NoSQL store is 'schemaless' — what does that actually mean, and what's the catch?
'Schemaless' means the database doesn't enforce a fixed schema — different documents in a collection can have different fields, and you can add a field without a migration. It's better called schema-on-read: the structure is interpreted by the application when it reads, rather than enforced by the database on write.
The catch is the schema doesn't disappear — it moves into your application code, which must handle missing fields, mixed types, and old document shapes (versioning) forever. Without DB-level constraints you can silently write inconsistent data, so mature NoSQL stores add optional validation (MongoDB JSON Schema validators) and teams still enforce structure in code. 'Flexible' is the upside; 'no guardrails' is the downside.
What a strong answer covers
- Schemaless = the DB doesn't enforce structure; really 'schema-on-read'.
- The schema moves into application code, which must tolerate missing/old/variant shapes.
- Flexibility speeds iteration but removes the DB's data-integrity guardrails.
- Mitigate with optional validators (MongoDB schema validation) and explicit document versioning.
Quick self-check
A 'schemaless' document store most accurately means…
Follow-ups they push on
- How does schema-on-read differ from schema-on-write?
- How do you evolve millions of existing documents to a new shape?
Red flag Believing 'schemaless' means no schema to manage — the schema is just enforced (or not) in application code, where inconsistencies accumulate silently.
source: MongoDB — Schema Validation ↗
Commonly asked mid concept very common In MongoDB, when do you embed a sub-document vs reference another collection?
Embed when the child is owned by and always read with the parent, the relationship is one-to-few, and the embedded data doesn't grow unbounded — e.g. a user's addresses inside the user document. One read fetches everything; no join.
Reference (store an ObjectId, join with $lookup or a second query) when the child is large, shared across parents (many-to-many), updated independently, or the array would grow without bound (a celebrity's millions of followers). This avoids the 16MB document cap and write amplification.
Rule: model around your access patterns, not entities — 'data that is accessed together should be stored together.'
Follow-ups they push on
- What's MongoDB's document size limit, and how does it force referencing?
- How would you model a comments-on-posts relationship?
Red flag Reflexively normalizing like a relational schema, or embedding an unbounded growing array that eventually hits the 16MB document limit.
source: MongoDB — Data Modeling Introduction ↗
Commonly asked mid concept common What is BASE and how does it differ from ACID?
BASE = Basically Available, Soft state, Eventual consistency. It's the consistency model many NoSQL/distributed stores choose: stay available and partition-tolerant, accept that replicas converge *eventually* rather than being instantly consistent.
Vs ACID, which insists every transaction leaves the DB strongly consistent and isolated. BASE relaxes that to gain availability and horizontal scale. It's the practical face of the CAP theorem: under a network partition you pick availability (BASE/AP) or consistency (ACID/CP). Use BASE where stale-by-seconds reads are fine (feeds, product views); use ACID where they aren't (payments).
Follow-ups they push on
- State the CAP theorem and which corner BASE sits in.
- Give a feature where eventual consistency is unacceptable.
Red flag Equating 'NoSQL' with 'no transactions' — many (MongoDB, DynamoDB) now offer ACID transactions; BASE is a choice, not an inherent limitation.
source: MongoDB — ACID Transactions / Database Consistency ↗
Commonly asked mid concept common What is the MongoDB aggregation pipeline, and how does it map to SQL?
The aggregation pipeline passes documents through ordered stages, each transforming the stream and feeding the next — like Unix pipes for data.
Rough SQL mapping: $match ~ WHERE, $group ~ GROUP BY (+ aggregates), $project ~ SELECT (shape columns), $sort ~ ORDER BY, $limit/$skip ~ LIMIT/OFFSET, $lookup ~ LEFT JOIN, $unwind ~ flatten an array into rows.
Stage order matters for performance: put $match and $sort early so they can use indexes and shrink the working set before expensive $group/$lookup.
Follow-ups they push on
- Why put $match as early as possible in the pipeline?
- What does $unwind do and when is it needed before $group?
Red flag Ordering stages so `$match` comes after a `$group`/`$lookup`, defeating index use and processing far more documents than necessary.
source: MongoDB — Aggregation Pipeline ↗
Commonly asked mid concept common Why use Redis for caching, and what are the main eviction/expiry concerns?
Redis is an in-memory key-value store, so reads/writes are microsecond-fast — ideal as a cache in front of a slower primary DB, plus sessions, rate limiters, and leaderboards (sorted sets).
Key concerns: set a TTL (EXPIRE) so stale data ages out; pick an eviction policy for when memory is full (allkeys-lru, allkeys-lfu, volatile-ttl, etc.); and have a cache-invalidation strategy on writes (write-through, or delete-on-update). Watch for stampede — many requests recomputing a hot key the instant it expires — mitigated by locks or jittered TTLs.
Follow-ups they push on
- Cache-aside vs write-through vs write-behind?
- What is a cache stampede / thundering herd, and how do you avoid it?
Red flag Caching without a TTL or invalidation plan (serving stale data forever), or ignoring eviction so the cache silently drops keys under memory pressure.
source: Redis — Key eviction (docs) ↗
Commonly asked mid concept common When would you NOT use NoSQL — i.e., when is a relational database still the right call?
Choose relational when you need strong multi-row transactions / ACID (money, inventory, bookings), flexible ad-hoc queries and joins across well-structured related data, constraints and referential integrity enforced by the DB, and a stable schema.
NoSQL earns its place for huge scale on a known access pattern, flexible/evolving document shapes, or relationship-traversal workloads. The honest senior answer is 'it depends on access patterns and consistency needs' — and modern Postgres (JSONB, partitioning, logical replication) covers many cases people reach for NoSQL for.
Follow-ups they push on
- How does Postgres JSONB blur the SQL/NoSQL line?
- Polyglot persistence — when is mixing both justified?
Red flag Picking NoSQL for hype/scale you don't have, then reimplementing joins and transactions in application code; or assuming relational 'can't scale'.
source: MongoDB — NoSQL vs SQL Databases ↗
Amazon senior concept occasional Why is NoSQL data modeling driven by access patterns, and what does DynamoDB single-table design illustrate?
Relational modeling normalizes by entity and joins at read time. NoSQL stores (especially DynamoDB) have no joins and charge for every access, so you model queries first: list the access patterns, then design keys so each query is a single, indexed key lookup.
Single-table design takes this to the extreme — multiple entity types (users, orders, items) share one table, distinguished by a composite primary key (a generic partition key + sort key, often PK/SK with prefixes like USER#123 / ORDER#456). Related items share a partition so one query fetches them together without a join, and secondary indexes (GSIs) serve alternate patterns. The cost is a rigid, query-specific schema that's painful to change when access patterns evolve.
What a strong answer covers
- No joins + per-request cost -> design around queries, not entities.
- List access patterns first, then shape partition/sort keys so each is one key lookup.
- Single-table design co-locates related items in a partition via prefixed composite keys.
- Secondary indexes (GSIs) add alternate access patterns; the schema is rigid to new ones.
Follow-ups they push on
- How does a composite (partition + sort) key let one query return several related items?
- What's the downside when a brand-new access pattern appears later?
Red flag Modeling a NoSQL store like a normalized relational schema and then needing joins the database can't do, forcing N round-trips or client-side joins.
source: AWS docs — DynamoDB single-table design / data modeling ↗
Commonly asked senior concept common What is eventual consistency, and how do read-your-writes and quorum reads/writes fit in?
Eventual consistency: replicas may temporarily disagree, but with no new writes they all converge to the same value. The window means a read just after a write can return stale data.
Stronger guarantees layer on top. Read-your-writes ensures *you* always see your own latest write (route your reads to a replica known to have it, or to the leader). Quorum tunes consistency per operation: with N replicas, require W acks on write and R replicas on read; if R + W > N the read and write sets overlap, so a read is guaranteed to see the latest acknowledged write (strong consistency) — at the cost of latency/availability. Dynamo-style systems expose W/R so you trade consistency against speed per call.
What a strong answer covers
- Eventual consistency: replicas converge once writes stop; reads can be briefly stale.
- Read-your-writes: a session always sees its own latest write.
- Quorum: pick W (write acks) and R (read replicas) out of N.
- R + W > N guarantees overlap -> a read sees the latest committed write (strong consistency).
- Higher R/W means stronger consistency but more latency and less availability.
Quick self-check
With N=3 replicas, which (W, R) configuration guarantees strongly-consistent reads?
Follow-ups they push on
- Why does R + W > N guarantee a read sees the newest write?
- What's a tunable example — W=N for strong writes vs W=1 for fast writes?
Red flag Assuming eventual consistency means 'never consistent', or thinking any single quorum value is right — R/W are a per-workload latency-vs-consistency dial.
source: Wikipedia — Eventual consistency ↗
Commonly asked senior concept common Explain sharding vs replication vs partitioning. How are they different?
Replication — keep copies of the same data on multiple nodes (leader-follower). Goal: high availability + read scaling + durability. It does *not* increase write capacity (one leader takes writes).
Sharding — split the dataset into disjoint pieces across nodes by a shard key, each node owning a subset. Goal: scale writes and storage beyond one machine.
Partitioning — the general term for splitting a table: horizontal = rows split across partitions (sharding is horizontal partitioning across servers); vertical = columns split into separate tables.
In practice you combine them: shard for write scale, then replicate each shard for HA.
Follow-ups they push on
- Why does replication alone not scale writes?
- How do you choose a shard key, and what's a hot-shard / hotspot?
Red flag Using sharding and replication interchangeably — replicas are full copies (HA + reads), shards are disjoint subsets (write/storage scale).
source: MongoDB — Sharding ↗
Commonly asked senior design occasional How would you choose a shard key, and what goes wrong with a bad one?
A good shard key has high cardinality, even write distribution, and matches your query pattern so most queries hit one shard (targeted, not scatter-gather).
Failure modes: a monotonically increasing key (timestamp, auto-increment id) sends all new writes to one shard — a 'hot shard'. A low-cardinality key (country, status) can't split finely enough. A key that doesn't appear in queries forces every query to fan out to all shards (broadcast).
Mitigations: hashed shard keys to spread writes, or compound keys (e.g. user_id + time) to keep related data together while distributing load.
Follow-ups they push on
- Why is a hashed shard key better for write distribution but worse for range queries?
- What is a scatter-gather query and why is it slow?
Red flag Picking an auto-increment or timestamp shard key and creating a permanent hotspot, or a key absent from common queries forcing broadcasts.
source: MongoDB — Choose a Shard Key ↗

3.7 Stored routines, views & triggers 11

★ must-know Commonly asked mid concept common What is SQL injection, and how do stored procedures and parameterized queries relate to preventing it?
SQL injection happens when user input is concatenated into a query string, so input like ' OR 1=1 -- becomes executable SQL. The real defense is parameterized queries / prepared statements: the SQL text and the data travel separately, so input is always treated as a value, never as code.
Stored procedures help *only if* they use parameters internally — a procedure that builds and EXECUTEs a string from its arguments (dynamic SQL) is just as injectable. So 'use stored procedures' is not itself the fix; 'never interpolate untrusted input into SQL' is. ORMs parameterize by default, which is a big part of why they're safer out of the box.
What a strong answer covers
- Injection = untrusted input concatenated into SQL text and executed as code.
- Fix = parameterized queries / prepared statements: code and data sent separately.
- Stored procedures are safe only when parameterized; dynamic SQL inside them is still vulnerable.
- ORMs parameterize by default; the danger returns the moment you build raw SQL by string concat.
Quick self-check
Which most reliably prevents SQL injection?
Follow-ups they push on
- How can a stored procedure still be injectable (dynamic SQL / EXECUTE)?
- Why isn't escaping/quoting input a reliable substitute for parameterization?
Red flag Believing 'we use stored procedures, so we're safe from injection' — a procedure that concatenates input into dynamic SQL is exactly as vulnerable as inline string-building.
source: OWASP — SQL Injection Prevention Cheat Sheet ↗
Commonly asked mid concept common What is the difference between a view and a materialized view, and when would you use each?
A regular view is just a saved query — it stores no data. Every read re-runs the underlying SELECT against the live tables, so results are always current but you pay the full query cost on each access.
A materialized view stores the computed result on disk, so reads are cheap — but the data is a snapshot that goes stale until you REFRESH MATERIALIZED VIEW.
Use a plain view to centralize/simplify a query, present a stable interface, or restrict columns for security. Use a materialized view when the query is expensive and slightly-stale results are acceptable: dashboards, reporting rollups, precomputed aggregates.
Follow-ups they push on
- How do you refresh a materialized view without blocking readers?
- Can you put an index on a materialized view? (yes)
Red flag Believing a regular view caches its results (it doesn't — it re-executes every time), or treating a materialized view as always up to date.
source: PostgreSQL — Materialized Views ↗
Commonly asked mid trick occasional Can you INSERT/UPDATE/DELETE through a view?
Sometimes. A simple view — one base table, no aggregation, DISTINCT, GROUP BY, window functions, or set operations — is automatically updatable: writes pass straight through to the base table. Complex views (joins, aggregates) are not directly writable; you make them writable with an INSTEAD OF trigger that translates the change to the right base tables.
Add WITH CHECK OPTION so an INSERT/UPDATE can't create a row that would fall outside the view's WHERE and silently disappear.
Follow-ups they push on
- What does WITH CHECK OPTION protect against?
- How does an INSTEAD OF trigger make a multi-table view writable?
Red flag Assuming any view is updatable, then being surprised when a write to a join/aggregate view errors out.
source: PostgreSQL — CREATE VIEW (updatable views) ↗
Commonly asked mid concept common What are triggers good for, and why are they dangerous in production?
A trigger runs a function automatically on INSERT/UPDATE/DELETE (BEFORE, AFTER, or INSTEAD OF). Legitimate uses: writing audit/history rows, enforcing invariants the schema can't express, maintaining a derived/denormalized column, or keeping a summary table in sync.
The danger is that triggers are invisible side effects. They fire on every row change, hide business logic away from the application, add latency to every write, can cascade or recurse, and quietly make bulk operations slow. They're powerful but easy to abuse.
Follow-ups they push on
- BEFORE vs AFTER vs INSTEAD OF — when do you reach for each?
- How do you prevent a trigger from recursively firing on itself?
Red flag Burying critical business logic in triggers so behavior becomes 'spooky action at a distance', or ignoring their per-row cost on large bulk writes.
source: PostgreSQL — CREATE TRIGGER ↗
Commonly asked mid debug occasional A table's writes are mysteriously slow and some rows change on their own — how do you debug it?
Symptoms like 'an UPDATE touched rows I never wrote', unexplained slow writes, or stack depth limit exceeded almost always trace back to a trigger.
Steps: list the triggers on the table (\d table in psql, or information_schema.triggers), read the trigger function, note BEFORE vs AFTER and which events fire it, and look for a trigger that writes back to the same table (recursion) or a per-row trigger running during a bulk operation. Add RAISE NOTICE to trace, and temporarily ALTER TABLE ... DISABLE TRIGGER to isolate the culprit.
Follow-ups they push on
- How do you stop a trigger from recursively re-firing on its own writes?
- Row-level vs statement-level triggers for a million-row update?
Red flag Debugging the application for hours when an AFTER trigger is the real cause — or disabling a trigger in production to test and forgetting to re-enable it.
source: PostgreSQL — Overview of Trigger Behavior ↗
Commonly asked mid concept occasional What's the difference between a stored function and a stored procedure in PostgreSQL?
A function returns a value (scalar, row, or set) and is meant to be *called inside* a SQL statement — SELECT my_fn(x). Because it runs *within* the calling query's transaction, it cannot issue COMMIT/ROLLBACK.
A procedure (added in Postgres 11) is invoked with CALL my_proc(...), may return nothing, and crucially can manage transactions — it can COMMIT/ROLLBACK mid-body, which is what makes procedures right for batch jobs that process and commit in chunks. So: need a value inside a query -> function; need explicit transaction control for multi-step/batch work -> procedure.
What a strong answer covers
- Function: returns a value, called inside SQL (SELECT f(...)), no transaction control.
- Procedure: called with CALL, can COMMIT/ROLLBACK in its body.
- Procedures (PG 11+) suit batch jobs that commit in chunks; functions suit computed values.
- A function runs inside the caller's transaction; it can't open/close one.
Quick self-check
You need a routine that processes a million rows in batches, committing every 10,000. In Postgres you should write a…
Follow-ups they push on
- Why can a procedure but not a function COMMIT mid-execution?
- What does VOLATILE vs STABLE vs IMMUTABLE tell the planner about a function?
Red flag Trying to COMMIT inside a function (errors), or assuming 'function' and 'procedure' are just two names for the same thing.
source: PostgreSQL docs — CREATE PROCEDURE (transaction control) ↗
Commonly asked mid concept occasional When do you reach for a BEFORE, AFTER, or INSTEAD OF trigger?
BEFORE fires before the row change and can modify or veto it — use it to validate, normalize/derive a column (set updated_at, lowercase an email), or RETURN NULL to skip the operation. The row isn't written yet, so you can't see its final generated id.
AFTER fires once the change is committed to the row — use it for side effects that depend on the final state: writing an audit/history row, enqueuing a notification, maintaining a summary table. It can see the new id.
INSTEAD OF applies only to views: it replaces the (impossible) direct write with custom logic, which is how you make a complex/multi-table view updatable.
What a strong answer covers
- BEFORE: validate / mutate / cancel the row before it's written (can RETURN NULL to skip).
- AFTER: react to the committed change — audit logs, notifications, summary maintenance.
- INSTEAD OF: only on views; substitutes custom DML to make a non-updatable view writable.
- BEFORE can't see auto-generated values (id/serial); AFTER can.
Quick self-check
You want to reject or normalize a value before it's stored. Which trigger timing fits?
Follow-ups they push on
- Why can't a BEFORE INSERT trigger see the new serial id?
- Row-level vs statement-level triggers — when does each fire?
Red flag Using an AFTER trigger to try to alter the row (too late) or a BEFORE trigger to read the generated primary key (not assigned yet).
source: PostgreSQL docs — Overview of Trigger Behavior (BEFORE/AFTER/INSTEAD OF) ↗
Commonly asked senior concept occasional When is a materialized view the wrong tool, and what would you use instead?
A materialized view recomputes its *entire* result on REFRESH — there's no built-in incremental update in core Postgres. So it's the wrong tool when you need near-real-time freshness or the base data is huge and changes constantly: each full refresh is expensive and the data is stale between refreshes.
Better alternatives by need: for freshness, maintain a summary/rollup table updated incrementally by triggers or in the write path (comment_count); for ad-hoc speed without staleness, just add the right indexes to the plain view's query; for genuinely incremental materialization, reach for an external tool or an extension (e.g. continuous aggregates in TimescaleDB). Materialized views fit *expensive, periodically-refreshed reporting* — dashboards that tolerate minutes/hours of lag.
What a strong answer covers
- Core Postgres materialized views refresh in full — no incremental maintenance.
- Wrong for near-real-time needs or huge, constantly-changing base data.
- Freshness alternative: an incrementally-maintained summary table (triggers / write-path updates).
- Speed-without-staleness alternative: index the plain view's underlying query.
- Right fit: expensive, periodically-refreshed reporting that tolerates lag.
Follow-ups they push on
- How would you keep a comment_count fresh without a materialized view?
- What do TimescaleDB continuous aggregates add over a plain materialized view?
Red flag Using a materialized view for data that must be fresh, then refreshing it constantly and paying a full recompute each time instead of maintaining an incremental summary table.
source: PostgreSQL docs — Materialized Views (refresh is full recompute) ↗
Commonly asked senior debug occasional How do you prevent a row-level trigger from recursively firing on its own writes?
If an AFTER UPDATE trigger on a table issues another UPDATE on the same table, that write fires the trigger again — risking infinite recursion and a stack depth limit exceeded error.
Guards: (1) make the trigger's write a no-op when nothing changed — in a BEFORE trigger, IF NEW IS DISTINCT FROM OLD THEN … ELSE RETURN NULL stops the cascade once values stabilize; (2) only re-write when a condition flips, so the second pass changes nothing and the chain ends; (3) use pg_trigger_depth() to act only at depth 1; (4) restructure so the trigger updates a *different* table. The cleanest fix is usually a BEFORE trigger that mutates NEW in place (no second UPDATE needed at all) rather than issuing a recursive write.
What a strong answer covers
- A trigger that writes back to its own table re-fires itself -> potential infinite recursion.
- Symptom: stack depth limit exceeded.
- Guard with a 'did anything actually change?' check (NEW IS DISTINCT FROM OLD).
- Or gate on pg_trigger_depth(), or update a different table.
- Best: a BEFORE trigger that edits NEW directly — no recursive UPDATE at all.
Follow-ups they push on
- Why does mutating NEW in a BEFORE trigger avoid recursion entirely?
- What does pg_trigger_depth() return and how do you use it?
Red flag Writing an AFTER trigger that UPDATEs the same table unconditionally, causing it to re-fire forever and hit the stack-depth limit.
source: PostgreSQL docs — Trigger Procedures / recursion behavior ↗
Commonly asked senior concept occasional When should business logic live in stored procedures/functions versus the application?
Pushing logic into the database keeps it close to the data: fewer round trips, atomic multi-statement work, reuse across apps and languages, and often faster set-based processing.
The costs: logic is now split across two codebases, it's harder to version/test/debug, you take on DB-vendor lock-in, and it burns scarce DB CPU that's hard to scale horizontally.
The modern default is to keep business logic in the application and reserve DB routines for data-intensive, set-based, or integrity-critical work where the round-trip or consistency win is real.
Follow-ups they push on
- Function vs procedure in Postgres — which can control transactions?
- How would you version-control and test stored procedures?
Red flag Either extreme: cramming all business logic into the DB (unmaintainable, unscalable), or chatty app code looping row-by-row over work that should be one set-based statement.
source: PostgreSQL — CREATE PROCEDURE ↗
Commonly asked senior coding occasional How do you refresh a materialized view without blocking reads?
A plain REFRESH MATERIALIZED VIEW mv takes an exclusive lock and blocks every reader until it finishes. Use REFRESH MATERIALIZED VIEW CONCURRENTLY mv instead: it rebuilds without blocking SELECTs. The trade-offs are that it requires a UNIQUE index on the view (so it can diff rows) and it runs slower.
Schedule refreshes off-peak (cron / pg_cron) or kick them off right after the upstream load completes. If you need near-real-time freshness, full refresh is the wrong tool — maintain a trigger-updated summary table instead.
Follow-ups they push on
- Why does CONCURRENTLY require a unique index?
- When is an incrementally-maintained summary table better than a materialized view?
Red flag Running a plain (non-concurrent) refresh on a hot view during business hours and locking out every reader.
source: PostgreSQL — REFRESH MATERIALIZED VIEW ↗

04 Node.js Internals 86 Q's

4.1 The event loop & async model 16

★ must-know Commonly asked mid debug common What prints, and in what order? console.log("A"); setTimeout(() => console.log("B"), 0); queueMicrotask(() => console.log("C")); Promise.resolve().then(() => console.log("D")); console.log("E")
A E C D B.
Sync code first: A, E. Then the microtask queue drains before any macrotask. queueMicrotask and Promise.resolve().then feed the same Promise/microtask queue, so they run in registration order: C was queued first, then D. Finally the setTimeout macrotask fires in the timers phase: B.
The point: queueMicrotask is not a separate higher-priority queue like nextTick — it shares the Promise microtask queue and is the standards-based way to schedule a microtask.
What a strong answer covers
- Sync runs to completion first: A, then E.
- queueMicrotask and Promise.then share one microtask queue, drained in FIFO/registration order: C then D.
- All microtasks drain before any macrotask, so setTimeout's B is last.
- Unlike process.nextTick, queueMicrotask has no priority over Promise callbacks — same queue.
Quick self-check
What is the output order?
Follow-ups they push on
- Where would a process.nextTick callback land relative to C and D?
- Why prefer queueMicrotask over Promise.resolve().then() for scheduling a microtask?
Red flag Treating queueMicrotask as a separate, higher-priority queue — it shares the Promise microtask queue and runs in registration order.
source: MDN — queueMicrotask ↗
Commonly asked mid debug very common What prints, and in what order? console.log("A"); setTimeout(() => console.log("B"), 0); Promise.resolve().then(() => console.log("C")); process.nextTick(() => console.log("D")); console.log("E")
A E D C B.
First the synchronous code runs top to bottom: A, then E. The other three are deferred. Before the event loop advances to its next phase, Node drains its microtask queues, and process.nextTick has its own queue that runs before the Promise microtask queue, so D then C. Finally the setTimeout callback fires in the timers phase: B.
The rule to memorize: nextTick > Promise microtasks > macrotasks (timers/immediate/I/O).
Follow-ups they push on
- Why does process.nextTick run before the Promise callback even though it was scheduled later in the code?
- What happens if a nextTick callback schedules another nextTick — can it starve the loop?
Red flag Saying the order follows the source-code order, or putting `C` (Promise) before `D` (nextTick).
source: Node.js docs — Event loop, timers, and nextTick ↗
Commonly asked mid debug common What prints? for (let i = 0; i < 3; i++) { setTimeout(() => console.log(i), 0); } for (var j = 0; j < 3; j++) { setTimeout(() => console.log(j), 0); }
0 1 2 3 3 3.
The first loop uses let, which is block-scoped: each iteration gets a fresh binding of i, so the three closures capture 0, 1, 2 respectively. The second loop uses var, which is function-scoped: all three closures capture the *same* j, and by the time the timers fire (after the synchronous loops finish) j is already 3 — so it prints 3 three times.
This is the classic closures-in-a-loop trap. The timers all queue with delay 0 and fire in order after the synchronous code completes.
What a strong answer covers
- let is block-scoped → a fresh binding per iteration → captures 0, 1, 2.
- var is function-scoped → one shared binding → all closures see the final value 3.
- All callbacks are deferred (setTimeout), so they read the variable after the loop finishes.
- Fix for var: an IIFE per iteration, or just use let.
Quick self-check
What is the output?
Follow-ups they push on
- How would you make the var loop print 0 1 2 without changing var to let?
- Would using Promise.resolve().then instead of setTimeout change the captured values?
Red flag Expecting both loops to print 0 1 2 — the var loop captures one shared, function-scoped binding.
source: Lydia Hallie — javascript-questions ↗
Commonly asked mid debug very common At the top level of a module: setTimeout(() => console.log("timeout"), 0); setImmediate(() => console.log("immediate")). Which logs first?
It is not guaranteed — the order is non-deterministic at the top level. setTimeout(0) is clamped to a 1ms timer, so whether the timers phase or the check phase reaches its callback first depends on how long process setup took. Run it twice and you may see different orders.
The twist interviewers want: move both into an I/O callback, e.g. inside fs.readFile(...), and setImmediate always wins. After an I/O (poll-phase) callback, the loop goes straight to the check phase (setImmediate) before looping back to timers.
Follow-ups they push on
- Why does setImmediate become deterministic once you nest both inside an fs.readFile callback?
- Where in the phase order do timers and check sit relative to the poll phase?
Red flag Claiming setImmediate or setTimeout always wins at the top level — the whole point is that it is non-deterministic there.
source: Node.js docs — setImmediate vs setTimeout ↗
Commonly asked mid concept very common Name the phases of the Node.js event loop in order, and say what runs in each.
Six phases, run in this order each iteration ("tick"):
1. timers — setTimeout/setInterval callbacks whose threshold has elapsed.
2. pending callbacks — a few deferred system/OS callbacks (e.g. some TCP errors).
3. idle, prepare — internal to libuv; you never schedule here.
4. poll — retrieve new I/O events and run their callbacks; the loop may block here waiting for I/O.
5. check — setImmediate callbacks.
6. close callbacks — e.g. socket.on("close", ...).
Between every callback (and between phases) Node drains the microtask queues: the process.nextTick queue first, then the Promise/queueMicrotask queue.
Follow-ups they push on
- In which phase does the loop actually block waiting for work?
- Are microtasks a phase of the loop? (No — they drain between callbacks.)
Red flag Listing microtasks (Promises) as an event-loop phase — they are not; they run between phases.
source: Node.js docs — Event loop, timers, and nextTick ↗
AmazonMetaTikTok mid concept very common Node.js is "single-threaded," yet it handles thousands of concurrent connections. How? Where do background threads come from?
There is one JavaScript thread that runs all your code on the event loop. Concurrency comes from not waiting: when you do I/O (network, disk, DNS), Node hands the work to the OS or to libuv and registers a callback, then immediately moves on. When the I/O completes, its callback is queued and runs later on the JS thread.
Most network I/O uses the OS's async primitives directly (epoll/kqueue/IOCP) — no extra thread. A few things that lack an async OS API run on libuv's thread pool (default size 4, UV_THREADPOOL_SIZE): file-system ops, DNS lookup, and some crypto/zlib work.
So: one thread for JS, the OS + a small libuv pool for the blocking bits.
Follow-ups they push on
- Which built-in operations actually use the libuv thread pool?
- What is UV_THREADPOOL_SIZE and when would you raise it?
Red flag Saying every async operation spawns a thread, or that the thread pool handles network sockets (it usually does not).
source: Node.js docs — Don't block the event loop ↗
Commonly asked mid trick common What is the difference between process.nextTick() and setImmediate(), despite the confusing names?
The names are backwards from what you would guess.
- process.nextTick(cb) runs cb before the event loop continues — as soon as the current operation finishes, before returning to the loop. It is a microtask, higher priority than Promises. "Next tick" here means "before the next loop phase," i.e. almost immediately.
- setImmediate(cb) schedules cb for the check phase of the *next* loop iteration. Despite "immediate," it is later than nextTick.
Node docs themselves recommend setImmediate for most cases because it is easier to reason about and cannot starve the loop the way recursive nextTick can.
Follow-ups they push on
- Why can recursive process.nextTick starve I/O but recursive setImmediate cannot?
- Which one would you use to defer work to 'after this function returns but before any I/O'?
Red flag Assuming setImmediate runs before nextTick because of the name — it is the opposite.
source: Node.js docs — Understanding setImmediate() ↗
Commonly asked mid concept common A request handler runs a synchronous for-loop summing 1 to 10 billion. What happens to every other in-flight request, and why?
Every other request stalls until the loop finishes. There is one JS thread, and a synchronous CPU-bound loop never yields to the event loop — no timers fire, no I/O callbacks run, no new connections are accepted. The whole server appears frozen.
Fixes, in order of preference:
1. Offload the CPU work to a Worker thread (or a child process / external service).
2. Chunk the work and yield between chunks with setImmediate so the loop can service I/O.
3. Push it out of the request path entirely (a job queue).
The mental model: async I/O is free concurrency, but CPU work is not — it must be moved off the main thread.
Follow-ups they push on
- How would you detect event-loop blocking in production? (event-loop lag / monitoring.)
- Why is Worker threads better than just adding more setTimeout calls here?
Red flag Thinking async/await or wrapping the loop in a Promise makes synchronous CPU work non-blocking — it does not.
source: Node.js docs — Don't block the event loop ↗
Commonly asked mid debug common What prints? async function f() { console.log(1); await null; console.log(2); } console.log(3); f(); console.log(4)
3 1 4 2.
console.log(3) runs. Then f() is *called* and runs synchronously up to the await: it logs 1. At await null, the function suspends and its continuation (console.log(2)) is scheduled as a microtask; control returns to the caller, which logs 4. The synchronous stack is now empty, so the microtask queue drains: 2.
The insight: code before the first await runs synchronously; everything after await is a microtask, even when you await an already-resolved value like null.
Follow-ups they push on
- Does it matter that we awaited `null` instead of a real Promise? (No — await always yields.)
- Where would a process.nextTick scheduled in main code land relative to console.log(2)?
Red flag Treating the body after `await` as still synchronous and printing `1 2` together.
source: Lydia Hallie — JavaScript Visualized: Promises & Async/Await ↗
Commonly asked senior concept occasional What is UV_THREADPOOL_SIZE, what is its default, and what symptom tells you it's too small?
UV_THREADPOOL_SIZE is the environment variable that sets the size of libuv's thread pool, which backs the handful of operations that lack an async OS API: file-system I/O, DNS lookup, and some crypto/zlib work. The default is 4.
The symptom of it being too small: those specific operations start queuing behind each other even though the CPU is idle and the event loop is free. For example, fire 5 concurrent crypto.pbkdf2 calls with a pool of 4 and the 5th does not start until one of the first four finishes — added latency that looks mysterious because nothing is "blocked."
Raise it (e.g. UV_THREADPOOL_SIZE=8) when you do heavy concurrent fs/crypto work, but it must be set before the pool is created (at process start).
What a strong answer covers
- Sets libuv's thread pool size; default 4.
- Backs fs I/O, dns.lookup, and some crypto/zlib — not network sockets (those use the OS directly).
- Symptom of too-small: those ops serialize/queue while CPU and event loop sit idle.
- Must be set at process startup — changing it after the pool spins up has no effect.
Follow-ups they push on
- Why doesn't raising UV_THREADPOOL_SIZE help an HTTP server doing pure network I/O?
- How would you tell pool saturation apart from event-loop blocking?
Red flag Raising the pool size to fix latency on network I/O — sockets don't use the pool, so it does nothing.
source: Node.js docs — UV_THREADPOOL_SIZE ↗
Commonly asked senior debug common What prints? const fs = require("fs"); fs.readFile(__filename, () => { setTimeout(() => console.log("timeout"), 0); setImmediate(() => console.log("immediate")); });
immediate then timeout — deterministically, every run.
The readFile callback runs in the poll phase. From the poll phase the loop advances next to the check phase, where setImmediate callbacks live — so immediate fires first. Only after wrapping back around to the timers phase does the setTimeout(0) callback run: timeout.
This is the famous twist: at the top level setTimeout(0) vs setImmediate ordering is non-deterministic, but inside an I/O callback setImmediate always wins because check immediately follows poll.
What a strong answer covers
- The I/O callback runs in the poll phase; the loop goes poll → check → (wrap) → timers.
- check (setImmediate) comes right after poll, so immediate runs before timeout.
- This ordering is deterministic inside an I/O callback (unlike at the top level).
- It demonstrates the phase order, not a race — setImmediate reliably beats setTimeout(0) here.
Quick self-check
What prints, and is it deterministic?
Follow-ups they push on
- Why is the same pair non-deterministic at the top level of the module?
- Where does a process.nextTick scheduled inside the readFile callback run relative to these two?
Red flag Saying setTimeout wins or that it's non-deterministic — inside an I/O callback, setImmediate is guaranteed first.
source: Node.js docs — setImmediate() vs setTimeout() ↗
Commonly asked senior concept occasional Can recursive process.nextTick() starve the event loop? Contrast with recursive setImmediate().
Yes — recursive process.nextTick can starve the loop. The nextTick queue is drained completely between phases, and a callback that schedules another nextTick keeps re-filling that queue, so the loop never advances to timers, poll, or I/O. Your server stops accepting connections and firing timers while the CPU spins on nextTicks.
Recursive setImmediate does not starve I/O. setImmediate callbacks run in the check phase, and each loop iteration runs the immediates queued *before* this iteration started — newly-scheduled ones wait for the *next* iteration. So the loop still visits the poll phase between iterations and services I/O.
This is exactly why Node's docs recommend setImmediate over nextTick for deferring work in most cases.
What a strong answer covers
- nextTick queue drains fully between phases; recursive nextTick re-fills it and blocks the loop from advancing.
- Recursive setImmediate yields each iteration — newly-queued immediates wait for the next tick, so I/O still runs.
- Starvation symptom: timers don't fire and new connections aren't accepted while CPU is busy.
- Docs recommend setImmediate for deferral precisely because it can't starve the loop.
Follow-ups they push on
- Why does a newly-scheduled setImmediate wait for the next loop iteration but a newly-scheduled nextTick does not?
- When is process.nextTick still the right tool despite the starvation risk?
Red flag Using recursive nextTick for chunked work — it can lock out all I/O; use setImmediate to chunk safely.
source: Node.js docs — process.nextTick() ↗
Commonly asked senior concept occasional How does the event loop in Node differ from the one in the browser? Name two concrete differences.
They share the core idea — a single JS thread, a macrotask queue, and a microtask queue drained between tasks — but differ in details:
1. Extra microtask queue: Node has process.nextTick, which runs before the Promise microtask queue. The browser has only the Promise/queueMicrotask queue.
2. Phases and setImmediate: Node's loop is libuv's multi-phase loop (timers, poll, check, …) and exposes setImmediate (the check phase). The browser has no setImmediate; its closest analog is task scheduling via setTimeout/messaging, and rendering steps (style/layout/paint, requestAnimationFrame) are interleaved into its loop — Node has no rendering.
So: Node = libuv phases + nextTick + setImmediate, no rendering; browser = task/microtask + a render step, no nextTick/setImmediate.
What a strong answer covers
- Node has two microtask queues (nextTick before Promises); the browser has only the Promise queue.
- Node's loop has libuv phases and setImmediate; the browser has neither.
- The browser interleaves rendering (rAF, style/layout/paint); Node has no render step.
- Both: single JS thread, microtasks drain to empty between macrotasks.
Follow-ups they push on
- What's the browser's closest equivalent to setImmediate?
- Where does requestAnimationFrame sit relative to microtasks in the browser?
Red flag Assuming setImmediate or process.nextTick exist in the browser, or that the two loops are identical.
source: MDN — The event loop ↗
Commonly asked senior concept occasional What is 'event-loop lag' (event-loop delay), why does it matter, and how do you measure it?
Event-loop lag is the extra time between when a callback (e.g. a timer) was *supposed* to run and when it *actually* runs. A timer set for 0ms that fires 80ms late means the loop spent ~80ms busy elsewhere — almost always a synchronous, CPU-bound task blocking the single thread.
It matters because it is the single best health signal for a Node service: high lag means requests are queuing and latency is spiking for *everyone*, even if CPU and memory look fine. It is the symptom of "don't block the event loop."
Measure it precisely with the built-in perf_hooks.monitorEventLoopDelay() histogram (min/max/percentiles), or the crude classic: a recurring setInterval that records how far past its scheduled time it fires.
What a strong answer covers
- Lag = actual minus scheduled callback time; reflects how long the loop was busy.
- High lag almost always means synchronous CPU work blocking the one JS thread.
- It's a leading indicator of latency for all requests, not just one.
- Measure with perf_hooks.monitorEventLoopDelay() (a histogram) or a self-timing setInterval.
Follow-ups they push on
- What's a healthy lag threshold for an HTTP service, and what would you alert on?
- How does monitorEventLoopDelay differ from just timing a setInterval?
Red flag Diagnosing latency with CPU/memory only — a blocked loop can show low CPU yet high lag and timeouts.
source: Node.js docs — perf_hooks.monitorEventLoopDelay ↗
Commonly asked senior debug common What prints? console.log("start"); setTimeout(() => console.log("timeout"), 0); Promise.resolve().then(() => { console.log("promise1"); process.nextTick(() => console.log("nextTick-in-promise")); }); process.nextTick(() => console.log("nextTick")); console.log("end")
start end nextTick promise1 nextTick-in-promise timeout.
Sync first: start, end. Then the microtask drain begins. The nextTick queue runs to completion first: nextTick. Then the Promise queue: promise1 — which itself schedules a new nextTick. The drain is exhaustive: after the Promise queue, Node re-checks the nextTick queue and finds nextTick-in-promise, running it before leaving the microtask phase. Only once both microtask queues are empty does the loop reach timers: timeout.
Key idea: microtasks added while draining are processed in the same drain, before any macrotask.
Follow-ups they push on
- Could this pattern (nextTick scheduling nextTick) starve the timers phase indefinitely?
- Where does queueMicrotask sit relative to process.nextTick?
Red flag Running `timeout` before `nextTick-in-promise` — newly-queued microtasks still drain before any timer.
source: Node.js docs — Event loop, timers, and nextTick ↗
Commonly asked senior concept common Are microtasks (Promise callbacks) part of the event loop's phases? When exactly do they run?
No — microtasks are not one of the libuv phases. There are two microtask queues (the process.nextTick queue, then the Promise/queueMicrotask queue) that Node drains completely between every callback and at each phase boundary.
Concretely: run one callback from a phase, then fully drain nextTick, then fully drain Promises, then run the next callback. Because the drain is exhaustive, a flood of microtasks (or recursive nextTick) can delay the loop from ever reaching the next macrotask — a real starvation risk.
In the browser the model is similar but there is only the Promise microtask queue (no nextTick).
Follow-ups they push on
- How does this differ between Node and the browser?
- What is queueMicrotask and why prefer it over Promise.resolve().then for scheduling?
Red flag Describing microtasks as 'the last phase' of the loop — they interleave between callbacks, not at the end.
source: Node.js docs — Event loop, timers, and nextTick ↗

4.2 Async evolution & error handling 14

★ must-know Commonly asked mid concept common What does Node do by default when a promise rejects with no handler? Has this changed across versions?
In current Node (the --unhandled-rejections=throw default since v15), an unhandled rejection is treated like an uncaught exception: Node prints the error and terminates the process with a non-zero exit code.
This was a deliberate hardening. Older Node (≤ v14) only logged an UnhandledPromiseRejectionWarning and kept running — which let silent, half-broken state accumulate. The change forces you to handle rejections.
You can still observe them via the process.on("unhandledRejection", ...) event (log/flush before exit), or override the mode with --unhandled-rejections=warn, but the right fix is to await/.catch the promise. Treat a crash here as a real bug, not noise.
What a strong answer covers
- Current default (throw, since v15): an unhandled rejection crashes the process with a non-zero code.
- Node ≤ v14 only logged a warning and kept running — the old, dangerous behavior.
- Hook process.on('unhandledRejection') to log/flush, but exit; don't swallow.
- The real fix is upstream: await, return, or .catch the promise.
Quick self-check
By default in current Node, an unhandled promise rejection will:
Follow-ups they push on
- Why was 'log and keep running' considered dangerous enough to change the default?
- What's the difference between the unhandledRejection and rejectionHandled events?
Red flag Assuming an unhandled rejection just logs a warning — in modern Node it terminates the process.
source: Node.js docs — --unhandled-rejections=mode ↗
Commonly asked junior concept common Trace the evolution callbacks → Promises → async/await. What problem did each step solve?
Callbacks: the original async primitive — pass a function(err, result). The error-first convention is the norm, but nesting dependent async steps creates the deeply-indented "callback hell" / pyramid of doom, and error handling is manual at every level.
Promises (ES2015): a first-class object representing a future value with .then/.catch. They flatten nesting into chains and give one .catch for the whole chain. Composition helpers: Promise.all, race, allSettled, any.
async/await (ES2017): syntactic sugar over Promises. await lets you write asynchronous code that *reads* synchronously, and ordinary try/catch handles errors. Under the hood it is still Promises and microtasks.
Follow-ups they push on
- Is async/await just Promises under the hood? (Yes.)
- When would you still reach for raw Promise combinators over await?
Red flag Claiming async/await makes code run on a background thread — it is the same single-threaded microtask machinery.
source: MDN — Asynchronous JavaScript ↗
Commonly asked mid debug common What prints, and does the program crash? Promise.reject(new Error("boom")).catch(() => console.log("caught")); console.log("sync")
sync then caught, and it does not crash.
The .catch is attached synchronously, in the same expression — so the rejection has a handler from the start; it's never "unhandled." The handler runs as a microtask, after the synchronous console.log("sync"). So order is sync, then caught.
Contrast with const p = Promise.reject(...); ... attach .catch later: as long as the handler is attached within the same tick, Node still treats it as handled. The danger is a rejection that reaches the end of a tick with *no* handler attached — that's what fires unhandledRejection.
What a strong answer covers
- .catch is attached in the same expression, so the rejection is handled — no crash.
- The catch handler runs as a microtask, after synchronous code: sync then caught.
- A rejection is 'unhandled' only if no handler is attached by the end of the tick.
- Attaching .catch even a few lines later (same tick) still counts as handled.
Quick self-check
What is the output, and does it crash?
Follow-ups they push on
- What would change if you removed the .catch entirely?
- Does attaching .catch in a later setTimeout still prevent unhandledRejection?
Red flag Thinking a synchronously-caught rejection crashes — it's handled, and the handler is just a microtask.
source: MDN — Promise.prototype.catch ↗
Commonly asked mid concept common Why is `[1,2,3].forEach(async (x) => { await save(x); })` a trap? What happens to errors and ordering?
forEach ignores the return value of its callback. Your callback returns a Promise, but forEach discards it — so nothing awaits the saves. The result:
- No waiting: code after the forEach runs *before* any save finishes; you can't sequence anything after it.
- Lost errors: each callback's promise floats; a rejection becomes an unhandledRejection rather than something you can catch.
- No ordering guarantee relative to the surrounding code.
Use for...of with await for sequential, or await Promise.all(arr.map(fn)) for concurrent. Both actually wait and let errors propagate.
``for (const x of [1,2,3]) await save(x); // sequential await Promise.all([1,2,3].map((x) => save(x))); // concurrent``
What a strong answer covers
- forEach discards the callback's returned Promise — the awaits are never awaited by the caller.
- Code after the forEach runs before the saves complete (no sequencing).
- Rejections float → unhandledRejection, not catchable at the call site.
- Use for...of + await (sequential) or Promise.all(map(...)) (concurrent).
Follow-ups they push on
- Which replacement gives sequential vs concurrent execution?
- Do .map and .filter have the same async problem as forEach?
Red flag Passing an async function to forEach and assuming the loop waits — it doesn't, and errors are lost.
source: MDN — Array.prototype.forEach (Caveats / async) ↗
Commonly asked mid concept common What is a "floating promise," and why is it dangerous? Show a version of fetchUser() that silently loses errors.
A floating promise is a Promise you create but never await, return, or attach .catch to. If it rejects, the rejection is unhandled — the error vanishes (and in modern Node, crashes the process).
``function handler(req, res) { saveToDb(req.body); // floating — no await, no .catch res.send("ok"); // responds 200 even if the DB write throws }`
The client gets 200 OK while the write may have failed silently. Fixes: await saveToDb(...) (and wrap in try/catch), or return it, or attach .catch. Lint rules like @typescript-eslint/no-floating-promises` catch these.
Follow-ups they push on
- What does Node do by default on an unhandled rejection in current LTS?
- How does the `no-floating-promises` lint rule help?
Red flag Assuming an un-awaited async call's errors will surface somewhere — they are lost unless explicitly handled.
source: Node.js docs — process 'unhandledRejection' ↗
Commonly asked mid debug common Why doesn't this try/catch catch the error? try { setTimeout(() => { throw new Error("boom"); }, 0); } catch (e) { console.log("caught"); }
It does not catch anything — the program crashes with an uncaught exception.
try/catch only guards the synchronous execution of its block. By the time the setTimeout callback actually runs (a later event-loop tick), the try block has long since returned and its stack frame is gone. The thrown error has no surrounding catch, so it becomes an uncaughtException.
To handle it, the try/catch must live inside the async callback, or use a Promise and .catch/await:
``setTimeout(() => { try { throw new Error("boom"); } catch (e) { console.log("caught"); } }, 0);``
Follow-ups they push on
- Why does try/catch around an `await`ed Promise work, but not around a bare callback?
- What is the last-resort safety net for uncaught exceptions, and why shouldn't you keep running after one?
Red flag Believing a synchronous try/catch can catch errors thrown from a later callback.
source: Node.js docs — process 'uncaughtException' ↗
Commonly asked mid concept common Compare Promise.all, Promise.allSettled, Promise.race, and Promise.any. When would you pick each?
- Promise.all — resolves with an array of all results; rejects on the first rejection (fail-fast). Use when you need *every* task to succeed (e.g. fan-out queries that all must return).
- Promise.allSettled — never rejects; resolves with {status, value|reason} for each. Use when you want *all* results regardless of individual failures (e.g. notify N services, report which failed).
- Promise.race — settles (resolve or reject) as soon as the first promise settles. Use for timeouts: race the work against a timer.
- Promise.any — resolves with the first fulfilled value; rejects only if *all* reject (with an AggregateError). Use for redundancy: first successful mirror/replica wins.
Follow-ups they push on
- How do you implement a timeout with Promise.race?
- With Promise.all, do the other promises stop running when one rejects? (No — they keep going.)
Red flag Confusing `race` (first to settle, including rejection) with `any` (first to fulfill), or assuming `all` cancels siblings on rejection.
source: MDN — Promise.allSettled ↗
Commonly asked mid debug common What prints, and how long does it take? const a = await slow(1000); const b = await slow(1000); — vs — const [a, b] = await Promise.all([slow(1000), slow(1000)])
The sequential version takes ~2000ms; the Promise.all version takes ~1000ms.
In the first snippet each await *pauses* until that promise settles before the next call even starts — the two slow(1000) calls run back-to-back. In the second, both slow(1000) calls are invoked first (kicking off concurrently), and await Promise.all waits for both — so they overlap.
The lesson: await in a sequence serializes independent work. If tasks do not depend on each other, start them together and await the aggregate.
Follow-ups they push on
- How would you write this so b's input depends on a's result? (Then sequential is correct.)
- What's the bug in `for (const url of urls) await fetch(url)` when order doesn't matter?
Red flag Awaiting independent operations one-by-one in a loop, turning parallelizable work into serial latency.
source: MDN — Using Promises ↗
Commonly asked mid concept common How do you handle errors in async/await code, and what's the difference between unhandledRejection and uncaughtException?
Within an async function, wrap awaited calls in try/catch; the catch receives whatever the awaited promise rejected with. For fire-and-forget chains, attach .catch. At the boundary (e.g. an Express route), funnel errors to a central error handler.
The two process-level events:
- unhandledRejection — a Promise rejected with no handler. Usually a bug (a floating promise). In current Node it terminates the process by default.
- uncaughtException — a synchronous (or callback) error bubbled to the top with no try/catch.
Both should be treated as last-resort: log, flush, and exit. The process is in an unknown state, so do not silently continue serving traffic.
Follow-ups they push on
- Why is it unsafe to keep the process alive after an uncaughtException?
- Where should the single catch-all error handler live in an Express app?
Red flag Using process.on('uncaughtException') to swallow errors and keep running — that hides corruption and leaks.
source: Node.js docs — process events ↗
Commonly asked mid concept occasional What does util.promisify do, and why is the error-first callback convention what makes it possible?
util.promisify(fn) wraps a function that follows Node's error-first callback convention — fn(...args, (err, result) => ...) — and returns a version that returns a Promise instead. The promise rejects with err if it's truthy, otherwise resolves with result.
It works *only* because the callback shape is standardized: error first, single result second. promisify knows exactly where the error and value are, so it can mechanically translate callback → Promise. Functions with a different callback shape (multiple results, or callback-first) need promisify.custom or manual wrapping.
In practice you reach for it less now because most core modules ship promise variants (fs.promises, dns.promises, timers/promises), but it's still the bridge for legacy callback APIs.
What a strong answer covers
- Converts an error-first callback API into one that returns a Promise.
- Rejects on truthy err, resolves on the single result — exactly the error-first shape.
- Non-standard callback shapes need util.promisify.custom.
- Often unnecessary today: fs.promises, dns.promises, timers/promises exist.
Follow-ups they push on
- How would you promisify a callback that returns multiple result values?
- When would you still use util.promisify instead of the promise-native API?
Red flag Promisifying a function whose callback isn't error-first (or callback-first) — the wrapper resolves/rejects wrongly.
source: Node.js docs — util.promisify ↗
Commonly asked senior concept occasional Why is it unsafe to keep the process alive after an 'uncaughtException'? What's the correct response?
An uncaughtException means an error escaped all try/catch and bubbled to the top. At that point you have no idea what state the program is in — a half-finished write, a held lock, a corrupted in-memory structure, a leaked connection. Continuing to serve traffic on top of that corruption risks silently wrong results and resource leaks.
Node's own docs are explicit: the handler is for synchronous cleanup, not for resuming normal operation. The correct pattern is to log the error, flush logs/metrics, release critical resources, and exit with a non-zero code — then let your process manager (systemd, Kubernetes, PM2) restart a fresh, clean process.
For graceful handling of *expected* errors, catch them where they occur; uncaughtException is the last-resort net, not a control-flow mechanism.
What a strong answer covers
- After uncaughtException the process state is unknown/corrupt — locks, writes, structures may be half-done.
- Node docs: the handler is for sync cleanup, not for resuming work.
- Correct response: log, flush, release resources, exit non-zero; let a supervisor restart.
- Use it as a last-resort net; handle expected errors at their source.
Follow-ups they push on
- What process-level supervisor would restart the exited process in a container?
- How does the domain module / AsyncLocalStorage relate to error isolation?
Red flag Using process.on('uncaughtException') to swallow and continue — it masks corruption and leaks.
source: Node.js docs — Warning: Using 'uncaughtException' correctly ↗
Commonly asked senior concept occasional What does async/await actually compile to, and why does that mean two awaits in a row are slower than Promise.all?
await expr is syntactic sugar for taking the promise expr resolves to and suspending the function until it settles, scheduling the continuation as a microtask — roughly Promise.resolve(expr).then(continuation). The function literally pauses at each await and resumes only after that promise settles.
So const a = await f(); const b = await g(); cannot start g() until f() has fully settled — they are serialized, total time ≈ time(f) + time(g). With Promise.all([f(), g()]), both f() and g() are *invoked synchronously first* (kicking off concurrently), and you await the aggregate — total ≈ max(f, g).
The mental model: await is a pause point, not a parallelizer. Start independent work before you await it.
What a strong answer covers
- await ≈ pause the function and resume its continuation as a microtask once the promise settles.
- Sequential awaits serialize: each starts only after the previous settles.
- Promise.all invokes all the calls first, then awaits the aggregate → overlap.
- Use sequential awaits only when later work *depends* on the earlier result.
Follow-ups they push on
- How would you start two awaits concurrently without Promise.all? (Assign the promises first, await later.)
- Does code before the first await run synchronously? (Yes.)
Red flag Treating await as 'fire concurrently' — it's a suspension point; independent awaits run one after another.
source: MDN — await ↗
Commonly asked senior coding occasional How do you add a timeout to an async operation that has no built-in timeout, and what's the catch with AbortController?
The classic pattern is Promise.race between the work and a timer that rejects:
``function withTimeout(p, ms) { return Promise.race([ p, new Promise((_, rej) => setTimeout(() => rej(new Error("timeout")), ms)), ]); }`
The catch: Promise.race only stops waiting — it does not cancel the underlying work. The original promise keeps running (the request still completes, the socket stays open), and the leftover setTimeout keeps the loop alive unless you clearTimeout it. So you can leak timers and in-flight requests.
The better tool when the API supports it is AbortController: pass controller.signal to fetch/streams/etc. and call controller.abort() on timeout to actually cancel the work and release resources. AbortSignal.timeout(ms)` is the built-in shorthand. The catch with AbortController: it only works if the callee honors the signal — it can't cancel arbitrary code that ignores it.
What a strong answer covers
- Promise.race([work, timeoutReject]) is the standard timeout pattern.
- race stops *waiting* but does not cancel the underlying work — it keeps running.
- Clear the timer (clearTimeout) or it can keep the event loop alive / leak.
- Prefer AbortController / AbortSignal.timeout(ms) to truly cancel — but only if the callee honors the signal.
Follow-ups they push on
- Why does the work keep running after Promise.race rejects on timeout?
- What does AbortController give you that Promise.race can't?
Red flag Assuming Promise.race cancels the slow operation — it only stops awaiting it; the work and timer can leak.
source: MDN — AbortController ↗
Commonly asked senior coding occasional You have an array of IDs and want to fetch each, but the upstream API rate-limits you. Why is `await Promise.all(ids.map(fetchOne))` risky, and what's a better pattern?
Promise.all(ids.map(fetchOne)) fires all requests at once. With thousands of IDs you can exhaust sockets, blow memory, and trip the upstream rate limit — every request fails together.
Better: bound the concurrency. Process in fixed-size batches, or use a concurrency-limiter (e.g. p-limit) so at most N run at a time:
``const limit = pLimit(5); const results = await Promise.all( ids.map((id) => limit(() => fetchOne(id))) );`
This keeps Promise.all's aggregate semantics while capping in-flight requests at 5. For pure sequential needs, a plain for...of with await` works but is slow.
Follow-ups they push on
- How would you also add retry-with-backoff for the rate-limit 429s?
- Why is allSettled sometimes better than all here?
Red flag Unbounded Promise.all over a large array — it looks elegant but is a classic source of overload and rate-limit failures.
source: MDN — Promise.all ↗

4.3 Streams & buffers 14

Commonly asked mid concept common Name the four stream types in Node and give a concrete example of each.
- Readable — you read data out of it. Example: fs.createReadStream(file), an incoming HTTP request (req).
- Writable — you write data into it. Example: fs.createWriteStream(file), an HTTP response (res), process.stdout.
- Duplex — readable *and* writable, two independent channels. Example: a TCP socket (net.Socket).
- Transform — a Duplex where the output is a function of the input. Example: zlib.createGzip(), a crypto cipher, or a custom parser.
The value of streams: process data in chunks as it arrives instead of buffering the whole payload in memory.
Follow-ups they push on
- How is a Transform stream different from a plain Duplex?
- Which stream type is an HTTP request, and which is the response?
Red flag Saying Duplex and Transform are the same — Transform's output is derived from its input; a Duplex's two sides are unrelated.
source: Node.js docs — How to use streams ↗
Commonly asked mid coding occasional Sketch a custom Transform stream that uppercases text. What are the _transform and _flush methods for?
Subclass Transform (or pass a transform option) and implement _transform(chunk, encoding, callback): process each incoming chunk, push any output, and call callback() to signal you're ready for the next chunk (or callback(err) to error the stream).
``import { Transform } from "node:stream"; const upper = new Transform({ transform(chunk, _enc, cb) { this.push(chunk.toString().toUpperCase()); cb(); }, });`
_flush(callback) is optional and runs once, after the last chunk but before the stream ends — use it to emit any buffered/trailing data (e.g. the final piece of a line-splitter that has a partial line left over). _transform is per-chunk; _flush` is the one-time finalizer.
What a strong answer covers
- _transform(chunk, enc, cb) runs per chunk: process, this.push(...), then cb().
- Call cb(err) to propagate errors; calling cb signals readiness for the next chunk (backpressure-aware).
- _flush(cb) runs once after the last chunk to emit any buffered/trailing output.
- Pass { transform, flush } options or subclass — both work.
Follow-ups they push on
- When is _flush essential? (Buffered/partial data like the last incomplete line.)
- How does calling the callback relate to backpressure on the readable side?
Red flag Forgetting to call the _transform callback — the stream stalls because it never asks for the next chunk.
source: Node.js docs — Implementing a Transform stream ↗
Commonly asked mid debug common An Express handler does `fs.readFile(bigFile, (e, data) => res.send(data))` and the server OOMs under load. What's the streaming fix?
fs.readFile buffers the entire file into memory before sending. Under concurrency, N simultaneous requests for a big file means N full copies in RAM at once — the heap balloons and the process OOMs.
The fix is to stream the file straight to the response, so only small chunks are in memory and backpressure throttles reads to the client's download speed:
``import { pipeline } from "node:stream/promises"; await pipeline(fs.createReadStream(bigFile), res);`
pipeline wires backpressure (a slow client pauses the file read) and cleans up/propagates errors. Memory stays ~highWaterMark-sized per request, independent of file size. (Frameworks expose this as res.sendFile/reply.send(stream)`, which stream under the hood.)
What a strong answer covers
- fs.readFile loads the whole file into RAM; N concurrent requests = N full copies → OOM.
- Streaming sends chunks, so per-request memory ≈ highWaterMark regardless of file size.
- Backpressure throttles disk reads to the client's download rate.
- Use pipeline(createReadStream, res) (or res.sendFile) for error handling + cleanup.
Follow-ups they push on
- Why does pipeline matter here over a bare .pipe to res?
- What does a slow client do to a streamed response vs a buffered one?
Red flag Buffering whole files with readFile in a request handler — fine in dev, OOMs under concurrent load.
source: Node.js docs — How to use streams ↗
Commonly asked mid design common You must read a 10GB file, transform each line, and write the result — on a box with 512MB RAM. How?
Stream it; never load the whole file. Build a pipeline of a Readable → Transform → Writable so only small chunks are in memory at any moment, with backpressure keeping the buffers bounded:
``import { pipeline } from "node:stream/promises"; await pipeline( fs.createReadStream("in"), someLineTransform, fs.createWriteStream("out") );`
pipeline wires backpressure (the read pauses when the write is slow) and — crucially — propagates errors and cleans up every stream (destroying them) if any stage fails. Memory stays ~highWaterMark-sized, independent of the 10GB total. fs.readFile` would try to allocate 10GB and crash.
Follow-ups they push on
- Why prefer pipeline() over chaining .pipe()? (Error handling + cleanup.)
- How would you split the stream into lines before the transform?
Red flag Reaching for fs.readFile / reading into one big Buffer — it cannot fit and OOMs the process.
source: Node.js docs — stream.pipeline ↗
Commonly asked mid concept occasional What is a Buffer, and why does Node need it when JavaScript already has strings and arrays?
A Buffer is a fixed-length chunk of raw binary memory outside the V8 heap — Node's way of handling bytes (files, TCP packets, images, crypto) that pre-date TypedArray in the language. It is a subclass of Uint8Array.
JavaScript strings are UTF-16 text, not bytes; a regular array is boxed and heap-heavy. Binary protocols, file contents, and network frames are sequences of bytes — Buffer gives you direct, efficient access to them and lets you control the encoding when converting to/from strings (buf.toString("utf8"), Buffer.from(str, "base64")).
Gotcha: a multi-byte UTF-8 character can be split across two chunks; decode with StringDecoder or accumulate before toString.
Follow-ups they push on
- What goes wrong if you call buf.toString() on a chunk that splits a multi-byte character?
- Why is Buffer allocated off the V8 heap?
Red flag Treating chunk boundaries as character boundaries — concatenating decoded chunks can corrupt multi-byte UTF-8.
source: Node.js docs — Buffer ↗
Commonly asked mid debug common This streaming code occasionally crashes the whole server with no stack trace pointing at user code. What's the most likely cause?
An unhandled 'error' event on a stream. Streams are EventEmitters, and EventEmitter has a special rule: if an 'error' event is emitted and there is no 'error' listener, Node *throws* — crashing the process.
With streams this is easy to hit: a read fails (file gone, socket reset), the source emits error, nothing is listening, and the server dies. The fix is to handle error on every stream, or — better — use pipeline(), which routes errors to one place and destroys the streams.
``rs.on("error", handle); // not optional``
Follow-ups they push on
- Why does an EventEmitter throw specifically on an unhandled 'error' event?
- How does pipeline() remove the need to attach error handlers to each stream?
Red flag Handling 'data'/'end' but forgetting 'error' — the one event whose absence crashes the process.
source: Node.js docs — Error handling with streams ↗
Commonly asked mid concept occasional What's the difference between Buffer.alloc(n) and Buffer.allocUnsafe(n), and why does the 'unsafe' one exist?
Buffer.alloc(n) allocates n bytes and zero-fills them — safe, predictable, but it pays the cost of writing zeros across the whole buffer.
Buffer.allocUnsafe(n) allocates n bytes without initializing them, so the memory may contain leftover bytes from previously freed allocations — potentially old data (passwords, keys, other requests). It's faster precisely because it skips the zero-fill.
The 'unsafe' version exists for hot paths where you're about to fully overwrite the buffer immediately (e.g. you copy/fill into all n bytes before reading). The danger is forgetting to overwrite some region and then sending/logging it — leaking stale memory. Default to Buffer.alloc; reach for allocUnsafe only when you'll write every byte before reading and have measured a real win.
Never use the deprecated new Buffer(n) constructor — it's unsafe and removed/forbidden.
What a strong answer covers
- alloc zero-fills (safe); allocUnsafe skips initialization (faster, may expose old memory).
- allocUnsafe may contain sensitive leftover bytes from freed allocations.
- Only safe when you fully overwrite every byte before any read.
- Avoid the deprecated new Buffer() constructor entirely.
Quick self-check
Which statement about Buffer.allocUnsafe(n) is correct?
Follow-ups they push on
- What real security bug can leak from sending an under-written allocUnsafe buffer?
- Why was the old `new Buffer(n)` constructor deprecated?
Red flag Using allocUnsafe and not overwriting every byte — you can leak stale heap memory into output.
source: Node.js docs — Buffer.allocUnsafe ↗
Commonly asked senior concept occasional What does stream.finished() / the 'end' vs 'finish' vs 'close' events tell you, and which fires for readable vs writable?
Three lifecycle events that interviewers conflate:
- 'end' — fires on a Readable when there's no more data to read (the source is exhausted).
- 'finish' — fires on a Writable after end() is called and all data has been flushed to the underlying system.
- 'close' — fires when the stream and its resources (file descriptor, socket) are destroyed/closed; it's the cleanup signal, on both kinds.
Because getting these right by hand is error-prone, stream.finished(stream, cb) (and its promise form) gives you one callback that resolves when a stream is no longer readable/writable or errors — abstracting over end/finish/close/error. It's the robust way to know "this stream is truly done."
What a strong answer covers
- 'end' → Readable exhausted (no more data to read).
- 'finish' → Writable flushed everything after end().
- 'close' → underlying resource destroyed; cleanup signal on either side.
- stream.finished() unifies end/finish/close/error into one done-or-failed callback.
Follow-ups they push on
- Why might 'finish' fire but 'close' not, or vice versa?
- How is stream.finished safer than listening for 'end' yourself?
Red flag Listening for 'end' on a Writable (it never fires there) or 'finish' on a Readable — wrong event for the side.
source: Node.js docs — stream.finished() ↗
Commonly asked senior concept occasional What are object-mode streams, and async iteration over a stream (for await...of)? When would you use each?
Object mode ({ objectMode: true }) lets a stream's chunks be arbitrary JS values (objects, numbers) instead of Buffers/strings. Useful for pipelines of parsed records — e.g. a CSV row parser emitting objects into a Transform that validates them. In object mode highWaterMark counts objects, not bytes (default 16).
Async iteration: a Readable is async-iterable, so you can consume it with for await...of:
``for await (const chunk of fs.createReadStream(file)) { process(chunk); }`
This reads chunks one at a time with built-in backpressure (the loop body's await pauses reading) and lets you use ordinary try/catch for errors — far more readable than wiring 'data'/'end'/'error'` by hand. Use it whenever you'd otherwise write event-handler boilerplate to consume a stream sequentially.
What a strong answer covers
- Object mode: chunks are arbitrary JS values, not Buffers/strings; highWaterMark counts objects (default 16).
- Readables are async-iterable: for await...of consumes chunk-by-chunk.
- Async iteration has built-in backpressure and lets try/catch handle errors.
- Use object mode for record pipelines; async iteration to avoid 'data'/'end'/'error' boilerplate.
Follow-ups they push on
- How does for await...of provide backpressure automatically?
- What happens to the stream if you break out of the for await loop early?
Red flag Assuming chunks are always Buffers — in object mode they're whatever you pushed, and toString() would mangle them.
source: Node.js docs — Consuming readable streams with async iterators ↗
Commonly asked senior debug occasional Why can `chunk.toString()` on each stream chunk corrupt text, and how do you decode multi-byte data safely?
Stream chunks split at arbitrary byte boundaries, not character boundaries. A multi-byte UTF-8 character (emoji, accented letters, CJK) can land with its first byte at the end of one chunk and the rest at the start of the next. Calling chunk.toString("utf8") on each chunk independently then decodes a partial character — producing the replacement char ` or mojibake — and you can't fix it by concatenating the broken strings afterward.
Safe options: - Usestring_decoder.StringDecoder, which buffers incomplete multi-byte sequences across chunks and only emits complete characters. - Or set the stream's encoding withsetEncoding("utf8") (which uses StringDecoder internally) so 'data'yields decoded strings. - Or accumulate the raw Buffers andBuffer.concat(...).toString()` once at the end (fine for small data, not for huge streams).
What a strong answer covers
- Chunks break on byte boundaries; a multi-byte char can straddle two chunks.
- chunk.toString() per chunk decodes partial characters → garbled output you can't repair by concatenation.
- Use StringDecoder (buffers incomplete sequences) or stream.setEncoding('utf8').
- Alternatively Buffer.concat all chunks and decode once — only for small payloads.
Follow-ups they push on
- Why can't you just concatenate the per-chunk decoded strings to fix it?
- When is Buffer.concat-then-decode acceptable vs StringDecoder?
Red flag Decoding each chunk with toString() independently — multi-byte characters spanning chunk boundaries corrupt.
source: Node.js docs — StringDecoder ↗
Commonly asked senior concept common What is backpressure? What does it mean when stream.write() returns false, and what is the 'drain' event for?
Backpressure is the feedback that a fast producer is outpacing a slow consumer. Each writable stream has an internal buffer with a highWaterMark. When write() pushes the buffer past that threshold, it returns false — a signal saying "stop writing, I'm full."
If you ignore it and keep writing, the buffer grows unbounded and memory balloons. The correct response: pause the source and wait for the drain event, which fires once the buffer has emptied below the mark, then resume.
You rarely wire this by hand — pipe() and pipeline() implement the pause/resume dance for you, which is exactly why they are preferred.
Follow-ups they push on
- How does pipe() handle backpressure automatically?
- What is highWaterMark and what happens if you set it very high?
Red flag Writing in a loop while ignoring write()'s return value — unbounded memory growth under load.
source: Node.js docs — Stream backpressuring ↗
Commonly asked senior concept occasional Why is pipeline() preferred over chaining .pipe()? What does each do about errors?
a.pipe(b).pipe(c) handles backpressure but not errors: if b emits error, pipe does not forward it or destroy the other streams. You are left with un-destroyed streams (leaked file descriptors/sockets) and an unhandled error event — which crashes the process if no listener exists.
stream.pipeline(a, b, c, cb) (or the promise form node:stream/promises) wires the same backpressure and: forwards the first error to the callback/rejection, and destroys every stream in the chain on completion or failure. That cleanup is the whole reason to prefer it.
Rule of thumb: use pipeline for anything with real error/cleanup needs; bare .pipe only for trivial throwaway cases.
Follow-ups they push on
- What resource leaks when a .pipe chain errors mid-way?
- What does the promise version of pipeline let you do with async/await?
Red flag Using long .pipe chains in production and assuming an error anywhere is handled — it is not.
source: Node.js docs — stream.pipeline ↗
Commonly asked senior concept occasional What are the two reading modes of a Readable stream (flowing vs paused), and how do you switch between them?
A Readable stream is in one of two modes:
- Paused (pull) — you explicitly call read() to pull chunks. This is the default for a freshly created stream.
- Flowing (push) — chunks are pushed at you as fast as they arrive via 'data' events.
It switches to flowing when you attach a 'data' listener, call resume(), or pipe() it. It goes back to paused with pause() or by removing the 'data' listener (and unpipe).
The practical takeaway: attaching a 'data' handler starts the firehose immediately — if your consumer is slow you must respect backpressure (or just use pipe/pipeline, which manages the mode for you).
Follow-ups they push on
- What starts a stream flowing the moment you attach a 'data' listener?
- Which mode does pipe() put the source in?
Red flag Adding a 'data' listener and assuming the stream waits for you — it starts pushing chunks immediately.
source: Node.js docs — Two reading modes ↗
Commonly asked senior concept occasional What is highWaterMark on a stream, and what actually happens if you set it very high vs very low?
highWaterMark is the buffer threshold that drives backpressure. For a Writable it's the byte (or object) count at which write() starts returning false; for a Readable it's how much data the stream buffers ahead via internal read() calls. Default is 64 KB for byte streams (16 objects in object mode).
- Set it very high: the stream buffers a lot before signaling backpressure, so more data sits in memory. You get fewer pause/resume cycles (possibly slightly higher throughput) at the cost of a bigger memory footprint — and a huge value can defeat the point of streaming.
- Set it very low: backpressure kicks in almost immediately, memory stays tiny, but you pay more overhead in frequent pause/resume and read calls, hurting throughput.
It's a memory-vs-throughput knob; the 64 KB default is a sensible balance for most workloads.
What a strong answer covers
- The buffer threshold that triggers backpressure (write() → false; readable buffers ahead).
- Default 64 KB for byte streams, 16 for object mode.
- Higher → more in-memory buffering, fewer pause/resume cycles, bigger footprint.
- Lower → tighter memory, more overhead from frequent backpressure signaling.
Follow-ups they push on
- How does highWaterMark interact with the drain event?
- Why might a very high highWaterMark partially defeat the purpose of streaming?
Red flag Cranking highWaterMark up to 'go faster' — it just buffers more in memory and can reintroduce OOM risk.
source: Node.js docs — Buffering / highWaterMark ↗

4.4 Modules & packages 14

Commonly asked junior concept common package.json: dependencies vs devDependencies vs peerDependencies — what's the distinction and when does each install?
- dependencies — packages your code needs at runtime (Express, the DB driver). Installed for everyone who installs your package.
- devDependencies — needed only to build/test/lint (TypeScript, jest, eslint). Installed for local dev, but skipped with npm install --omit=dev (production installs).
- peerDependencies — a package your plugin expects the host project to provide, to avoid duplicate/clashing copies (e.g. a React component library lists react as a peer so it uses the app's single React).
Getting this wrong: a runtime package in devDeps breaks production; a build tool in deps bloats the production image.
Follow-ups they push on
- What breaks if you put your web framework in devDependencies?
- Why do React component libraries list react as a peerDependency rather than a dependency?
Red flag Putting runtime libs in devDependencies — works locally, then crashes in a --omit=dev production install.
source: npm docs — package.json dependencies ↗
Commonly asked mid concept very common CommonJS vs ES Modules: name the real differences (syntax, loading, this, __dirname, top-level await).
- Syntax: CJS uses require() / module.exports; ESM uses import / export.
- Loading: CJS is synchronous and loads at runtime, so require() can be conditional/dynamic. ESM is asynchronous with a static parse phase — imports are hoisted and resolved before the body runs (use dynamic import() for conditional loading).
- Bindings: CJS exports a *copied value*; ESM exports *live bindings* (re-exported values stay in sync).
- this: top-level this is module.exports in CJS, but undefined in ESM.
- __dirname/__filename: available in CJS; in ESM you derive them from import.meta.url.
- Top-level await: allowed in ESM, not in CJS.
Node picks the mode from "type" in package.json (or .cjs/.mjs extension).
Follow-ups they push on
- How do you get __dirname in an ES module?
- Why can you require() conditionally but not top-level-import conditionally?
Red flag Saying they are interchangeable — sync vs async loading and live-bindings vs copied-values cause real behavioral differences.
source: Node.js docs — Modules: ECMAScript modules ↗
Commonly asked mid concept occasional What is a transitive dependency, and why can `npm audit` report dozens of vulnerabilities you didn't install directly?
A transitive (indirect) dependency is a package your dependencies depend on — not something you listed in your package.json. A modern app with a handful of direct deps routinely pulls in hundreds of transitive packages, and the lockfile records the whole tree.
npm audit scans that entire tree against a vulnerability database, so most reported issues live deep in transitive packages you never named. That's also the supply-chain risk surface: you trust not just your deps but everything they trust.
Fixing them: npm audit fix bumps within allowed ranges; a transitive fix may require the direct dependency to update, or an overrides entry in package.json to force a patched version. And weigh severity in context — a vuln in a dev-only or unreachable code path isn't always exploitable in your app.
What a strong answer covers
- Transitive = a dependency of your dependencies; you didn't list it directly.
- Apps pull in hundreds of transitive packages; the lockfile captures the full tree.
- npm audit scans the whole tree, so most findings are in indirect packages.
- Fix via npm audit fix, upgrading the direct dep, or overrides to pin a patched version.
Follow-ups they push on
- When would you use the `overrides` field to force a transitive version?
- Why isn't every audit 'high severity' finding actually exploitable in your app?
Red flag Treating every npm audit finding as a critical blocker, or assuming you can only fix direct dependencies.
source: npm docs — npm audit ↗
Commonly asked mid concept common How do you get __dirname and __filename in an ES module, and why aren't they available like in CommonJS?
In CommonJS, __dirname and __filename are injected into every module's wrapper scope. ESM has no such wrapper — modules run in a standard scope where those magic variables don't exist. Instead, ESM gives you import.meta.url, the file's URL (a file:// string).
Derive the paths from it:
``import { fileURLToPath } from "node:url"; import { dirname } from "node:path"; const __filename = fileURLToPath(import.meta.url); const __dirname = dirname(__filename);`
fileURLToPath is required because import.meta.url is a URL, not a filesystem path (and on Windows or with spaces/special chars, naive string slicing breaks). Recent Node also exposes import.meta.dirname / import.meta.filename` as conveniences.
What a strong answer covers
- CJS injects __dirname/__filename via the module wrapper; ESM has no wrapper.
- ESM exposes import.meta.url (a file:// URL) instead.
- Convert with fileURLToPath(import.meta.url) then path.dirname(...).
- Don't string-slice the URL — fileURLToPath handles Windows/encoding correctly.
Follow-ups they push on
- Why is fileURLToPath needed instead of just stripping the file:// prefix?
- What are import.meta.dirname and import.meta.filename?
Red flag Hand-parsing import.meta.url by slicing 'file://' — breaks on Windows paths and URL-encoded characters.
source: Node.js docs — import.meta.url ↗
Commonly asked mid concept occasional Why does committing node_modules vs relying on the lockfile matter, and what makes `npm ci` deterministic where `npm install` isn't?
You normally don't commit node_modules (huge, platform-specific native builds, churns the diff); you commit the lockfile and rebuild from it. The lockfile + npm ci is what gives reproducibility without the bloat.
What makes them differ:
- npm install treats package.json as the source of truth: it resolves ranges, may update the lockfile, and reuses/patches an existing node_modules. Two installs at different times can yield different trees if a new in-range version was published.
- npm ci treats the lockfile as authoritative: it deletes node_modules first, installs the exact pinned versions, and errors if package.json and the lockfile disagree. No range re-resolution, so the tree is byte-identical every run — ideal for CI/prod.
So determinism comes from npm ci refusing to re-resolve ranges and always starting from a clean slate.
What a strong answer covers
- Commit the lockfile, not node_modules (bloat + platform-specific native builds).
- npm install may update the lockfile and reuse node_modules → can drift over time.
- npm ci wipes node_modules and installs the exact lockfile versions, erroring on mismatch.
- Determinism = no range re-resolution + clean-slate install.
Follow-ups they push on
- Why might committing node_modules with native addons break on a teammate's machine?
- What happens with npm ci if you forgot to update the lockfile after editing package.json?
Red flag Using npm install in CI (non-deterministic, can silently bump versions) instead of npm ci.
source: npm docs — npm ci ↗
Commonly asked mid trick common What's the difference between `exports = foo` and `module.exports = foo` in CommonJS? Which one actually works, and why?
Only module.exports = foo works to replace the whole export.
At module start, Node does roughly exports = module.exports = {} — exports is just a *local variable pointing at the same object* as module.exports. What gets returned to the requirer is module.exports.
- exports.foo = ... works because you are mutating the shared object.
- exports = foo only reassigns the local variable exports; module.exports still points at the original {}, so the requirer gets an empty object.
- module.exports = foo correctly replaces what is returned.
Rule: use exports.x = ... to add properties, but module.exports = ... to export a single thing.
Follow-ups they push on
- After `module.exports = foo`, does `exports.bar = 1` still affect the export? (No.)
- Why does `exports.foo = ...` work but `exports = {...}` not?
Red flag Reassigning `exports = ...` and wondering why the importer gets `{}` — you broke the alias to module.exports.
source: Node.js docs — module.exports vs exports ↗
Commonly asked mid concept common In semver, what versions does "^1.2.3" allow, and how does that differ from "~1.2.3"? When is each dangerous?
Semver is MAJOR.MINOR.PATCH.
- ^1.2.3 (caret) allows everything up to but not including the next MAJOR — >=1.2.3 <2.0.0. So 1.9.0 is fine; 2.0.0 is not. (Special case: for 0.x, ^0.2.3 is treated as >=0.2.3 <0.3.0 — a 0.x minor bump can break.)
- ~1.2.3 (tilde) allows only PATCH bumps — >=1.2.3 <1.3.0.
Caret is the npm default. The risk: a sloppy maintainer ships a breaking change in a *minor*, and your caret range silently pulls it in. That is exactly why package-lock.json pins exact resolved versions for reproducible installs.
Follow-ups they push on
- Why is the lockfile essential even though you specified a range?
- What does ^0.2.3 resolve to, and why is the 0.x rule special?
Red flag Thinking ^ and ~ are the same, or trusting that minor bumps are always non-breaking.
source: npm docs — About semantic versioning ↗
Commonly asked mid concept common What does package-lock.json do, and why should you commit it? What's the difference between `npm install` and `npm ci`?
package-lock.json records the *exact* version, resolved URL, and integrity hash of every package in the tree (including transitive deps). Because package.json only specifies ranges, the lockfile is what makes installs reproducible — everyone and CI get byte-identical trees. Commit it.
- npm install reads package.json, may update the lockfile to satisfy ranges, and adds/removes packages. Good for development.
- npm ci installs strictly from the lockfile, errors if package.json and the lock disagree, and wipes node_modules first. Deterministic and faster — the right choice for CI and production builds.
Follow-ups they push on
- Why does npm ci fail if package.json and the lockfile are out of sync?
- What integrity field in the lockfile protects against tampered packages?
Red flag Gitignoring the lockfile (irreproducible builds) or using `npm install` in CI instead of `npm ci`.
source: npm docs — npm ci ↗
Commonly asked mid debug common What prints? // counter.js: let c = 0; module.exports = { inc: () => ++c, get: () => c }; // app.js: const a = require("./counter"); const b = require("./counter"); a.inc(); console.log(b.get())
It prints 1.
CommonJS caches modules by resolved path. The first require("./counter") executes the file once and caches its module.exports; the second require returns the same cached object — no re-execution. So a and b are the *same* object sharing the *same* c. a.inc() makes c 1, and b.get() reads that same c: 1.
This is why a module is effectively a singleton — handy for shared config/connections, but a trap if you expect a fresh instance per require.
Follow-ups they push on
- What key does the module cache use, and how can the same file be loaded twice?
- How would you force a fresh module instance? (Bust require.cache — and why that's usually a smell.)
Red flag Expecting each require to give a fresh module — it returns the cached singleton.
source: Node.js docs — Modules caching ↗
Commonly asked senior concept occasional What does the "exports" field in package.json do, and how do conditional exports (import/require/default) work?
The exports field defines a package's official entry points and, crucially, encapsulates it: once you declare exports, consumers can import only the paths you list — deep imports into internal files (pkg/lib/secret.js) are blocked. It supersedes main.
Conditional exports map one specifier to different files depending on how it's loaded:
``{ "exports": { ".": { "import": "./index.mjs", "require": "./index.cjs", "default": "./index.mjs" } } }`
Node picks import when loaded via import/import(), require when loaded via require(), and default as the fallback. This is how a package ships both an ESM and a CJS build from one entry point (the "dual package"). Conditions are matched in order, so put more specific ones first; default` must be last.
What a strong answer covers
- exports declares entry points and encapsulates internals (blocks deep imports).
- Conditional exports map a specifier to different files by condition.
- import vs require lets one package ship both ESM and CJS builds (dual package).
- Conditions match in order, most-specific first; default is the last-resort fallback.
Follow-ups they push on
- What's the 'dual package hazard' and how do conditional exports relate to it?
- How does the exports field break tools that relied on deep-importing internal files?
Red flag Adding an exports field and accidentally breaking consumers who deep-imported internal paths.
source: Node.js docs — Packages: conditional exports ↗
Commonly asked senior debug occasional What prints? // a.mjs: export let count = 0; export function inc() { count++; } // main.mjs: import { count, inc } from "./a.mjs"; inc(); console.log(count)
It prints 1.
ESM exports are live bindings, not copied values. The imported count is a read-only *view* of the exporter's count variable — not a snapshot taken at import time. When inc() mutates count inside a.mjs, the importer's view reflects the new value, so console.log(count) reads 1.
Contrast with CommonJS: const { count } = require("./a") copies the value at require time, so calling inc() would not change your local count (it'd still be 0). Note you can *read* the live binding but not reassign it from the importer (count = 5 throws — imports are read-only).
What a strong answer covers
- ESM imports are live, read-only bindings to the exporter's variables.
- Mutating the exported variable inside its module is visible to all importers.
- CommonJS copies values at require time, so it would still print 0.
- Importers can read the live value but cannot reassign the binding (TypeError).
Quick self-check
What does main.mjs print?
Follow-ups they push on
- What's the CommonJS equivalent and why does it print 0 instead?
- Why can't you reassign an imported binding in the importing module?
Red flag Assuming ESM imports are value snapshots like CJS — they're live bindings, so mutations show through.
source: MDN — export (live bindings) ↗
Commonly asked senior concept occasional How does Node resolve `require("foo")` (a bare specifier) vs `require("./foo")` (a relative path)?
Relative/absolute (./foo, ../foo, /abs/foo): resolve against the current file. Node tries the exact path, then foo.js/foo.json/foo.node, then foo/ as a directory (its package.json main/exports, else index.js).
Bare specifier (foo): Node walks node_modules outward — ./node_modules/foo, then the parent's node_modules, up to the filesystem root — and uses the first match. Core modules (fs, path, or node:fs) short-circuit this and win immediately.
This outward walk is why a dependency can resolve a different copy of a package than your app, and why node_modules can nest.
Follow-ups they push on
- Why might two packages each get their own copy of a shared dependency?
- What does the `exports` field in package.json change about resolution?
Red flag Assuming bare specifiers resolve from one global location — Node searches node_modules up the directory tree.
source: Node.js docs — Modules: all-together resolution ↗
Commonly asked senior trick occasional What is a circular dependency between two CommonJS modules, and what does the importer actually receive?
A circular dependency is a.js requiring b.js while b.js requires a.js. CommonJS doesn't deadlock — it returns a partially-completed module.exports.
When a starts loading and requires b, b begins executing; if b then requires a, Node sees a is already in progress and hands b the **partial exports of a as they exist *right now* (whatever a had assigned before the require(b) line). If a hadn't exported the thing b needs yet, b sees undefined.
So behavior depends on statement order** and is fragile. Symptom: a value is mysteriously undefined only when modules load in a particular order. Fixes: restructure to break the cycle, extract the shared piece into a third module, or require lazily (inside the function that uses it). ESM handles cycles better via live bindings but can still hit temporal-dead-zone errors.
What a strong answer covers
- CJS doesn't deadlock; it returns the partial exports of the in-progress module.
- What b sees of a depends on what a had exported before its require(b) line.
- Symptom: a dependency value is undefined depending on load order.
- Fix: break the cycle, extract a shared module, or require lazily inside a function.
Follow-ups they push on
- How does ESM's live-binding model change circular-dependency behavior?
- Why does moving the require() to the bottom of the file sometimes 'fix' it?
Red flag Assuming a circular require throws or deadlocks — it silently returns half-initialized exports.
source: Node.js docs — Modules: Cycles ↗
Commonly asked senior concept common How do you import a CommonJS package from an ES module, and an ESM-only package from CommonJS? Why is one harder?
CJS → from ESM: easy. import pkg from "cjs-package" works — Node treats the module's module.exports as the default export. Named imports work for statically-detectable named exports, but the whole object is reliably available as the default.
ESM-only → from CJS: harder, because require() of an ESM module is restricted. ESM is asynchronous (it can use top-level await) and require is synchronous, so historically require("esm-only-pkg") threw ERR_REQUIRE_ESM. The portable workaround is dynamic import(), which returns a promise:
``const { thing } = await import("esm-only-pkg");`
(Recent Node versions added synchronous require() of ESM that has no top-level await, but dynamic import() is the safe, version-independent answer.)
The asymmetry comes from sync-vs-async loading: pulling async ESM into a sync require` is the fundamentally awkward direction.
What a strong answer covers
- CJS from ESM: import x from 'cjs' — module.exports becomes the default export.
- ESM from CJS: require() is restricted (ESM is async), classically ERR_REQUIRE_ESM.
- Portable fix for ESM-from-CJS: dynamic import() (returns a promise).
- Asymmetry stems from ESM being async (top-level await) vs require being synchronous.
Follow-ups they push on
- Why is dynamic import() the version-safe way to load ESM from CJS?
- What does it mean that newer Node can require() ESM without top-level await?
Red flag Trying to require() an ESM-only package and hitting ERR_REQUIRE_ESM — reach for dynamic import().
source: Node.js docs — Interoperability with CommonJS ↗

4.5 Globals, events & CPU concurrency 14

Commonly asked junior concept common What's on the `process` global that you actually use? Cover argv, env, exit codes, and the on() events.
process is the interface to the running Node process:
- process.argv — CLI arguments array; [0] is the node binary, [1] is the script, real args from [2].
- process.env — environment variables (always strings); the standard place for config/secrets (process.env.NODE_ENV, DATABASE_URL).
- process.exit(code) — terminate now with an exit code (0 success, non-zero failure). Prefer letting the loop drain naturally; exit() can cut off in-flight I/O.
- Events: process.on("SIGTERM"/"SIGINT", ...) for graceful shutdown, plus "uncaughtException" and "unhandledRejection" as last-resort handlers.
It is also an EventEmitter, which is why those on(...) hooks exist.
Follow-ups they push on
- Why does process.argv start your real arguments at index 2?
- How do you implement graceful shutdown on SIGTERM in a containerized service?
Red flag Calling process.exit() in the middle of request handling and truncating pending writes/logs.
source: Node.js docs — process ↗
Commonly asked junior concept occasional Why should you read configuration from environment variables (process.env) instead of hardcoding it or committing a config file?
It is the twelve-factor practice: keep config in the environment, separate from code. Benefits:
- One build, many environments — the same artifact runs in dev/staging/prod by swapping env vars; no code change or rebuild per environment.
- Secrets stay out of git — DB passwords and API keys never land in the repo (a top cause of credential leaks).
- Ops-friendly — platforms (Docker, Kubernetes, Cloudflare, CI) all inject env vars natively.
Practical notes: process.env values are always strings (coerce numbers/booleans yourself), use a .env file (gitignored) locally, and validate required vars at startup so a missing DATABASE_URL fails fast rather than at 3am.
Follow-ups they push on
- Why validate env vars at boot instead of where they're used?
- What type are process.env values, and what bug does that cause with `process.env.PORT`?
Red flag Committing secrets in a config file, or assuming process.env.PORT is a number (it's a string).
source: Node.js docs — process.env ↗
Commonly asked mid design common How do you implement graceful shutdown on SIGTERM in a containerized Node service, and why does it matter?
When an orchestrator (Kubernetes, Docker, a process manager) stops your container, it sends SIGTERM and gives a grace period before SIGKILL. Without handling it, in-flight requests are cut off, connections drop, and writes can be left half-done.
Graceful shutdown on SIGTERM:
``process.on("SIGTERM", async () => { server.close(); // stop accepting new connections await drainInFlightRequests(); // let current ones finish await db.end(); await redis.quit(); // close pools/connections process.exit(0); });`
Steps: stop accepting new work (server.close()), let in-flight requests drain (with a timeout fallback so a stuck request can't hang shutdown forever), close DB/cache/queue connections, then exit 0. This avoids dropped requests during deploys/scaling and prevents connection-pool leaks. Also handle SIGINT` for local Ctrl-C.
What a strong answer covers
- Orchestrators send SIGTERM, then SIGKILL after a grace period.
- On SIGTERM: stop accepting new connections (server.close), drain in-flight, close pools, exit 0.
- Add a timeout fallback so a stuck request can't block shutdown indefinitely.
- Prevents dropped requests during deploys and connection-pool leaks; handle SIGINT too.
Follow-ups they push on
- Why do you need a timeout fallback around draining in-flight requests?
- What happens to open requests if you ignore SIGTERM until SIGKILL?
Red flag Calling process.exit(0) immediately on SIGTERM, truncating in-flight requests instead of draining first.
source: Node.js docs — Signal events ↗
Commonly asked mid concept occasional What globals are available in Node without require (e.g. globalThis, Buffer, __dirname, setTimeout), and which are NOT truly global?
Genuinely global (available anywhere, no import): globalThis, process, Buffer, console, the timer functions (setTimeout/setInterval/setImmediate and their clear*), queueMicrotask, URL/URLSearchParams, TextEncoder/TextDecoder, and (in modern Node) fetch, structuredClone, and AbortController.
The trap — these look global but are actually module-scoped variables injected by the CommonJS wrapper, not properties of globalThis: __dirname, __filename, require, module, exports. That's exactly why they don't exist in ES modules (no wrapper) — you use import.meta.url and static import instead.
So: timers/process/Buffer are true globals; the require/module/__dirname family are per-module wrapper locals.
What a strong answer covers
- True globals: globalThis, process, Buffer, console, timers, fetch, URL, AbortController, etc.
- __dirname, __filename, require, module, exports are module-wrapper locals, not on globalThis.
- That's why those CJS locals are absent in ESM (no module wrapper).
- Many former-polyfill APIs (fetch, structuredClone) are now built-in globals.
Follow-ups they push on
- Why are __dirname and require unavailable in ES modules?
- Is `fetch` available globally in current Node without a library?
Red flag Calling __dirname/require 'global' — they're injected per-module by the CJS wrapper and absent in ESM.
source: Node.js docs — Global objects ↗
AmazonMetaTikTok mid concept common Explain the EventEmitter pattern. What's special about the 'error' event, and what's the 'newListener' / max-listeners warning about?
EventEmitter is Node's pub/sub primitive: register handlers with on(event, fn) (or once) and fire them with emit(event, ...args). Synchronous by default — listeners run in registration order on the same tick. Much of Node's API (streams, HTTP servers, sockets) is built on it.
Two gotchas interviewers probe:
- 'error' is special: if you emit("error") and there is no error listener, the emitter throws and crashes the process. Always handle 'error'.
- Max listeners: adding more than 10 listeners for one event logs a *MaxListenersExceededWarning* — a heuristic for a listener leak (e.g. adding a handler per request and never removing it). Raise the limit with setMaxListeners only if it is genuinely intentional.
Follow-ups they push on
- Why does an unhandled 'error' event crash, while other unhandled events are silent?
- What real bug does the 'more than 10 listeners' warning usually indicate?
Red flag Treating the max-listeners warning as noise and bumping the limit, instead of finding the leak.
source: GreatFrontend — JS interview questions by ex-interviewers ↗
Commonly asked mid debug occasional What prints? const EventEmitter = require("events"); const e = new EventEmitter(); e.on("x", () => console.log("A")); e.on("x", () => console.log("B")); console.log("before"); e.emit("x"); console.log("after")
before A B after.
EventEmitter listeners are synchronous — emit calls each registered handler in order, on the same tick, before emit returns. So console.log("before") runs, then emit("x") invokes the two listeners immediately (A, then B, in registration order), and only then does console.log("after") run.
This surprises people who assume events are deferred/async like DOM events or setTimeout. If you need a listener to yield, you must defer it yourself (e.g. setImmediate inside the handler).
Follow-ups they push on
- How would you make a listener run asynchronously without blocking emit?
- In what order do multiple listeners for the same event fire?
Red flag Assuming emit is asynchronous and printing `before after A B`.
source: Node.js docs — EventEmitter emit ↗
Commonly asked mid concept common What's the difference between EventEmitter's on() and once(), and why is a per-request on() handler a classic leak?
on(event, fn) registers a handler that fires on every emission until you remove it. once(event, fn) fires exactly once and then auto-removes itself.
The leak: code that does emitter.on("data", handler) per request (or per connection) on a long-lived emitter, without ever calling removeListener/off. Each request adds another handler that's never cleaned up; the array of listeners grows unbounded, the closures pin everything they captured, and memory climbs. Node's heuristic warns at >10 listeners (MaxListenersExceededWarning) precisely to catch this.
Fixes: use once when you only need the next event; remove handlers when the request ends (off); or use AbortSignal/{ signal } to auto-detach. The warning is a symptom — find and remove the accumulating listener, don't just raise setMaxListeners.
What a strong answer covers
- on fires on every emission until removed; once fires once and auto-removes.
- Per-request on() on a long-lived emitter without cleanup accumulates listeners → leak.
- Captured closures keep referenced objects alive; >10 listeners triggers the warning.
- Fix with once, explicit off, or an AbortSignal — not by bumping setMaxListeners.
Quick self-check
Which best describes the difference between on() and once()?
Follow-ups they push on
- How does passing an AbortSignal help auto-remove a listener?
- Why does the leak grow memory and not just listener count?
Red flag Adding a listener per request and never removing it, then silencing the max-listeners warning instead of fixing it.
source: Node.js docs — emitter.once() ↗
Commonly asked senior design very common Worker threads vs child processes vs cluster — what does each give you, and when do you pick which?
All add parallelism, but for different jobs:
- Worker threads — multiple JS threads in one process, can share memory via SharedArrayBuffer, cheap to spawn, message-passing for the rest. Pick for CPU-bound JS work (image resize, parsing, hashing) you want to keep in-process.
- Child processes (spawn/fork/exec) — full separate OS processes, total isolation, can run any program (not just Node). Pick to run an external binary (ffmpeg, git) or to isolate untrusted/risky work.
- Cluster — forks multiple copies of your server that share one listening port; the OS load-balances connections across them. Pick to use all CPU cores for an I/O-bound server (the classic way to scale an HTTP server).
Shorthand: CPU-bound in-process → worker; external program / isolation → child process; scale a server across cores → cluster.
Follow-ups they push on
- Why is cluster the wrong tool for a single CPU-heavy computation?
- How do worker threads share data without copying it?
Red flag Reaching for cluster to speed up one CPU-bound task — cluster scales request throughput, not a single computation.
source: Node.js docs — Worker threads ↗
Commonly asked senior concept occasional How do worker threads communicate with the main thread, and what data can/can't cross the boundary?
Worker threads talk over a message channel: worker.postMessage(value) / parentPort.postMessage(value), received via "message" events. There is no shared scope — each thread has its own V8 isolate, globals, and module registry.
What can cross:
- Structured-cloneable values — objects, arrays, Maps, Sets, typed arrays, etc. are copied (deep clone).
- Transferable objects (ArrayBuffer, MessagePort) can be moved in the transferList: ownership transfers and the sender's copy is detached (zero-copy, but no longer usable on the sender).
- SharedArrayBuffer is genuinely shared (both threads see the same bytes; coordinate with Atomics).
What can't cross: functions, closures, class instances with methods, DOM-like handles — anything not structured-cloneable throws a DataCloneError. So you pass data, not behavior; the worker loads its own code from a file/string.
What a strong answer covers
- Communicate via postMessage + 'message' events; no shared scope between threads.
- Plain data is deep-copied via structured clone.
- ArrayBuffer/MessagePort can be transferred (detached on sender, zero-copy).
- Functions/closures/methods can't be sent (DataCloneError); SharedArrayBuffer truly shares memory.
Follow-ups they push on
- What's the difference between transferring an ArrayBuffer and copying it?
- Why can't you postMessage a function to a worker?
Red flag Trying to postMessage a function or class instance with methods — only structured-cloneable data crosses.
source: Node.js docs — worker.postMessage() ↗
Commonly asked senior debug occasional What prints, and is it on the main thread? const { Worker, isMainThread } = require("worker_threads"); if (isMainThread) { new Worker(__filename); console.log("main"); } else { console.log("worker"); }
Both main and worker print — main from the main thread, worker from the spawned worker — and the relative order is non-deterministic (the worker starts asynchronously, so main usually prints first, but don't rely on it).
The pattern is the standard self-referencing worker: the file checks isMainThread. On first run it's true, so the branch spawns a Worker(__filename) — which re-executes the same file in a new thread where isMainThread is false, taking the else branch and printing worker. The worker has its own module instance, globals, and event loop; it does not share memory with the main thread (only message-passing / SharedArrayBuffer).
What a strong answer covers
- Both branches run: main on the main thread, worker in the spawned thread.
- new Worker(__filename) re-executes the file with isMainThread === false.
- Relative print order is non-deterministic (worker starts async).
- The worker has its own isolate/globals/event loop — no shared memory by default.
Quick self-check
What does this program output?
Follow-ups they push on
- Why is the order of 'main' vs 'worker' not guaranteed?
- How would the worker send a result back to the main thread?
Red flag Assuming only one line prints, or that the worker shares the main thread's variables/globals.
source: Node.js docs — worker_threads isMainThread ↗
Commonly asked senior concept occasional Why does cluster scale an I/O-bound server but not a single CPU-bound computation, and how does it share a port?
Cluster forks N worker processes (typically one per core), each a full Node instance with its own event loop. They all share one listening socket: the primary process creates the listener and hands incoming connections to workers (by default the OS/round-robin distributes them). So N independent event loops handle requests in parallel — that's why it scales an I/O-bound server across cores: more loops = more concurrent request handling and CPU utilization.
It does nothing for a single CPU-bound computation, because that one task runs on one worker's single thread; the other workers can't help compute it — they're separate processes handling *other* requests. Cluster scales throughput (requests/sec across many requests), not the latency of one heavy computation. For that, you need worker threads (split the work) or an algorithmic fix.
What a strong answer covers
- Cluster = N processes, each its own event loop, sharing one listening socket.
- Primary distributes connections (round-robin by default) → parallel request handling across cores.
- Scales I/O-bound throughput, not the latency of a single computation.
- One CPU-bound task still runs on one thread; use worker threads to split it.
Follow-ups they push on
- How does the primary process distribute incoming connections to workers?
- When would worker threads beat cluster for the same workload?
Red flag Expecting cluster to speed up one heavy computation — it multiplies request handlers, not the single task.
source: Node.js docs — Cluster ↗
Commonly asked senior concept occasional A worker thread is meant to share a big array with the main thread to avoid copying. How do you actually share memory, and what's the catch?
Ordinary postMessage(data) copies via structured clone (or *transfers* an ArrayBuffer, leaving the sender's copy detached). To truly share memory you use a SharedArrayBuffer (often viewed through a typed array): both threads see the same bytes, no copy.
The catch: shared mutable memory reintroduces data races. Two threads writing the same slot need coordination — use the Atomics API (Atomics.add, Atomics.wait/notify) for safe reads/writes and signaling. You can only share raw binary buffers this way, not arbitrary JS objects.
So: copy is the safe default; SharedArrayBuffer + Atomics is the zero-copy path you reach for only when the data is large and the synchronization is worth it.
Follow-ups they push on
- What's the difference between transferring an ArrayBuffer and sharing a SharedArrayBuffer?
- Why do you need Atomics rather than just writing to the shared buffer directly?
Red flag Assuming postMessage shares memory — by default it copies (or transfers), and SharedArrayBuffer still needs Atomics for safety.
source: Node.js docs — Worker threads ↗
Commonly asked senior debug occasional Cluster forks one worker per core, but in-memory session state and a request counter behave oddly across requests. Why, and how do you fix it?
Each cluster worker is a separate process with its own memory — they share the listening socket, not application state. A request lands on whichever worker the OS hands it to, so an in-process counter or in-memory session is only correct on the worker that happened to handle the *previous* request. Across workers you see stale/jumping values.
The fix is to externalize shared state: put sessions and counters in Redis (or a DB), so all workers read/write one source of truth. As a stopgap you can enable sticky sessions (route a client to the same worker), but that just pins the problem rather than solving shared state — and it breaks if that worker restarts.
General rule: cluster (and any horizontally-scaled service) must be stateless; keep state in a shared store.
Follow-ups they push on
- Why don't cluster workers share a single counter variable?
- What do sticky sessions buy you, and why aren't they a real substitute for external state?
Red flag Keeping sessions/counters in process memory under cluster and expecting consistency across workers.
source: Node.js docs — Cluster ↗
Commonly asked senior concept occasional spawn vs exec vs execFile vs fork in child_process — what distinguishes them, and which can blow up on large output?
All run a child process, differently:
- spawn(cmd, args) — launches a process and streams its stdout/stderr. No output-size limit; use it for long-running processes or large output (e.g. piping ffmpeg).
- exec(cmdString) — runs the command in a shell and buffers all output, handing it to a callback. Convenient, but the buffer is capped (maxBuffer, default 1 MB) — exceed it and the child is killed with an error. Shell parsing also opens command-injection risk if you interpolate untrusted input.
- execFile(file, args) — like exec (buffers output) but runs the binary directly, no shell — safer against injection, no shell features.
- fork(modulePath) — a specialized spawn for a new Node.js process running a JS file, with a built-in IPC channel (child.send/process.on("message")).
The trap: exec/execFile buffer, so big output OOMs or trips maxBuffer; stream with spawn instead.
What a strong answer covers
- spawn streams output (no size cap) — best for large/long output.
- exec runs in a shell and buffers (default 1 MB maxBuffer) → kills child on overflow; injection risk.
- execFile buffers too but skips the shell — safer, no shell features.
- fork spawns a child Node process with an IPC message channel.
Follow-ups they push on
- Why is execFile safer than exec against command injection?
- What error do you get when exec output exceeds maxBuffer?
Red flag Using exec for a command with large output — it buffers and either OOMs or hits maxBuffer; stream with spawn.
source: Node.js docs — Child process ↗

4.6 V8, memory & frameworks 14

Commonly asked junior concept common What is Express middleware? Walk through what next() does and how the chain executes.
Express middleware is a function (req, res, next) that sits in a chain between the incoming request and the route handler. Each request flows through the registered middleware in order; a middleware can read/modify req/res, end the response, or call next() to pass control to the next one.
- Call next() → continue to the next middleware/handler.
- Call next(err) → skip ahead to the error-handling middleware (the special 4-arg form (err, req, res, next)).
- Call neither and don't send a response → the request hangs (a common bug).
Uses: logging, body parsing, auth, CORS, and a final centralized error handler. Order matters — auth must run before the protected handler.
Follow-ups they push on
- Why must the error-handling middleware have four arguments?
- What happens if a middleware neither sends a response nor calls next()?
Red flag Forgetting to call next() (request hangs) or registering middleware in the wrong order (auth after the handler).
source: Express docs — Using middleware ↗
Commonly asked mid concept common What are the most common causes of memory leaks in a long-running Node service?
A leak in a GC'd runtime is memory the GC can't reclaim because something still references it. Usual suspects:
- Unbounded caches / maps — a module-level Map you only ever add to; it grows forever. Use an LRU with a size cap or TTLs.
- Forgotten event listeners / timers — adding a listener (or setInterval) per request/connection and never removing it; the closures pin everything they captured (hence the max-listeners warning).
- Growing module-level (global) state — pushing onto an array that is never trimmed.
- Closures capturing big objects — a long-lived callback that closes over a large buffer keeps it alive.
Diagnose by watching RSS/heap trend upward over hours, then take two heap snapshots and diff what grew.
Follow-ups they push on
- How do you confirm a leak vs normal heap growth? (Two snapshots, diff retained objects.)
- Why is an unbounded cache the textbook leak?
Red flag Believing garbage collection makes leaks impossible — reachable-but-unused references defeat the GC.
source: Node.js docs — Memory diagnostics ↗
Commonly asked mid debug common In Express, why doesn't a thrown error inside an async route handler reach your error-handling middleware (in Express 4)?
In Express 4, the router only catches errors thrown synchronously. An async handler returns a promise; if it rejects (or you await something that throws), the rejection happens on a later tick after the handler already returned — Express never sees it, so your (err, req, res, next) middleware isn't invoked and the request hangs (and you get an unhandledRejection).
Fixes:
- Forward errors explicitly: try { ... } catch (e) { next(e); }.
- Wrap handlers in an async helper that catches and calls next (or use express-async-errors).
Express 5 fixes this: it automatically forwards a rejected promise from a handler to the error middleware, so a plain throw/rejection in an async handler is caught. Know which major version you're on — this is a very common production gotcha.
What a strong answer covers
- Express 4's router only catches synchronous throws; a rejected async handler escapes it.
- Result: error middleware isn't called, the request hangs, and you get unhandledRejection.
- Fix in v4: try/catch + next(err), an async wrapper, or express-async-errors.
- Express 5 auto-forwards rejected promises to the error handler.
Follow-ups they push on
- How does an async-handler wrapper forward rejections to next()?
- What changed in Express 5 around async error handling?
Red flag Assuming Express 4 catches async/await errors automatically — it doesn't; the request hangs.
source: Express docs — Error handling (async) ↗
Commonly asked mid concept occasional What does dependency injection in NestJS actually solve as a codebase grows, compared to manually constructing services?
Without DI you wire dependencies by hand: each class news the things it needs, which hardcodes concrete implementations and threads constructor arguments through the whole tree. As the app grows this becomes brittle — changing a service's dependencies means editing every call site, and substituting a fake for tests is painful.
NestJS DI inverts that: you declare a class @Injectable() and ask for its dependencies in the constructor; an IoC container constructs and caches them (singletons by default) and injects them where declared. Benefits:
- Decoupling — depend on an abstraction/token, swap the concrete provider in one place.
- Testability — override a provider with a mock in the test module; no monkey-patching.
- Lifecycle/scoping — the container manages singletons (and request-scoped instances) consistently.
The payoff is at scale: wiring lives in module metadata, not scattered new calls, so large teams can reason about and replace pieces independently.
What a strong answer covers
- Manual wiring hardcodes concretes and threads constructor args through the tree.
- Nest's IoC container constructs, caches (singleton by default), and injects dependencies.
- Decouples via tokens/abstractions — swap a provider in one place.
- Makes testing easy (override providers with mocks) and centralizes lifecycle/scoping.
Follow-ups they push on
- How would you inject a mock repository in a Nest unit test?
- What's the difference between a singleton and a request-scoped provider?
Red flag Dismissing DI as ceremony — its payoff (decoupling, testability) shows up as the dependency graph grows.
source: NestJS docs — Providers / Dependency injection ↗
Commonly asked mid concept common How do you debug and profile a Node process — say it's leaking memory or pinning the CPU in production?
Start with the built-in inspector: run with --inspect (or --inspect-brk) and connect Chrome DevTools (chrome://inspect) or VS Code.
- CPU pinned: take a CPU profile (DevTools Profiler, or --prof / --cpu-prof) and read the flame graph for the hot function. Also watch event-loop lag — high lag means something is blocking the loop.
- Memory leak: take two heap snapshots minutes apart under load and use the Comparison view to see which object types keep growing and what retains them. process.memoryUsage() (RSS/heapUsed) shows the trend.
In production, prefer low-overhead options: --cpu-prof/--heap-prof to dump profiles to disk, or APM tools. The first move is almost always: snapshot/profile, then diff.
Follow-ups they push on
- How do you find what retains a leaked object in a heap snapshot? (Retainers path.)
- What is event-loop lag and how would you measure it?
Red flag Guessing at the hot path or leak instead of taking a profile / two heap snapshots and diffing.
source: Node.js docs — Debugging with --inspect ↗
Commonly asked mid concept occasional Express vs Fastify vs NestJS — at a concept level, what differentiates them?
- Express — the minimal, unopinionated classic: a thin router + middleware model. Huge ecosystem, you assemble structure yourself. Great default; less guidance on large-app architecture.
- Fastify — Express-like but built for performance and developer ergonomics: a faster router, schema-based validation/serialization (JSON Schema) that also speeds up responses, and a first-class plugin/encapsulation system. Pick when throughput and built-in validation matter.
- NestJS — an opinionated framework (Angular-inspired) layered on top of Express *or* Fastify: TypeScript-first, modules/controllers/providers, dependency injection, decorators. Pick for large, structured teams/codebases that want enforced architecture out of the box.
Trade-off axis: Express (minimal, flexible) → Fastify (fast, validated) → Nest (structured, batteries-included).
Follow-ups they push on
- What does Fastify's schema-based serialization buy you over plain JSON.stringify?
- What problem does NestJS's dependency injection solve as a codebase grows?
Red flag Calling them interchangeable — they sit at very different points on the minimal-vs-opinionated spectrum.
source: Fastify docs — Benchmarks & overview ↗
Commonly asked mid concept common How does single-threaded Node serve high concurrency, and where does that model fall down?
Node wins at I/O-bound concurrency because the one JS thread never *waits* on I/O — it dispatches the request to the OS/libuv and serves other requests while the bytes are in flight. Thousands of mostly-idle connections (each waiting on a DB or network) cost little: no thread-per-connection overhead, just registered callbacks. That is the sweet spot: APIs, proxies, real-time/websocket servers.
Where it falls down: CPU-bound work. One synchronous heavy computation (image processing, big JSON crunch, sync crypto) blocks the single thread and stalls *every* connection. The fixes are the concurrency tools — worker threads for in-process CPU work, cluster to use all cores for throughput, or offloading to a separate service/queue.
Summary: brilliant for I/O concurrency, weak for CPU parallelism — so keep CPU work off the event-loop thread.
Follow-ups they push on
- Why is thread-per-connection (classic blocking servers) less memory-efficient for many idle connections?
- Which workloads should you NOT put on a plain single-process Node server?
Red flag Claiming Node is fast for everything — it shines for I/O concurrency, not CPU parallelism.
source: Node.js docs — Don't block the event loop ↗
Commonly asked mid concept common Name the four classic Node 'gotchas' that bite teams in production, and how each manifests.
The recurring four:
1. Blocking the event loop — synchronous CPU work (or *Sync fs calls) on the request path freezes the whole server; symptom is rising latency/timeouts across all requests at once. Offload to a worker or chunk with setImmediate.
2. Unhandled stream errors — a stream emits 'error' with no listener and crashes the process. Handle 'error' on every stream / use pipeline.
3. Floating promises — an un-awaited async call whose rejection is lost (or now crashes via unhandledRejection); symptom is silent failures or sudden exits. Always await/return/.catch.
4. Unhandled rejections / uncaught exceptions — treated as last-resort: log and exit, don't swallow and keep serving a corrupted process.
These map directly onto the earlier chapters — they are the failure modes of the event loop, streams, and async model.
Follow-ups they push on
- Which of these would a linter (no-floating-promises) catch automatically?
- Why is 'log and continue' the wrong response to an uncaughtException?
Red flag Treating these as edge cases — they are the single most common ways production Node services fall over.
source: Node.js docs — Don't block the event loop ↗
Commonly asked senior concept occasional Why does JIT compilation make microbenchmarks misleading, and what does V8 do with 'hot' functions?
V8 runs JS through a tiered pipeline: an interpreter (Ignition) runs bytecode immediately, and an optimizing compiler (TurboFan, with a mid-tier Maglev) recompiles 'hot' functions — ones called often — into fast machine code, using runtime type feedback to specialize them.
This makes naive microbenchmarks misleading two ways: (1) the first runs are slow (cold, interpreted) before optimization kicks in, so timing a few iterations measures warmup, not steady state; (2) if a function later sees an unexpected type, V8 deoptimizes it back to slower code — a benchmark with uniform inputs won't reveal the real-world deopt cost. Also dead-code elimination can delete a benchmark whose result is unused.
Takeaways: warm up before measuring, run many iterations, use a real benchmarking harness, and keep functions monomorphic (consistent argument shapes) so V8 can keep them optimized.
What a strong answer covers
- V8 tiers: Ignition (interpret) → Maglev/TurboFan (optimize hot functions) using type feedback.
- Cold runs are slow; timing few iterations measures warmup, not steady state.
- Type changes trigger deoptimization; uniform-input benchmarks hide that cost.
- Warm up, run many iterations, keep functions monomorphic; beware dead-code elimination.
Follow-ups they push on
- What is a 'deopt' and what kinds of code commonly trigger it?
- Why does keeping object shapes consistent (monomorphic) help V8?
Red flag Trusting a few-iteration microbenchmark — you're measuring cold interpreted code, not optimized steady state.
source: V8 blog — Firing up the Ignition interpreter / TurboFan ↗
Commonly asked senior concept occasional What does Fastify's schema-based serialization buy you over returning a plain object that gets JSON.stringify'd?
When you attach a response JSON Schema to a Fastify route, Fastify compiles a specialized serializer (via fast-json-stringify) tailored to that exact shape. Instead of the generic JSON.stringify reflecting over the object at runtime, it runs straight-line code that knows the fields and types ahead of time — measurably faster serialization, the main reason for the speedup on JSON-heavy endpoints.
Two more wins: the schema acts as an output contract — fields not in the schema are stripped, which prevents accidentally leaking internal/sensitive properties — and combined with request schemas you get validation at the boundary. So: faster responses, an explicit contract, and a safety filter against over-exposure.
Trade-off: you must keep the schema in sync with the response, and a field you forget to declare silently disappears from the output.
What a strong answer covers
- Compiles a shape-specific serializer (fast-json-stringify) — faster than generic JSON.stringify.
- Strips fields not in the schema → prevents leaking internal/sensitive properties.
- Pairs with request schemas for boundary validation and an explicit contract.
- Trade-off: undeclared fields silently vanish; the schema must stay in sync.
Follow-ups they push on
- How does schema-based serialization prevent accidentally leaking a password field?
- What's the risk of forgetting to add a field to the response schema?
Red flag Forgetting that fields absent from the response schema are silently dropped from the output.
source: Fastify docs — Validation and Serialization ↗
Commonly asked senior concept occasional WeakMap and WeakRef exist partly to avoid memory leaks. How does a WeakMap-keyed cache differ from a Map-keyed one?
A Map holds strong references to its keys. If you use objects as keys in a long-lived Map cache and never delete them, those keys (and their values) can never be garbage-collected — the Map itself keeps them alive. That's the textbook unbounded-cache leak.
A WeakMap holds its keys weakly: an entry does not prevent its key object from being collected. Once nothing else references the key, the GC can reclaim the key and its associated value, and the entry vanishes automatically. So a WeakMap keyed by an object (e.g. caching per-request or per-element metadata) cleans itself up when the key dies — no manual eviction.
Caveats: WeakMap keys must be objects, it's not enumerable (no .size, no iteration — because collection timing is non-deterministic), and it's a tool for associating data with object lifetimes, not a general size-bounded cache (use an LRU for that). WeakRef/FinalizationRegistry are the lower-level primitives for individual weak references.
What a strong answer covers
- Map keys are strong references → object keys live as long as the Map (leak risk).
- WeakMap keys are weak → key + value are GC'd once nothing else references the key.
- WeakMap keys must be objects; it's not iterable and has no .size.
- Great for per-object metadata tied to lifetime; use an LRU for size-bounded caches.
Quick self-check
Why can a WeakMap-keyed cache avoid a leak that a Map-keyed one causes?
Follow-ups they push on
- Why can't a WeakMap be iterated or report its size?
- When is a WeakMap the wrong choice and an LRU cache the right one?
Red flag Using a plain Map with object keys as a long-lived cache and never evicting — it pins keys/values forever.
source: MDN — WeakMap ↗
Commonly asked senior concept occasional Give a rough picture of how V8 manages memory and garbage collection. What's the generational heap?
V8 (the JS engine in Node and Chrome) compiles JS to machine code and manages a generational heap on the generational hypothesis: most objects die young.
- Young generation (new space) — small; new allocations go here. Collected often by a fast Scavenge (copying) collector. Cheap because it touches little memory.
- Old generation (old space) — objects that survive a couple of scavenges are *promoted* here. Collected less often by Mark-Sweep-Compact (mark reachable objects, sweep the rest, compact to fight fragmentation).
Much of this runs concurrently/incrementally to keep pauses short. The heap has a default cap (historically ~1.5–2GB for old space) tunable via --max-old-space-size. The practical takeaway: short-lived allocations are nearly free; long-lived retained objects are what cost you.
Follow-ups they push on
- Why is collecting the young generation so much cheaper than the old generation?
- What does --max-old-space-size change, and when do you raise it?
Red flag Describing GC as one big stop-the-world sweep — modern V8 is generational and largely incremental/concurrent.
source: Node.js docs — Memory diagnostics ↗
Commonly asked senior concept occasional What's the difference between RSS, heapTotal, and heapUsed in process.memoryUsage(), and which one reveals a leak?
process.memoryUsage() returns several numbers:
- rss (Resident Set Size) — total physical RAM the process holds: V8 heap + native allocations + Buffers (off-heap) + code/stack. The OS-level footprint.
- heapTotal — memory V8 has reserved for its JS object heap.
- heapUsed — the portion of that heap actually in use by live JS objects.
- (external / arrayBuffers) — memory used by C++ objects and ArrayBuffers/Buffers bound to V8, outside the JS heap.
For a leak, watch the trend over time, not a single reading. A steadily-climbing heapUsed that never drops after GC points to a JS-object leak (caches, listeners). A climbing rss with flat heapUsed points to off-heap/native growth (Buffers, native addons). So heapUsed for JS leaks, rss/external for off-heap ones.
What a strong answer covers
- rss = total physical RAM (heap + native + Buffers + code) — the OS footprint.
- heapTotal = V8 heap reserved; heapUsed = live JS objects within it.
- Climbing heapUsed that survives GC → JS-object leak (caches, listeners).
- Climbing rss/external with flat heapUsed → off-heap/native (Buffer) growth.
Follow-ups they push on
- Why might rss grow while heapUsed stays flat? (Off-heap Buffers / native memory.)
- Why look at the trend across snapshots rather than one reading?
Red flag Diagnosing all leaks via heapUsed — off-heap Buffer/native growth shows up in rss/external, not the JS heap.
source: Node.js docs — process.memoryUsage() ↗
Commonly asked senior concept occasional What does --max-old-space-size control, and why does raising it sometimes hide a leak rather than fix it?
--max-old-space-size=<MB> raises the cap on V8's old-generation heap (where long-lived objects live). When the old space approaches this limit, V8 runs aggressive GC; if memory still can't be reclaimed, the process dies with FATAL ERROR: ... JavaScript heap out of memory. The default is well under modern machine RAM (historically ~2 GB on 64-bit), so legitimately large workloads sometimes need it raised.
The trap: bumping it to make OOM crashes "go away" when the real problem is a leak. If memory grows without bound, a bigger cap just postpones the crash — it grows to the new limit and dies again, now with bigger GC pauses along the way. Raise it when working set is genuinely large and bounded; for unbounded growth, profile and fix the leak (heap snapshots, retainer paths) instead.
What a strong answer covers
- Sets V8's old-generation heap cap; hitting it → 'JavaScript heap out of memory' crash.
- Default is below machine RAM, so large legitimate workloads may need it raised.
- For a real leak, a higher cap just delays the crash (and worsens GC pauses).
- Raise for genuinely-large bounded working sets; profile/fix for unbounded growth.
Follow-ups they push on
- How do you tell a real leak from a legitimately large working set?
- What's the downside of a very large old-space heap on GC pause times?
Red flag Cranking --max-old-space-size to stop OOM crashes that are actually a leak — it postpones, not fixes.
source: Node.js docs — --max-old-space-size ↗

05 Frontend 67 Q's

5.1 How the browser works 16

★ must-know Commonly asked mid concept common How does the browser build the DOM and the CSSOM, and how do they combine into the render tree?
The browser tokenizes the HTML bytes into nodes and assembles them into the DOM tree — a complete model of the markup. In parallel it parses CSS (inline, <style>, and external) into the CSSOM, a tree of style rules with the cascade resolved.
The render tree combines the two: it walks the DOM and attaches computed styles, but includes only the nodes that will be painted. Nodes with display:none are excluded entirely; <head> and <script> are not visual so they are absent too. visibility:hidden nodes stay in the tree (they occupy space).
The render tree then feeds layout, which computes each node's geometry.
What a strong answer covers
- DOM = full parsed markup; CSSOM = parsed style rules with the cascade applied.
- The render tree = DOM nodes that will be displayed, each annotated with computed styles.
- display:none nodes are excluded from the render tree; visibility:hidden nodes are kept (they still take space).
- The CSSOM cannot be built incrementally the way the DOM can — CSS is treated as render-blocking until fully parsed.
Quick self-check
Which node is present in the DOM but NOT in the render tree?
Follow-ups they push on
- Why is the render tree not a 1:1 copy of the DOM?
- Why does an element with display:none not appear in the render tree but visibility:hidden does?
Red flag Saying the render tree is just the DOM, or that display:none and visibility:hidden are treated the same here. display:none drops the node entirely; visibility:hidden keeps it (with its box).
source: web.dev — Constructing the Object Model (CRP) ↗
Commonly asked junior concept common What is the difference between the DOMContentLoaded and load events?
DOMContentLoaded fires when the HTML is fully parsed and the DOM is built — deferred scripts have run, but it does not wait for stylesheets, images, or subframes.
load fires later, when the page and all dependent resources (images, stylesheets, iframes) have finished loading.
Most app initialization that only needs the DOM should run on DOMContentLoaded (or just use defer); reserve load for logic that needs final layout or image dimensions.
Follow-ups they push on
- Does DOMContentLoaded wait for async scripts?
- When would you actually need the load event?
Red flag Thinking DOMContentLoaded waits for images, or putting all init in load and delaying interactivity unnecessarily.
source: MDN — Document: DOMContentLoaded event ↗
Meta mid concept very common Walk me through what happens from typing a URL to seeing the page render.
DNS resolves the host, TCP+TLS connect, the browser requests the HTML and parses it into the DOM; CSS is parsed into the CSSOM; DOM + CSSOM combine into the render tree. Then layout (reflow) computes geometry, paint fills pixels, and composite assembles layers on the GPU.
Note that CSS is render-blocking and <script> is parser-blocking unless marked async or defer. This whole sequence is the critical rendering path.
Follow-ups they push on
- Why can transform/opacity animations skip layout and paint?
- Where does the JS engine block the parser, and how do async/defer change that?
Red flag Forgetting the CSSOM, or conflating reflow (layout) with repaint (paint). Saying the DOM alone produces pixels.
source: web.dev — Critical rendering path ↗
Commonly asked mid concept occasional How does a browser repaint at 60fps, and what is the ~16ms frame budget? Where does requestAnimationFrame fit?
At a 60Hz refresh rate the browser aims to produce a new frame every ~16.7ms (1000/60). Within that budget it must run any JS, recalculate style, lay out, paint, and composite — so a long-running task that overruns 16ms causes a dropped frame (jank).
requestAnimationFrame(cb) schedules cb to run right before the next paint, so visual updates align with the frame instead of firing at arbitrary times (as setTimeout would). It is the correct place to do animation work and DOM writes that should be visible next frame.
Real budget is less than 16ms because the browser itself needs some of it; aim to keep main-thread work well under that.
What a strong answer covers
- 60fps means a frame roughly every 16.7ms (1000ms / 60).
- All per-frame work (JS, style, layout, paint, composite) must fit the budget or a frame drops.
- requestAnimationFrame runs callbacks just before the next repaint, syncing visual updates to the frame.
- Prefer rAF over setTimeout for animation; setTimeout isn't aligned to the refresh cycle.
Follow-ups they push on
- Why is requestAnimationFrame better than setTimeout for animations?
- What happens to rAF callbacks in a background (hidden) tab?
Red flag Using setTimeout for smooth animation (not frame-aligned), or assuming you have the full 16ms — the browser's own work eats into it.
source: MDN — Window: requestAnimationFrame() ↗
Commonly asked mid concept common How do async, defer, and type="module" scripts differ in download and execution timing?
A plain <script> blocks the parser: download and run happen inline, halting DOM construction.
async: downloads in parallel; runs as soon as it arrives, possibly interrupting parsing, in no guaranteed order.
defer: downloads in parallel; runs after the document is parsed, just before DOMContentLoaded, in document order.
type="module" scripts are deferred by default (no attribute needed) and execute in order; adding async to a module makes it run as soon as it and its imports are ready. Modules are also always strict mode and have their own scope.
Quick rule: app/UI code → defer (or a module); independent third-party (analytics) → async.
What a strong answer covers
- Plain script: parser-blocking download + execute.
- async: parallel download, run on arrival, unordered.
- defer: parallel download, run after parse in order (before DOMContentLoaded).
- type="module": deferred by default, ordered, strict mode, scoped.
Quick self-check
By default (no async/defer attribute), when does a <script type="module"> execute?
Follow-ups they push on
- Why is a module script deferred even without the defer attribute?
- What ordering guarantees do you lose with async?
Red flag Adding `defer` to a module thinking it's required (it's already deferred), or assuming async preserves execution order.
source: MDN — <script> type=module / async / defer ↗
Commonly asked mid concept common What is the difference between reflow and repaint?
Reflow (layout) recomputes element geometry — sizes and positions. It is expensive because changing one element can cascade to its ancestors, descendants, and siblings. Triggers: width/height, margin/padding, font-size, adding/removing DOM nodes, reading offsetHeight.
Repaint redraws pixels without changing geometry — e.g. color, background-color, visibility. Cheaper than reflow.
Composite-only changes (transform, opacity) can skip both layout and paint and run on the GPU's compositor thread, which is why they animate smoothly.
Follow-ups they push on
- Why does reading offsetWidth in a loop after writing styles cause layout thrashing?
- How would you batch DOM reads and writes to avoid forced synchronous layout?
Red flag Claiming color changes cause reflow, or that all CSS animations are cheap. Animating `top`/`left`/`width` triggers reflow every frame; `transform` does not.
source: web.dev — Critical rendering path ↗
Commonly asked mid concept common Why is CSS render-blocking, and why is a plain <script> parser-blocking?
CSS is render-blocking because the browser will not paint until it has the CSSOM — rendering with incomplete styles would cause a flash of unstyled content. So it blocks the first render, though not DOM construction.
A plain <script> is parser-blocking: when the parser hits it, it stops building the DOM, fetches (if external) and executes the script, then resumes. Scripts can read and mutate the DOM, so the browser cannot safely keep parsing past them. This is why scripts are traditionally placed at the end of <body>.
Follow-ups they push on
- What do async and defer change about this?
- What is a render-blocking resource vs a parser-blocking one?
Red flag Saying CSS blocks DOM construction (it blocks render, not the DOM), or that all scripts block the parser regardless of attributes.
source: web.dev — Critical rendering path ↗
Commonly asked mid concept common What is the difference between async and defer on a script tag?
Both download the script in parallel without blocking the parser; they differ in when execution happens and whether order is preserved.
defer: execute after the HTML is fully parsed, just before DOMContentLoaded, and in document order. Good for scripts that depend on the DOM or on each other.
async: execute as soon as the download finishes, which can interrupt parsing, and in no guaranteed order. Good for independent scripts like analytics.
A plain script (no attribute) blocks the parser while it downloads and runs.
Follow-ups they push on
- Which would you use for a third-party analytics snippet, and which for an app bundle?
- Do async/defer affect inline scripts?
Red flag Swapping the two, or claiming async preserves order. async is order-independent; defer preserves order. (async/defer are ignored on inline scripts.)
source: MDN — <script>: async and defer ↗
Commonly asked senior concept occasional Why can the browser parse HTML and discover sub-resources before the document is fully loaded? What is the preload scanner?
Modern browsers run a secondary preload scanner (also called a lookahead pre-parser) that races ahead of the main HTML parser. While the main parser may be blocked executing a synchronous <script>, the preload scanner scans the raw markup for resources — <img>, <link>, <script src> — and starts fetching them early.
This is why a render-blocking script does not also stall *network* discovery of later assets. It is also why CSS injected by JavaScript (rather than declared in markup) can hurt performance: the preload scanner cannot see it, so the fetch starts late.
Takeaway: keep critical resources in the initial HTML as plain <link>/<img> so the scanner can find them.
What a strong answer covers
- The preload scanner pre-parses raw HTML to discover and fetch sub-resources ahead of the main parser.
- It keeps the network busy even when the main parser is blocked on a synchronous script.
- It only sees resources declared in the markup — JS-injected assets are invisible to it.
- Declaring critical assets as plain tags (or <link rel=preload>) lets discovery start as early as possible.
Follow-ups they push on
- Why might lazy-loading or injecting your LCP image via JS hurt LCP?
- How does <link rel=preload> interact with the preload scanner?
Red flag Assuming a blocking script also blocks all network discovery — the preload scanner keeps fetching declared resources. Hiding critical assets behind JS injection defeats it.
source: web.dev — How the browser's preload scanner speeds up page loads ↗
Commonly asked senior concept occasional What is the difference between a render-blocking resource and a parser-blocking resource?
Render-blocking resources prevent the browser from painting the first frame until they are processed — chiefly CSS (and synchronous CSS in <head>). The DOM may keep being built, but nothing is shown until the CSSOM is ready.
Parser-blocking resources halt DOM construction itself. A synchronous <script> is the classic case: the parser stops, fetches and runs the script, then resumes — because the script could document.write or mutate the not-yet-built DOM.
They overlap (a blocking script is effectively both, since stopping the parser also delays render), but the mental model differs: CSS blocks *painting*, scripts block *parsing*.
What a strong answer covers
- Render-blocking (CSS): DOM keeps building, but first paint waits for the CSSOM.
- Parser-blocking (sync <script>): DOM construction itself pauses until the script runs.
- async/defer make scripts non-parser-blocking; media queries / print can make a stylesheet non-render-blocking.
- A synchronous in-<head> script behind a stylesheet is doubly bad: it waits for the CSS, then blocks the parser.
Follow-ups they push on
- Why might a synchronous script wait for a preceding stylesheet to load?
- How do you make a stylesheet non-render-blocking with the media attribute?
Red flag Conflating the two: CSS blocks render (not DOM construction); a plain script blocks parsing (and therefore render too).
source: web.dev — Render blocking resources ↗
Commonly asked senior concept occasional What is the compositor thread, and how is the browser's main thread different from it?
The main thread runs JavaScript, parses HTML/CSS, computes style, layout, and paint. If it is busy (a long task), the page cannot respond to input or update the DOM — this is what hurts INP.
The compositor thread runs separately and assembles already-painted layers into the final frame, handling scrolling and transform/opacity animations on the GPU. Because it does not need the main thread, scrolling and compositor-driven animations stay smooth even while JS is busy — until they need a property that forces layout/paint, which bounces work back to the main thread.
This split is why transform/opacity animate at 60fps and why heavy JS tanks responsiveness but not necessarily scroll.
What a strong answer covers
- Main thread: JS execution, style, layout, paint — a single thread that blocks the whole page when busy.
- Compositor thread: stitches painted layers, handles scroll and transform/opacity off the main thread (often GPU-accelerated).
- Compositor-only changes (transform, opacity) skip layout and paint, so they animate even during main-thread work.
- Long main-thread tasks block input handling and DOM updates, degrading responsiveness (INP).
Follow-ups they push on
- Why does animating `top`/`left` re-involve the main thread every frame?
- How does breaking up long tasks improve responsiveness?
Red flag Believing all animations run off the main thread, or that the compositor can recompute layout. It only composites already-painted layers.
source: web.dev — Inside look at modern web browser (the compositor) ↗
Commonly asked senior debug occasional What does this code do to rendering performance, and how would you fix it? for (const el of items) { el.style.width = el.offsetWidth + 10 + 'px'; }
Each iteration writes a style (el.style.width = ...) and then the next read of offsetWidth forces the browser to flush layout so the read is accurate — a forced synchronous layout on every pass. With N items you get N reflows: classic layout thrashing.
Fix: split into a read phase then a write phase so layout is computed at most once.
const widths = items.map((el) => el.offsetWidth);
items.forEach((el, i) => { el.style.width = widths[i] + 10 + 'px'; });
Now all reads happen against one stable layout, and all writes are batched before the next reflow.
What a strong answer covers
- Reading offsetWidth after a style write forces a synchronous layout so the value is fresh.
- Interleaving read/write per iteration = one reflow per item = layout thrashing.
- Fix: batch all reads first, then all writes (read/write separation).
- requestAnimationFrame can schedule the write phase to align with the next frame.
Quick self-check
Why is the original loop slow?
Follow-ups they push on
- Which properties besides offsetWidth force a synchronous layout when read?
- How would FastDOM or requestAnimationFrame help here?
Red flag Thinking the cost is the loop itself rather than the read-after-write pattern that forces a reflow each iteration.
source: web.dev — Avoid large, complex layouts and layout thrashing ↗
Commonly asked senior concept occasional What is a layer (compositor layer), and what is the tradeoff of promoting elements with will-change?
The browser can split the page into compositor layers — separate bitmaps the GPU can transform and blend independently. Promoting an element to its own layer lets the compositor move it (via transform) without repainting, which is what makes such animations cheap.
will-change: transform (or opacity) hints the browser to promote an element ahead of time so the first frame is not janky. The tradeoff: each layer costs GPU memory, and too many layers add management overhead that can make things slower, not faster.
Rule of thumb: apply will-change just before an animation and remove it after; never blanket it onto many elements.
What a strong answer covers
- A compositor layer is an independently rasterized surface the GPU can move/blend without repaint.
- will-change proactively promotes an element so animations start smoothly.
- Each layer consumes GPU memory; over-promotion causes overhead and can regress performance.
- Apply will-change narrowly and temporarily, not as a global optimization.
Follow-ups they push on
- How can you inspect layers in DevTools (the Layers panel)?
- Why is `will-change: transform` on every element a bad idea?
Red flag Treating `will-change` as a free speed-up and applying it everywhere — it inflates memory and can hurt performance.
source: MDN — will-change ↗
Commonly asked senior concept common Why do animating transform and opacity perform better than animating top/left or width/height?
top/left/width/height change geometry, so every animation frame triggers layout (reflow), then paint, then composite — on the main thread.
transform and opacity can be handled by the compositor: the element is promoted to its own layer and the GPU moves/blends it without re-running layout or paint. The work happens off the main thread, so it stays smooth even if JS is busy.
Practical rule: animate transform and opacity; use will-change sparingly to hint layer promotion.
Follow-ups they push on
- What is the downside of promoting too many layers with will-change?
- What is the compositor thread and how is it separate from the main thread?
Red flag Overusing `will-change` on everything (memory blow-up, no benefit), or believing all CSS animations bypass the main thread.
source: web.dev — Animations and performance ↗
Commonly asked senior concept occasional What is layout thrashing, and how do you avoid forced synchronous layout?
Layout thrashing is repeatedly interleaving DOM writes and layout-forcing reads in a loop, so the browser must recompute layout synchronously over and over.
Reading a property like offsetHeight, getBoundingClientRect(), or scrollTop after a style write forces the browser to flush pending layout immediately so the read is accurate — a forced synchronous layout.
Fix: batch all reads first, then all writes. Libraries like FastDOM do this; requestAnimationFrame can schedule the write phase.
Follow-ups they push on
- Which DOM properties force a synchronous layout when read?
- How does requestAnimationFrame help schedule reads vs writes?
Red flag Reading offsetWidth and then writing style in the same loop iteration, forcing a reflow each pass.
source: web.dev — Avoid large, complex layouts and layout thrashing ↗
Commonly asked senior concept common What is the critical rendering path and how would you optimize it?
The critical rendering path is the sequence of steps the browser takes to turn HTML, CSS, and JS into pixels: build the DOM, build the CSSOM, combine into the render tree, lay out, paint, composite.
Optimizing it means getting the first meaningful paint sooner by reducing critical resources:
- Inline critical CSS, defer the rest; minimize render-blocking CSS.
- Add defer/async to scripts so they do not block parsing.
- Preload key assets (<link rel="preload">), preconnect to origins.
- Minify and compress; reduce bytes and round-trips.
Follow-ups they push on
- How does inlining critical CSS help LCP?
- What is the tradeoff of inlining vs caching a separate CSS file?
Red flag Listing micro-optimizations without naming the blocking resources (CSS render-blocking, scripts parser-blocking) that actually delay first paint.
source: web.dev — Critical rendering path ↗

5.2 DOM, HTML & CSS 16

Commonly asked junior concept very common What is the difference between display:none and visibility:hidden?
display:none removes the element from the render tree entirely: it occupies no space, is not painted, and is skipped by most assistive tech. Toggling it triggers reflow.
visibility:hidden keeps the element in layout — it still occupies its box and affects siblings — but is not painted (invisible). It is not interactive.
A third option, opacity:0, is fully painted and still interactive (clickable) and laid out; it just renders transparent.
Follow-ups they push on
- Which of the three is keyboard-focusable / clickable?
- Which triggers reflow when toggled vs only repaint?
Red flag Saying visibility:hidden removes the element from layout — it still occupies space. Confusing opacity:0 (still clickable) with display:none.
source: MDN — visibility ↗
Commonly asked junior coding very common How would you center a div both horizontally and vertically? Give more than one approach.
Flexbox (most common): on the parent,
display: flex; align-items: center; justify-content: center;
Grid (terse): on the parent, display: grid; place-items: center;
Absolute + transform (no flex/grid): on the child,
position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%);
The transform trick offsets by the element's own size (the 50% in translate is relative to the element), so it centers regardless of dimensions. Flexbox/grid are preferred for in-flow content; absolute centering suits overlays where the child is taken out of flow.
What a strong answer covers
- Flexbox: align-items: center; justify-content: center; on the container.
- Grid: place-items: center; — the shortest form.
- Absolute + translate(-50%, -50%) centers without knowing the element's size.
- Prefer flex/grid for in-flow content; absolute centering for overlays/modals.
Follow-ups they push on
- Why does translate(-50%, -50%) work without knowing the element's dimensions?
- What changes if the parent has a fixed height vs auto height?
Red flag Using `margin: auto` for vertical centering on a block (works horizontally, not vertically without flex), or forgetting the parent needs a height for flex centering to be visible.
source: MDN — Box alignment (centering) ↗
Commonly asked junior concept very common Explain the CSS box model and the box-sizing property.
Every element is a box with four areas, from inside out: content, padding, border, margin.
With the default box-sizing: content-box, the width you set applies to the content only; padding and border are added on top, so the rendered box is wider than width.
With box-sizing: border-box, width includes content + padding + border, so the element stays the size you set. This is why many resets apply * { box-sizing: border-box; }.
Follow-ups they push on
- Why do margins collapse vertically between block elements?
- Does margin count toward the element's width in either box-sizing mode?
Red flag Forgetting that margin is always outside the box (never part of width), or not knowing border-box folds padding/border into the declared width.
source: MDN — The box model ↗
Commonly asked junior concept common Why does semantic HTML matter? Give examples beyond <div> and <span>.
Semantic elements describe meaning, not just appearance, which benefits accessibility, SEO, and maintainability.
Elements like <header>, <nav>, <main>, <article>, <section>, <aside>, <footer> create landmarks that screen readers and the accessibility tree expose, letting users jump between regions. <button>, <a>, <label>, <input> come with built-in keyboard behavior and roles.
A <div> with a click handler has none of that — you would have to re-add role, tabindex, and key handling manually.
Follow-ups they push on
- What do you lose by using <div onClick> instead of <button>?
- How do landmark elements help screen-reader navigation?
Red flag Treating semantics as purely cosmetic, or reinventing a button from a div without role/tabindex/keyboard support.
source: MDN — HTML: A good basis for accessibility ↗
Commonly asked mid concept common What does flex: 1 actually mean? Break down flex-grow, flex-shrink, and flex-basis.
flex is shorthand for three properties:
- flex-grow — how much a item grows to fill leftover free space, relative to siblings.
- flex-shrink — how much it shrinks when there isn't enough space.
- flex-basis — the item's starting size before grow/shrink (its 'ideal' main size).
flex: 1 expands to flex: 1 1 0% — grow 1, shrink 1, basis 0%. Because basis is 0, items size purely by their grow ratio, so equal flex: 1 items become equal width regardless of content. Contrast flex: auto (1 1 auto), where content size is the starting point, so items differ by content length.
What a strong answer covers
- flex: <grow> <shrink> <basis>; flex: 1 = 1 1 0%.
- flex-grow distributes free space; flex-shrink distributes overflow; flex-basis is the pre-grow size.
- flex: 1 on siblings gives equal sizes (basis 0); flex: auto (basis auto) sizes from content first.
- flex-basis takes priority over width for the main-axis starting size.
Follow-ups they push on
- What's the difference between flex: 1 and flex: auto?
- When does flex-shrink: 0 matter (preventing an item from collapsing)?
Red flag Thinking flex: 1 sets a width directly, or confusing flex: 1 (basis 0, equal sizes) with flex: auto (basis auto, content-driven sizes).
source: MDN — flex ↗
Commonly asked mid concept occasional Why and when do vertical margins collapse between block elements?
Margin collapsing is when adjacent vertical margins combine into a single margin equal to the largest of them, rather than summing. It applies only to block-level boxes in normal flow along the block (vertical) axis — never horizontal margins.
Three cases: adjacent siblings (the bottom margin of one and top margin of the next collapse); a parent and its first/last child (if no border/padding/content separates them); and an empty block (its own top and bottom margins collapse).
It does not happen for flex/grid items, floated or absolutely-positioned elements, or when a border, padding, or overflow: auto separates the boxes. This trips people up when a child's margin unexpectedly pushes the parent.
What a strong answer covers
- Collapsing takes the max of the two margins, not the sum — vertical only.
- Happens between siblings, parent/first-or-last child, and within empty blocks.
- Prevented by a border, padding, overflow other than visible, or a BFC.
- Does not apply to flex/grid items, floats, or absolutely positioned boxes.
Follow-ups they push on
- How does establishing a block formatting context (BFC) stop collapsing?
- Why does a child's top margin sometimes push the parent down?
Red flag Expecting margins to add up, or thinking collapsing applies to flex/grid items (it doesn't) or to horizontal margins (it doesn't).
source: MDN — Mastering margin collapsing ↗
Commonly asked mid concept common What is the difference between rem, em, %, vw/vh, and px? When would you reach for each?
px is an absolute (device-independent) pixel — fixed, predictable, but ignores user font preferences.
em is relative to the current element's font-size (for most properties), so it compounds when nested. rem is relative to the root <html> font-size — no compounding, which makes it the go-to for scalable, accessible typography and spacing.
% is relative to the parent's corresponding dimension. vw/vh are 1% of the viewport's width/height, useful for full-screen sections.
Practical default: rem for type and spacing (respects user zoom/root size), %/fr/vw for fluid layout, px for hairline borders.
What a strong answer covers
- px: absolute and fixed; doesn't scale with user font settings.
- em: relative to the element's own font-size — compounds when nested.
- rem: relative to the root font-size — no compounding; best for accessible, scalable type.
- % is relative to the parent; vw/vh are 1% of viewport width/height.
Follow-ups they push on
- Why can nested em values produce surprising sizes?
- Why is rem preferred over px for font sizes from an accessibility standpoint?
Red flag Confusing em (element-relative, compounds) with rem (root-relative), or using px for font sizes and breaking user zoom/font-size preferences.
source: MDN — CSS values and units (length) ↗
Commonly asked mid concept common What is the difference between event.target and event.currentTarget on a bubbling event?
event.target is the element where the event originated — the deepest node that was actually clicked/typed-in. event.currentTarget is the element whose listener is currently running — i.e. the element you called addEventListener on.
During bubbling, target stays constant as the event travels up, while currentTarget changes at each ancestor whose listener fires. In a delegated handler on a <ul>, currentTarget is the <ul>, and target is the specific <li> (or a child of it, which is why you often use target.closest('li')).
Note: in an arrow function this won't be the element, but event.currentTarget always is.
What a strong answer covers
- target = where the event started (deepest element); constant through bubbling.
- currentTarget = the element whose listener is running now; changes per ancestor.
- In delegation, currentTarget is the parent you bound to; target is the actual descendant.
- currentTarget equals this in a normal function handler, but not in an arrow function.
Quick self-check
A click listener is on a <ul>. The user clicks a <span> inside an <li>. Inside the handler, what are target and currentTarget?
Follow-ups they push on
- Why use target.closest('li') instead of target directly in a delegated handler?
- What is event.target inside a handler bound directly to the element itself?
Red flag Swapping the two: target is the origin, currentTarget is the listening element. In delegation, acting on target directly can grab a nested child instead of the row.
source: MDN — Event: currentTarget ↗
Commonly asked mid trick common What is the difference between stopPropagation and preventDefault?
They are orthogonal. preventDefault() cancels the browser's default action for the event — following a link, submitting a form, checking a checkbox — but the event still propagates to other listeners.
stopPropagation() stops the event from traveling further through the capture/bubble phases to other elements, but does not cancel the default action. (stopImmediatePropagation() additionally prevents other listeners on the *same* element.)
So: prevent the default behavior with preventDefault; stop the event reaching parents with stopPropagation. Returning false from a jQuery handler does both, but in plain DOM return false only works in inline on* attributes.
What a strong answer covers
- preventDefault() cancels the default browser action; propagation still happens.
- stopPropagation() halts travel to other elements; default action still happens.
- stopImmediatePropagation() also blocks other listeners on the same element.
- They're independent — you sometimes call both, sometimes one.
Quick self-check
A form submit handler calls only event.stopPropagation(). What happens?
Follow-ups they push on
- When would you call both on the same event?
- Why is calling stopPropagation broadly considered risky for delegation?
Red flag Believing stopPropagation also cancels the default action (it doesn't), or that preventDefault stops bubbling (it doesn't).
source: MDN — Event: preventDefault() ↗
Commonly asked mid concept common How does CSS specificity work, and what wins between an ID selector and 10 classes?
Specificity is scored as a tuple (inline, IDs, classes/attributes/pseudo-classes, elements). Higher tuples win; ties are broken by source order (last wins).
An ID is (0,1,0,0). Ten classes is (0,0,10,0). The ID still wins because the ID column outranks the class column regardless of count — it is not a base-10 sum where 10 classes overflow into the ID column.
!important overrides normal specificity; inline styles outrank selectors. Use these sparingly.
Follow-ups they push on
- Where do !important and inline styles sit in the cascade?
- How do :where() and :is() affect specificity?
Red flag Treating specificity as a single base-10 number so '10 classes beat 1 ID' — columns do not carry over.
source: MDN — Specificity ↗
Commonly asked mid concept common When would you use Flexbox versus CSS Grid?
Flexbox is for one-dimensional layout — a row or a column — where you distribute space along a single axis (nav bars, toolbars, centering, equal-height items in a row).
Grid is for two-dimensional layout — rows and columns together — where you place items into a defined grid (page layouts, card galleries, dashboards).
They compose: a grid cell can itself be a flex container. Reach for Grid when you care about both axes at once; Flexbox when content drives a single axis.
Follow-ups they push on
- How do you center an element both horizontally and vertically with each?
- What does flex: 1 actually mean (flex-grow/shrink/basis)?
Red flag Calling Grid 'just for grids of images' or Flexbox 'two-dimensional'. The key distinction is 1D vs 2D.
source: MDN — Relationship of grid layout to other layout methods ↗
Commonly asked mid concept common Explain CSS position values: static, relative, absolute, fixed, and sticky.
static — default; in normal flow, top/left ignored.
relative — stays in flow but offset from its normal spot; becomes a positioning context for absolute children.
absolute — removed from flow; positioned relative to the nearest positioned ancestor (else the initial containing block).
fixed — removed from flow; positioned relative to the viewport, so it stays put on scroll.
sticky — a hybrid: behaves like relative until it crosses a scroll threshold, then sticks like fixed within its container.
Follow-ups they push on
- What makes an ancestor a 'positioned' ancestor for absolute children?
- Why might position:sticky silently not work (overflow on an ancestor)?
Red flag Saying absolute is relative to the viewport (that is fixed), or that sticky is independent of its containing block.
source: MDN — position ↗
Commonly asked mid concept occasional What is the difference between the HTML attribute and the DOM property (e.g. input value)?
The attribute is what is written in the HTML source; the property is the live value on the DOM object. They are linked at parse time but can diverge.
For an <input value="hi">: getAttribute("value") returns the original "hi" (the default), while inputEl.value reflects what the user has currently typed. Editing the box changes the property, not the attribute.
Some attributes are reflected (id, className), others are not symmetric (value, checked). This is why React tracks value as state.
Follow-ups they push on
- Why does setAttribute('value', ...) not update what the user sees after they have typed?
- How does this relate to controlled vs uncontrolled inputs in React?
Red flag Assuming attribute and property always stay in sync. For value/checked the attribute is just the initial default.
source: MDN — Attributes ↗
Amazon mid coding common How would you efficiently insert 1,000 DOM nodes without causing 1,000 reflows?
Build the nodes off the live DOM and insert once, so layout is recomputed a single time.
Use a DocumentFragment:
const frag = document.createDocumentFragment();
for (const item of items) { const li = document.createElement("li"); li.textContent = item; frag.appendChild(li); }
list.appendChild(frag);
Appending to the fragment does not touch the rendered tree; the single appendChild(frag) inserts all children in one operation. Avoid innerHTML += in a loop (re-parses everything each time) and avoid appending one-by-one to the live list.
Follow-ups they push on
- Why is innerHTML += in a loop both slow and unsafe?
- When would you use virtualization instead of inserting all 1,000 nodes?
Red flag Appending each node directly to the live DOM in the loop, or using innerHTML += which re-parses the whole list every iteration.
source: MDN — DocumentFragment ↗
Commonly asked senior concept occasional What is a block formatting context (BFC), and name two ways to create one. Why is it useful?
A block formatting context is a self-contained region of layout where block boxes lay out and floats are managed independently of the outside. Inside a BFC, vertical margins don't collapse with elements outside it, and the BFC contains its floated children.
Ways to create one: overflow other than visible (e.g. overflow: hidden/auto), display: flow-root (the purpose-built, side-effect-free option), being a flex/grid item, display: inline-block, or floating/absolute positioning.
Classic uses: clearing floats (a floated child no longer overflows its parent's height), and stopping margin collapse between a parent and child. display: flow-root is the modern, intention-revealing way to do both.
What a strong answer covers
- A BFC is an isolated layout region; floats and margins inside don't leak out.
- Create with display: flow-root, overflow ≠ visible, flex/grid item, float, or absolute.
- Contains floats (no parent collapse) and blocks external margin collapsing.
- display: flow-root is the clean, side-effect-free way to establish one.
Follow-ups they push on
- Why was `overflow: hidden` historically used to clear floats?
- What advantage does display: flow-root have over the overflow hack?
Red flag Using `overflow: hidden` to clear floats and accidentally clipping content or scrollbars; `display: flow-root` avoids those side effects.
source: MDN — Block formatting context ↗
Commonly asked senior concept common What is a stacking context, and why might a higher z-index element still appear behind a lower one?
z-index only orders elements within the same stacking context. A stacking context is a self-contained layer: once formed, its children are painted as a unit, and their z-index values cannot escape it.
So if element A (z-index: 9999) lives inside a parent that forms a stacking context with a low z-index, and element B (z-index: 1) is in a *sibling* context with a higher one, B paints on top — A's huge z-index is meaningless across contexts.
New stacking contexts are created by more than position + z-index: opacity < 1, transform, filter, will-change, isolation: isolate, and being a flex/grid child with a z-index, among others. This is the usual cause of 'my z-index isn't working'.
What a strong answer covers
- z-index is only comparable within one stacking context, never across them.
- Once an ancestor forms a context, a child's z-index is trapped inside it.
- Contexts are created by position+z-index but also opacity < 1, transform, filter, will-change, isolation: isolate.
- A z-index:9999 child of a low context loses to a z-index:1 element in a higher-ranked sibling context.
Quick self-check
Which of these does NOT, by itself, create a new stacking context?
Follow-ups they push on
- Name three properties besides position that create a stacking context.
- How does `isolation: isolate` help contain z-index without side effects?
Red flag Assuming z-index is globally comparable. A larger z-index loses if its element sits in a lower-ranked ancestor stacking context.
source: MDN — Stacking context ↗

5.3 JavaScript that matters for the frontend 18

Commonly asked junior trick common What is the difference between == and ===, and name a coercion gotcha.
=== is strict equality: no type coercion — different types are never equal. == is loose equality: it coerces operands to a common type first, which produces surprising results.
Gotchas: 0 == "" is true, 0 == "0" is true, but "" == "0" is false (not transitive). null == undefined is true, yet null == 0 is false. NaN === NaN is false.
Rule: default to ===; the one common, intentional == is x == null to catch both null and undefined.
Follow-ups they push on
- Why is NaN not equal to itself, and how do you test for it?
- What does the abstract equality algorithm do for object vs primitive comparisons?
Red flag Claiming == is just === plus 'minor type stuff', then getting tripped by the non-transitive empty-string/zero cases.
source: MDN — Equality comparisons and sameness ↗
Commonly asked junior trick common What is the difference between null and undefined, and what does typeof return for each?
undefined means a variable has been declared but not assigned, a missing function argument, a missing object property, or a function with no return. The engine produces it.
null is an intentional 'no value' that *you* assign to signal emptiness.
The famous quirk: typeof undefined is "undefined", but typeof null is "object" — a long-standing bug kept for backward compatibility. They are loosely equal (null == undefined is true) but not strictly equal (null === undefined is false).
Use x == null to test for both at once, or ?? (nullish coalescing) which treats only null/undefined as missing.
What a strong answer covers
- undefined: engine-produced 'not assigned / missing'. null: developer-assigned 'intentionally empty'.
- typeof undefined === 'undefined'; typeof null === 'object' (a historical bug).
- null == undefined is true; null === undefined is false.
- ?? treats only null/undefined as missing, unlike || which also catches 0/''/false.
Quick self-check
What does typeof null evaluate to?
Follow-ups they push on
- Why does ?? differ from || for falsy values like 0 and ''?
- How do you reliably check that a value is null or undefined but not 0/''?
Red flag Expecting typeof null to be 'null' (it's 'object'), or using || where ?? is needed and accidentally treating 0/'' as missing.
source: MDN — null ↗
Meta mid debug very common What does this print, and why? for (var i = 0; i < 3; i++) { setTimeout(() => console.log(i), 1); }
It prints 3, 3, 3.
var is function-scoped, so all three callbacks close over the same i. The setTimeout callbacks run after the synchronous loop finishes, by which point i has been incremented to 3.
Fixes: use let (block-scoped — each iteration gets a fresh binding, printing 0 1 2); or capture per-iteration with an IIFE (j => setTimeout(() => console.log(j), 1))(i).
Follow-ups they push on
- Change var to let — what prints now and why?
- How does the IIFE version create a separate closure per iteration?
Red flag Answering 0 1 2 for the var version. The classic mistake is forgetting var is shared and the timers fire after the loop.
source: lydiahallie/javascript-questions (Q2) ↗
Commonly asked mid debug common What does this print, and why? let count = 0; const fns = []; for (let i = 0; i < 3; i++) { fns.push(() => i); } console.log(fns.map((f) => f()));
It logs [0, 1, 2].
With let, the loop creates a fresh binding of i for each iteration, so each arrow closes over a different i holding that iteration's value. (count is a red herring — it's never touched.)
If this used var instead, all three closures would share one function-scoped i, and after the loop finished i would be 3, so it would log [3, 3, 3]. This is the canonical demonstration of why let fixed the classic loop-closure bug.
What a strong answer covers
- let gives each iteration its own binding of the loop variable.
- Each closure captures its iteration's i, so the result is [0, 1, 2].
- With var (function-scoped, one shared binding) it would be [3, 3, 3].
- Closures capture variables (bindings), not snapshot values.
Quick self-check
What is logged?
Follow-ups they push on
- Rewrite this with var to get [3, 3, 3], then explain the fix.
- How does this relate to the setTimeout-in-a-loop classic?
Red flag Answering [3, 3, 3] for the `let` version — that's the `var` behavior. let creates a new binding per iteration.
source: MDN — Closures (creating closures in loops) ↗
Commonly asked mid debug common What does this print, and why? const obj = { name: 'obj', greet() { setTimeout(function () { console.log(this.name); }, 0); }, }; obj.greet();
It logs undefined (in a browser, this is the global object, where name is ''; in strict mode/modules this is undefined and it would throw).
The inner function passed to setTimeout is a plain function called by the timer, not as a method of obj. Its this is therefore not obj — implicit binding only happens for obj.method() call syntax. The timer invokes it as a bare function.
Fixes: use an arrow function in the timeout (inherits greet's this), capture const self = this, or .bind(this). This is the single most common this-loss bug in callbacks.
What a strong answer covers
- this is set by the call site; the timer calls the callback as a plain function.
- Plain-function this is the global object (sloppy mode) or undefined (strict/module).
- An arrow function in setTimeout inherits the enclosing method's this (= obj).
- Alternatives: const self = this capture, or .bind(this).
Quick self-check
What logs (assume a non-strict browser global where name is '')?
Follow-ups they push on
- Rewrite greet so it logs 'obj'.
- Why does an arrow function fix this but a regular function doesn't?
Red flag Assuming the callback inherits `obj` as `this` because it's defined inside a method. Only the call site sets a normal function's `this`.
source: MDN — this (callbacks) ↗
Commonly asked mid concept common What is the difference between call, apply, and bind?
All three set a function's this explicitly; they differ in when it runs and how arguments are passed.
call(thisArg, a, b) — invokes immediately, arguments passed individually.
apply(thisArg, [a, b]) — invokes immediately, arguments passed as an array. (Mnemonic: Apply = Array.)
bind(thisArg, a) — does not invoke; returns a new function with this (and any leading args) permanently fixed. You call that later. A bound function cannot be re-bound, and new on it ignores the bound this.
With spread, call(...args) covers most apply cases today.
What a strong answer covers
- call: invoke now, args listed individually.
- apply: invoke now, args as an array (Apply = Array).
- bind: returns a new permanently-bound function; doesn't invoke.
- A bound function's this can't be overridden by a later call/bind.
Follow-ups they push on
- Can you re-bind a function that's already bound?
- How does spread syntax make apply less necessary?
Red flag Mixing up apply (array) and call (list), or thinking bind invokes the function immediately — it returns a new one.
source: MDN — Function.prototype.bind() ↗
Commonly asked mid coding common Implement a throttle function, and explain how it differs from debounce.
Throttle guarantees fn runs at most once per wait window, no matter how often it's called — good for scroll/resize/mousemove. Debounce waits until calls *stop* for wait ms, then fires once — good for search-as-you-type.
function throttle(fn, wait) {
let last = 0;
return function (...args) {
const now = Date.now();
if (now - last >= wait) {
last = now;
fn.apply(this, args);
}
};
}
This is a leading-edge throttle: it fires immediately, then ignores calls until the window elapses. The timestamp lives in a closure, and fn.apply(this, args) forwards context and arguments.
What a strong answer covers
- Throttle: at most one call per time window (steady cadence under continuous events).
- Debounce: fires only after calls go quiet for wait ms.
- Throttle suits scroll/resize; debounce suits typeahead/validation.
- The closure holds the last-run timestamp; forward this/args via apply.
Follow-ups they push on
- Add a trailing-edge call so the final event isn't dropped.
- When would you choose throttle over debounce for a scroll handler?
Red flag Implementing debounce and calling it throttle (resetting a timer on each call is debounce). Also dropping the trailing call so the last event is lost.
source: GreatFrontend — Throttle ↗
Commonly asked mid concept common What is the difference between a shallow copy and a deep copy, and how do you make each?
A shallow copy duplicates only the top level; nested objects/arrays are still shared references. So mutating a nested value affects both copies. Make one with {...obj}, Object.assign({}, obj), or arr.slice().
A deep copy recursively clones every level, so the copy is fully independent. Modern way: structuredClone(obj) (handles Dates, Maps, Sets, cyclic refs). The old hack JSON.parse(JSON.stringify(obj)) works only for plain JSON-safe data — it drops functions, undefined, and Symbols, and turns Date into a string.
Key point: spread is shallow, so a nested array inside a spread copy is still linked to the original.
What a strong answer covers
- Shallow copy shares nested references; spread/Object.assign/slice are shallow.
- Deep copy clones every level into an independent structure.
- structuredClone() is the modern deep-copy API (handles Dates/Maps/Sets/cycles).
- JSON.parse(JSON.stringify(x)) loses functions, undefined, Symbols, and Dates.
Follow-ups they push on
- Why does the spread operator not deep-copy nested arrays?
- What types does JSON.stringify silently drop or mangle?
Red flag Believing spread or Object.assign deep-copies — nested objects stay shared. Reaching for JSON round-trip on data containing Dates/functions/undefined.
source: MDN — Shallow copy / Deep copy (structuredClone) ↗
Commonly asked mid concept common What is the difference between function declarations and function expressions with respect to hoisting?
A function declaration (function foo() {}) is hoisted whole — both its name and body — so you can call it on a line *above* where it's written.
A function expression (const foo = function () {} or an arrow) follows variable hoisting rules. With const/let, the binding is hoisted but in the temporal dead zone, so calling it early throws ReferenceError. With var, the variable hoists as undefined, so calling it early throws TypeError: foo is not a function (it's undefined, not callable yet).
So declarations are usable before their line; expressions are not, and the error you get depends on var vs let/const.
What a strong answer covers
- Function declarations are fully hoisted (callable before their definition).
- Function expressions follow the variable's hoisting: TDZ for let/const, undefined for var.
- Calling a var-assigned expression early → TypeError (not a function).
- Calling a let/const expression early → ReferenceError (TDZ).
Quick self-check
What happens? foo(); var foo = function () { return 1; };
Follow-ups they push on
- What error do you get calling a var function expression before assignment, and why?
- Are named function expressions hoisted by their name? (No — only inside their own scope.)
Red flag Assuming all functions are hoisted. Only declarations are; expressions hoist per their variable's rules (TDZ or undefined).
source: MDN — Hoisting ↗
Commonly asked mid debug very common What does this print? const shape = { radius: 10, diameter() { return this.radius * 2; }, perimeter: () => 2 * Math.PI * this.radius }; console.log(shape.diameter()); console.log(shape.perimeter());
It prints 20 and then NaN.
diameter is a regular method: called as shape.diameter(), this is shape, so this.radius is 10 → 20.
perimeter is an arrow function: arrows do not get their own this; they use the lexically enclosing this (here the module/global scope), where radius is undefined. 2 * Math.PI * undefined → NaN.
Follow-ups they push on
- Rewrite perimeter so it works.
- Why are arrow functions a bad choice for object methods but a good choice for callbacks?
Red flag Assuming the arrow's `this` is the object. Arrows ignore the call site and bind `this` lexically.
source: lydiahallie/javascript-questions (Q3) ↗
Commonly asked mid debug very common What does this print? function sayHi() { console.log(name); console.log(age); var name = "Lydia"; let age = 21; } sayHi();
It logs undefined, then throws a ReferenceError.
var name is hoisted and initialized to undefined, so the first log reads undefined.
let age is hoisted too but not initialized — it sits in the temporal dead zone until its declaration runs. Accessing it before that line throws ReferenceError: Cannot access 'age' before initialization, so the second log never completes.
Follow-ups they push on
- What exactly is the temporal dead zone?
- How does hoisting differ for function declarations vs function expressions?
Red flag Saying both are undefined, or that let is 'not hoisted at all'. It is hoisted but uninitialized (TDZ).
source: lydiahallie/javascript-questions (Q1) ↗
Amazon mid concept very common What is a closure? Give a practical use case.
A closure is a function bundled with references to the variables from the scope where it was defined. The inner function keeps those variables alive even after the outer function returns.
Use cases: private state (a counter factory where the count is inaccessible from outside), partial application / currying, memoization caches, and stateful callbacks like the timer ID inside a debounce.
Example:
function makeCounter() { let n = 0; return () => ++n; }
const c = makeCounter(); c(); // 1 — n is private and persists.
Follow-ups they push on
- How do closures cause memory leaks if you are not careful?
- How does debounce use a closure to remember the timer ID?
Red flag Defining a closure only as 'a function inside a function' without mentioning that it captures and persists the enclosing variables.
source: MDN — Closures ↗
Meta mid concept very common What is event delegation, and why attach one listener to a parent instead of many to children?
Event delegation exploits bubbling: instead of binding a listener to every child, you bind one to a common ancestor and inspect event.target to find which child triggered it.
Benefits: fewer listeners (lower memory), and it automatically handles dynamically added children without rebinding.
Example:
list.addEventListener("click", (e) => { const li = e.target.closest("li"); if (li) handle(li.dataset.id); });
Use event.target for the actual origin and event.currentTarget for the element the listener is on.
Follow-ups they push on
- What is the difference between event.target and event.currentTarget?
- Which events do not bubble, and how do you delegate those (capture phase / focusin)?
Red flag Confusing target with currentTarget, or assuming every event bubbles (focus/blur do not; focusin/focusout do).
source: MDN — Event bubbling and delegation ↗
AmazonGoogle mid coding very common Implement a debounce function.
Debounce delays calling fn until wait ms have passed since the last call; every new call resets the timer. The timer id lives in a closure.
function debounce(fn, wait) {
let t;
return function (...args) {
clearTimeout(t);
t = setTimeout(() => fn.apply(this, args), wait);
};
}
Using a normal function (not an arrow) for the wrapper preserves the caller's this, and fn.apply(this, args) forwards both. Common in search-as-you-type and resize handlers.
Follow-ups they push on
- How does throttle differ from debounce?
- Add a leading-edge (immediate) option.
- Why must the wrapper forward `this` and `args`?
Red flag Hoisting the timer outside the returned function incorrectly (shared across instances), or dropping `this`/`args` so the debounced fn loses context.
source: GreatFrontend — Debounce ↗
Commonly asked senior concept common What is the difference between a microtask and a macrotask, and which queue drains first?
After each macrotask (and after the current synchronous run-to-completion finishes), the event loop drains the entire microtask queue before taking the next macrotask or rendering.
Microtasks: Promise .then/.catch/.finally callbacks, queueMicrotask, MutationObserver. They run as soon as the stack is empty, ahead of any timer.
Macrotasks (tasks): setTimeout, setInterval, I/O, message events, UI events. One per loop turn.
Consequence: a resolved Promise always runs before a setTimeout(0). And an unbounded chain of microtasks can starve rendering and timers, because the loop won't move on until the microtask queue is empty.
What a strong answer covers
- Order each turn: run a macrotask → drain all microtasks → (maybe render) → next macrotask.
- Microtasks: Promise callbacks, queueMicrotask, MutationObserver.
- Macrotasks: setTimeout/setInterval, I/O, UI/message events.
- Resolved Promise beats setTimeout(0); runaway microtasks can starve render/timers.
Quick self-check
Which of these schedules a MICROTASK?
Follow-ups they push on
- Why can microtasks starve the UI but a queue of setTimeouts can't as easily?
- Where does requestAnimationFrame sit relative to micro/macrotasks?
Red flag Thinking setTimeout(0) runs before a resolved Promise. Microtasks always drain fully before the next macrotask.
source: MDN — In depth: Microtasks and the JavaScript runtime environment ↗
Meta senior concept common How is `this` determined at call time? Walk through the binding rules.
For a normal function, this depends on how it is called, checked in priority order:
1. new Fn() — this is the freshly created object.
2. fn.call/apply/bind(obj) — this is the explicit obj.
3. obj.fn() — this is the receiver obj (implicit binding).
4. Plain fn() — this is undefined in strict mode, else the global object.
Arrow functions ignore all of the above: they capture this lexically from where they were defined. That is why arrows are handy in callbacks but wrong as object methods.
Follow-ups they push on
- Why does passing obj.method as a callback lose `this`?
- What does bind return, and can you re-bind a bound function?
Red flag Saying `this` is fixed by where a function is defined (true only for arrows). For normal functions it is the call site.
source: MDN — this ↗
Meta senior debug very common Explain the event loop, the call stack, and the difference between microtasks and macrotasks. What prints? console.log(1); setTimeout(() => console.log(2), 0); Promise.resolve().then(() => console.log(3)); console.log(4);
It prints 1, 4, 3, 2.
Synchronous code runs first on the call stack: 1, then 4.
When the stack is empty, the event loop drains the entire microtask queue before any macrotask. Promise.then is a microtask → 3. setTimeout is a macrotask → 2, runs last.
So: sync (1, 4) → all microtasks (3) → next macrotask (2).
Follow-ups they push on
- Where do queueMicrotask, MutationObserver, and requestAnimationFrame fit?
- Why can a runaway chain of microtasks starve rendering and timers?
Red flag Predicting `1 4 2 3`. The trap is thinking setTimeout(0) beats a resolved Promise — microtasks always drain first.
source: MDN — In depth: Microtasks and the JavaScript runtime environment ↗
Commonly asked senior concept common How does prototypal inheritance work? What is the difference between __proto__ and prototype?
Every object has an internal link ([[Prototype]], exposed as __proto__) to another object. Property lookups walk this prototype chain until found or it hits null.
prototype is a property on constructor functions: when you do new Fn(), the new object's __proto__ is set to Fn.prototype. So instances delegate to Fn.prototype for shared methods.
Mnemonic: prototype lives on the constructor; __proto__ (better: Object.getPrototypeOf) lives on instances and points at the constructor's prototype.
Follow-ups they push on
- How do ES6 classes map onto prototypes under the hood?
- Why put methods on the prototype instead of in the constructor?
Red flag Mixing up `prototype` (on the constructor) and `__proto__` (on the instance), or thinking class syntax is not prototype-based — it is sugar.
source: MDN — Inheritance and the prototype chain ↗

5.4 Browser networking & app architecture 17

Meta mid concept very common What is the same-origin policy, and what problem does CORS solve?
The same-origin policy stops a page on origin A (scheme + host + port) from reading responses from origin B by default — it limits how a document loaded from one origin can interact with a resource from another, which protects user credentials.
CORS is the server's controlled opt-in: the server returns headers like Access-Control-Allow-Origin telling the browser it is allowed to expose the response to that origin. CORS does not turn off security — it lets a server selectively relax the same-origin policy for trusted callers.
Follow-ups they push on
- What exactly counts as 'same origin'?
- Is CORS enforced by the browser or the server?
Red flag Saying CORS is a thing the client enables to bypass security. It is a server opt-in; the browser enforces it.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗
Commonly asked mid concept common What is tree-shaking, and what does your code need to do for it to work?
Tree-shaking is dead-code elimination at the module level: the bundler keeps only the exports you actually import and drops the rest, shrinking the bundle.
It relies on ES modules' static structure — import/export are statically analyzable, so the bundler can trace which exports are used. CommonJS (require) is dynamic and resists shaking.
For it to work well: use ESM, import named members (not the whole namespace), avoid modules with side effects at import time, and mark packages "sideEffects": false in package.json so the bundler can safely prune. A stray top-level side effect can force a whole module to be kept.
What a strong answer covers
- Tree-shaking removes unused exports to reduce bundle size.
- Requires static ESM import/export; CommonJS require is too dynamic.
- Side-effectful modules can't be safely dropped; "sideEffects": false signals safety.
- Import named members, not import * as everything.
Follow-ups they push on
- Why can CommonJS modules not be tree-shaken reliably?
- What does the package.json "sideEffects" field do?
Red flag Assuming any unused import is automatically dropped. Side effects, CommonJS, or namespace imports can defeat tree-shaking.
source: MDN — Tree shaking ↗
Commonly asked mid concept common What is code-splitting and lazy loading, and how do they improve load performance?
Code-splitting breaks one large bundle into smaller chunks that can be loaded on demand instead of all upfront. Lazy loading is fetching a chunk only when it's actually needed — typically via the dynamic import() expression, which returns a Promise and tells the bundler to emit a separate chunk.
The payoff is a smaller initial bundle: less JS to download, parse, and execute before the page is interactive, which improves load time and INP. Common split points: per-route (load a route's code on navigation) and per-component (a heavy modal/chart loaded on first interaction).
React pairs React.lazy(() => import('./X')) with <Suspense> for a fallback while the chunk loads.
What a strong answer covers
- Code-splitting = multiple chunks; lazy loading = fetch a chunk on demand.
- Dynamic import() returns a Promise and creates a separate bundle chunk.
- Shrinks the initial bundle → faster parse/execute → better TTI/INP.
- Split by route and by heavy on-interaction components.
Follow-ups they push on
- How does React.lazy + Suspense work together?
- What's the risk of splitting too aggressively (many tiny chunks / waterfalls)?
Red flag Lazy-loading everything (request waterfalls, layout shift on load), or splitting code that's needed for first paint and delaying it.
source: MDN — JavaScript modules (dynamic import) ↗
Commonly asked mid concept common What causes Cumulative Layout Shift (CLS), and how do you prevent it?
CLS measures unexpected movement of visible content during loading — content jumping as late-arriving elements push things around. Good is ≤ 0.1 at the 75th percentile.
Common causes: images/videos/ads without reserved space; web fonts swapping in and reflowing text (FOUT); content injected above existing content; and animating layout properties.
Fixes: always set width/height (or aspect-ratio) on media so the browser reserves the box; reserve space for ads/embeds; use font-display: optional/swap plus size-matched fallbacks to minimize font reflow; and never insert content above what the user is viewing unless in response to an interaction.
What a strong answer covers
- CLS = sum of unexpected layout shifts; target ≤ 0.1 (p75).
- Top cause: media without dimensions — set width/height or aspect-ratio.
- Reserve space for ads/embeds and avoid injecting content above the fold.
- Tame font swap (FOUT) with font-display and metric-matched fallbacks.
Follow-ups they push on
- Why does specifying width and height on an <img> prevent shift even before it loads?
- How can web fonts cause layout shift, and how do you reduce it?
Red flag Omitting image dimensions (relying on CSS alone) so the browser can't reserve space, or inserting banners above current content after load.
source: web.dev — Cumulative Layout Shift (CLS) ↗
Commonly asked mid concept occasional How does HTTP caching work for assets? Explain Cache-Control, ETags, and cache busting.
Cache-Control is the primary header. max-age=N lets the browser use a cached copy without revalidating for N seconds; no-cache means 'cache it but revalidate before use'; no-store means never cache; immutable promises the file won't change.
ETags enable conditional revalidation: the server sends a content hash; the browser later sends If-None-Match, and the server returns a tiny 304 Not Modified if unchanged — saving the payload but not the round-trip.
Cache busting combines both worlds: give bundled assets a content hash in the filename (app.a1b2c3.js) and serve them with max-age=31536000, immutable. When content changes, the filename changes, so you cache forever yet always serve fresh files. Keep the HTML entry point short-lived.
What a strong answer covers
- Cache-Control: max-age skips revalidation; no-cache revalidates; no-store never caches.
- ETag + If-None-Match → 304 Not Modified avoids re-downloading unchanged bytes.
- Cache busting: content-hashed filenames served immutable long-lived.
- Hash the assets, keep the HTML short-lived so new asset URLs are discovered.
Follow-ups they push on
- Why is no-cache not the same as no-store?
- Why hash filenames instead of just lowering max-age?
Red flag Thinking `no-cache` means 'don't cache' (it means revalidate), or setting long max-age on un-hashed filenames so users get stale files.
source: MDN — HTTP caching ↗
Meta mid concept common What is the virtual DOM, and is it actually faster than direct DOM manipulation?
The virtual DOM is an in-memory tree of lightweight JS objects describing the UI. On a state change, the framework builds a new tree, diffs it against the previous one (reconciliation), and applies the minimal set of real DOM mutations.
It is not magically faster than hand-optimized direct DOM writes — diffing has its own cost. Its value is a declarative model: you describe the target UI and let the framework batch updates and avoid redundant reflows, which is faster than naive re-rendering and far easier to reason about than manual surgery.
Follow-ups they push on
- Why do React lists need stable keys during reconciliation?
- How do fine-grained reactive frameworks (Solid/Svelte) avoid a VDOM entirely?
Red flag Asserting the virtual DOM is always faster than direct manipulation. The real win is the declarative programming model plus batched updates.
source: React docs — Preserving and Resetting State (reconciliation) ↗
Commonly asked mid concept common What problem do bundlers and transpilers solve? Distinguish Webpack/Vite from Babel.
A bundler (Webpack, Vite, esbuild, Rollup) builds a dependency graph from your modules and produces a few optimized files — handling code-splitting, tree-shaking, asset imports, and minification. It solves 'too many modules and too many requests' and lets the browser load less.
A transpiler (Babel, the TS compiler, SWC) converts source into a form browsers/runtimes accept: modern JS → older JS, JSX → createElement calls, TypeScript → JS.
They complement each other: a bundler usually runs a transpiler step. Vite additionally serves native ES modules in dev for instant startup.
Follow-ups they push on
- What is tree-shaking and what does it require to work (ESM, side-effect-free)?
- Why is Vite's dev server fast compared to a classic Webpack dev build?
Red flag Treating bundler and transpiler as synonyms. Babel transforms syntax; Webpack/Vite assemble and optimize the module graph.
source: Vite — Why Vite ↗
Google mid concept very common What are the Core Web Vitals, and what does each measure?
Three field metrics for real-user experience, judged at the 75th percentile:
- LCP (Largest Contentful Paint) — loading; time for the largest content element to render. Good ≤ 2.5s.
- INP (Interaction to Next Paint) — responsiveness; the latency of interactions. Good ≤ 200ms. INP became a stable Core Web Vital in 2024, replacing FID.
- CLS (Cumulative Layout Shift) — visual stability; how much content unexpectedly shifts. Good ≤ 0.1.
Typical fixes: optimize the LCP image / preload it; cut long tasks for INP; reserve space (width/height, aspect-ratio) for CLS.
Follow-ups they push on
- Why did INP replace FID?
- What causes layout shift and how do you prevent it (dimensions, font swap)?
Red flag Citing FID as a current Core Web Vital (it was replaced by INP in 2024), or mixing up which metric covers loading vs responsiveness vs stability.
source: web.dev — Web Vitals ↗
Commonly asked mid concept common What are the essentials of web accessibility (a11y) a frontend engineer must get right?
Start with semantic HTML — native <button>, <a>, <label>, headings, and landmark elements give you roles, focus, and keyboard behavior for free.
Key practices: meaningful alt text on images (empty alt="" for decorative ones); every form control associated with a <label>; full keyboard navigation with a visible focus indicator and logical tab order; sufficient color contrast; and ARIA only to fill gaps semantics cannot cover (custom widgets) — never to paper over a non-semantic <div>.
Test with keyboard-only, a screen reader, and automated tools (axe/Lighthouse).
Follow-ups they push on
- What does 'the first rule of ARIA' (don't use ARIA if a native element exists) mean?
- How do you make a custom dropdown keyboard-accessible?
Red flag Reaching for ARIA first instead of semantic HTML, removing focus outlines without a replacement, or treating alt text as optional.
source: MDN — What is accessibility? ↗
Commonly asked mid concept common Compare cookies, localStorage, and sessionStorage for storing data in the browser.
Cookies (~4KB) are sent to the server with every matching request. Best for auth/session tokens, ideally HttpOnly (JS cannot read them, mitigating XSS theft), Secure, and SameSite.
localStorage (~5–10MB) is JS-only, persists until cleared, and is not sent to the server. Good for non-sensitive client state. Vulnerable to XSS, so never store tokens that must stay secret.
sessionStorage is like localStorage but scoped to a single tab and cleared when it closes.
Follow-ups they push on
- Why store auth tokens in HttpOnly cookies rather than localStorage?
- What does SameSite do for CSRF protection?
Red flag Recommending localStorage for auth tokens (readable by any XSS), or thinking localStorage is sent to the server like cookies.
source: MDN — Web Storage API ↗
Commonly asked mid trick occasional What is the difference between fetch and XMLHttpRequest, and does fetch reject on a 404?
fetch is the modern Promise-based API; XMLHttpRequest is the older event/callback-based one. fetch is cleaner, streams responses, and integrates with AbortController for cancellation.
Key gotcha: fetch only rejects on network failure, not on HTTP error status. A 404 or 500 still resolves — you must check response.ok (or response.status) yourself and throw if it is false.
response.json() returns a Promise, so you await it twice (the response, then the body).
Follow-ups they push on
- How do you cancel a fetch request?
- Does fetch send cookies by default cross-origin (credentials)?
Red flag Assuming a 404 lands in the catch block. It does not — fetch resolves; only network errors reject. Forgetting to check response.ok.
source: MDN — Using the Fetch API ↗
Commonly asked senior concept common What is hydration in SSR, and why can it be costly? What problems does it cause?
Hydration is the client-side step where a framework takes server-rendered HTML and attaches event listeners and reconstructs component state, making the static markup interactive. The server sends visible HTML fast (good first paint), then the browser must download the JS, re-run the components, and 'wire up' the existing DOM.
It's costly because you effectively render twice — once on the server, once on the client — and the page can look ready but not respond to clicks until hydration finishes (the 'uncanny valley' / poor INP).
Mitigations: less client JS, partial/progressive hydration, islands architecture (hydrate only interactive bits), streaming SSR, and server components that never ship to the client.
What a strong answer covers
- Hydration = attaching listeners/state to server-rendered HTML to make it interactive.
- Work is duplicated: render on server, then re-render/wire-up on client.
- Page can appear ready but be unresponsive until hydration completes (hurts INP).
- Fixes: islands, partial/progressive hydration, streaming, server components.
Follow-ups they push on
- How does an islands architecture (e.g. Astro) reduce hydration cost?
- What is the 'uncanny valley' of a hydrating page?
Red flag Thinking SSR alone makes a page interactive. The HTML is visible immediately, but interactivity waits for hydration.
source: web.dev — Rendering on the web (hydration) ↗
Commonly asked senior concept common What is XSS, what are the main types, and how do you defend against it?
Cross-site scripting (XSS) is injecting attacker-controlled script that runs in a victim's page with the site's privileges (reading cookies, DOM, making requests as the user).
Types: stored (malicious input saved server-side and served to others), reflected (script bounced off a request like a search param), and DOM-based (client JS writes untrusted data into the DOM, e.g. via innerHTML).
Defenses: contextual output encoding/escaping (treat data as data, not markup); avoid innerHTML/dangerouslySetInnerHTML with untrusted input — use textContent; sanitize rich HTML with a vetted library (DOMPurify); set a strong Content-Security-Policy; and mark session cookies HttpOnly so injected JS can't read them.
What a strong answer covers
- XSS runs attacker script in the user's session context.
- Three types: stored, reflected, DOM-based.
- Primary defense: contextual output encoding; prefer textContent over innerHTML.
- Layer with CSP, HTML sanitization (DOMPurify), and HttpOnly cookies.
Follow-ups they push on
- Why does an HttpOnly cookie limit the damage of an XSS?
- How does a Content-Security-Policy mitigate XSS?
Red flag Treating input validation as sufficient. The core fix is output encoding for the right context; CSP and sanitization are defense-in-depth, not a single switch.
source: OWASP — Cross Site Scripting (XSS) ↗
Commonly asked senior concept occasional What is CSRF, and how is it different from XSS? How do you defend against it?
CSRF (cross-site request forgery) tricks a logged-in user's browser into sending an unwanted state-changing request to your site. It exploits the fact that browsers attach cookies automatically, so a forged request from another site rides the victim's session.
Key difference from XSS: XSS is a *code injection* (attacker runs script in your page); CSRF is a *request forgery* that needs no script on your page — it abuses ambient cookie auth. XSS can defeat most CSRF defenses, so fixing XSS comes first.
Defenses: SameSite cookies (Lax/Strict) so cookies aren't sent on cross-site requests; anti-CSRF tokens (a per-session secret the attacker can't read); and checking Origin/Referer. Avoid using GET for state changes.
What a strong answer covers
- CSRF: forged state-changing request riding the victim's auto-sent cookies.
- XSS injects script; CSRF forges a request and needs no script on your page.
- Defenses: SameSite cookies, anti-CSRF tokens, Origin/Referer checks.
- Never perform state changes on GET; XSS can bypass CSRF tokens, so fix XSS too.
Follow-ups they push on
- How does SameSite=Lax block a typical CSRF attack?
- Why can an XSS vulnerability defeat anti-CSRF tokens?
Red flag Conflating CSRF with XSS, or thinking a CSRF token alone is enough when an XSS hole can simply read it.
source: OWASP — Cross Site Request Forgery (CSRF) ↗
Commonly asked senior design common How would you optimize the Largest Contentful Paint (LCP) of a page?
LCP marks when the largest in-viewport element (often a hero image or headline block) renders; good is ≤ 2.5s at p75. Optimize the four phases of its timeline:
- TTFB: fast server / CDN, cache HTML, reduce redirects.
- Resource load delay: make the LCP image discoverable early — put it in the markup (not JS-injected), fetchpriority="high", <link rel="preload">, and *don't* lazy-load it.
- Resource load time: serve a right-sized, modern-format (AVIF/WebP), compressed image over a fast connection.
- Render delay: cut render-blocking CSS/JS so the element can paint.
The single biggest lever is usually ensuring the LCP image is requested as early as possible and not deferred.
What a strong answer covers
- LCP = render time of the largest viewport element; good ≤ 2.5s (p75).
- Break it into TTFB, load delay, load time, render delay and attack each.
- Don't lazy-load the LCP image; make it discoverable early (fetchpriority, preload, in-markup).
- Serve right-sized modern-format images; reduce render-blocking resources.
Follow-ups they push on
- Why is lazy-loading the hero image an anti-pattern for LCP?
- How does fetchpriority="high" change the request ordering?
Red flag Lazy-loading or JS-injecting the hero image (the preload scanner can't find it early), or optimizing total page weight while ignoring the LCP element's own request timing.
source: web.dev — Optimize Largest Contentful Paint ↗
Commonly asked senior concept common What is a CORS preflight request, and what triggers one versus a 'simple' request?
A preflight is an automatic OPTIONS request the browser sends before the real request to ask the server whether the actual call is allowed. It carries Origin, Access-Control-Request-Method, and Access-Control-Request-Headers.
It is triggered by non-simple requests: methods other than GET/HEAD/POST, custom headers, or a Content-Type outside application/x-www-form-urlencoded, multipart/form-data, or text/plain (e.g. application/json).
A simple request skips preflight — but the server still must return Access-Control-Allow-Origin for the script to read the response.
Follow-ups they push on
- Why does sending Content-Type: application/json trigger a preflight?
- How can Access-Control-Max-Age reduce preflight overhead?
Red flag Thinking simple requests need no CORS headers — they still need Access-Control-Allow-Origin to be readable. Forgetting application/json forces a preflight.
source: MDN — Preflight request ↗
Commonly asked senior design common Compare SPA, MPA, SSR, and SSG, and the tradeoffs of each.
MPA: server sends a full HTML page per navigation. Simple, great for content sites; full reloads between pages.
SPA: one HTML shell, JS renders routes client-side. Fast in-app navigation; weak initial load and SEO, needs JS to render anything.
SSR: server renders HTML per request, then hydrates on the client. Good first paint and SEO with dynamic data; higher server cost and TTFB.
SSG: HTML built at deploy time, served from a CDN. Fastest and cheapest; only fits content that does not change per request (or use ISR to revalidate).
Follow-ups they push on
- What is hydration and why can it be costly?
- Where does ISR (incremental static regeneration) fit between SSR and SSG?
Red flag Conflating SSR with SSG (request-time vs build-time), or claiming SPAs are inherently bad for SEO without mentioning SSR/prerendering as the fix.
source: web.dev — Rendering on the web ↗

06 Senior Cross-Cutting 164 Q's

6.1 System design fundamentals 14

★ must-know Commonly asked senior concept common Explain the CAP theorem and how it actually informs a design decision.
CAP says that when a network partition happens, a distributed system can keep only two of Consistency, Availability, Partition tolerance — and since partitions are unavoidable in real networks, the real choice is C vs A during a partition.
CP (consistency over availability): on a partition, refuse or block requests rather than serve stale/conflicting data — pick this when correctness is non-negotiable (a bank balance, inventory you can oversell). AP (availability over consistency): keep serving on both sides of the partition and reconcile later (eventual consistency) — pick this when staleness is tolerable and uptime matters more (a social feed, a shopping cart, DNS).
The senior point: CAP only bites *during* a partition; the rest of the time you get both. And it is a spectrum — many stores let you tune consistency per request (e.g. quorum reads/writes), so you choose CP or AP per use case, not per company.
What a strong answer covers
- Partitions are inevitable, so the live tradeoff is Consistency vs Availability during a partition.
- CP: refuse/block on partition to avoid stale data — banking, inventory, anything correctness-critical.
- AP: stay up and reconcile later (eventual consistency) — feeds, carts, DNS.
- CAP only constrains you during a partition; normally you get C and A both.
- Often tunable per request (quorum reads/writes), so the choice is per use case, not absolute.
Quick self-check
During a network partition, an 'AP' system chooses to:
Follow-ups they push on
- Give a concrete system you'd build CP and one you'd build AP, and why.
- How do quorum reads/writes let you tune where you sit on the spectrum?
- What does 'eventual consistency' actually promise the client?
Red flag Stating you 'pick two of three' as a permanent architecture choice — partition tolerance is mandatory, so the decision is C-vs-A only when a partition occurs, and it can be tuned per request.
source: System Design Primer — CAP theorem ↗
AmazonGoogleMicrosoft senior design very common Design a URL shortener (TinyURL / bit.ly). Walk me through it.
Clarify scope first: read-heavy (~100:1 reads:writes), so optimize the redirect path. Estimate: ~100M new URLs/day -> ~1.2K writes/s, ~120K reads/s; 7-char base62 = 62^7 ~= 3.5T codes, plenty for years. Storage ~500 bytes/row * 100M/day -> ~36TB over 2 years.
Core: a key-gen service maps short->long. Two strategies: (a) base62-encode a globally unique counter (e.g. a range-allocator / Snowflake-style ID) — no collisions, but reveals volume; (b) hash the long URL (MD5/SHA) and take a prefix, then collision-check. Store mapping in a KV store / sharded SQL; reads go cache-first (Redis, LRU on hot links) then DB.
Wrap up: use 301 only if you do not need per-click analytics (browser caches it), else 302; shard by hash of short code; push click analytics to a queue (Kafka) for async aggregation.
Follow-ups they push on
- How do you guarantee globally unique codes across shards without a single counter bottleneck?
- 301 vs 302 — which do you pick and what do you lose with each?
- How would custom aliases and link expiry change the design?
Red flag Jumping straight to a schema before clarifying the read/write ratio and scale. Also picking 301 while still wanting click analytics — the cached redirect never hits your server again.
source: ByteByteGo — Design A URL Shortener ↗
Commonly asked senior concept common Why and how would you introduce a message queue between services? What does it buy you?
A queue (SQS, RabbitMQ) or a log (Kafka) decouples a producer from a consumer: the producer drops a message and moves on, the consumer processes it on its own schedule. That buys you three things — async (the user-facing request returns immediately while slow work happens in the background), buffering (a traffic spike fills the queue instead of overwhelming the consumer), and resilience (if the consumer is down, messages wait instead of being lost).
Use it for work that does not need a synchronous answer: sending email, generating thumbnails, fanning out notifications, ingesting events. You also gain independent scaling (add consumers to drain a backlog) and smoothing of bursty load.
The tradeoffs you must name: added operational complexity, eventual rather than immediate results, and the need to handle idempotency because most queues guarantee at-least-once delivery (the same message can arrive twice).
What a strong answer covers
- Decouples producer/consumer: async work, buffering of spikes, resilience when a consumer is down.
- Use it for fire-and-forget work: email, thumbnails, notification fanout, event ingestion.
- Enables independent scaling — add consumers to drain a backlog.
- Most queues are at-least-once, so consumers must be idempotent (dedupe on a key).
- Cost: more moving parts, eventual results, and ordering is not free (often per-partition only).
Follow-ups they push on
- Why must queue consumers usually be idempotent?
- What's the difference between a queue (SQS/RabbitMQ) and a log (Kafka)?
- How does a dead-letter queue help, and when do messages land there?
Red flag Assuming exactly-once delivery and writing a non-idempotent consumer — at-least-once redelivery then double-charges, double-sends, or double-processes on the inevitable retry.
source: AWS — What is message queuing? ↗
Google senior design common Design a typeahead / search autocomplete service.
Clarify: top-k suggestions per prefix, ranked by popularity, with very low latency (every keystroke fires a request) and read-heavy load. Two halves — serving and data-gathering.
Serving: precompute the top-k completions for each prefix so a query is a single lookup, not a scan. A trie with the top-k cached at each node answers a prefix in O(prefix length); cache hot tries/results in Redis at the edge. Debounce on the client and cap suggestions so you do not hammer the backend.
Data-gathering (offline): aggregate query logs to count frequencies, then rebuild/update the trie periodically (e.g. via a batch job) rather than on every search — autocomplete tolerates being slightly stale. Wrap up: shard the trie by prefix range, discuss freshness vs cost of rebuild cadence, and personalization/spell-correction as extensions.
Follow-ups they push on
- Why precompute top-k per prefix instead of querying at request time?
- How do you keep the suggestions fresh without rebuilding the trie on every query?
- How would you shard the trie across nodes?
Red flag Querying the database for matching terms on every keystroke and sorting at request time — that does not survive the read volume; the win is precomputing top-k per prefix offline and serving from a cached trie.
source: ByteByteGo — Design A Search Autocomplete System ↗
Commonly asked senior concept common What is consistent hashing, and what specific problem does it solve that modulo hashing does not?
With naive hash(key) % N sharding, changing the node count N changes the modulus, so almost every key remaps to a different node — adding or removing one cache/storage node reshuffles the entire keyspace and cold-starts everything.
Consistent hashing maps both nodes and keys onto the same hash ring (0…2^m). A key is owned by the next node clockwise. Now adding or removing a node only remaps the keys between that node and its neighbor — roughly 1/N of keys, not all of them.
The refinement is virtual nodes: place each physical node at many points on the ring so load spreads evenly and removing a node redistributes its keys across many others instead of dumping them all on one neighbor. This is the standard partitioning scheme for distributed caches, Cassandra, and DynamoDB-style stores.
What a strong answer covers
- hash(key) % N remaps nearly all keys when N changes — catastrophic for a cache.
- Consistent hashing puts nodes + keys on a ring; a key goes to the next node clockwise.
- Adding/removing a node only remaps ~1/N of keys (those between it and its neighbor).
- Virtual nodes spread each physical node across many ring points for even load + smooth rebalancing.
- It's the backbone of distributed caches, Cassandra, and DynamoDB-style partitioning.
Quick self-check
You add one node to a cluster of N. Roughly what fraction of keys remap under consistent hashing vs `hash % N`?
Follow-ups they push on
- Why do virtual nodes improve load balance and rebalancing?
- Roughly what fraction of keys move when you add one node to a ring of N?
- How does this connect to designing a distributed cache?
Red flag Saying consistent hashing 'avoids collisions' — it is about minimizing key movement when the node set changes, not about hash collisions; without virtual nodes load can still skew badly.
source: System Design Primer — Consistent hashing / sharding ↗
Commonly asked senior concept common When and how do you add a cache to a read-heavy system, and what are the gotchas?
Add a cache when reads dominate, the same data is read far more often than it changes, and the database is the bottleneck. The most common pattern is cache-aside (lazy loading): the app reads the cache first; on a miss it reads the database, populates the cache with a TTL, and returns. Writes update the database and invalidate or update the cached entry. Alternatives are read-through/write-through (the cache layer itself loads/writes the DB) and write-back (write cache now, flush to DB async — fast but risks loss).
The gotchas are where seniority shows. Cache invalidation is the hard problem — stale data after a write if you forget to evict. Cache stampede / thundering herd: a hot key expires and thousands of requests hit the DB at once — mitigate with request coalescing, a short lock, or staggered TTLs. Cold start after a flush hammers the DB. And caching is for tolerable-staleness data — never cache something that must be strongly consistent (a bank balance) without care.
What a strong answer covers
- Add a cache when reads ≫ writes, data is reused, and the DB is the bottleneck.
- Cache-aside: read cache → miss → read DB → populate with TTL; writes invalidate the entry.
- Alternatives: read-through/write-through (cache fronts the DB) and write-back (async flush, risks loss).
- Invalidation is the hard part — stale reads after a write if you forget to evict.
- Guard against stampede (hot key expiry → DB flood): coalescing, locks, staggered TTLs.
Quick self-check
In the cache-aside pattern, what happens on a cache miss?
Follow-ups they push on
- Why is cache invalidation famously the hard part of caching?
- How do you prevent a cache stampede when a popular key expires?
- What data should you NOT cache, and why?
Red flag Caching without an invalidation/TTL strategy — writes update the DB but leave stale entries in the cache, so users keep reading old data until the entry happens to expire.
source: System Design Primer — Caching ↗
AmazonStripeCloudflare senior design very common Design a distributed rate limiter for a public API.
Clarify: client-side vs server-side (server-side), what dimensions to limit (per-user, per-IP, per-endpoint, global), and the action on limit (drop, queue, return 429 with Retry-After). Pick an algorithm and justify it: token bucket (allows bursts, simple, most common), leaky bucket (smooths output), fixed window (cheap but boundary spikes), sliding-window log (accurate, memory-heavy), sliding-window counter (good accuracy/cost tradeoff).
For a distributed fleet, counters must be shared: keep them in a central store like Redis, keyed by userId:window, incremented atomically (e.g. a Lua script / INCR + EXPIRE) so the read-modify-write is race-free. Put the limiter at the edge / API gateway so rejected traffic never reaches your services.
Discuss tradeoffs: local in-memory counters are fast but let bursts through across nodes; Redis adds a network hop and a single point to scale; allow a small over-limit margin to tolerate Redis latency.
Follow-ups they push on
- Token bucket vs sliding-window counter — when do you prefer each?
- How do you keep the counter consistent across many API servers?
- What happens if Redis goes down — fail open or fail closed?
Red flag Using a non-atomic get-then-set on the counter, which races under concurrency and lets requests slip past the limit. Also putting the limiter behind the app instead of at the gateway.
source: ByteByteGo — Design A Rate Limiter ↗
Meta senior design very common Design a social media news feed (e.g. the Facebook/Twitter timeline).
Clarify: feed of posts from people you follow, ranked (recency or relevance), heavy read load. The core decision is fanout-on-write (push) vs fanout-on-read (pull).
Fanout-on-write: when a user posts, push the post id into every follower's precomputed feed cache. Reads are O(1) and fast — great for most users. But it explodes for celebrities with millions of followers (the hot-key / fanout problem).
Fanout-on-read: build the feed at request time by pulling recent posts from everyone the user follows. No write amplification, but reads are expensive and slow.
The standard answer is a hybrid: push for normal accounts, pull for a small set of high-follower accounts, then merge at read time. Cache assembled feeds (Redis), store posts in a sharded store, and rank with a separate scoring service.
Follow-ups they push on
- How do you handle a celebrity with 50M followers under fanout-on-write?
- Where does ranking/ML scoring fit — write time or read time?
- How do you keep the feed cache from growing unbounded?
Red flag Committing to pure fanout-on-write without acknowledging the celebrity hot-key problem, or pure fanout-on-read and ignoring read latency at scale.
source: ByteByteGo — Design A News Feed System ↗
AmazonMeta senior design common Design a distributed cache (like a multi-node Redis/Memcached layer).
Clarify: read-heavy lookups, low latency, data too large for one node's RAM, so partition across nodes. The key technique is consistent hashing: map both nodes and keys onto a hash ring so that adding/removing a node only remaps ~1/N of keys instead of remapping everything (which a naive hash(key) % N would do).
Discuss replication for availability (replica per shard, read from replicas), an eviction policy (LRU/LFU) since memory is bounded, and the write policy: write-through (write cache + DB together, consistent but slower) vs write-back (write cache, flush DB async, fast but risks loss) vs cache-aside (app reads cache, on miss reads DB and populates).
Wrap up: name failure modes — cache stampede on a hot key expiring (mitigate with request coalescing / a short lock), and the thundering herd on cold start.
Follow-ups they push on
- Why consistent hashing instead of modulo hashing?
- How do you prevent a cache stampede when a hot key expires?
- Cache-aside vs write-through — what consistency do you give up?
Red flag Proposing `hash(key) % N` for sharding — adding one node reshuffles almost every key and cold-starts the whole cache.
source: ByteByteGo — Distributed Cache (System Design Interview) ↗
Commonly asked senior design very common How do you do back-of-the-envelope estimation? Estimate QPS and storage for a service with 100M daily active users.
The point is order-of-magnitude reasoning, not precision. Start from DAU and an action rate. Say each of 100M users does ~10 reads/day -> 1B reads/day. Divide by ~86,400 s/day (~10^5) -> ~12K reads/s average; multiply by a peak factor of ~2-3x -> ~30K peak QPS.
Storage: rows/day * bytes/row * retention. If you store 1M new items/day at ~1KB each, that is ~1GB/day, ~365GB/year, ~1TB over 3 years — round freely.
Keep a few anchors memorized: ~10^5 seconds/day, reads usually dwarf writes (often 100:1), a memory read is ~ns, SSD ~µs, network round-trip cross-region ~tens of ms. State your assumptions out loud and round to clean powers of ten so the interviewer can follow.
Follow-ups they push on
- What read:write ratio did you assume and why?
- How does the storage number change the database choice?
- What peak-to-average factor is reasonable, and why?
Red flag Reaching for a calculator and false precision. The interviewer wants to see assumptions stated and powers-of-ten arithmetic, not 31,536,000 seconds.
source: System Design Primer — Back-of-the-envelope ↗
Commonly asked senior design very common Walk me through the 4-step framework you use to attack any system design interview.
(1) Understand the problem & scope: ask clarifying questions, separate functional from non-functional requirements (scale, latency, consistency, availability), and do capacity estimates. Pin down what is in and out of scope before drawing anything.
(2) Propose a high-level design and get buy-in: sketch the major boxes — clients, API/gateway, services, datastores, cache, queue — and the data flow. Confirm the interviewer agrees before going deep.
(3) Deep dive: pick the 1-2 components the interviewer cares about (the data model, the sharding strategy, the hot path) and go deep — algorithms, schema, partitioning, the actual bottleneck.
(4) Wrap up: name bottlenecks, single points of failure, and tradeoffs; mention what you would monitor and how you would scale the next 10x. The discipline is to drive the conversation, not silently draw.
Follow-ups they push on
- How do you decide which component to deep-dive on?
- What non-functional requirements do you always ask about?
Red flag Skipping step 1 and diving into a database schema before clarifying scale, latency, and consistency needs — the single most common reason candidates fail the round.
source: ByteByteGo — A framework for system design interviews ↗
Commonly asked senior design common How do you identify bottlenecks and single points of failure in a design, and how do you remove them?
Trace the request path and ask, at each hop, what happens if this one component dies or saturates. A single load balancer, a single primary database, a single cache node, or a single region are classic SPOFs.
Remove SPOFs with redundancy + failover: run the load balancer in an active-passive pair, replicate the DB (primary + replicas, automatic promotion), spread services across multiple availability zones, and use health checks so traffic routes away from dead instances.
For bottlenecks, find the component nearest its capacity ceiling: stateless app tier scales horizontally behind the LB; a write-bound DB needs sharding or a queue to absorb bursts; a read-bound DB needs replicas + a cache. The senior move is to quantify it (this shard does X writes/s, the limit is Y) rather than hand-wave 'add more servers'.
Follow-ups they push on
- How does making services stateless help you scale horizontally?
- How do you decide between adding read replicas vs sharding?
Red flag Treating 'add a load balancer' as the whole answer while the load balancer itself remains a single point of failure, or scaling a stateful service horizontally without externalizing session state.
source: System Design Primer — Availability patterns ↗
Google senior design common Design a web crawler that can crawl the public web.
Clarify: scale (billions of pages), politeness (respect robots.txt and per-host rate limits), freshness, and what you extract. Core loop: a URL frontier (a queue, partitioned by host so one host's pages go to one worker for politeness) feeds a fleet of fetchers; fetched HTML goes to a parser that extracts links, which are de-duplicated and fed back into the frontier.
Key components: a DNS cache (DNS resolution is a hidden bottleneck), a seen-URL set (Bloom filter / hash store) to avoid re-crawling, and content de-duplication (hash or Simhash of page content to skip near-duplicates). Store raw pages in object storage / a distributed file store.
Wrap up: politeness is the subtle part — partition the frontier by domain and apply a per-host crawl delay so you do not hammer one site; add priority queues so important pages get crawled sooner.
Follow-ups they push on
- How do you avoid crawling the same URL (or near-duplicate content) twice?
- How do you stay polite to a single host while still being massively parallel?
- Why is DNS a bottleneck and how do you mitigate it?
Red flag Forgetting politeness/robots.txt and a de-dup mechanism — an interviewer reads that as someone who would get the crawler IP-banned and stuck in cycles.
source: ByteByteGo — Design A Web Crawler ↗
Commonly asked senior concept very common How do you choose between SQL and NoSQL for a system-design problem?
Drive it from access patterns and requirements, not preference. Reach for SQL (Postgres/MySQL) when you need strong consistency and multi-row transactions (ACID), rich ad-hoc queries and joins, and a stable relational schema — payments, orders, anything where correctness beats raw write throughput.
Reach for NoSQL when you need massive horizontal write scale, a flexible/evolving schema, or a specific access shape: a wide-column store (Cassandra/DynamoDB) for huge write volume and known key lookups, a document store (MongoDB) for nested aggregates, a KV store (Redis) for caching, a graph DB for relationship-heavy traversals.
The senior framing is the tradeoff: most NoSQL stores trade joins and strong consistency for partition tolerance and horizontal scale, and you must model the table around the query up front. State your access pattern, then justify the store.
What a strong answer covers
- Choose from access patterns + consistency needs, never from familiarity.
- SQL: ACID transactions, joins, ad-hoc queries, stable relational schema (orders, payments).
- NoSQL: horizontal write scale, flexible schema, query-shaped models (Cassandra/DynamoDB, Mongo, Redis).
- NoSQL usually trades joins + strong consistency for scale — you model around the query first.
- It is not all-or-nothing: polyglot persistence — SQL for the core, Redis for cache, a search index alongside.
Quick self-check
A payments service needs multi-row transactions and strong consistency. Best default store?
Follow-ups they push on
- Which store fits a write-heavy event log with known key lookups?
- What do you give up when you pick a wide-column store over Postgres?
- When would you run both a SQL store and a NoSQL store in the same system?
Red flag Declaring 'NoSQL because it scales' with no access pattern stated — many NoSQL stores need the schema modeled around the exact query, and you lose joins/transactions you may actually need.
source: System Design Primer — SQL or NoSQL ↗

6.2.1 Containers (Docker) 13

Commonly asked mid concept very common What is the difference between a Docker image and a container?
An image is the blueprint — an immutable, read-only stack of layers (filesystem + metadata like the default command) built from a Dockerfile. A container is a running (or stopped) instance of an image: Docker adds a thin writable layer on top of the read-only image layers and gives it an isolated process, network, and mount namespace.
The analogy: image is to container as a class is to an object, or a program on disk is to a process. You can spin up many containers from one image; each gets its own writable layer, so changes inside one container do not affect the image or the other containers.
Follow-ups they push on
- What happens to data written inside a container when it is removed?
- Why are image layers read-only and the container layer writable?
Red flag Saying data persists in the image after a container writes to it — writes land in the container's ephemeral writable layer and vanish when the container is removed unless you mount a volume.
source: Docker docs — Images and layers ↗
Commonly asked mid concept occasional What is the difference between Docker's default bridge network and a user-defined bridge network?
Both use the bridge driver, but a user-defined bridge adds the feature you almost always want: built-in DNS-based service discovery. Containers on the same user-defined network can reach each other by container name (http://api:3000), because Docker runs an embedded DNS resolver for that network.
On the default bridge network, name resolution is not provided — containers can only reach each other by IP (or the legacy, deprecated --link), which is fragile because IPs change. User-defined networks also give you better isolation (only containers you attach can talk) and let you attach/detach containers on the fly.
The practical takeaway: for any multi-container app, create a user-defined bridge (which is exactly what docker-compose does automatically) so services find each other by name rather than chasing IP addresses.
What a strong answer covers
- User-defined bridge networks give automatic DNS — reach containers by name.
- The default bridge has no name resolution (IP only, or deprecated --link).
- User-defined networks add isolation — only attached containers can communicate.
- Compose creates a user-defined network for you, which is why services resolve each other by service name.
- Prefer user-defined bridges for any multi-container app; avoid relying on the default bridge.
Follow-ups they push on
- Why is reaching containers by IP on the default bridge fragile?
- How does docker-compose use this under the hood?
- What does the `host` network driver change about all this?
Red flag Expecting container-name DNS resolution to work on the default `bridge` network — it doesn't; you must create a user-defined network (or use compose) to get name-based service discovery.
source: Docker docs — Networking overview ↗
Commonly asked mid concept occasional What is a container registry, and what is the danger of deploying images tagged `:latest`?
A registry (Docker Hub, GHCR, ECR) is the remote store for images: you push built images to it and nodes pull them at deploy time. An image is addressed by registry/repository:tag plus an immutable content digest (sha256:...).
The :latest tag is the trap. It is just a mutable label, not a guarantee of newness — it points to whatever was last pushed with that tag, and it can be overwritten. So 'deploy :latest' is non-deterministic: two nodes pulling at different times can run different code, you can't tell which build is in production, and rollbacks are ambiguous. It also undermines caching (Docker may skip re-pulling a tag it already has, so you can silently run a stale image).
The fix: deploy immutable, specific tags (a version or git SHA, e.g. :1.4.2 or :sha-abc123), or pin by digest. Reserve :latest for casual local use only.
What a strong answer covers
- A registry stores images; nodes pull by repo:tag plus an immutable sha256 digest.
- :latest is a mutable pointer, not 'the newest' — it can be overwritten and means different things over time.
- Deploying :latest is non-deterministic: nodes can run different builds; rollbacks are ambiguous.
- Pin to a version or git SHA tag (or the digest) so a deploy is reproducible and traceable.
- It also defeats reliable cache invalidation — you can silently keep running a stale image.
Quick self-check
What does the `:latest` tag actually guarantee about an image?
Follow-ups they push on
- Why is pinning by digest the strongest guarantee of running an exact image?
- How does `:latest` make a rollback ambiguous?
- What naming scheme would you use for production image tags?
Red flag Shipping `:latest` to production — it is mutable, so different nodes can run different code and you lose the ability to say exactly which build is live or roll back to a known-good one.
source: Docker docs — Push and pull / registries ↗
Commonly asked mid trick occasional What is the difference between `COPY` and `ADD` in a Dockerfile, and which should you default to?
Both copy files into the image, but ADD has two extra, surprising behaviors: it can fetch a remote URL, and it auto-extracts local tar archives into the destination. COPY does exactly one thing — copy local files/directories from the build context — with no magic.
The guidance (and Docker's own best practice) is to default to COPY because it is explicit and predictable. Reserve ADD for the one case it is genuinely good at: copying-and-extracting a local tarball in a single step. For fetching remote files, prefer an explicit RUN curl/wget (or better, ADD's checksum options) so the intent and caching are clear.
The trick the interviewer is checking: candidates who use ADD https://... casually may not realize it bypasses the clarity of COPY and can silently auto-extract archives, leading to surprising image contents.
What a strong answer covers
- COPY copies local build-context files only — no surprises.
- ADD also fetches remote URLs and auto-extracts local tar archives.
- Default to COPY for predictability (Docker's own best-practice guidance).
- Use ADD only for its niche win: copy-and-extract a local tarball in one step.
- For remote downloads prefer explicit RUN curl/wget so caching and intent are clear.
Follow-ups they push on
- What surprising thing happens if you `ADD` a local `.tar.gz` file?
- Why is `RUN curl` often preferred over `ADD <url>` for remote files?
- When is `ADD` genuinely the right choice?
Red flag Using `ADD` everywhere as a synonym for `COPY` — its auto-extraction of tar archives and URL fetching are silent, surprising behaviors; default to `COPY` and reach for `ADD` only deliberately.
source: Docker docs — Dockerfile reference (ADD / COPY) ↗
Commonly asked mid concept very common Why does the order of instructions in a Dockerfile matter? How does layer caching work?
Each Dockerfile instruction creates a layer. On rebuild, Docker reuses a cached layer as long as that instruction and everything it depends on are unchanged; the first instruction that changes invalidates that layer and every layer after it.
So you order from least-frequently-changing to most-frequently-changing. The classic example for a Node app: COPY package.json then RUN npm install BEFORE COPY . .. Dependencies change rarely, so the expensive npm install layer stays cached across most builds; only the cheap source-copy layer rebuilds when you edit code. If you COPY . . first, every source edit busts the cache and reinstalls all dependencies.
Follow-ups they push on
- Where would you put `COPY package.json` vs `COPY . .` and why?
- How does a `.dockerignore` file interact with build caching?
Red flag Copying the whole source tree before installing dependencies — every code change then invalidates the dependency-install layer and forces a slow full reinstall.
source: Docker docs — Building best practices ↗
Commonly asked mid concept common What is a `.dockerignore` file and why does it matter for both build speed and security?
.dockerignore lists paths excluded from the build context — the set of files the Docker daemon receives before building. Excluding node_modules, .git, build output, and local env files makes the context smaller, so builds start faster and the cache is less likely to bust on irrelevant changes.
The security angle: without it, a COPY . . can sweep secrets (.env, .aws/, private keys, .git history) straight into an image layer, where they persist even if a later layer deletes them. So .dockerignore both speeds up builds and keeps secrets out of the image.
Follow-ups they push on
- Why does deleting a secret in a later layer not actually remove it from the image?
- What belongs in a typical `.dockerignore`?
Red flag Believing that a `RUN rm secret` later in the Dockerfile removes the secret — layers are additive, so the file still lives in the earlier layer and can be extracted from the image history.
source: Docker docs — Building best practices (.dockerignore) ↗
Commonly asked mid concept common When would you use docker-compose, and what problem does it solve?
docker-compose defines and runs a multi-container app from a single declarative YAML file. Instead of starting each container with a long docker run and wiring up networks/volumes by hand, you describe the services (app, db, cache), their images/build contexts, ports, env, volumes, and dependencies, then docker compose up brings the whole stack up on a shared network where services reach each other by service name.
Its sweet spot is local development and CI — reproducing a realistic multi-service environment (e.g. an API + Postgres + Redis) with one command. It is not an orchestrator; for production scheduling, self-healing, and scaling across many machines you reach for Kubernetes.
Follow-ups they push on
- How do services in a compose file discover each other?
- Why is compose not a substitute for Kubernetes in production?
Red flag Pitching docker-compose as a production orchestration tool — it does not give you multi-node scheduling, self-healing, or rolling updates across a cluster.
source: Docker docs — Docker Compose overview ↗
Commonly asked mid concept common What is the difference between a Docker volume and a bind mount, and when do you use each?
Both persist data outside the container's ephemeral writable layer, but they differ in who owns the storage. A named volume is managed by Docker in its own storage area (/var/lib/docker/volumes/...); you reference it by name, Docker handles the location, and it is the portable, production-friendly default — great for databases and app data that must outlive a container.
A bind mount maps a specific host path straight into the container. It is tied to the host's directory layout, so it is ideal for local development (mount your source code so edits show up live) but brittle and host-coupled for production.
Rule of thumb: volumes for data Docker should manage and that must survive container removal; bind mounts for sharing host files into a container during development. A third option, tmpfs, keeps data in memory only — for secrets/scratch that should never hit disk.
What a strong answer covers
- Both survive the container's ephemeral writable layer; the difference is who owns the storage.
- Named volume: Docker-managed, portable, the production default (databases, persistent app data).
- Bind mount: a specific host path into the container — perfect for live-reloading source in local dev.
- Bind mounts are host-coupled and brittle for production; volumes abstract the location away.
- tmpfs mounts live in memory only — for scratch/secret data that must never touch disk.
Quick self-check
You want a Postgres container's data to survive container recreation and stay portable across hosts. Use:
Follow-ups they push on
- Why is a bind mount a poor choice for production data persistence?
- Where does a named volume actually live, and why does that make it portable?
- When would you reach for a tmpfs mount?
Red flag Relying on a bind mount in production — it couples the container to the host's exact directory layout, so the same image behaves differently (or breaks) on another host; use a named volume so Docker owns the storage.
source: Docker docs — Volumes ↗
Commonly asked senior debug occasional Your container starts and immediately exits with code 0, and you don't know why. How do you debug it?
Exit code 0 means the main process finished successfully — a container lives exactly as long as its PID 1 runs, so if the command completes, the container stops. This is usually a misconception, not a bug: the image's CMD/ENTRYPOINT ran a one-shot command (or a process that daemonized into the background) instead of a long-running foreground process.
Debug it: docker ps -a to confirm the exit code, docker logs <container> to see what it printed, and docker inspect <container> for the actual command and config. Then check whether CMD runs a foreground process — a common trap is starting a server that forks into the background, so PID 1 returns and the container exits.
Fix: make the entrypoint run a long-lived foreground process (e.g. nginx -g 'daemon off;', or run the app directly rather than via a launcher that backgrounds it). For interactive debugging, override the entrypoint: docker run -it --entrypoint sh <image>.
What a strong answer covers
- A container runs only as long as its PID 1; exit 0 = the main command completed normally.
- Usual cause: CMD ran a one-shot command, or a server daemonized into the background so PID 1 returned.
- Inspect with docker ps -a (exit code), docker logs, and docker inspect (the actual command).
- Fix: run the process in the foreground (e.g. nginx -g 'daemon off;').
- Drop into the image to poke around: docker run -it --entrypoint sh <image>.
Follow-ups they push on
- Why does a server that forks into the background cause the container to exit?
- How do you get a shell inside an image whose entrypoint exits immediately?
- How is exit code 0 different in meaning from 137 or 1?
Red flag Assuming a clean exit code 0 means something crashed — it means the foreground process finished; the real fix is running a long-lived foreground process as PID 1, not adding restart policies.
source: Docker docs — Run and manage containers ↗
Commonly asked senior coding common Write a multi-stage Dockerfile for a Node app and explain why multi-stage builds matter.
A multi-stage build uses multiple FROM statements: a heavy build stage compiles/installs, then a slim runtime stage copies only the final artifacts. The build toolchain (compilers, dev dependencies) never ships in the final image, so it is smaller and has a smaller attack surface.
FROM node:20 AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-slim
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]
The COPY --from=build pulls only built output from the earlier stage; the final image starts from a slim base and runs as the non-root node user.
Follow-ups they push on
- Why run as a non-root user in the final stage?
- How would you get an even smaller image (distroless / alpine)?
Red flag Shipping the full build image with dev dependencies and toolchain, or running as root in the final stage — bigger image, larger attack surface, and a container that can do more damage if compromised.
source: Docker docs — Multi-stage builds ↗
Commonly asked senior trick common What is the difference between `CMD` and `ENTRYPOINT` in a Dockerfile?
Both define what runs when the container starts, but they compose differently. ENTRYPOINT sets the fixed executable; CMD sets default arguments that are easy to override at docker run time.
With ENTRYPOINT ["python", "app.py"] the container always runs that; anything you pass to docker run is appended as args. With only CMD ["python", "app.py"], passing a command to docker run replaces it entirely. A common pattern is ENTRYPOINT for the binary plus CMD for default flags, so docker run image uses the defaults and docker run image --other-flag overrides just the flags.
Prefer the exec form (JSON array) over the shell form so signals like SIGTERM reach your process directly for clean shutdown.
Follow-ups they push on
- Why does the exec form matter for graceful shutdown / signal handling?
- How do `ENTRYPOINT` and `CMD` combine when both are present?
Red flag Using the shell form (`CMD node server.js`) so the app runs as a child of `/bin/sh`, which swallows `SIGTERM` — the container then gets SIGKILLed on stop instead of shutting down gracefully.
source: Docker docs — Dockerfile reference (CMD / ENTRYPOINT) ↗
Commonly asked senior debug common Your Docker image is 1.2GB and builds take 10 minutes on every code change. How do you debug and fix it?
Two separate problems: image size and build time.
Size: run docker history <image> to see which layers are fat. Usual culprits are a heavy base image (use -slim/-alpine/distroless), build toolchain shipped in the runtime image (fix with a multi-stage build copying only artifacts), and dev dependencies (npm ci --omit=dev). Combine related RUN steps and clean package caches in the same layer so the cleanup actually shrinks the layer.
Build time on every change: this is almost always cache invalidation from instruction order. Copy and install dependencies before copying source, add a .dockerignore so unrelated files do not bust the context, and enable BuildKit so independent stages build in parallel. After reordering, only the source layer rebuilds on a code edit, dropping the loop from minutes to seconds.
Follow-ups they push on
- Which tool shows you per-layer size, and what do you look for?
- Why does cleaning a cache in a separate `RUN` not reduce image size?
Red flag Adding `RUN rm -rf /var/cache/...` as a new layer after the install layer — additive layers mean the bytes still count; the cleanup must happen in the same `RUN` as the install.
source: Docker docs — Building best practices ↗
Commonly asked senior concept common How do containers achieve isolation? What kernel features make a container different from a VM?
A container is just a regular Linux process that the kernel isolates using two features: namespaces and cgroups. Namespaces scope *what a process can see* — separate PID, network, mount, user, and hostname namespaces make the process believe it has its own process tree, network stack, and filesystem. cgroups scope *what it can use* — CPU, memory, and I/O limits. Together they give the illusion of a private machine while everything shares one host kernel.
That shared kernel is the key contrast with a VM: a VM runs a full guest OS with its own kernel on top of a hypervisor, so it is heavier (GBs, slow boot) but more strongly isolated. A container shares the host kernel, so it is lightweight (MBs, sub-second start) but the isolation is weaker — a kernel exploit can cross the boundary.
This is why containers pack densely and start fast, and why you don't run untrusted multi-tenant workloads on bare containers without extra sandboxing.
What a strong answer covers
- A container is a host process isolated by namespaces (what it can see) + cgroups (what it can use).
- Namespaces: PID, network, mount, user, UTS — each process gets its own view of the system.
- cgroups bound CPU/memory/IO so one container can't starve the others.
- Containers share the host kernel (light, fast); VMs run a full guest OS + hypervisor (heavy, stronger isolation).
- Weaker container isolation is why untrusted multi-tenant workloads need extra sandboxing (gVisor, microVMs).
Quick self-check
Which pair of Linux kernel features primarily provides container isolation?
Follow-ups they push on
- What do namespaces isolate vs what cgroups limit?
- Why does sharing the host kernel make containers faster but less isolated than VMs?
- When would you still prefer a VM (or microVM) over a plain container?
Red flag Describing a container as a 'lightweight VM' — there is no guest OS or hypervisor; it is a host process with kernel-enforced isolation, which is exactly why the isolation boundary is weaker than a VM's.
source: Docker docs — What is a container? ↗

6.2.2 Orchestration (Kubernetes) 13

★ must-know Commonly asked mid concept very common What are the differences between a Service of type ClusterIP, NodePort, and LoadBalancer?
They form a ladder of increasing external exposure, and each builds on the previous.
ClusterIP (the default) gives the Service a stable virtual IP reachable only inside the cluster — perfect for service-to-service traffic that should never be public. NodePort opens a fixed high port (30000–32767) on every node, so external traffic to nodeIP:nodePort reaches the Service; it builds on ClusterIP and is mostly a dev/debug or building-block mechanism, not a polished production front door. LoadBalancer provisions an external cloud load balancer (an AWS NLB/ALB, a GCP LB) that fronts the Service with a single external IP — the production way to expose one Service to the internet.
The senior nuance: one LoadBalancer per Service gets expensive, so for many HTTP services you front them with a single Ingress (L7 routing/TLS) backed by one load balancer instead of a LoadBalancer Service each.
What a strong answer covers
- ClusterIP (default): internal-only stable virtual IP — service-to-service traffic.
- NodePort: opens a fixed port on every node; builds on ClusterIP, mainly dev/building-block.
- LoadBalancer: provisions a cloud load balancer with an external IP — production single-service exposure.
- Each type is a superset of the previous (LoadBalancer → NodePort → ClusterIP under the hood).
- Many HTTP services? Use one Ingress instead of a LoadBalancer per Service to save cost.
Quick self-check
You need internal-only communication between two microservices in the cluster. Which Service type?
Follow-ups they push on
- Why would you front many services with an Ingress instead of a LoadBalancer each?
- What range do NodePorts fall in, and why isn't NodePort a great production front door?
- How does a LoadBalancer Service actually get its external IP?
Red flag Reaching for a LoadBalancer Service per microservice — each provisions (and bills for) a separate cloud load balancer; route many HTTP services through a single Ingress instead.
source: Kubernetes docs — Service (publishing types) ↗
Commonly asked mid concept very common Explain the core Kubernetes objects: Pod, Deployment, Service, and Ingress. How do they relate?
A Pod is the smallest deployable unit — one or more containers sharing a network namespace and storage. Pods are ephemeral; you rarely create them directly.
A Deployment is the controller you actually use: you declare a desired replica count and a pod template, and it manages a ReplicaSet to keep that many pods running, replacing crashed ones and handling rolling updates.
A Service gives that fluid set of pods a single stable virtual IP and DNS name, load-balancing across the matching pods (selected by labels) so callers do not chase changing pod IPs.
Ingress sits in front of Services to route external HTTP(S) traffic — host/path routing and TLS termination — to the right Service. So: Ingress -> Service -> Pods, with the Deployment keeping the pods alive underneath.
Follow-ups they push on
- How does a Service know which pods to send traffic to?
- What is the difference between a Service of type ClusterIP, NodePort, and LoadBalancer?
Red flag Conflating a Service with an Ingress — a Service does L4 load-balancing inside the cluster, Ingress does L7 HTTP routing and TLS at the edge.
source: Kubernetes docs — Concepts ↗
Commonly asked mid concept occasional What is a namespace in Kubernetes, and what problems does it actually solve (and not solve)?
A namespace is a virtual cluster-within-a-cluster: a scope for naming and a boundary for applying policy. It lets you partition one physical cluster among teams or environments (team-a, staging) so names don't collide and you can attach ResourceQuotas (cap CPU/memory per namespace), RBAC (who can do what, where), and NetworkPolicies per slice.
What it is good for: organization, quota, and access control on a shared cluster. What it is not: a hard security/isolation boundary. By default, pods in different namespaces can still reach each other over the network — namespaces alone do not isolate traffic; you need NetworkPolicies for that. And some objects are cluster-scoped (nodes, PersistentVolumes, namespaces themselves), so they live outside any namespace.
The senior point: namespaces are an organizational and policy primitive, not a substitute for multi-tenancy isolation between untrusted parties.
What a strong answer covers
- A namespace scopes names and is the unit for ResourceQuota, RBAC, and NetworkPolicy.
- Great for partitioning a shared cluster by team or environment.
- Not a network isolation boundary — cross-namespace pod traffic is allowed by default.
- Use NetworkPolicies to actually restrict traffic between namespaces.
- Some objects are cluster-scoped (nodes, PVs, namespaces) and aren't namespaced.
Quick self-check
By default, can a pod in namespace `a` reach a pod in namespace `b` over the network?
Follow-ups they push on
- Why don't namespaces stop pods in different namespaces from talking to each other?
- What do you add to get real network isolation between namespaces?
- Name a couple of resources that are cluster-scoped, not namespaced.
Red flag Treating namespaces as a security boundary for untrusted tenants — without NetworkPolicies (and often stronger isolation), pods across namespaces can still reach each other on the network.
source: Kubernetes docs — Namespaces ↗
Commonly asked mid trick very common What is the difference between a ConfigMap and a Secret? Is a Secret actually encrypted?
Both inject configuration into pods (as env vars or mounted files) and both keep config out of the image. The difference is intent: ConfigMaps hold non-sensitive config (feature flags, URLs); Secrets hold sensitive values (passwords, tokens, keys).
The gotcha: a Secret is only base64-encoded, not encrypted — base64 is trivially reversible, so anyone who can read the Secret object sees the value. To actually protect Secrets you must enable encryption-at-rest for etcd, lock down access with RBAC, and avoid committing Secret manifests to git. Many teams go further with an external secret store (Vault, cloud secret managers) and pull values in at runtime.
Follow-ups they push on
- What two things must you configure to make Secrets meaningfully secure?
- Why is putting a Secret YAML in git dangerous even though it 'looks encoded'?
Red flag Claiming a Kubernetes Secret is encrypted by default — it is base64, which is encoding, not encryption. Without encryption-at-rest + RBAC it offers essentially no confidentiality.
source: Kubernetes docs — Secrets ↗
Commonly asked senior concept common What is a StatefulSet, and how is it different from a Deployment? When do you need one?
A Deployment treats its pods as interchangeable, fungible replicas — random names, no stable identity, no per-pod storage. That is exactly right for stateless app servers.
A StatefulSet gives each pod a stable, sticky identity: a stable ordinal name (db-0, db-1), stable network identity (a headless Service gives each a predictable DNS name), and its own persistent volume that survives reschedule and follows the pod. Pods are created/scaled/terminated in order (0, 1, 2 …), which matters for clustered systems that need a known startup/teardown sequence.
You need a StatefulSet for stateful, clustered workloads where identity matters: databases, Kafka, ZooKeeper, Elasticsearch — anything where pod db-0 must keep being db-0 with the same data. For stateless web/API tiers, always use a Deployment. The senior caveat: running databases in-cluster at all is a real decision; many teams prefer a managed database over a StatefulSet.
What a strong answer covers
- Deployment pods are fungible (random names, shared/no per-pod storage) — for stateless apps.
- StatefulSet gives each pod a stable ordinal identity (db-0), stable DNS, and its own PVC.
- Pods come up / scale / terminate in order, which clustered systems rely on.
- Use it for databases, Kafka, ZooKeeper, Elasticsearch — workloads where identity + data stick to the pod.
- Caveat: consider a managed database instead of running stateful systems in-cluster.
Quick self-check
Which workload genuinely requires a StatefulSet rather than a Deployment?
Follow-ups they push on
- Why does a database need stable identity and per-pod storage that a web server doesn't?
- What role does the headless Service play for a StatefulSet?
- When would you avoid a StatefulSet and use a managed service instead?
Red flag Running a stateful, clustered system (a database, Kafka) under a plain Deployment — pods get random identities and can share/lose storage, so a rescheduled pod comes back as a different node with the wrong (or no) data.
source: Kubernetes docs — StatefulSets ↗
Commonly asked senior concept occasional How do you control which node a pod lands on? Explain taints/tolerations vs node affinity.
Two mechanisms that work from opposite directions. Node affinity (and the simpler nodeSelector) is a pod-side attraction: the pod says 'schedule me on nodes with label gpu=true'. It can be hard (requiredDuringScheduling) or soft/preferred.
Taints and tolerations are a node-side repulsion: you taint a node (kubectl taint nodes node1 gpu=true:NoSchedule) so it repels all pods by default, and only pods that carry a matching toleration are allowed on. So a taint reserves a node; a toleration is a pod's permission slip to land on a tainted node.
The key distinction: affinity *attracts* a pod toward nodes; a taint *repels* pods away from a node unless they tolerate it — and a toleration alone does not *force* a pod onto that node (you pair it with affinity for that). Use taints to dedicate expensive/special nodes (GPU, spot) and affinity to steer pods toward the right hardware; add pod anti-affinity to spread replicas across nodes/zones for HA.
What a strong answer covers
- Node affinity / nodeSelector: pod-side *attraction* toward nodes with matching labels.
- Taints: node-side *repulsion* — a tainted node rejects pods unless they tolerate the taint.
- Tolerations: a pod's permission to schedule onto a tainted node (but doesn't force it there).
- Combine: taint dedicates a node (GPU/spot), affinity steers the right pods to it.
- Pod anti-affinity spreads replicas across nodes/zones for availability.
Follow-ups they push on
- Why doesn't a toleration alone guarantee a pod runs on the tainted node?
- How would you dedicate GPU nodes so only ML workloads land there?
- How does pod anti-affinity improve availability?
Red flag Assuming a toleration *attracts* a pod to a tainted node — a toleration only lets the pod tolerate the taint; to actually steer it there you also need node affinity/nodeSelector.
source: Kubernetes docs — Taints and Tolerations ↗
Commonly asked senior concept occasional Why do you set both a readiness probe and a preStop hook + terminationGracePeriod for zero-downtime shutdown?
When a pod is deleted (a rolling update, a scale-down), two things happen in parallel, which is the source of the race: Kubernetes sends the container SIGTERM, and it (asynchronously) removes the pod from Service endpoints. Because endpoint removal propagates through kube-proxy/iptables with a small delay, the load balancer can keep sending new requests to a pod that has already started shutting down — causing dropped connections mid-rollout.
The fix is to give that propagation time to win the race. A preStop hook that sleeps a few seconds delays the actual shutdown so in-flight endpoint removal completes before the app stops accepting connections. The terminationGracePeriodSeconds must be long enough to cover the preStop sleep plus the app draining in-flight requests after SIGTERM, before Kubernetes escalates to SIGKILL. Readiness probes handle the *startup* side (no traffic until ready); preStop + grace period handle the *shutdown* side.
The app must also handle SIGTERM to stop accepting new work and finish in-flight requests — otherwise it gets SIGKILLed and drops connections regardless.
What a strong answer covers
- On pod deletion, SIGTERM and endpoint removal happen in parallel — that's the race.
- Endpoint removal propagates with a delay, so traffic can still arrive at a terminating pod.
- A preStop sleep delays shutdown until endpoint removal propagates (drains the LB).
- terminationGracePeriodSeconds must cover preStop + in-flight drain before SIGKILL.
- The app must catch SIGTERM and finish in-flight requests, or it gets force-killed.
Follow-ups they push on
- Why can a pod still receive traffic after it gets SIGTERM?
- What happens if the grace period is shorter than your preStop + drain time?
- Why isn't a readiness probe alone enough for graceful shutdown?
Red flag Relying on SIGTERM handling alone and skipping the preStop delay — endpoint removal hasn't propagated yet, so the load balancer keeps routing new requests to the dying pod and connections drop mid-rollout.
source: Kubernetes docs — Pod Lifecycle (termination) ↗
Commonly asked senior concept common What is the difference between a liveness probe and a readiness probe? What breaks if you confuse them?
A liveness probe answers 'is this container healthy?' If it fails, the kubelet restarts the container. A readiness probe answers 'can this pod take traffic right now?' If it fails, the pod is pulled out of the Service's endpoints but is NOT restarted.
Use readiness for slow startup or temporary unavailability (warming a cache, waiting on a dependency); use liveness only for unrecoverable hangs.
The classic mistake: pointing a liveness probe at a deep health check that also depends on a database. When the DB hiccups, every pod fails liveness and gets restarted simultaneously — turning a transient blip into a full self-inflicted outage. There is also a startupProbe for slow-booting apps so liveness does not kill them before they finish starting.
Follow-ups they push on
- Why should a liveness probe usually NOT check downstream dependencies?
- When would you add a startupProbe?
Red flag Using a liveness probe that depends on a database or downstream service — a transient outage then triggers a restart storm across all pods, amplifying the incident instead of riding it out.
source: Kubernetes docs — Configure Liveness, Readiness and Startup Probes ↗
Commonly asked senior concept common How does a rolling update work in a Deployment, and how do you roll back a bad release?
When you change a Deployment's pod template, the Deployment controller creates a new ReplicaSet and shifts pods gradually: it scales the new ReplicaSet up and the old one down, governed by maxSurge (how many extra pods above desired during the update) and maxUnavailable (how many can be missing). With readiness probes in place, traffic only moves to new pods once they report ready, so there is no downtime.
Kubernetes keeps the old ReplicaSets around, so rollback is just kubectl rollout undo deployment/<name> — it scales the previous ReplicaSet back up. You watch progress with kubectl rollout status. Tune maxSurge/maxUnavailable to trade rollout speed against capacity headroom.
Follow-ups they push on
- What do maxSurge and maxUnavailable control?
- Why does a rolling update need readiness probes to be safe?
- How is a rolling update different from blue-green or canary?
Red flag Rolling out without readiness probes — Kubernetes considers a pod 'available' as soon as the container starts and sends it traffic before the app can actually serve, causing a wave of errors mid-rollout.
source: Kubernetes docs — Performing a Rolling Update ↗
Commonly asked senior debug common A pod is stuck in CrashLoopBackOff. Walk me through how you debug it.
CrashLoopBackOff means the container keeps starting and exiting, and Kubernetes is backing off between restarts. Work the evidence:
kubectl describe pod <pod> — read the Events and the last container state (exit code, OOMKilled, reason). kubectl logs <pod> --previous — the logs from the crashed instance (current logs may be empty because it just restarted).
Common causes: the app crashes on startup (bad config / missing env var / unreachable dependency — visible in logs); exit code 137 / OOMKilled means it exceeded its memory limit (raise the limit or fix the leak); a failing liveness probe restarting a healthy-but-slow app (add a startupProbe); or a bad image/command. Fix the root cause rather than just bumping restart limits.
Follow-ups they push on
- Why use `kubectl logs --previous` here?
- What does exit code 137 tell you?
Red flag Reading only `kubectl logs <pod>` (which shows the freshly restarted container, often empty) instead of `--previous`, and missing that an OOMKill or a too-aggressive liveness probe is the actual cause.
source: Kubernetes docs — Debug Running Pods ↗
Commonly asked senior concept common What is the difference between resource requests and limits, and how do they affect scheduling and stability?
A request is the amount of CPU/memory a container is guaranteed; the scheduler uses requests to decide which node a pod fits on. A limit is the hard ceiling the container may not exceed.
The behaviors differ by resource. Exceed a memory limit and the container is OOMKilled. Exceed a CPU limit and the container is throttled (slowed), not killed. If you set no requests, the scheduler packs pods blindly and nodes get oversubscribed; if requests are far below real usage, you overcommit and nodes thrash. The senior point is the QoS class: pods with requests == limits are Guaranteed and evicted last under node memory pressure; pods with no requests/limits are BestEffort and evicted first.
Follow-ups they push on
- What happens when a container exceeds its CPU limit vs its memory limit?
- How do requests and limits determine a pod's QoS class and eviction order?
Red flag Setting limits without requests (or omitting both) — the scheduler cannot reason about capacity, leading to oversubscribed nodes and BestEffort pods that are the first to be evicted under pressure.
source: Kubernetes docs — Resource Management for Pods and Containers ↗
Commonly asked senior concept occasional Walk me through what happens, end to end, when you run `kubectl apply -f deployment.yaml`.
kubectl sends the manifest to the API server, which authenticates, authorizes (RBAC), runs admission controllers, and persists the desired state to etcd. Nothing is running yet — you have only recorded intent.
Controllers then reconcile. The Deployment controller sees a new Deployment and creates a ReplicaSet; the ReplicaSet controller creates Pod objects to reach the desired replica count. The scheduler watches for unscheduled pods and binds each to a suitable node based on requests, affinity, and taints. On each chosen node, the kubelet sees a pod assigned to it, pulls the image, and starts the container via the container runtime, reporting status back to the API server.
The whole system is a declarative control loop: you state the desired state, and independent controllers continuously drive the actual state toward it.
Follow-ups they push on
- Which component decides which node a pod runs on?
- Why is this described as a reconciliation/control loop rather than imperative execution?
Red flag Describing it as imperative ('kubectl starts the container') — kubectl only records desired state; controllers and the kubelet asynchronously reconcile reality toward it.
source: Kubernetes docs — Kubernetes Components ↗
Commonly asked senior concept common How does the Horizontal Pod Autoscaler work, and why does it need resource requests set?
The HPA is a control loop (default every 15s) that scales a Deployment's replica count up or down to keep an observed metric near a target. The classic case: target 50% average CPU. It reads current per-pod usage from the metrics server and applies roughly desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric).
The catch interviewers probe: CPU/memory targets are expressed as a percentage of the pod's resource request. If you set no CPU request, there is no denominator, so the HPA cannot compute utilization and will not scale on CPU. So requests are a prerequisite, not optional.
Discuss the rest: HPA changes *replica count* (horizontal), distinct from the Vertical Pod Autoscaler which resizes a pod; it can scale on custom/external metrics (queue depth, RPS) not just CPU; and you add a stabilization window to prevent flapping (rapid scale up/down thrash) on noisy metrics.
What a strong answer covers
- HPA control loop adjusts replica count to keep a metric near target: ceil(replicas × current/target).
- CPU/memory targets are a percentage of the pod's request — no request means no denominator, no scaling.
- Horizontal (more pods) vs Vertical Pod Autoscaler (bigger pods) — different tools.
- Can scale on custom/external metrics (queue depth, RPS), not just CPU.
- A stabilization window prevents flapping on noisy/bursty metrics.
Follow-ups they push on
- Why does an HPA on CPU silently do nothing if you forgot to set CPU requests?
- When would you scale on a custom metric like queue length instead of CPU?
- How is HPA different from the cluster autoscaler?
Red flag Configuring an HPA on CPU but omitting CPU resource requests — utilization is computed relative to the request, so with no request the HPA has nothing to divide by and never scales.
source: Kubernetes docs — Horizontal Pod Autoscaling ↗

6.2.3 CI/CD 12

Commonly asked mid concept very common What is the difference between continuous integration, continuous delivery, and continuous deployment?
Continuous integration (CI): developers merge to a shared branch frequently, and every push automatically builds and runs the test suite, so integration problems surface in minutes, not at a big-bang merge.
Continuous delivery (CD): every change that passes CI is automatically built into a deployable, release-ready artifact and pushed through environments up to a staging gate — but the final push to production is a manual button.
Continuous deployment: the same pipeline, with the manual gate removed — every change that passes all automated checks goes straight to production, no human in the loop. The distinction people get wrong is delivery (human approves the prod release) vs deployment (fully automated to prod).
Follow-ups they push on
- Where exactly is the manual gate in delivery vs deployment?
- What must be true about your test suite to safely do continuous deployment?
Red flag Using 'continuous delivery' and 'continuous deployment' interchangeably — the difference is whether a human approves the production release.
source: GitHub docs — About continuous integration ↗
Commonly asked mid coding very common Write a basic GitHub Actions workflow that runs tests on every pull request. Explain the trigger, jobs, and steps.
A workflow is YAML in .github/workflows/. The top-level on sets the trigger, jobs are units that run on a runner, and each job has steps.
name: CI
on:
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm test
on: pull_request triggers on every PR to main; the single test job runs on a fresh Ubuntu runner; steps check out the code, set up Node, install deps deterministically with npm ci, and run the suite. Jobs run in parallel by default; needs: makes one wait on another.
Follow-ups they push on
- How do you make a deploy job run only after the test job passes?
- Why `npm ci` instead of `npm install` in CI?
Red flag Forgetting `actions/checkout` (the runner starts empty, so the build has no source), or using `npm install` instead of `npm ci` so the lockfile is not respected and builds become non-reproducible.
source: GitHub docs — Writing workflows / quickstart ↗
Commonly asked mid concept common Explain the typical stages of a CI/CD pipeline: build, test, deploy. What runs where?
Build: compile/transpile, install dependencies, and produce a versioned, immutable artifact (a binary, a bundle, or — most commonly — a container image) pushed to a registry. The key principle is build once and promote that same artifact through every environment.
Test: run fast unit tests first (fail early), then integration tests, then optionally end-to-end tests, plus quality and security scans (lint, SAST, dependency/vulnerability scan). Order from cheapest/fastest to slowest so the pipeline fails fast.
Deploy: ship the already-built artifact to staging, run smoke tests, then promote to production with a rollout strategy (rolling/blue-green/canary) and health checks that can trigger automatic rollback. Building a fresh artifact per environment is the anti-pattern — you would no longer be testing what you ship.
Follow-ups they push on
- Why build the artifact once and promote it rather than rebuilding per environment?
- Why run unit tests before integration and e2e tests?
Red flag Rebuilding the artifact separately for staging and production — you then deploy something you never actually tested, defeating the point of the pipeline.
source: GitHub docs — About continuous deployment ↗
Commonly asked mid concept common Why and how do you cache dependencies in CI? What's the difference between caching and an artifact?
CI runners start clean every run, so without caching you re-download every dependency on each build — slow and wasteful. A dependency cache restores files like node_modules/~/.npm keyed on a hash of the lockfile (package-lock.json): a cache *hit* restores them in seconds; a cache *miss* (lockfile changed) rebuilds and saves a fresh cache. In GitHub Actions the setup-* actions can do this with one cache: line, or you use actions/cache directly.
The distinction interviewers want: a cache is a build-time optimization — it is keyed, can be evicted, and you must never *depend* on it existing (a miss must still produce a correct build). An artifact is an *output* you deliberately persist — the built binary/image/test report you pass between jobs or download later. Cache = speed, may vanish; artifact = a result you must keep.
Key the cache carefully: too broad and you serve stale deps; too narrow and you never hit it. Hashing the lockfile is the sweet spot.
What a strong answer covers
- Runners are ephemeral; caching avoids re-downloading deps every run.
- Key the cache on a lockfile hash — hit restores fast, miss rebuilds and re-saves.
- Cache = build-time speedup, evictable, must never be *required* for correctness.
- Artifact = a deliberate output you persist (binary/image/report) and pass between jobs.
- Bad cache keys cause stale dependencies (too broad) or constant misses (too narrow).
Quick self-check
What is the right cache key for a Node project's `node_modules` cache?
Follow-ups they push on
- Why must your build still succeed on a cache miss?
- What goes wrong if your cache key is the branch name instead of the lockfile hash?
- When would you use an artifact instead of a cache?
Red flag Treating a cache like an artifact and depending on it being present, or keying it too loosely so a stale `node_modules` is restored after the lockfile changed — leading to 'works in CI but with old deps' bugs.
source: GitHub docs — Caching dependencies to speed up workflows ↗
Commonly asked mid concept occasional How do you run the same CI job across multiple language versions or OSes efficiently?
Use a build matrix. Instead of copy-pasting a near-identical job per Node version or OS, you declare a matrix and CI fans out one job per combination automatically, running them in parallel. In GitHub Actions:
strategy:
matrix:
node: [18, 20, 22]
os: [ubuntu-latest, windows-latest]
That single job definition expands to 6 parallel jobs (3 versions × 2 OSes), each on its own runner. You can include/exclude specific combinations and set fail-fast (cancel the rest on first failure) on or off depending on whether you want full results.
The value is coverage without duplication: test the support matrix you promise users, catch a version-specific break early, and keep the workflow DRY. The tradeoff is runner minutes — a wide matrix multiplies cost, so test the combinations that matter, not every permutation.
What a strong answer covers
- A matrix fans one job definition out into one parallel job per combination.
- matrix: { node: [...], os: [...] } expands to the cross-product, each on its own runner.
- include/exclude tune specific combos; fail-fast controls cancel-on-first-failure.
- Gives coverage of your support matrix without duplicating job YAML.
- Cost grows with the cross-product — test combinations that matter, not every permutation.
Follow-ups they push on
- What does `fail-fast: false` change about a matrix run?
- How would you exclude one specific version/OS combination?
- What's the cost tradeoff of a very wide matrix?
Red flag Duplicating an entire job per version/OS instead of using a matrix — it's verbose, drifts out of sync, and you forget to update one copy; the matrix keeps all combinations defined in one place.
source: GitHub docs — Running variations of jobs in a workflow (matrix) ↗
Commonly asked senior concept occasional Why is a fast CI feedback loop so important, and how do you keep a pipeline fast as it grows?
The whole point of CI is fast feedback on whether a change is safe. A pipeline that takes 40 minutes breaks the developer's flow — they context-switch, stack up un-merged PRs, and start ignoring or working around the signal. Speed is what keeps CI trustworthy and keeps people integrating frequently.
Keep it fast as it grows: parallelize (split the test suite across runners / use a matrix), fail fast by ordering cheap checks first (lint and unit tests before slow e2e), cache dependencies and build outputs, and only run what changed for large monorepos (path filters / affected-project detection). Build the artifact once and promote it rather than rebuilding per stage.
The senior framing: treat pipeline duration as a product metric you budget and watch — when a stage gets slow, profile it like you would slow code. A flaky or slow pipeline is a tax on every single merge.
What a strong answer covers
- CI exists for fast feedback; a slow pipeline breaks flow and erodes trust in the signal.
- Parallelize test suites and use matrices to spread work across runners.
- Fail fast: cheap checks (lint, unit) before slow ones (integration, e2e).
- Cache deps/build outputs and only run what changed in big monorepos.
- Treat pipeline duration as a tracked metric — profile a slow stage like slow code.
Follow-ups they push on
- Why does ordering fast tests before slow ones matter even at the same total cost?
- How does 'only test what changed' work in a monorepo?
- What's the cost of letting a pipeline creep to 40 minutes?
Red flag Letting pipeline time creep unbounded — once feedback takes tens of minutes, developers batch changes and stop trusting CI, which defeats the purpose of continuous integration entirely.
source: GitHub docs — About continuous integration ↗
Commonly asked senior debug occasional A deploy to production succeeds but the app is broken; rolling back code didn't fix it. How do you reason about the failure and prevent it?
First separate the layers: a 'green' deploy only means the *pipeline* succeeded, not that the app *works*. If rolling back the code didn't fix it, the breakage is almost certainly not in the code artifact — look at the things that aren't versioned with the image: a database migration that already ran (and is irreversible), a changed config/feature flag, a new infra/secret value, or a dependency/external service.
The migration case is the classic trap: code rolls back instantly, but a schema change (dropped column, altered type) does not, so old code now hits an incompatible schema. The discipline is backward-compatible, expand-then-contract migrations — deploy schema changes that both old and new code can run against, ship code, then remove the old shape in a later release — so rolling back code is always safe.
Prevention: add post-deploy smoke tests/health checks that gate the rollout (so a broken deploy auto-rolls-back before users see it), decouple migrations from code deploys, use feature flags to separate 'deployed' from 'released', and ensure rollbacks are actually tested, not assumed.
What a strong answer covers
- A green pipeline ≠ a working app — 'success' is about the deploy, not behavior.
- If code rollback didn't help, the cause is unversioned state: migrations, config, flags, secrets, deps.
- Irreversible DB migrations are the classic trap — code reverts, schema doesn't.
- Fix with expand-then-contract backward-compatible migrations so rollback is always safe.
- Prevent with post-deploy smoke tests that gate/auto-rollback, plus feature flags to separate deploy from release.
Follow-ups they push on
- Why doesn't rolling back code fix a forward database migration?
- What does an expand-then-contract migration look like in practice?
- How do feature flags let you separate 'deployed' from 'released'?
Red flag Assuming a code rollback always restores a known-good state — irreversible schema migrations and out-of-band config changes aren't part of the artifact, so the rollback leaves old code running against a changed world.
source: GitHub docs — About continuous deployment ↗
Commonly asked senior concept occasional Why is trunk-based development paired with feature flags so common in CI/CD, and what problem does it solve over long-lived branches?
Long-lived feature branches drift away from main for days or weeks, so when they finally merge you get merge hell — big, painful, conflict-ridden integrations exactly when you can least afford surprises. That defeats the 'continuous' in continuous integration, whose whole premise is integrating *frequently* so problems surface in small, cheap increments.
Trunk-based development has everyone commit small changes to main (or very short-lived branches merged within a day), keeping the branch always releasable. The obvious tension: how do you merge unfinished work without shipping it? Feature flags — you merge the code behind an off-by-default flag, so it's integrated and tested continuously but invisible to users until you flip it on. This also decouples deploy from release: deploying code and exposing a feature become separate decisions, enabling canary/gradual rollouts and instant kill-switches.
Senior framing: small frequent merges + flags keep integration cheap and continuous and make release a runtime toggle rather than a deployment event — at the cost of flag hygiene (you must clean up stale flags).
What a strong answer covers
- Long-lived branches drift from main → painful big-bang merges that defeat continuous integration.
- Trunk-based: small frequent commits to main, kept always releasable.
- Feature flags let you merge unfinished work off-by-default — integrated and tested, not yet exposed.
- Flags decouple deploy from release: shipping code and turning a feature on are separate decisions.
- Enables canary/gradual rollout + instant kill-switch; cost is flag hygiene (remove stale flags).
Follow-ups they push on
- How do feature flags let you merge incomplete work to main safely?
- What does 'decoupling deploy from release' buy you operationally?
- What's the maintenance cost of feature flags over time?
Red flag Sitting on a long-lived branch 'until the feature is done' — it diverges from main and turns into a high-risk merge; the CI premise is to integrate small changes continuously, using flags to hide the unfinished parts.
source: GitHub docs — About continuous integration ↗
Commonly asked senior concept common How do you handle secrets (API keys, deploy credentials) in a CI/CD pipeline?
Never hardcode secrets in source, the workflow file, or build logs. Inject them at runtime from a secret store: GitHub Actions encrypted secrets / environments, or an external manager like HashiCorp Vault, AWS Secrets Manager, or a cloud key vault. The CI system makes them available as masked env vars so they do not print in logs.
Stronger still: prefer short-lived, scoped credentials over long-lived static keys — for cloud deploys, use OIDC so the workflow exchanges its identity token for temporary cloud credentials, eliminating stored long-lived keys entirely. Scope secrets to the environment that needs them and gate production secrets behind required reviewers. And remember a secret echoed into a log or committed to git is compromised forever — rotate it.
Follow-ups they push on
- Why is OIDC-based short-lived credential exchange better than a stored static cloud key?
- What do you do the moment a secret leaks into a build log?
Red flag Putting credentials in the repo or in plain workflow env, or echoing a secret in a debug step — once it lands in git history or a log it must be treated as permanently compromised and rotated.
source: GitHub docs — Using secrets in GitHub Actions ↗
Commonly asked senior concept common Compare blue-green and canary deployment strategies. When would you choose each?
Blue-green runs two full environments: blue (current) serves all traffic while green (new) is deployed and verified, then you flip traffic to green at once. Rollback is instant — flip back to blue. Cost: you run double the infrastructure during the cutover, and a bad release hits 100% of users the moment you switch.
Canary releases the new version to a small slice of traffic (say 5%), watches error rates and latency, then gradually ramps to 100%. It limits blast radius and catches problems with real traffic before everyone is exposed, but it is more complex (traffic splitting, automated metric analysis) and the rollout is slower.
Pick blue-green when you want a clean, instant, all-or-nothing switch and can afford duplicate capacity; pick canary when blast-radius control matters and you have the observability to judge a partial rollout.
Follow-ups they push on
- What does each strategy give you for rollback?
- What observability do you need to run a canary safely?
Red flag Calling a deployment a 'canary' when there is no automated metric analysis gating the ramp — without watching error/latency on the small slice, you have just slowed down a full rollout, not limited blast radius.
source: AWS — Blue/Green vs Canary deployment strategies ↗
Commonly asked senior debug common Your CI build passes locally but fails intermittently in the pipeline. How do you approach a flaky build?
Flakiness almost always comes from hidden non-determinism. Hunt the usual sources: tests that depend on execution order or shared mutable state; reliance on real time/timezone, random seeds, or wall-clock sleeps instead of waiting on a condition; tests hitting real networks/external services; and concurrency races. The 'works locally' clue points at environment differences — different dependency versions, missing lockfile pinning, or fewer CPUs on the runner exposing a race.
Approach: make it reproducible (run the suite repeatedly, randomize order, run in a clean container matching CI), then isolate the offending test and fix the root cause. Pin dependencies with a lockfile and npm ci, mock external calls, and replace sleeps with explicit waits. Blanket auto-retry hides flakes and erodes trust in the suite — fix, do not paper over.
Follow-ups they push on
- Why does 'passes locally' point you toward environment/ordering differences?
- Why is blindly retrying failed tests a bad long-term fix?
Red flag Slapping an automatic retry on the whole suite so red turns green — the underlying race or shared-state bug stays, and the team stops trusting CI failures.
source: GitHub docs — Continuous integration concepts ↗
Commonly asked senior concept common What is a deployment gate / required approval, and where do manual gates belong in a pipeline?
A gate is a condition that must pass before a stage proceeds — automated (tests green, security scan clean, smoke checks pass) or manual (a required human approval). In GitHub Actions you implement this with environments that have required reviewers and optionally a wait timer or branch restrictions; a job targeting that environment pauses until approved.
Where gates belong: automated quality gates everywhere (fail fast on tests/lint/scans), and a manual approval only at the boundary you actually want a human to own — typically the promotion to production. That manual prod gate is exactly the line between continuous *delivery* (human approves prod) and continuous *deployment* (no gate). You also gate to protect the production *secrets/credentials*, which are scoped to that environment and unlocked only after approval.
The senior framing: minimize manual gates (they create bottlenecks and false confidence) and lean on strong automated checks; reserve human approval for genuinely high-risk, irreversible promotions.
What a strong answer covers
- A gate blocks a stage until a condition passes — automated (tests/scans) or manual (approval).
- GitHub Actions: environments with required reviewers / wait timer pause a job until approved.
- Put automated gates everywhere (fail fast); reserve manual approval for the prod promotion.
- That manual prod gate is the line between continuous delivery and continuous deployment.
- Environment gates also protect prod secrets, unlocked only after the gate passes.
Follow-ups they push on
- How does a required-reviewer environment gate relate to delivery vs deployment?
- Why can too many manual gates be worse than fewer, stronger automated ones?
- How does gating an environment also protect production credentials?
Red flag Gating every stage with manual approvals 'to be safe' — it creates bottlenecks and rubber-stamp approvals; strong automated gates plus a single human gate at prod promotion is the better pattern.
source: GitHub docs — Using environments for deployment ↗

6.2.4 Infrastructure as Code (Terraform) 13

Commonly asked junior concept very common What is the difference between `terraform plan` and `terraform apply`?
plan is a dry run: Terraform refreshes state, compares your desired configuration against the current state, and prints the exact set of actions it would take — what gets created, updated in place, replaced (destroy+create), or destroyed — without changing anything. It is your review-before-you-touch-prod safety check, and you can save it to a file.
apply executes those changes against the real providers and then writes the new state. If you pass a saved plan file, apply runs exactly that plan with no surprises; without one, apply shows the plan again and asks for confirmation. The senior habit is to always read the plan output (especially anything marked for replacement/destruction) before approving an apply.
Follow-ups they push on
- What does it mean when a plan shows a resource will be replaced rather than updated in place?
- Why apply a saved plan file in automation?
Red flag Running `apply -auto-approve` in CI without reviewing the plan — you can silently destroy and recreate a stateful resource (like a database) that a config change forced to be replaced.
source: Terraform docs — terraform plan / apply ↗
Commonly asked junior concept common What is the difference between a Terraform provider and a resource?
A provider is a plugin that teaches Terraform how to talk to a specific platform's API — aws, google, azurerm, cloudflare, kubernetes. You configure it once (region, credentials), and it exposes the set of resource and data-source types for that platform.
A resource is a single managed object you declare — resource "aws_s3_bucket" "assets" { ... } describes one bucket. The provider knows how to create, read, update, and delete that resource type via the platform's API. So: the provider is the integration layer; resources are the things you actually provision through it. A data source is the read-only sibling — it looks up existing infrastructure without managing it.
Follow-ups they push on
- How is a data source different from a resource?
- Can one Terraform config use multiple providers at once?
Red flag Confusing a resource with a data source — a resource is created and managed by Terraform; a data source only reads existing infrastructure and never creates anything.
source: Terraform docs — Providers ↗
Commonly asked mid concept very common What is the Terraform state file, and why does it matter so much?
State is Terraform's record (terraform.tfstate, JSON) mapping each resource in your config to the real-world object it created — IDs, attributes, and metadata. Terraform needs it to know what it already manages, so on the next plan it can diff your desired config against reality and compute the minimal set of changes.
Without state, Terraform could not tell the difference between 'create a new resource' and 'this resource already exists, just update it', and it would have no way to know what to destroy. State also caches attribute values and tracks dependencies. Because it can contain sensitive values (passwords, keys) in plaintext, it must be protected — which leads straight into remote state.
Follow-ups they push on
- Why can't Terraform just query the cloud provider instead of keeping state?
- Why is committing tfstate to a git repo dangerous?
Red flag Treating state as a disposable cache or committing it to git — it can hold secrets in plaintext, and a lost/corrupt state file orphans real infrastructure that Terraform no longer recognizes.
source: Terraform docs — State ↗
Commonly asked mid concept occasional What are input variables, outputs, and locals in Terraform, and how do they differ?
They're the three ways data flows through a config. Input variables (variable) are the parameters a module accepts from its caller — the public 'function arguments' (region, instance size), set via .tfvars, CLI flags, or env vars, and typed/validated. Outputs (output) are the values a module exposes back to its caller or the CLI — the 'return values' (a created VPC's ID, a load balancer's DNS name) that other modules consume. Locals (locals) are named intermediate expressions used *inside* a config to avoid repetition — computed once, referenced as local.name, never settable from outside.
The mental model: variables are inputs (caller → module), outputs are results (module → caller), locals are private helpers (internal only). This is exactly what makes a module a clean interface: callers only touch its variables and outputs, never its internals.
A practical note: mark sensitive variables/outputs sensitive = true so Terraform redacts them in plan/apply logs.
What a strong answer covers
- Variables: a module's input parameters (caller → module), typed and validatable.
- Outputs: values a module returns (module → caller / CLI), consumed by other modules.
- Locals: private named expressions, computed once, used internally to avoid repetition.
- Together, variables + outputs form a module's clean public interface; locals stay internal.
- Use sensitive = true to redact secret variables/outputs from logs.
Follow-ups they push on
- Why can't a local be set from outside the module?
- How does one module consume another module's output?
- When would you mark a variable or output `sensitive`?
Red flag Confusing locals with variables — a local is a computed internal helper that callers can't override, while a variable is the external input; using a local where you needed a configurable input makes the module non-parameterizable.
source: Terraform docs — Variables and outputs ↗
Commonly asked mid concept occasional How does Terraform decide the order to create resources? What are implicit vs explicit dependencies?
Terraform builds a dependency graph from your config and creates/updates/destroys resources in the order that graph implies, parallelizing wherever there's no dependency between resources. You rarely specify order yourself.
Implicit dependencies are inferred from references: if a security group rule uses aws_vpc.main.id, Terraform knows the VPC must exist first, because the rule reads an attribute of the VPC. This is the idiomatic, preferred way — wire resources together by referencing each other's attributes and the ordering falls out automatically (and correctly, including on destroy, which runs in reverse).
Explicit dependencies use depends_on to force an ordering Terraform can't infer — typically when there's a *hidden* relationship not expressed through a reference (e.g. an app needs an IAM policy attached before it runs, but doesn't reference the attachment's attributes). Use depends_on sparingly; over-using it usually means you should have referenced the attribute instead.
What a strong answer covers
- Terraform builds a dependency graph and parallelizes independent resources automatically.
- Implicit deps: inferred from attribute references (aws_vpc.main.id) — the idiomatic way.
- Referencing attributes gets ordering right for create *and* destroy (reverse order) for free.
- Explicit deps (depends_on): force an order for a hidden relationship not expressed by a reference.
- Use depends_on sparingly — usually a missing attribute reference is the real fix.
Follow-ups they push on
- Why is an implicit dependency via attribute reference preferred over `depends_on`?
- Give an example where `depends_on` is genuinely necessary.
- How does the graph handle destroy ordering?
Red flag Sprinkling `depends_on` everywhere to 'be safe' — it serializes resources that could run in parallel and hides the real relationships; reference the attribute you depend on and let Terraform infer the order.
source: Terraform docs — Resource dependencies ↗
Commonly asked mid concept common What are Terraform modules and why do you use them?
A module is a reusable, parameterized bundle of Terraform resources — a directory with input variables, resources, and outputs. Instead of copy-pasting the same 200 lines to stand up a VPC or a service in dev, staging, and prod, you write it once as a module and call it three times with different inputs.
The payoff is DRY infrastructure, consistency (every environment provisions the same way), and an interface boundary: callers only deal with the module's variables and outputs, not its internals. Every Terraform config has an implicit root module; you compose it from child modules (your own, or versioned modules from the registry). The trap is over-abstracting too early — wrap something in a module once you actually have repetition, not speculatively.
Follow-ups they push on
- How do you pass data in and out of a module?
- How do you pin a module to a specific version and why?
Red flag Over-modularizing on day one — wrapping a single-use resource in a deeply nested module hierarchy adds indirection without the reuse that justifies it.
source: Terraform docs — Modules ↗
Commonly asked mid concept common Why is Infrastructure as Code better than clicking through a cloud console, and what is the difference between declarative and imperative IaC?
IaC makes infrastructure versioned, reviewable, and reproducible. Config lives in git, so changes go through pull requests and code review, you have an audit trail, you can roll back, and you can stand up an identical environment on demand instead of relying on someone remembering which buttons they clicked. It eliminates configuration drift and snowflake servers.
Declarative vs imperative: declarative (Terraform) means you describe the desired end state and the tool figures out the steps and the diff to get there — apply it twice and nothing extra happens (idempotent). Imperative (a shell/SDK script) means you spell out the steps to take, and re-running can double-create or fail because it does not reason about current state. Terraform is declarative, which is why plan can show you precisely what will change before anything happens.
Follow-ups they push on
- Why does declarative IaC give you idempotency for free?
- How does putting infra in git change your change-management process?
Red flag Describing Terraform as a script that 'runs commands to build infra' — that is the imperative mental model; Terraform reconciles toward a declared end state and is idempotent.
source: Terraform docs — What is Terraform / intro ↗
Commonly asked senior concept occasional What is the difference between `count` and `for_each` for creating multiple resources, and why does it matter for state?
Both create multiple instances of a resource, but they key the instances differently in state, and that's the whole game. count produces a list indexed by integer position — resource[0], resource[1]. for_each produces a map keyed by a stable string — resource["web"], resource["db"].
The trap with count: because instances are positional, removing an item from the middle of the list shifts every later index, so Terraform thinks those resources changed identity and proposes to destroy-and-recreate them. With for_each, each instance is bound to its own key, so deleting one only affects that one — the rest stay put.
Guidance: use count for N identical, order-independent copies (or a simple on/off toggle, count = var.enabled ? 1 : 0); use for_each whenever you iterate over a set/map of distinct things (named buckets, subnets per AZ) so that adding or removing one doesn't churn the others.
What a strong answer covers
- count → list indexed by integer position; for_each → map keyed by a stable string.
- Removing a middle count element shifts later indices, forcing destroy/recreate of unrelated resources.
- for_each binds each instance to its key, so add/remove touches only that instance.
- Use count for N identical copies or an on/off toggle (count = enabled ? 1 : 0).
- Use for_each for a set/map of distinct named things (buckets, subnets per AZ).
Quick self-check
You manage 5 distinct named S3 buckets and sometimes remove one from the middle. Which is safer?
Follow-ups they push on
- Why does deleting the first of three `count` resources recreate the other two?
- When is `count` still the right choice over `for_each`?
- How do you reference a specific instance under each approach?
Red flag Using `count` over a list of distinct named resources — removing or reordering an element shifts every later index, so Terraform destroys and recreates resources you never intended to touch; `for_each` keyed by name avoids the churn.
source: Terraform docs — The for_each meta-argument ↗
Commonly asked senior trick occasional Why is `terraform destroy` (or an accidental resource replacement) so dangerous, and how do you guard against it?
Terraform faithfully executes the declared end state — including deletion. The danger is that a config change can force a replace (destroy + create) of a resource you assumed would update in place: changing an attribute marked 'ForceNew' (an EC2 instance's AMI, a database's engine, a subnet) makes Terraform plan to destroy the old object and create a new one. On a stateful resource like a production database, that's data loss executed by a routine-looking apply.
Guards, layered: (1) read the plan — anything showing -/+ destroy and then create or # forces replacement is a red flag, never -auto-approve blindly. (2) Add lifecycle { prevent_destroy = true } on critical resources so Terraform errors out rather than destroying them. (3) Use create_before_destroy where a replacement is acceptable but downtime isn't. (4) Take backups / enable deletion protection on the cloud side as a last line. (5) For stateful data stores, often manage them outside the same Terraform lifecycle as ephemeral compute.
The trick being tested: knowing that 'update' can silently mean 'replace', and that the plan output is your safety check.
What a strong answer covers
- A config change to a ForceNew attribute makes Terraform destroy + recreate — potential data loss.
- The plan shows it as -/+ / # forces replacement — that's your red flag to stop.
- lifecycle { prevent_destroy = true } makes Terraform refuse to destroy critical resources.
- create_before_destroy avoids downtime when a replace is genuinely acceptable.
- Layer cloud-side deletion protection / backups; manage stateful stores apart from ephemeral compute.
Follow-ups they push on
- How do you tell from a plan that a resource will be replaced rather than updated in place?
- What does `prevent_destroy` actually do when a destroy is attempted?
- Why separate a production database's lifecycle from your app's Terraform?
Red flag Approving a plan without noticing a `# forces replacement` on a stateful resource — Terraform will dutifully destroy the production database and create a fresh empty one, and `apply` doesn't ask 'are you sure this is a DB?'.
source: Terraform docs — The lifecycle meta-argument ↗
Commonly asked senior concept very common What is remote state and state locking, and what problem do they solve on a team?
Local state lives on one engineer's laptop — useless for a team and easy to lose. Remote state stores the state file in a shared backend (S3, Azure Blob, GCS, Terraform Cloud) so everyone reads and writes the same source of truth, and sensitive state is not scattered across machines.
State locking prevents two people from running apply against the same state at the same time. Backends acquire a lock (e.g. S3 with a DynamoDB lock table, or native locking in Terraform Cloud) for the duration of the operation; a second concurrent apply is blocked until the lock releases. Without locking, two simultaneous applies interleave writes and corrupt the state file, leaving Terraform's view inconsistent with reality.
Follow-ups they push on
- What corrupts the state if two engineers apply at the same time without a lock?
- How do you implement locking with an S3 backend?
Red flag Using a shared remote backend without locking — concurrent applies race on the state file and corrupt it, after which plans no longer match reality.
source: Terraform docs — Backends and remote state ↗
Commonly asked senior concept common What is configuration drift, and how do you detect and reconcile it in Terraform?
Drift is when the real infrastructure no longer matches what Terraform's state/config says — typically because someone made a change by hand in the cloud console ('ClickOps') outside Terraform.
Detection: terraform plan refreshes state against the provider and shows the divergence as changes it wants to make; a plan that proposes changes you did not author is drift. Reconcile in one of two directions: bring the real resource back in line by re-applying your config, or, if the manual change is desirable, update the Terraform config to match (and apply). For resources created outside Terraform, terraform import brings them under management.
The durable fix is process: make Terraform the single source of truth, restrict console write access, and run plan in CI on a schedule to catch drift early.
Follow-ups they push on
- How does a scheduled `plan` in CI help you catch drift?
- When would you update the config to match reality instead of reverting reality?
Red flag Letting people make changes in the cloud console alongside Terraform — the next apply silently reverts their manual fix (or vice versa), and the two views of reality keep fighting.
source: Terraform docs — Manage resource drift ↗
Commonly asked senior concept common How do you bring an existing, manually-created cloud resource under Terraform management?
You import it — Terraform's state knows nothing about resources it didn't create, so you have to tell it. The two-part move: (1) write a matching resource block in your config for the existing object, then (2) bring it into state, either with the CLI terraform import <resource_address> <real_id> or, in modern Terraform, an import block that does it as part of plan/apply (and can even generate config).
The critical detail interviewers probe: importing only updates state, it does not write your configuration. If your hand-written resource block doesn't match the real object's settings, the very next plan will propose changes to 'fix' the real resource back to your (incomplete) config. So after importing you run plan and iterate on the config until the plan is clean (no changes) — that confirms config, state, and reality all agree.
This is also how you remediate drift / ClickOps: adopt the orphaned resource instead of destroying and recreating it.
What a strong answer covers
- Terraform ignores anything it didn't create — you must import existing resources into state.
- Two steps: write a matching resource block, then terraform import (or an import {} block).
- Import updates state only — it does not generate or fix your config.
- Iterate until plan shows no changes, proving config + state + reality agree.
- It's the safe way to adopt ClickOps/orphaned resources without destroy-and-recreate.
Quick self-check
After `terraform import` of an existing bucket, the next `plan` wants to modify it. Why?
Follow-ups they push on
- Why does a fresh import often produce a plan that wants to change the resource?
- What's the difference between the CLI `import` command and an `import` block?
- How does import help you fix drift without recreating infrastructure?
Red flag Running `terraform import` and assuming you're done — import only writes state, not config, so a mismatched resource block makes the next apply try to 'correct' the real resource; you must get a clean plan first.
source: Terraform docs — Import existing resources ↗
Commonly asked senior concept common How do you manage multiple environments (dev / staging / prod) in Terraform, and why are workspaces often the wrong tool?
The common patterns: separate state per environment with a shared module. You write the infrastructure once as a module, then have a thin per-environment root config (environments/prod, environments/staging) that calls the module with different variables (instance sizes, counts) and, crucially, its own backend/state file. This isolates blast radius — a bad apply in staging can't touch prod's state.
Terraform workspaces let one config switch between multiple state files (default, dev, prod) without copying code. They're tempting for environments but are usually the wrong fit: they share the same backend and code, it's easy to run apply against the wrong workspace by accident (no separate credentials/approval boundary), and they don't capture genuinely different configs well. They're better suited to short-lived, near-identical parallel copies (e.g. per-feature-branch ephemeral envs).
Senior answer: isolate prod with its own state, backend, and credentials; use modules for DRY; reserve workspaces for ephemeral, structurally-identical environments.
What a strong answer covers
- Default pattern: one shared module + thin per-env root configs with separate state/backends.
- Separate state per env isolates blast radius — staging mistakes can't corrupt prod.
- Workspaces swap state files on one config/backend — convenient but no real isolation boundary.
- Workspace risk: applying to the wrong environment with no separate credentials/approval.
- Use workspaces for ephemeral, identical envs; use separate state+backend for dev/staging/prod.
Follow-ups they push on
- Why does sharing a backend across environments via workspaces increase risk?
- How do modules keep multi-environment configs DRY?
- When are workspaces genuinely the right tool?
Red flag Using a single workspace-switched config for prod and staging — one fat-fingered `terraform workspace select` and an `apply` hits the wrong environment, with no separate backend or credential boundary to stop it.
source: Terraform docs — Workspaces ↗

6.2.5 Cloud fundamentals 12

Commonly asked mid concept very common What is the difference between a region and an availability zone, and how do you use them for high availability?
A region is a geographic area (e.g. us-east-1). Inside each region are multiple availability zones (AZs) — physically separate data centers with independent power, cooling, and networking, connected by high-bandwidth, low-latency links (single-digit ms).
For high availability, spread your workload across multiple AZs in a region: if one AZ loses power, the others keep serving, and a load balancer routes around the failed zone. That protects against a data-center-level failure with negligible latency cost. Going multi-region adds protection against a whole-region outage and lets you serve users closer to them, but it is far more complex (cross-region replication, data consistency, higher latency between regions). The pragmatic default is multi-AZ within one region; reach for multi-region when you genuinely need regional fault tolerance or global low latency.
Follow-ups they push on
- Why is multi-AZ the common HA default rather than multi-region?
- What new problems does going multi-region introduce?
Red flag Confusing the two, or running everything in a single AZ and calling it 'in the cloud so it's highly available' — one AZ failure then takes the whole service down.
source: AWS — Regions and Availability Zones ↗
Commonly asked mid concept common Walk me through the core cloud compute, storage, and networking primitives and when you'd reach for each.
Compute: VMs (EC2-style — full control, you manage the OS), containers (ECS/EKS — packaged apps, orchestrated), and serverless functions (Lambda — event-driven, no servers to manage, scales to zero). Move up that ladder as you want less operational overhead and more elasticity.
Storage: object storage (S3 — cheap, durable, infinite-scale blobs: images, backups, static assets), block storage (EBS — a virtual disk attached to one VM, for databases/filesystems), and file storage (EFS/NFS — a shared filesystem across many machines). Match the access pattern: blobs over HTTP -> object; a disk for one instance -> block; shared POSIX filesystem -> file.
Networking: a VPC is your isolated private network; subnets segment it (public vs private); security groups are instance-level firewalls; and a load balancer spreads traffic across instances. The skill is mapping a workload to the cheapest primitive that fits its access and durability needs.
Follow-ups they push on
- When would you pick object storage over block storage?
- When does serverless make sense vs a long-running container?
Red flag Reaching for a full VM you have to patch and babysit when a managed/serverless option fits, or using a database on object storage (wrong access pattern) instead of block storage.
source: AWS — Types of cloud computing / core services ↗
Commonly asked mid concept common What is the cloud shared responsibility model, and why does it matter?
Security is split between the provider and you. The provider is responsible for security OF the cloud — the physical data centers, hardware, the hypervisor, and the managed-service infrastructure. You are responsible for security IN the cloud — your data, IAM users and permissions, network config (security groups, public/private subnets), OS patching on VMs you run, and application-level security.
The line shifts with the service tier: with a raw VM you patch the OS; with a managed database the provider patches it but you still own access control and your data; with serverless even more moves to the provider, but IAM and data are always yours. It matters because most cloud breaches are customer-side misconfigurations — a public S3 bucket or an over-permissive IAM policy — not the provider being hacked.
Follow-ups they push on
- How does the responsibility line move between a self-managed VM and a managed service?
- Whose fault is a publicly exposed storage bucket under this model?
Red flag Assuming 'the cloud provider handles security' end to end — IAM, data, and network configuration are always the customer's responsibility, and that is where most breaches actually happen.
source: AWS — Shared Responsibility Model ↗
Commonly asked mid concept common What is the difference between vertical and horizontal scaling in the cloud, and which does the cloud make easy?
Vertical scaling (scale up) means giving one instance more resources — a bigger CPU/RAM tier. It is simple and needs no app changes, but you hit a hardware ceiling, usually need a restart/downtime to resize, and the single box is still a single point of failure.
Horizontal scaling (scale out) means adding more instances behind a load balancer. It scales effectively without limit and improves availability (lose one node, the rest serve), which is exactly what cloud auto-scaling groups automate — add instances when load rises, remove them when it falls. The catch is the app must be stateless (or externalize session state to a shared store like Redis) so any instance can handle any request. The cloud's elasticity is built around horizontal scaling; that is why 'make services stateless' is such a load-bearing design rule.
Follow-ups they push on
- Why does horizontal scaling require stateless services?
- What does an auto-scaling group buy you over manually resizing an instance?
Red flag Trying to scale a stateful, session-on-the-box service horizontally — requests landing on a different instance lose the session, so you are forced back into sticky sessions or a single big vertical box.
source: AWS — Auto Scaling / scaling concepts ↗
Commonly asked mid concept very common What is the difference between authentication and authorization in cloud IAM, and how do roles fit in?
Authentication answers 'who are you?' — proving identity (a user signing in, a service presenting credentials or a token). Authorization answers 'what are you allowed to do?' — evaluating policies to decide whether that proven identity may perform an action on a resource. Authn comes first; authz comes after. They're distinct: a correctly authenticated user can still be denied an action.
In cloud IAM, policies are the authorization rules (allow/deny on actions + resources), attached to identities. An IAM role is an identity with policies but no permanent credentials — instead, a trusted principal (an EC2 instance, a Lambda, another account, a federated user) assumes the role and receives temporary, auto-rotating credentials. That's why roles are the best-practice way to grant permissions to services: no long-lived access keys to leak.
So: authn = identity, authz = permissions (policies), and roles = a way to hand out scoped, temporary permissions to whoever/whatever assumes them.
What a strong answer covers
- Authentication = prove who you are; authorization = what you're allowed to do (policies).
- Authn happens first; an authenticated identity can still be denied by authorization.
- Policies encode authorization (allow/deny on actions + resources).
- An IAM role has no permanent credentials — principals assume it for temporary ones.
- Roles are best practice for services (EC2/Lambda): no long-lived keys to leak.
Quick self-check
An EC2 instance needs to read one S3 bucket. The best-practice way to grant this is:
Follow-ups they push on
- Why are IAM roles with temporary credentials safer than static access keys for a service?
- Can an authenticated identity ever be denied? Why?
- What does it mean for a principal to 'assume' a role?
Red flag Conflating authentication with authorization — proving identity (authn) does not grant any permission; access is still decided by the policies evaluated at the authorization step.
source: AWS — IAM identities (roles) / how IAM works ↗
Commonly asked mid concept common What is object storage (like S3), and why is it not a filesystem or a database?
Object storage stores data as objects — a blob of bytes plus metadata and a unique key — in a flat namespace (a bucket), accessed over HTTP APIs (GET/PUT), not a mounted disk. It's built for massive scale, very high durability (S3 famously targets eleven 9s by replicating across devices/AZs), and cheap capacity. Ideal for images, video, backups, logs, static website assets, and data-lake files.
Why it's not a filesystem: there are no real directories (the '/' in a key is cosmetic — it's a flat key space), you can't do partial in-place edits efficiently (you generally replace the whole object), and there's no POSIX file locking or low-latency random byte access like a block device. Why it's not a database: no transactions, no rich queries/joins, no secondary indexes — it's a key→blob store, not a query engine.
The skill is matching the access pattern: whole-blob read/write over HTTP, write-once-read-many, durability over mutability → object storage. Mutable structured records you query → a database. A disk for an OS/DB → block storage.
What a strong answer covers
- Objects = blob + metadata + key in a flat bucket namespace, accessed via HTTP APIs.
- Built for scale, extreme durability (S3 ~11 nines), and low cost — images, backups, logs, assets.
- Not a filesystem: no real directories, no efficient partial edits, no POSIX locking/random access.
- Not a database: no transactions, joins, or queries — it's key→blob.
- Match access pattern: whole-blob, write-once-read-many → object storage.
Quick self-check
Which workload is the BEST fit for object storage like S3?
Follow-ups they push on
- Why is the '/' in an S3 key not a real directory?
- When would block storage be the right choice over object storage?
- What makes object storage so durable?
Red flag Using object storage as a database or a mutable filesystem — there are no transactions/queries and no efficient in-place edits, so a workload needing those will be slow, awkward, or incorrect.
source: AWS — What is object storage? (S3) ↗
Commonly asked mid concept common Compare the IaaS, PaaS, and SaaS service models. Who manages what at each level?
It's a ladder of how much the provider manages vs you. IaaS (raw VMs, networking, storage — EC2) gives you the infrastructure; you still manage the OS, runtime, and app. Most control, most operational burden. PaaS (App Engine, Heroku, managed databases) hands you a platform — you push code and the provider runs the OS, runtime, scaling, and patching; you manage only your app and data. SaaS (Gmail, Salesforce) is finished software you just use; the provider manages essentially everything, you manage only your data and configuration.
The through-line is the shared responsibility line moving up as you go IaaS → PaaS → SaaS: you trade control and flexibility for less operational work. (Serverless/FaaS sits near PaaS — even the runtime instance is abstracted, scaling to zero.)
The senior framing: pick the highest level that still meets your control/customization needs, so you don't waste engineering effort managing layers a provider would handle for free.
What a strong answer covers
- IaaS (EC2): provider runs hardware/virtualization; you run OS, runtime, app — most control.
- PaaS (App Engine, managed DBs): push code; provider runs OS/runtime/scaling/patching.
- SaaS (Gmail, Salesforce): finished software; you manage only your data and config.
- The responsibility line moves up IaaS → PaaS → SaaS: less control, less ops burden.
- Pick the highest level that still meets your control needs to minimize wasted ops effort.
Quick self-check
On a managed PaaS, which layer are YOU still responsible for?
Follow-ups they push on
- Where does serverless / FaaS sit on this ladder?
- What do you give up moving from IaaS to PaaS?
- How does this map onto the shared responsibility model?
Red flag Defaulting to IaaS and hand-managing OS/runtime/scaling when a PaaS would handle it — you pay in engineering time for control you don't actually need.
source: AWS — Types of cloud computing (IaaS/PaaS/SaaS) ↗
Commonly asked senior concept common How do you control and reason about cloud cost? What's the difference between on-demand, reserved, and spot pricing?
Cloud's elasticity cuts both ways: pay-per-use is great until idle or oversized resources quietly bleed money. The compute pricing tiers trade flexibility for cost: on-demand is full price, no commitment — for spiky or unpredictable workloads; reserved instances / savings plans commit to 1–3 years for a big discount — for steady, predictable baseline load; spot uses spare capacity at up to ~90% off but can be reclaimed with little notice — for fault-tolerant, interruptible work (batch jobs, CI, stateless workers that can be killed and rescheduled).
The broader cost levers: right-size (most instances are over-provisioned), auto-scale so you pay for what you use and scale to zero where possible (serverless), watch egress/data-transfer (a sneaky cost), set lifecycle policies to tier cold data to cheaper storage, and tag resources so you can attribute spend. Set budgets and alerts so surprises page you, not finance.
Senior framing: match the pricing model to the workload's tolerance for interruption and predictability — steady baseline on reserved, bursts on on-demand, interruptible bulk on spot.
What a strong answer covers
- On-demand: full price, no commitment — spiky/unpredictable workloads.
- Reserved / savings plans: 1–3yr commit for big discount — steady baseline load.
- Spot: up to ~90% off spare capacity but reclaimable anytime — fault-tolerant, interruptible work.
- Levers: right-size, auto-scale/scale-to-zero, watch egress, tier cold data, tag for attribution.
- Set budgets + alerts so cost surprises page engineers early.
Follow-ups they push on
- What kind of workload is safe to run on spot instances, and what isn't?
- Why is data egress an easy cost to overlook?
- How does auto-scaling change your cost profile vs a fixed fleet?
Red flag Running interruptible bulk work on full-price on-demand (or worse, putting a stateful production service on spot) — the first wastes ~90% of the spend, the second gets reclaimed out from under you with little warning.
source: AWS — EC2 instance purchasing options (on-demand/reserved/spot) ↗
Commonly asked senior concept occasional What does it mean for an architecture to be 'cloud-native', and why design for failure?
Cloud-native means building for the cloud's actual characteristics rather than lifting a fixed on-prem server into a VM. Core ideas: treat servers as cattle, not pets (instances are disposable and replaceable, not hand-tended); make services stateless so they scale horizontally and any instance can handle any request; externalize state to managed stores; automate provisioning with IaC; and design for failure — assume any instance, AZ, or dependency can die at any moment.
Why design for failure: at cloud scale, hardware *will* fail constantly — it's a statistical certainty, not an edge case. So you build in redundancy (multi-AZ), health checks and auto-replacement (a dead instance is terminated and a new one launched automatically), retries with backoff and circuit breakers for flaky dependencies, and graceful degradation. The famous expression of this is Netflix's Chaos Monkey, which kills production instances on purpose to prove the system survives.
Senior framing: the cloud doesn't give you reliability for free — it gives you the *primitives* (multiple AZs, auto-scaling, managed failover) and you must architect to use them.
What a strong answer covers
- Cloud-native = build for the cloud's traits, not a lifted-and-shifted pet server.
- Cattle not pets: instances are disposable, replaced automatically, never hand-tended.
- Stateless services + externalized state enable horizontal scaling and easy replacement.
- Design for failure: at scale hardware *will* fail — redundancy, health checks, retries, circuit breakers.
- The cloud gives primitives (multi-AZ, auto-scale, failover); you must architect to use them.
Follow-ups they push on
- What does 'cattle not pets' mean for how you operate servers?
- Why is statelessness a prerequisite for treating instances as disposable?
- What is a circuit breaker protecting you from?
Red flag Lifting an on-prem 'pet' server into a single cloud VM and calling it cloud-native — without statelessness, redundancy, and automated replacement, you've just moved a single point of failure into someone else's data center.
source: AWS — Reliability pillar (Well-Architected Framework) ↗
Commonly asked senior debug occasional An EC2 instance in a private subnet can't reach the internet to pull package updates. How do you diagnose and fix it?
A private subnet by definition has no route to an internet gateway, so instances there can't make outbound internet calls directly — that's the intended design, not a bug. The fix for *outbound-only* access is a NAT gateway: place it in a public subnet, and add a route in the private subnet's route table sending 0.0.0.0/0 to the NAT gateway. The NAT allows egress (and the return traffic for connections it initiated) but blocks unsolicited inbound — so the instance can pull updates while staying unreachable from the internet.
Work the diagnosis like a checklist down the path: (1) the private subnet's route table — is there a 0.0.0.0/0 → nat-... route? (2) the NAT gateway itself — is it in a *public* subnet that routes to an internet gateway? (3) security group outbound rules — egress allowed? (4) NACL — does the subnet's stateless ACL allow both the outbound request and the inbound return traffic? (5) DNS resolution working?
The senior tell: knowing that a NAT gateway (not an internet gateway) is the correct egress mechanism for private subnets, and checking the stateless NACL return-traffic rule that bites people.
What a strong answer covers
- Private subnet = no internet-gateway route by design; direct outbound fails as intended.
- Fix outbound-only access with a NAT gateway in a public subnet + a 0.0.0.0/0 → NAT private route.
- NAT allows egress + return traffic but blocks unsolicited inbound — instance stays private.
- Diagnose down the path: route table → NAT placement → SG egress → NACL (return traffic!) → DNS.
- Stateless NACLs must explicitly allow the inbound return traffic, a common silent culprit.
Follow-ups they push on
- Why a NAT gateway rather than an internet gateway for a private-subnet instance?
- Why must the NAT gateway itself live in a public subnet?
- Which stateless rule on a NACL commonly breaks return traffic?
Red flag Attaching an internet gateway route to the private subnet 'to fix it' — that makes the subnet public and the instance internet-reachable, defeating the security design; the correct egress path is a NAT gateway.
source: AWS — NAT gateways ↗
Commonly asked senior concept very common Explain the principle of least privilege in cloud IAM, with a concrete example.
Least privilege means every identity (user, role, service) gets exactly the permissions it needs to do its job and nothing more. The smaller the granted permission set, the smaller the blast radius if those credentials leak or the service is compromised.
Concrete example: a Lambda that only reads from one bucket should have a policy granting s3:GetObject scoped to that specific bucket's ARN — not s3:* on *. Wildcards like Action: * / Resource: * are the classic violation. In practice: prefer roles with temporary credentials over long-lived access keys, scope policies to specific actions and resource ARNs, start from deny and add only what is needed, and review/trim permissions over time. Pair it with separation of duties so no single role can both deploy and exfiltrate.
Follow-ups they push on
- Why prefer IAM roles with temporary credentials over static access keys?
- How do you discover and trim over-broad permissions after the fact?
Red flag Granting broad wildcard policies (`s3:*` on `*`) 'to get it working' and never tightening them — one leaked key then has the run of the whole account.
source: AWS — IAM security best practices (least privilege) ↗
Commonly asked senior concept occasional Why might a company choose managed cloud services over self-hosting, and what are the tradeoffs?
Managed services (RDS instead of running your own Postgres, EKS instead of bootstrapping Kubernetes) shift operational burden to the provider: patching, backups, failover, scaling, and HA come built in, so a small team ships faster and pages less. You trade money and some control for time and reliability.
The tradeoffs: higher direct cost, less control over versions/tuning/internals, and vendor lock-in (managed offerings differ across clouds, raising switching cost). Self-hosting gives maximum control and can be cheaper at very large, steady scale, but you now own the on-call, the upgrades, and the failure modes. The senior answer weighs team size, scale, and how differentiating the capability is: do not burn your scarce engineers running undifferentiated infrastructure a managed service handles well.
Follow-ups they push on
- How does vendor lock-in factor into choosing a managed service?
- At what scale might self-hosting actually become the cheaper choice?
Red flag Defaulting to self-hosting core infrastructure 'to save money' on a small team — the hidden cost is the engineering time and on-call burden of operating it, which usually dwarfs the managed-service bill.
source: AWS — What is managed services / cloud value ↗

6.2.6 Networking 13

Commonly asked mid concept occasional What is a TLS/SSL certificate, who issues it, and how does a browser decide to trust it?
A certificate binds a public key to an identity (a domain name) and is signed by a Certificate Authority (CA). When you connect, the server presents its certificate during the TLS handshake; the browser verifies the CA's signature, walks the chain of trust up to a root CA that's pre-installed in the OS/browser trust store, and checks the cert isn't expired, matches the hostname, and hasn't been revoked. If all that holds, the browser trusts that it's really talking to that domain (this is the authentication part of TLS).
The chain matters: a root CA signs intermediate CAs, which sign your leaf certificate, so the server sends leaf + intermediates and the browser anchors trust at the root it already trusts. A self-signed cert isn't signed by a trusted CA, so browsers warn — fine for internal/dev, not for public sites. Today most public certs come from Let's Encrypt (free, automated via ACME) and are short-lived, renewed automatically.
Senior tell: trust is anchored in pre-installed root CAs and verified via the signature chain — not 'the browser checks the cert with the website'.
What a strong answer covers
- A cert binds a public key to a domain identity, signed by a Certificate Authority.
- Browser verifies the chain of trust up to a root CA in its pre-installed trust store.
- It also checks expiry, hostname match, and revocation before trusting.
- Chain: root CA → intermediate(s) → leaf; the server sends leaf + intermediates.
- Self-signed = not CA-trusted (browser warns); public certs now mostly Let's Encrypt via ACME.
Follow-ups they push on
- Why does a self-signed certificate trigger a browser warning?
- What is the chain of trust, and where is it anchored?
- Why are automated, short-lived certs (Let's Encrypt/ACME) now the norm?
Red flag Thinking the browser validates a certificate by checking with the website itself — trust is anchored in pre-installed root CAs and verified through the signature chain; the site never gets to vouch for its own identity.
source: Cloudflare — What is an SSL certificate? ↗
Commonly asked mid concept common What is a reverse proxy, and how is it different from a forward proxy and a load balancer?
A reverse proxy (e.g. nginx) sits in front of your servers and faces clients: clients connect to it, and it forwards requests to backends. It centralizes TLS termination, caching, compression, request routing, and security (it hides backend topology and can absorb attacks). The client doesn't know which backend served it.
A forward proxy sits in front of clients and faces the internet — it proxies outbound requests on behalf of users (corporate egress filtering, anonymity, caching outbound). So the two differ by which side they represent: reverse proxy works for the servers, forward proxy works for the clients.
A load balancer is a specific job — distributing traffic across multiple backends — that a reverse proxy often performs, but a reverse proxy also does TLS, caching, and routing beyond just balancing. In practice nginx is frequently both reverse proxy and load balancer.
Follow-ups they push on
- Which side does each proxy represent — the client or the server?
- Is every reverse proxy a load balancer? Is every load balancer a reverse proxy?
Red flag Mixing up forward and reverse proxies — a forward proxy acts on behalf of the client (outbound), a reverse proxy acts on behalf of the server (inbound).
source: Cloudflare — What is a reverse proxy? ↗
Commonly asked mid concept common What does a CDN do, and how does it speed up content delivery?
A CDN is a globally distributed network of edge servers that cache copies of your content close to users. When a user requests an asset, they're served from the nearest edge location instead of a round trip to a single origin — turning a 100-300ms origin fetch into a 5-20ms edge cache hit and slashing latency for users far from your origin.
Beyond latency, a CDN offloads traffic from your origin (the origin only serves cache misses, so it handles far less load and survives traffic spikes), and the edge often adds TLS termination, compression, and DDoS protection. It's ideal for static and cacheable content — images, CSS/JS, video, downloads. Dynamic, per-user responses are harder to cache, though edge compute and smart cache keys help. Cache invalidation (knowing when to purge stale content) is the recurring hard part.
Follow-ups they push on
- What kinds of content cache well on a CDN, and what doesn't?
- How does a CDN reduce load on your origin, not just latency?
Red flag Thinking a CDN only helps latency — it also massively offloads the origin (origin only serves cache misses), which is often the bigger win during traffic spikes.
source: Cloudflare — What is a CDN? ↗
Commonly asked mid concept common What is a VPC, and what's the difference between a public and a private subnet?
A VPC (Virtual Private Cloud) is your own logically isolated network inside the cloud, with a private IP range you control. You carve it into subnets, each living in one availability zone.
The public/private distinction is about reachability from the internet, controlled by routing. A public subnet has a route to an internet gateway, so resources there can have public IPs and be reached from the internet — that's where you put load balancers and bastion hosts. A private subnet has no direct internet route inbound; that's where you put app servers and databases so they can't be reached directly. Private-subnet resources still make outbound calls (e.g. to pull updates) through a NAT gateway, which allows egress but not unsolicited inbound. The pattern: internet -> load balancer in a public subnet -> app/database in private subnets.
Follow-ups they push on
- How does a private-subnet instance reach the internet for outbound updates?
- Where would you place a database and why?
Red flag Putting databases in a public subnet for convenience — they become directly reachable from the internet; databases belong in private subnets behind a load balancer or bastion.
source: AWS — VPC subnets (public/private) ↗
Commonly asked mid concept very common What is the difference between TCP and UDP, and when would you choose UDP?
TCP is connection-oriented and reliable: a handshake sets up the connection, then it guarantees ordered, complete, error-checked delivery, retransmitting lost packets and applying flow/congestion control. That reliability costs overhead and latency — the handshake, acks, and head-of-line blocking when a lost packet stalls everything behind it. It's the default for anything that must arrive intact: HTTP, database connections, file transfer.
UDP is connectionless and 'fire-and-forget': no handshake, no ordering, no retransmission, no congestion control — just send datagrams. It's leaner and lower-latency, but the application must tolerate (or itself handle) loss and reordering. Choose UDP when timeliness beats completeness: live video/voice (a dropped frame is better than a stalled stream), real-time gaming, and DNS (one small request/response where setting up a TCP connection would be wasteful).
The modern twist: QUIC/HTTP-3 runs over UDP and rebuilds reliability/ordering in userspace to dodge TCP's head-of-line blocking — UDP as a foundation, not a compromise.
What a strong answer covers
- TCP: connection + handshake, reliable, ordered, congestion-controlled — HTTP, DBs, file transfer.
- UDP: connectionless, no ordering/retransmission — lean and low-latency, app handles loss.
- Choose UDP when timeliness beats completeness: live video/voice, gaming, DNS.
- TCP's reliability adds latency (handshake, acks, head-of-line blocking on loss).
- QUIC/HTTP-3 runs on UDP and re-adds reliability in userspace to avoid TCP head-of-line blocking.
Quick self-check
For a live voice/video call, why is UDP often preferred over TCP?
Follow-ups they push on
- Why does a single lost packet in TCP stall everything behind it (head-of-line blocking)?
- Why does DNS traditionally use UDP for a typical query?
- How does QUIC get reliability while running over UDP?
Red flag Calling UDP 'unreliable so never use it' — for latency-sensitive, loss-tolerant traffic (voice, video, gaming, DNS) UDP is the correct choice, and modern protocols like QUIC build on it deliberately.
source: Cloudflare — What is the difference between TCP and UDP? ↗
Commonly asked mid concept common Explain how DNS resolution works end to end, and what the common record types do.
DNS turns a name into an IP through a hierarchy of caches and authoritative servers. The browser/OS cache is checked first; on a miss, a recursive resolver (your ISP's or e.g. 8.8.8.8) does the legwork: it asks a root server (which points to the right TLD), then the TLD server for .com (which points to the domain's authoritative nameserver), then the authoritative nameserver, which returns the actual record. Results are cached at each level for the record's TTL, so most lookups never travel the full chain.
Common records: A (name → IPv4) and AAAA (→ IPv6); CNAME (alias one name to another name); MX (mail servers); TXT (arbitrary text — SPF/DKIM, domain verification); NS (delegates a zone to nameservers). TTL is the lever for caching vs agility: a long TTL means fewer lookups but slow propagation when you change records; a short TTL flips that — which is why you lower TTL *before* a planned migration.
Senior tell: knowing the resolver, not the browser, walks root→TLD→authoritative, and that DNS is a globally cached, eventually-consistent system (TTL governs staleness).
What a strong answer covers
- Caches first (browser/OS), then a recursive resolver walks root → TLD → authoritative.
- Each level caches the answer for its TTL, so most lookups short-circuit.
- A/AAAA = name→IPv4/IPv6; CNAME = alias to another name; MX = mail; TXT = SPF/verification; NS = delegation.
- TTL trades caching vs agility — lower it before a migration so changes propagate fast.
- DNS is globally cached and eventually consistent; stale answers persist until TTL expires.
Quick self-check
You're migrating to a new server IP next week and want DNS to cut over quickly. What do you do first?
Follow-ups they push on
- Why lower a record's TTL before a planned IP migration?
- What's the difference between an A record and a CNAME?
- Who actually walks the root→TLD→authoritative chain — the browser or the resolver?
Red flag Changing a DNS record and expecting it to take effect instantly — old answers stay cached until the TTL expires, so propagation is governed by TTL; you lower TTL ahead of time for fast cutovers.
source: Cloudflare — What is DNS? / DNS records ↗
Commonly asked senior concept very common What is the difference between an L4 and an L7 load balancer, and when would you use each?
An L4 (transport-layer) load balancer routes by IP and TCP/UDP port without looking at the payload. It is fast, low-overhead, protocol-agnostic, and preserves the connection — ideal for raw TCP/UDP, very high throughput, low latency, or when you need a static IP (e.g. AWS NLB).
An L7 (application-layer) load balancer understands HTTP: it can route on hostname, URL path, headers, and cookies, terminate TLS, do sticky sessions, and apply content-based rules (e.g. send /api to one service, /static to another). That intelligence costs a bit more processing (e.g. AWS ALB). Rule of thumb: web/HTTP traffic that needs path/host routing or TLS termination -> L7; non-HTTP, ultra-high-throughput, or static-IP needs -> L4.
Follow-ups they push on
- Which layer can do path-based routing, and why can't the other?
- Where does TLS termination happen in each case?
Red flag Claiming an L4 load balancer can route by URL path or host header — it never inspects the HTTP payload, so content-based routing requires L7.
source: Cloudflare — What is load balancing? ↗
Commonly asked senior concept common What load balancing algorithms exist, and how do sticky sessions and health checks fit in?
Common algorithms: round-robin (rotate through backends — simple, assumes roughly equal requests/servers); least-connections (send to the backend with the fewest active connections — better when request durations vary widely); weighted variants (bias toward bigger instances); and hash-based (e.g. hash the client IP or a key to pin a client consistently to one backend).
Health checks are what make a load balancer safe: it periodically probes each backend and stops routing to any that fail, so a dead or unhealthy instance is automatically taken out of rotation and traffic flows only to healthy ones. Without them the LB cheerfully sends traffic into a black hole.
Sticky sessions (session affinity) pin a given client to the same backend (via a cookie or IP hash) so server-local session state keeps working. It's a crutch: it undermines even load distribution and breaks when that backend dies. The senior view is to avoid the need for it by making services stateless and externalizing session state to a shared store (Redis), so any backend can serve any request and the LB is free to balance optimally.
What a strong answer covers
- Algorithms: round-robin, least-connections (varied request durations), weighted, hash-based.
- Health checks remove failed backends from rotation — without them the LB routes into dead servers.
- Sticky sessions pin a client to one backend so server-local state works — but it's a crutch.
- Stickiness undermines even balancing and breaks when the pinned backend dies.
- Better: stateless services + shared session store (Redis) so any backend serves any request.
Follow-ups they push on
- When is least-connections a better choice than round-robin?
- Why do sticky sessions undermine even load distribution?
- How does externalizing session state let you drop sticky sessions?
Red flag Relying on sticky sessions to hold user state on a specific server — when that server is removed (scale-in, failure, deploy) the session is lost; externalize session state so any backend can serve the request.
source: Cloudflare — What is load balancing? ↗
Commonly asked senior concept occasional What is HTTP/2 (and HTTP/3), and what problems did they solve over HTTP/1.1?
HTTP/1.1 sends one request/response per connection at a time; a slow response blocks the ones behind it on that connection (head-of-line blocking), so browsers worked around it by opening many parallel connections — wasteful and still limited.
HTTP/2 introduced multiplexing: many requests/responses share one TCP connection as interleaved streams, so a slow response no longer blocks others *at the HTTP layer*. It also added header compression (HPACK) and stream prioritization. But because it still rides on a single TCP connection, a lost TCP segment stalls *all* streams — TCP-level head-of-line blocking remained.
HTTP/3 fixes that by moving off TCP onto QUIC (over UDP): streams are independent at the transport layer, so one lost packet only stalls its own stream, not the others. QUIC also folds the transport + TLS handshake together for faster (often 0-1 RTT) connection setup and better mobility across networks. The arc: HTTP/2 solved app-layer HOL blocking, HTTP/3 solved transport-layer HOL blocking.
What a strong answer covers
- HTTP/1.1: one in-flight request per connection → app-layer head-of-line blocking, many parallel sockets.
- HTTP/2: multiplexes many streams on one TCP connection + header compression (HPACK).
- HTTP/2's flaw: still one TCP connection, so a lost segment causes transport-level HOL blocking.
- HTTP/3 runs on QUIC over UDP — independent streams, so one lost packet stalls only its stream.
- QUIC also merges transport+TLS handshake for faster (0-1 RTT) setup and connection migration.
Quick self-check
What was the key transport change in HTTP/3 versus HTTP/2?
Follow-ups they push on
- Why does HTTP/2 still suffer head-of-line blocking despite multiplexing?
- How does running over QUIC/UDP let HTTP/3 avoid that?
- What did header compression (HPACK) buy HTTP/2?
Red flag Claiming HTTP/2 eliminated head-of-line blocking entirely — it removed it at the HTTP layer, but a single lost TCP packet still stalls all streams; only HTTP/3 over QUIC removes the transport-level HOL blocking.
source: Cloudflare — What is HTTP/3? ↗
Commonly asked senior debug occasional Users intermittently get 502/504 errors from your service behind a load balancer. How do you debug it?
First decode the codes: a 502 Bad Gateway means the load balancer got an invalid/empty response from a backend; a 504 Gateway Timeout means the backend didn't respond within the LB's timeout. Both point *downstream of the LB* — the LB is reachable, so the problem is the backends or the path to them, and 'intermittent' suggests some backends or some requests, not all.
Work it methodically: (1) check backend health in the LB — are some targets failing health checks and flapping in/out of rotation? (2) backend logs/metrics — crashes, restarts, OOM, or slow endpoints (504 often = a slow query or a downstream dependency timing out). (3) timeout mismatch — a classic 502 cause is the LB's idle/keep-alive timeout being *longer* than the backend's, so the backend closes a connection the LB still tries to reuse; align them (backend keep-alive ≥ LB idle timeout). (4) capacity — are 5xx spikes correlated with load (saturated backends, exhausted connection pools, thread starvation)? (5) recent deploys/config changes.
Senior tell: distinguishing 502 (bad/empty upstream response, often the keep-alive timeout mismatch) from 504 (upstream too slow), and following the request path from LB → backend → that backend's dependencies.
What a strong answer covers
- 502 = LB got an invalid/empty response from a backend; 504 = backend timed out — both downstream of the LB.
- Check backend health checks — flapping targets cause intermittent failures.
- Inspect backend logs/metrics: crashes, OOM, slow endpoints, dependency timeouts (typical 504).
- Classic 502: LB idle timeout > backend keep-alive, so the backend drops a connection the LB reuses — align them.
- Correlate 5xx with load (saturated pools/threads) and recent deploys.
Follow-ups they push on
- What's the difference in meaning between a 502 and a 504?
- How does a keep-alive/idle timeout mismatch produce intermittent 502s?
- Why does 'intermittent' point you toward specific backends or load rather than the LB itself?
Red flag Blaming the load balancer for 502/504s — both codes mean the LB reached the backend but the backend gave a bad or slow response; the fix is almost always in the backends, their dependencies, or a timeout mismatch, not the LB config alone.
source: Cloudflare — What is a 502 Bad Gateway error? ↗
Commonly asked senior concept very common Walk me through what happens, step by step, when a user types a URL and hits Enter.
DNS resolution first: the browser checks its cache, then the OS, then a recursive resolver, which walks root -> TLD -> authoritative nameserver to get the IP (often a CDN edge or load balancer IP).
Then the TCP connection (handshake) to that IP on port 443, and a TLS handshake to negotiate keys and verify the server's certificate so the channel is encrypted. The browser sends the HTTP request; it likely lands on a CDN edge or load balancer, which either serves cached content or forwards to an origin server. The server (behind a reverse proxy / LB, possibly hitting app servers, caches, and databases) returns the response. Finally the browser parses HTML, fetches sub-resources (CSS/JS/images, often from the CDN), and renders. This question is a checklist of the whole stack — name DNS, TCP, TLS, LB/reverse proxy, CDN, and origin.
Follow-ups they push on
- Where does the CDN fit, and what does it save you?
- What does the TLS handshake actually establish before any HTTP is sent?
Red flag Skipping straight to 'the server returns HTML' and omitting DNS, the TLS handshake, and the load-balancer/CDN hops — the interviewer is probing breadth across the whole networking stack.
source: Cloudflare — What is DNS? ↗
Commonly asked senior concept common How does the TLS handshake work, and what does HTTPS actually give you?
HTTPS = HTTP over TLS. TLS provides three things: encryption (eavesdroppers can't read traffic), authentication (you're talking to the real server, via its certificate), and integrity (tampering is detected).
The handshake establishes a shared session key without ever sending it in the clear. The client sends a ClientHello (supported versions/ciphers); the server responds with its choice plus its certificate. The client validates the certificate against a trusted Certificate Authority chain (this is the authentication step). They then perform a key exchange — modern TLS 1.3 uses ephemeral Diffie-Hellman so each session gets a fresh key (forward secrecy: stealing the server's key later can't decrypt past traffic). Once the shared symmetric key is derived, the rest of the session uses fast symmetric encryption. TLS 1.3 also trimmed the handshake to one round trip. The asymmetric crypto is only used to bootstrap the symmetric key.
Follow-ups they push on
- What is forward secrecy and why does ephemeral key exchange give it to you?
- Why switch from asymmetric to symmetric encryption after the handshake?
Red flag Saying the whole session is encrypted with the server's public/private key pair — asymmetric crypto only bootstraps a shared symmetric key; the bulk traffic uses fast symmetric encryption.
source: Cloudflare — What happens in a TLS handshake? ↗
Commonly asked senior concept occasional What is the difference between a security group and a network ACL, and how do they implement defense in depth?
Both are virtual firewalls in a VPC but operate at different scopes and behave differently. A security group is attached to an instance/ENI and is stateful: if you allow an inbound request, the response is automatically allowed back out — you only write the rules you care about, and you can only specify allow rules.
A network ACL is attached to a subnet and is stateless: it evaluates inbound and outbound traffic independently (so you must allow the return traffic explicitly), it supports both allow and deny rules, and rules are evaluated in numbered order. So security groups guard individual resources, NACLs guard the whole subnet boundary.
Using both is defense in depth: the NACL is a coarse subnet-level gate (e.g. block a bad IP range for everything in the subnet) and the security group is the fine-grained per-instance control. An attacker has to get past both layers.
Follow-ups they push on
- Why does a stateless NACL require you to allow return traffic explicitly?
- Why run both instead of relying on just the security group?
Red flag Treating a security group as stateless and adding redundant outbound rules for return traffic, or assuming a NACL is stateful and forgetting to allow the response, which silently drops connections.
source: AWS — Compare security groups and network ACLs ↗

6.2.7 Observability 12

★ must-know Google senior concept very common Define SLI, SLO, SLA, and error budget — how do they relate?
An SLI (Service Level *Indicator*) is a measured quantity of service health — e.g. the proportion of HTTP requests that succeed under 300ms.
An SLO (Service Level *Objective*) is the internal target for an SLI over a window — e.g. 99.9% of requests succeed over 28 days. It is what you *aim* for.
An SLA (Service Level *Agreement*) is a contract with customers that includes consequences (refunds, penalties) if you miss it. SLAs are looser than SLOs so you have headroom before you owe anyone money.
The error budget is 1 − SLO — the allowed amount of unreliability (0.1% for a 99.9% SLO). It turns reliability into a currency: while budget remains you can ship fast and take risks; when it is exhausted you freeze risky launches and prioritize stability. It is the mechanism that lets dev and ops stop arguing about pace versus reliability.
What a strong answer covers
- SLI = the measurement; SLO = the internal target on that measurement; SLA = the externally-promised, consequence-bearing version.
- Set the SLO tighter than the SLA so you get warning before breaching the contract.
- Error budget = 1 − SLO — the explicit, spendable allowance of failure over the window.
- 100% is the wrong reliability target: it is impossibly expensive and leaves no budget to ship features.
- When the budget is spent, the policy is to halt risky releases until reliability recovers.
Quick self-check
Your SLO is 99.9% success over 28 days. What is the error budget?
Follow-ups they push on
- Why is targeting 100% availability the wrong goal?
- What should happen operationally when the error budget is fully consumed?
- Why is an SLA usually looser than the corresponding SLO?
Red flag Conflating SLO and SLA, or setting them equal — the SLA must be looser than the SLO, and the error budget only makes sense as the gap below the SLO target.
source: Google SRE Book — Service Level Objectives ↗
Commonly asked junior concept common How do Prometheus and Grafana divide responsibilities in a typical stack?
Prometheus is the time-series database and collector: it *pulls* (scrapes) metrics from instrumented targets, stores them, and evaluates alerting rules. Querying is done with PromQL.
Grafana is the visualization/dashboard layer: it queries Prometheus (and many other sources) and renders graphs, tables, and alerts for humans.
The one-liner: Prometheus collects and stores the numbers; Grafana makes them legible. They are complementary, not competitors — you commonly run both together.
Follow-ups they push on
- Why does Prometheus prefer a pull model over push?
- Where does Alertmanager fit relative to Prometheus?
Red flag Thinking Grafana stores metrics — it is a query/visualization front-end over data sources, not a TSDB.
source: Grafana — Prometheus data source ↗
Commonly asked mid concept very common What are the three pillars of observability, and what question does each one answer?
Logs are timestamped, discrete records — the narrative of *what happened* on one service. Best for forensic, after-the-fact debugging of a specific event.
Metrics are aggregated numbers over time (counters, gauges, histograms) — they answer *how much / how often / is the trend bad?* Cheap to store, great for dashboards and alerting thresholds.
Traces follow a single request across service boundaries — they answer *where did the time go / which hop failed?* in a distributed system.
The strong answer ties them together: a metric alert tells you something is wrong, a trace localizes which service, and logs from that service explain why.
Follow-ups they push on
- Which pillar is most expensive to store at scale, and why?
- How do you correlate a log line with the trace it belongs to?
Red flag Treating the three as interchangeable, or claiming logs alone give you observability — logs do not show cross-service latency the way traces do.
source: Sematext — Three Pillars of Observability ↗
Commonly asked mid concept occasional What's the difference between a counter, a gauge, and a histogram in Prometheus, and when do you use each?
A counter only ever increases (or resets to zero on restart): total requests served, total errors. You don't read its raw value — you apply rate() to get per-second throughput. A counter answers 'how many, cumulatively?'
A gauge goes up and down: current memory usage, in-flight requests, queue depth, temperature. You read it directly; it answers 'what is the value *right now*?'
A histogram samples observations into configurable buckets (e.g. request durations) so you can compute quantiles like p95/p99 with histogram_quantile(). It answers 'what does the *distribution* look like?' — essential for latency, where the mean lies.
Pick by the question: cumulative count → counter; point-in-time level → gauge; distribution/percentiles → histogram.
What a strong answer covers
- Counter = monotonically increasing; query with rate(), never read raw (it resets on restart).
- Gauge = a value that rises and falls; read directly for current state.
- Histogram = bucketed observations enabling quantiles (p95/p99) via histogram_quantile().
- Latency belongs in a histogram, not a gauge or an average — the tail is what hurts users.
Quick self-check
You want p99 request latency on a dashboard. Which metric type do you instrument?
Follow-ups they push on
- Why do you apply `rate()` to a counter instead of reading its value?
- What's the difference between a Prometheus histogram and a summary?
Red flag Using a gauge for an ever-growing total (so a restart silently resets it and breaks your dashboards), or averaging latency instead of using a histogram for percentiles.
source: Prometheus — Metric types ↗
Commonly asked mid concept common What problem does OpenTelemetry solve?
OpenTelemetry (OTel) is a vendor-neutral standard — APIs, SDKs, and the Collector — for generating and exporting traces, metrics, and logs.
The problem it solves: before OTel, each backend (Datadog, Jaeger, New Relic, Prometheus) had its own agent and instrumentation library, so switching vendors meant re-instrumenting your code. With OTel you instrument *once* against a common API, then point the Collector at whatever backend you choose — no code change to switch or fan out to several.
It is now a CNCF project and the de-facto wire format (OTLP) for telemetry.
Follow-ups they push on
- What does the OTel Collector do that an in-process SDK exporter doesn't?
- How does context propagation let a trace span multiple services?
Red flag Calling OpenTelemetry a 'monitoring tool' or a backend — it generates and ships telemetry; it does not store or visualize it (that's Prometheus, Grafana, Jaeger, etc.).
source: OpenTelemetry — What is OpenTelemetry? ↗
Google mid concept common What are the four golden signals, and why is each one worth alerting on?
Google SRE's four golden signals for a user-facing system are latency, traffic, errors, and saturation.
Latency — how long requests take; crucially, track *successful* and *failed* latency separately, since a fast error can hide a problem. Traffic — demand on the system (requests/sec, transactions/sec). Errors — the rate of failed requests, including the sneaky ones that return 200 but are wrong. Saturation — how 'full' the most constrained resource is (memory, I/O, CPU), the leading indicator of imminent degradation.
If you can only instrument four things, these give you the broadest coverage of user-visible health. RED is essentially the request-side subset (rate/errors/duration); saturation adds the resource-pressure dimension.
What a strong answer covers
- The four: latency, traffic, errors, saturation — broad coverage from a minimal set.
- Measure latency of failures separately from successes — a fast 500 skews the average and masks the outage.
- Saturation is a *leading* indicator: it warns before latency and errors blow up.
- RED (Rate/Errors/Duration) maps onto the request-facing three; saturation is the resource lens (the S in USE-style thinking).
Follow-ups they push on
- Why must you separate the latency of failed requests from successful ones?
- How do the golden signals overlap with the RED method?
Red flag Folding failed-request latency into your overall latency metric — a flood of instant errors makes p50 latency look great while users are seeing failures.
source: Google SRE Book — Monitoring Distributed Systems (Golden Signals) ↗
Commonly asked mid concept common Why is structured logging preferred over plain-text logs, and what is a correlation/trace ID for?
Structured logging emits each log as machine-parseable key/value data (typically JSON) — {"level":"error","user_id":42,"latency_ms":910} — instead of a free-text sentence. The payoff: you can index, filter, and aggregate on fields (level=error AND service=checkout) in a log platform, rather than writing fragile regexes against prose.
A correlation ID (a.k.a. request/trace ID) is a unique identifier generated at the edge and propagated through every service and log line for a single request. It lets you reconstruct the entire path of one request across many services by filtering on one value — turning scattered log lines into a coherent story, and linking logs to the matching distributed trace.
Together they make logs queryable *and* joinable, which is what makes them useful at scale.
What a strong answer covers
- Structured logs are key/value (JSON) — indexable and filterable on fields, not parsed from prose.
- A correlation/trace ID is generated at the edge and propagated so all log lines for one request share it.
- Filtering on the correlation ID reconstructs one request's journey across every service it touched.
- It also bridges logs and traces — the same ID ties a log line to its span in a distributed trace.
Follow-ups they push on
- How does a correlation ID get propagated across an async message queue?
- Why does free-text logging become unmanageable in a microservices fleet?
Red flag Logging unstructured prose (or, worse, logging secrets/PII into those fields) — it forces brittle text parsing and can leak sensitive data into the log store.
source: OpenTelemetry — Logs / Correlation ↗
Commonly asked senior concept occasional What is tail-based sampling in distributed tracing, and why use it over head-based sampling?
Tracing every request at full volume is too expensive to store, so you sample. The question is *when* you decide.
Head-based sampling decides at the *start* of a trace — e.g. keep 1% of requests, chosen randomly at the root. It is cheap and simple, but blind: it might throw away the slow or errored traces, which are exactly the ones you want.
Tail-based sampling buffers the spans of a trace and decides *after* it completes, so it can keep traces based on outcome — every error, every request over 1s, plus a baseline sample of normal ones. You get the interesting traces without storing everything.
The tradeoff: tail-based needs to buffer complete traces (memory/coordination in the Collector) and is operationally heavier, but it captures the long tail that head-based sampling probabilistically discards.
What a strong answer covers
- Sampling exists because storing 100% of traces is prohibitively expensive at scale.
- Head-based decides at trace start (cheap, stateless) but can discard the slow/errored traces you most need.
- Tail-based decides after the trace finishes, so it can retain all errors and high-latency traces.
- Tail-based costs more: it must buffer whole traces and coordinate spans before deciding.
Follow-ups they push on
- Why can't head-based sampling preferentially keep error traces?
- What infrastructure does the OTel Collector need to do tail-based sampling?
Red flag Using uniform head-based sampling and then being surprised that the rare production error has no trace — the random sample almost never captured it.
source: OpenTelemetry — Sampling ↗
Google senior concept common What makes a good alert? Why do teams end up with alert fatigue, and how do you fix it?
A good alert is actionable, urgent, and user-impacting — it pages a human only when something needs a human to intervene *now*. The SRE guidance is to alert on symptoms (users are seeing errors / latency, the SLO is burning) rather than causes (CPU is at 80%), because a high CPU that isn't hurting anyone is not worth waking someone.
Alert fatigue sets in when too many alerts fire — noisy thresholds, alerts on causes that self-heal, duplicate pages for one incident — so on-call engineers start ignoring them, and the real page gets lost in the noise.
Fixes: alert on SLO burn rate rather than raw thresholds; route non-urgent signals to a dashboard or ticket instead of a page; deduplicate and group related alerts; and ruthlessly delete or tune any alert that consistently fires without requiring action. Every page should be reviewed: was it actionable?
What a strong answer covers
- Page only on symptoms users feel (errors, latency, SLO burn) — not on causes that may be harmless.
- Every page must be actionable and urgent; if no human action is needed now, it shouldn't page.
- Alert fatigue comes from noisy/duplicate/self-healing alerts; people then ignore the real one.
- Fix it with burn-rate alerts, deduplication/grouping, ticket-not-page routing, and pruning useless alerts.
Follow-ups they push on
- What is multi-window, multi-burn-rate alerting and why is it better than a static threshold?
- Why is paging on high CPU usually a bad idea?
Red flag Alerting on every resource metric (cause-based alerting) — it buries the few symptom-based pages that actually matter and trains on-call to dismiss notifications.
source: Google SRE Workbook — Alerting on SLOs ↗
Commonly asked senior concept common What's the difference between monitoring and observability?
Monitoring watches for *known* failure modes: you decide in advance what to measure, set thresholds, and alert when a line is crossed. It answers questions you predicted.
Observability is the property of a system that lets you ask *new* questions about its internal state from the outside, without shipping new code — to debug failures you did not anticipate.
The relationship: monitoring is a subset of what observable systems enable. You still need both — monitoring catches the predictable, observability lets you investigate the unknown-unknowns in complex distributed systems.
Follow-ups they push on
- What property of your telemetry makes a system observable rather than just monitored?
- Why do microservices raise the bar for observability versus a monolith?
Red flag Saying observability is 'just monitoring with more dashboards' — the distinction is exploring unknown-unknowns versus alerting on known thresholds.
source: TechTarget — The 3 pillars of observability ↗
Commonly asked senior debug common Your Prometheus storage is exploding after a deploy. What's the most likely cause and the fix?
Almost always a high-cardinality label. Each unique combination of label values is a separate time series; adding an unbounded label like user_id, request_id, email, or a raw URL with IDs multiplies series count explosively.
Fix: drop the offending label, or replace it with a bounded one. Use http_method, status-code *class* (2xx/5xx), route *template* (/users/:id, not /users/8123), and service — values with a small, fixed set.
If you genuinely need per-user detail, that belongs in logs or traces (high cardinality there is fine), not in metric labels.
Follow-ups they push on
- Why is high cardinality cheap in tracing but catastrophic in metrics?
- How would you find which metric is the culprit?
Red flag Putting unbounded identifiers (user IDs, request IDs, timestamps) into metric labels — the classic cardinality blow-up.
source: Sematext — Three Pillars of Observability (cardinality) ↗
Commonly asked senior concept occasional What are the RED and USE methods, and when would you use each?
RED (Rate, Errors, Duration) is request-centric — for services/endpoints, you watch request rate, error rate, and latency distribution. It answers 'is this service healthy from the caller's view?'
USE (Utilization, Saturation, Errors) is resource-centric — for every resource (CPU, disk, network, memory) you watch how busy it is, how much work is queued, and its error count. It answers 'is this machine/resource a bottleneck?'
Use RED for your request-serving services and USE for the infrastructure underneath them; they are complementary lenses.
Follow-ups they push on
- Why is a latency *percentile* (p99) more useful than a mean for the D in RED?
- What are the four golden signals and how do they relate to RED?
Red flag Alerting on averages instead of percentiles — a healthy mean hides a brutal p99 tail.
source: Grafana — RED method ↗

6.2.8 Deployment strategies 12

Commonly asked mid concept very common Compare blue-green, canary, and rolling deployments — define each and give the tradeoff.
Blue-green: run two full environments; blue serves prod while green gets the new version, then flip all traffic at once. Tradeoff: instant rollback (flip back), but you pay for double infrastructure.
Canary: release to a small slice of traffic/users first, watch metrics, then ramp up. Tradeoff: limits blast radius and catches real-world bugs early, but needs good monitoring and automated rollback, and the rollout is slower.
Rolling: replace instances in batches in place until all run the new version. Tradeoff: no extra infrastructure and simple, but both versions run simultaneously during the roll, rollback is slower, and bugs surface gradually.
Choice comes down to risk tolerance, infra budget, and how fast you need to recover.
Follow-ups they push on
- Which strategies require your two versions to be backward/forward compatible at the same time?
- How does a canary differ from a rolling deploy mechanically?
Red flag Confusing canary with rolling — canary targets a *traffic/user* slice and is metric-gated; rolling replaces *instances* batch by batch regardless of who they serve.
source: Unleash — Comparing deployment strategies ↗
Commonly asked mid concept common What's the difference between continuous delivery and continuous deployment?
Both build on continuous integration (merge and test small changes frequently) and keep main in an always-releasable state. The difference is the last step.
Continuous delivery: every change that passes the pipeline is *ready* to deploy, but the actual push to production is a manual decision — a human clicks the button. You can release any time, on demand.
Continuous deployment: there is no manual gate — every change that passes all automated checks deploys to production automatically. It demands very strong test coverage, automated rollback, and good observability, because nothing stops a bad change but the pipeline itself.
So: continuous *delivery* makes release a one-click choice; continuous *deployment* removes the click. Many teams do CD-delivery and reserve full auto-deploy for services where they trust their safety nets.
What a strong answer covers
- Both rest on CI and an always-releasable main.
- Continuous delivery = always *ready* to ship, but a human triggers the production release.
- Continuous deployment = every passing change auto-ships to prod, with no manual gate.
- Continuous deployment requires strong automated tests, rollback, and observability to be safe.
Quick self-check
What is the single distinguishing feature of continuous *deployment* versus continuous *delivery*?
Follow-ups they push on
- What safety nets must be in place before you trust full continuous deployment?
- Where do feature flags fit in a continuous-deployment pipeline?
Red flag Using 'CD' loosely — interviewers care that you distinguish *delivery* (manual release trigger) from *deployment* (fully automatic), and know the latter's higher safety-net bar.
source: Atlassian — Continuous integration vs delivery vs deployment ↗
Commonly asked mid concept occasional What is a deployment rollback, and why is 'roll forward' often preferred in practice?
Rollback restores the previous known-good version after a bad deploy. With blue-green it is a traffic flip; with rolling it means re-deploying the old image batch by batch.
Many mature teams prefer roll forward — ship a fix as a new deploy — because rollback can be unsafe when the bad version already wrote incompatible data or ran a forward-only migration. You cannot 'un-migrate' easily, and an old binary against a new schema can corrupt things.
Strong answer: keep deploys small and frequent so the diff to fix or revert is tiny, make migrations backward-compatible so rollback stays an option, and automate whichever path you choose.
Follow-ups they push on
- When is rollback strictly impossible?
- How do small, frequent deploys make both rollback and roll-forward safer?
Red flag Assuming rollback is always safe — irreversible migrations or data written by the new version can make rolling back worse than rolling forward.
source: Google SRE Book — Release Engineering ↗
Google mid concept common Why should deployments be automated and repeatable rather than a manual checklist?
Manual deploys are slow, error-prone, and unrepeatable — the same human running the same steps will eventually skip one under pressure, and the process lives in one person's head. Automation makes the deploy deterministic and self-documenting: the pipeline *is* the runbook.
The SRE principle is that releases should be hermetic and reproducible — build from a known, version-controlled source with pinned tools so the same inputs always produce the same artifact, independent of the machine running the build. Combined with automated tests as gates, this lets you deploy frequently and safely, and makes rollback a known, rehearsed action rather than improvisation during an incident.
Frequent small automated deploys also shrink each change's blast radius — easier to test, easier to bisect, easier to revert.
What a strong answer covers
- Manual steps are non-repeatable and fail under pressure; automation makes deploys deterministic.
- Builds should be hermetic/reproducible — pinned source and tools, same inputs → same artifact.
- Automated test gates let you deploy frequently and safely, with rehearsed rollback.
- Small, frequent, automated releases shrink each deploy's blast radius.
Follow-ups they push on
- What does a 'hermetic build' mean and why does it aid reproducibility?
- How does deploy frequency relate to the size of each change's risk?
Red flag Relying on a manual, tribal-knowledge deploy checklist — it doesn't scale, drifts from reality, and turns every release into a risk that only one person can run.
source: Google SRE Book — Release Engineering ↗
Commonly asked senior debug occasional Your canary shows no errors and gets promoted to 100%, then production falls over an hour later. What likely went wrong?
The canary passed because the failure mode wasn't *visible at canary scale or duration*. The usual suspects:
- A slow resource leak (memory, file descriptors, connection-pool exhaustion) that only crosses the limit after an hour of uptime — the short bake never reached it.
- A load/scale effect: at 1% traffic a new query or lock was fine; at 100% it saturates the database or a downstream dependency that the small canary never pressured.
- Cold→warm transitions: caches were warm on the old fleet but the new version's cache was cold under full load, or a thundering-herd on cutover.
- Time/cron-triggered behavior (a batch job, TTL expiry) that simply hadn't fired during the canary window.
Response: roll back (or roll forward a fix), then fix the *process* — longer bake time, load-aware canary analysis, and dependency/saturation metrics, not just error rate.
What a strong answer covers
- Canaries miss bugs that need time (leaks, scheduled jobs) or scale (DB/lock saturation at full traffic) to manifest.
- 1% traffic gives no statistical power for rare paths and no pressure on shared downstreams.
- Cold caches / thundering herd on full cutover can sink a version that looked fine warm.
- Fix the process: longer bake time, watch saturation and dependencies, not error rate alone.
Follow-ups they push on
- How would a memory leak escape a 15-minute canary but kill the fleet in an hour?
- Why can a query be fine at 1% traffic and lethal at 100%?
Red flag Trusting a short, low-traffic canary as proof of safety — error-rate-only, brief canaries are blind to leaks, scale effects, and time-triggered behavior.
source: Google SRE Workbook — Canarying Releases ↗
Commonly asked senior concept common Why does any zero-downtime deploy require old and new versions to be compatible, and what breaks if they aren't?
Rolling, canary, and blue-green (during the flip) all have a window where both versions serve traffic simultaneously against shared state — the same database, the same message formats, the same caches and clients. If the versions aren't mutually compatible, that window corrupts data or throws errors.
Concrete breakages: the new version writes a message/field the old version can't parse (or vice versa); the new schema drops or renames a column the old code still reads; a client gets a v2 response from the new instance then a v1 from the old one on the next request. Rollback is the same problem in reverse — the old version must tolerate data the new version already wrote.
The discipline is backward and forward compatibility via expand/contract: change in additive, tolerant steps (add before you read, deploy before you require, contract only after everything is upgraded) so any two adjacent versions can coexist.
What a strong answer covers
- Every zero-downtime strategy has a window where N and N+1 run together on shared state.
- Incompatibility there means corrupted data or runtime errors, not a clean failure.
- Rollback needs the same property in reverse: old code must tolerate data new code wrote.
- The cure is expand/contract / parallel change — additive, tolerant steps so adjacent versions coexist.
Follow-ups they push on
- How does expand/contract make a column rename safe across a rolling deploy?
- Why does message-queue schema evolution need both forward and backward compatibility?
Red flag Assuming a deploy is atomic — for the duration of any rolling/canary/blue-green cutover two versions coexist, so a breaking change to schema or wire format corrupts the in-flight overlap.
source: Martin Fowler — ParallelChange (expand/contract) ↗
Commonly asked senior design common Your service uses blue-green deploys. A migration adds a NOT NULL column. Why is this dangerous, and how do you ship it safely?
During the flip (and any rollback) both the old and new code may run against the *same* database. An old instance does not know about the new column; if it is NOT NULL with no default, the old code's inserts fail. A destructive migration also makes rollback impossible.
Safe approach is expand/contract (a.k.a. parallel change):
1. Expand: add the column as nullable / with a default — old and new code both work.
2. Deploy code that writes (and backfills) the new column.
3. Backfill existing rows.
4. Contract: only after all code uses it, add the NOT NULL constraint and drop old paths.
The rule: schema changes must be backward-compatible with the version still running.
Follow-ups they push on
- How does the same problem bite a rolling deploy?
- Why should you never rename a column in a single migration?
Red flag Coupling a destructive/forward-only schema change to the same release as the code that needs it — it breaks the still-running old version and blocks rollback.
source: Martin Fowler — ParallelChange (expand/contract) ↗
Commonly asked senior concept common When would you choose blue-green over canary, and vice versa?
Blue-green suits big-bang releases where you want an instant, all-or-nothing cutover and the cleanest possible rollback — e.g. a major version where running both versions side by side for long is undesirable, and you can afford the duplicate environment.
Canary suits fast-evolving services where you want to validate a change against *real* production traffic before full exposure, and where a small percentage of affected users is an acceptable way to catch regressions monitoring can detect.
Real answer mentions constraints: canary needs solid metrics + automated rollback; blue-green needs budget for two environments and a story for shared state (DB, caches).
Follow-ups they push on
- What makes automated rollback feasible for a canary but trickier for blue-green?
- How do feature flags let you decouple deploy from release entirely?
Red flag Recommending canary without acknowledging it is useless without good observability to decide promote-vs-roll-back.
source: TechTarget — canary vs blue/green vs rolling ↗
Commonly asked senior concept occasional What's the difference between a deployment and a release, and why does the distinction matter?
Deploy = getting new code running in production. Release = exposing that behavior to users. Feature flags let you separate the two: you can deploy dark code that is off, then flip it on (release) independently — and turn it off without a redeploy.
Why it matters: it shrinks risk. Deploys become routine and frequent; releases become a business decision (flag on for 5%, then 50%, then all). Rollback of a feature is a config flip, not a redeploy. It also enables trunk-based development — unfinished work hides behind a flag instead of a long-lived branch.
Follow-ups they push on
- What is the operational cost of accumulating stale feature flags?
- How do flags enable canary-style releases without canary infrastructure?
Red flag Conflating deploy and release — assuming code is live to users the instant it is deployed, when a flag may gate it.
source: Martin Fowler — Feature Toggles ↗
Commonly asked senior concept occasional How does Kubernetes implement a rolling update, and what knobs control its safety?
A Deployment's RollingUpdate strategy spins up new-version Pods and tears down old ones gradually, governed by two knobs:
- maxUnavailable — how many Pods below the desired count you tolerate during the roll (availability floor).
- maxSurge — how many extra Pods above desired you allow (capacity ceiling).
Kubernetes only routes traffic to a Pod once its readiness probe passes, so a broken new version that never becomes ready stalls the rollout instead of taking traffic. kubectl rollout undo reverts to the prior ReplicaSet.
For canary/blue-green you layer in a service mesh or progressive-delivery controller (Argo Rollouts, Flagger) — vanilla Deployments only do rolling.
Follow-ups they push on
- Why is a correct readiness probe essential for a safe rolling update?
- What does maxSurge=0, maxUnavailable=0 do — and why is it a deadlock?
Red flag Forgetting readiness probes — without them Kubernetes sends traffic to Pods that are up but not actually ready to serve.
source: Kubernetes — Rolling updates ↗
Commonly asked senior concept common What's the operational cost of feature flags, and how do you keep them from becoming tech debt?
Flags decouple deploy from release and are powerful, but each one adds a branch to your code's runtime behavior. Costs: combinatorial explosion (N flags = 2^N possible states you can't all test), stale flags that linger long after a rollout completes and confuse readers, and the risk of a flag becoming a permanent, undocumented config knob.
The fix is treating flags as short-lived by default: a release toggle exists only to ramp a feature, and you delete it (and its dead branch) the moment the feature is 100% rolled out. Distinguish flag *kinds* — release toggles are transient; ops/kill-switches and permissioning toggles are long-lived and managed differently. Track flags in a registry with an owner and an expiry, and add cleanup to the definition of done.
What a strong answer covers
- Each flag doubles the runtime state space — 2^N combinations quickly become untestable.
- Stale release toggles are tech debt: dead branches that mislead readers and rot.
- Categorize toggles — release (short-lived) vs ops/kill-switch and permissioning (long-lived) — and manage each differently.
- Give every flag an owner and expiry; deleting the flag is part of finishing the feature.
Follow-ups they push on
- Why are short-lived release toggles managed differently from long-lived kill-switches?
- How does an unbounded set of flags undermine your test strategy?
Red flag Leaving release toggles in the code after the feature is fully rolled out — they accumulate into untested, confusing dead branches and a combinatorial test nightmare.
source: Martin Fowler — Feature Toggles (Managing toggles) ↗
Commonly asked senior design occasional What metrics actually decide whether to promote or roll back a canary?
A canary is only as good as the signal you judge it by. The decision should be automated and metric-gated, comparing the canary against the baseline (the current stable version) over the same window — not against historical numbers, since traffic shifts.
Watch the user-facing signals: error rate, latency percentiles (p95/p99, not the mean), and request success/throughput, plus key business metrics where relevant (checkout completion, sign-ups). Saturation of the canary's resources is a secondary guard. If any guardrail metric on the canary is statistically worse than baseline beyond a threshold, auto-roll-back; otherwise ramp traffic up in stages.
The pitfalls to design around: too short a bake time (a slow leak or a cache that hasn't warmed won't show yet), too little canary traffic (no statistical power), and comparing against the wrong baseline.
What a strong answer covers
- Compare the canary against the concurrent baseline, over the same window — not against historical data.
- Gate on user-facing signals: error rate, latency percentiles, success/throughput, plus business KPIs.
- Automate the verdict: breach a guardrail → auto-roll-back; otherwise ramp up in stages.
- Give it enough bake time and traffic volume for the signal to be statistically meaningful.
Follow-ups they push on
- Why compare against a concurrent baseline rather than yesterday's numbers?
- What kind of bug would a 10-minute canary with 1% traffic still miss?
Red flag Promoting a canary too quickly or on too little traffic — slow leaks, cold caches, and rare-path errors don't surface in a short, low-volume bake, so the 'green' canary ships a latent bug.
source: Google SRE Workbook — Canarying Releases ↗

6.3 Security fundamentals 15

★ must-know Commonly asked senior concept common What is the root cause shared by all injection attacks, and why is parameterization the fix?
Every injection flaw — SQL, OS command, LDAP, NoSQL, XPath, even XSS — has the same root cause: untrusted data is interpreted as code because data and instructions travel on the same channel. The interpreter can't tell which bytes you meant as a value and which as syntax, so attacker input rewrites the command's structure.
Parameterization fixes this by *separating the channels*: the query/command template (the code) is sent and compiled independently of the parameters (the data), so user input is bound as a literal value and can never change the parsed structure. SELECT * FROM users WHERE id = ? with a bound parameter treats '; DROP TABLE as a harmless string.
This is why the generalized defense is 'keep code and data separate' — prepared statements for SQL, argument arrays (not shell strings) for OS commands, and context-aware encoding for output. Escaping/blocklisting is a fragile fallback, not the primary control.
What a strong answer covers
- Root cause of all injection: untrusted data is parsed as code because they share one channel.
- Parameterization sends template and data separately, so input binds as a literal and can't alter structure.
- Generalizes beyond SQL: arg arrays for OS commands, parameterized APIs for LDAP/NoSQL, encoding for output (XSS).
- Escaping/blocklists are fallbacks, not the fix — they miss encodings and edge cases.
Quick self-check
Why do parameterized queries prevent SQL injection?
Follow-ups they push on
- Why is OS command injection still possible even with a 'parameterized' shell call if you pass a single string?
- How does XSS fit the same 'data interpreted as code' model?
Red flag Thinking injection is a SQL-specific problem solved by a SQL-specific trick — it's a universal code/data-confusion flaw, and the universal fix is separating the two, not escaping characters.
source: OWASP — Injection Prevention Cheat Sheet ↗
Commonly asked junior concept very common What's the difference between authentication and authorization, and why must both be enforced server-side?
Authentication (authn) is *who are you?* — verifying identity (password, token, passkey). Authorization (authz) is *what are you allowed to do?* — checking that the verified identity has permission for this action/resource. Authn always comes first; authz decides what that authenticated identity may access.
Both must be enforced server-side because the client is fully under the attacker's control: hiding a button, disabling a form field, or checking a role in JavaScript stops only honest users. An attacker just crafts the HTTP request directly (curl, Burp), bypassing every front-end check. The browser is a convenience layer, never a trust boundary.
So the server must, on every request, verify the credential *and* re-check that this identity is permitted — front-end checks are UX, not security.
What a strong answer covers
- Authn = who you are (verify identity); authz = what you may do (verify permission). Authn precedes authz.
- The client is attacker-controlled — any check in JS/HTML can be bypassed by crafting the raw request.
- Enforce both on the server, every request; front-end checks are UX, not a trust boundary.
- Skipping the server-side authz re-check is exactly the Broken Access Control (#1) failure.
Quick self-check
An admin-only button is hidden in the UI for non-admins, but the /admin/delete endpoint has no server-side role check. What's true?
Follow-ups they push on
- How can an attacker bypass a front-end-only role check?
- Where do authentication failures (A07) differ from access-control failures (A01)?
Red flag Enforcing access control only in the UI (hidden buttons, disabled fields) — the server must re-verify, since the client can forge any request directly.
source: OWASP — Authorization Cheat Sheet ↗
Commonly asked junior concept very common What is SQL injection, and what is the *one* correct defense?
SQL injection is when untrusted input is concatenated into a query so the attacker can change its structure — e.g. ' OR '1'='1 to bypass a login, or '; DROP TABLE users;-- to destroy data.
The primary defense is parameterized queries / prepared statements: the SQL text and the data travel on separate channels, so input is always treated as a value, never as code. ORMs do this for you when used correctly.
Defense in depth adds least-privilege DB accounts and allow-list input validation — but escaping by hand is error-prone and not the real fix. The principle (separate code from data) generalizes to *all* injection: OS command, LDAP, NoSQL, etc.
Follow-ups they push on
- Why is manual escaping or a blocklist of bad characters not a reliable defense?
- How does an ORM still let you write injectable queries?
Red flag Saying 'sanitize/escape the input' as the primary fix — parameterization is the answer; ad-hoc escaping misses cases.
source: OWASP — SQL Injection Prevention Cheat Sheet ↗
Commonly asked mid concept common What is defense in depth, and why isn't input validation alone enough to stop XSS?
Defense in depth is layering independent controls so that no single failure is fatal — if one layer is bypassed, another still stands. No control is perfect, so you don't bet everything on one.
For XSS, input validation alone is insufficient because the danger depends on output context, not the input. A string that's harmless in an HTML body can break out inside a <script> block, an HTML attribute, a URL, or a CSS context — and validation at the input boundary can't know where the value will eventually be rendered. Worse, data arrives from many sources (DB, other services) that never passed your input filter.
So you layer: context-aware output encoding at the point of rendering (the primary defense), a strict Content-Security-Policy as a backstop that limits what injected script can do, HttpOnly cookies so stolen script can't read the session token, and input validation as one more (not the only) layer.
What a strong answer covers
- Defense in depth = independent layers; a single bypass shouldn't compromise the system.
- XSS safety depends on output context (HTML body vs attribute vs JS vs URL), which input validation can't anticipate.
- Primary defense is context-aware output encoding at render time; CSP is the backstop.
- Data also enters from sources that never hit your input filter (DB, other services), so input validation alone is incomplete.
Follow-ups they push on
- Why does the same string need different encoding in an HTML attribute vs a JavaScript context?
- What does a Content-Security-Policy actually restrict?
Red flag Treating input validation as the complete XSS fix — encoding must happen at output based on context, and CSP/HttpOnly provide the additional layers that catch what slips through.
source: OWASP — Cross Site Scripting Prevention Cheat Sheet ↗
Commonly asked mid concept common What is Security Misconfiguration (OWASP A02:2025), and give concrete examples.
Security Misconfiguration is risk introduced by how systems are set up rather than by code flaws — and it climbed to A02 in the 2025 Top 10, reflecting how common it is across the increasingly complex, configurable stacks we run.
Concrete examples: default or unchanged credentials; verbose error pages or stack traces leaking internals in production; unnecessary features/ports/services left enabled; an S3 bucket or admin console open to the public; missing security headers (HSTS, CSP); directory listing on; debug mode on in prod; overly permissive CORS.
The defense is a repeatable, hardened baseline: minimal install (remove what you don't use), secure defaults, infrastructure-as-code so every environment is configured identically and reviewably, automated configuration scanning, and segregated environments. It overlaps tightly with least privilege and supply-chain hygiene.
What a strong answer covers
- Risk from setup, not code — defaults, exposed services, leaked errors, missing headers.
- Rose to A02 in 2025 because modern stacks have huge configurable surface area.
- Examples: default creds, public buckets, debug mode in prod, verbose stack traces, permissive CORS.
- Fix with a hardened, minimal, repeatable baseline (IaC + config scanning + identical environments).
Follow-ups they push on
- Why does Infrastructure-as-Code reduce misconfiguration risk?
- Why are verbose production error messages a security problem, not just a UX one?
Red flag Treating misconfiguration as a one-time setup task — config drifts across environments and over time; without IaC and scanning, prod quietly diverges into an insecure state.
source: OWASP Top 10:2025 — A02 Security Misconfiguration ↗
Commonly asked mid trick occasional What is encoding (Base64), and why is it not a security control?
Encoding transforms data into another representation for safe transport or storage — Base64, URL-encoding, hex. It's a fully reversible, keyless, public algorithm: anyone can decode it with no secret. Its purpose is *compatibility* (e.g. putting binary in a text/JSON field), not secrecy.
That's the trap: Base64 *looks* scrambled, so people mistake it for protection. But dXNlcjpwYXNz decodes to user:pass in one trivial step — it provides zero confidentiality.
The three are distinct: encoding = reversible, no key, for compatibility; encryption = reversible *with a key*, for confidentiality; hashing = one-way, no key, for integrity/verification. Anytime someone says 'we Base64 the password before sending,' that's a misunderstanding — over HTTP it's plaintext; you need TLS (encryption) for confidentiality.
What a strong answer covers
- Encoding is reversible and keyless — its job is transport/compatibility, not secrecy.
- Base64 'looks' encrypted but decodes in one public step → zero confidentiality.
- Distinguish the trio: encoding (no key, compat) vs encryption (key, confidentiality) vs hashing (one-way, integrity).
- Base64-ing a credential adds no protection; only TLS/encryption provides confidentiality on the wire.
Quick self-check
Which statement about Base64 encoding is correct?
Follow-ups they push on
- Where is Base64 a legitimate, correct choice?
- Why is 'we Base64-encode the API key in the header' not securing anything?
Red flag Mistaking Base64 (or any encoding) for encryption — it's a reversible public transform with no key and provides no confidentiality whatsoever.
source: OWASP — Cryptographic Storage Cheat Sheet ↗
Commonly asked mid concept very common How do XSS and CSRF differ, and how do you defend against each?
XSS injects attacker-controlled JavaScript that runs in the victim's browser in *your* site's origin. Defenses: context-aware output encoding, a strict Content-Security-Policy, sanitize any HTML you must render, and HttpOnly cookies so stolen script cannot read the session token.
CSRF tricks an already-authenticated browser into firing an unwanted state-changing request (the browser auto-attaches the cookie). Defenses: anti-CSRF tokens, SameSite cookies, and verifying the Origin/Referer header.
The crisp distinction: XSS abuses the site's trust in user input; CSRF abuses the site's trust in the user's authenticated session.
Follow-ups they push on
- Why does SameSite=Lax mitigate most CSRF?
- Why don't CSRF tokens help against XSS?
Red flag Claiming CSRF tokens stop XSS — if you have XSS, the attacker's script can just read the CSRF token and forge a valid request.
source: OWASP — Cross Site Request Forgery (CSRF) ↗
Commonly asked mid concept very common Hashing vs encryption — what's the difference, and which do you use for passwords?
Encryption is reversible: with the key you can recover the plaintext. Use it for data you must read back — data in transit (TLS), secrets at rest.
Hashing is one-way: you cannot invert it; you can only re-hash a candidate and compare. Use it when you only ever need to *verify*, never recover — exactly the password case.
So passwords are hashed, not encrypted — if you can decrypt them, so can an attacker who steals your key. And not just any hash: use a slow, memory-hard password hash with a per-password salt.
Follow-ups they push on
- Where does encoding (Base64) fit — is it a security control?
- What's the difference between hashing and an HMAC/keyed hash?
Red flag Saying you 'encrypt passwords' — that is the wrong primitive; passwords should be hashed with a dedicated password hash so they are non-recoverable.
source: OWASP — Password Storage Cheat Sheet ↗
Commonly asked mid trick occasional A login endpoint returns 'user not found' for unknown emails and 'wrong password' for known ones. What's wrong?
It is a user-enumeration vulnerability. The two distinct messages let an attacker probe which emails are registered, building a target list for credential stuffing, phishing, or password spraying.
Fix: return a single generic message — 'invalid email or password' — for both cases, and keep the response *timing* uniform (still run a dummy password hash when the user doesn't exist) so the attacker can't distinguish via latency either. The same care applies to signup ('email already in use') and password-reset flows.
This ties to OWASP A07 (Authentication Failures).
Follow-ups they push on
- How could an attacker still enumerate users via response timing, and how do you prevent that?
- How does this interact with the password-reset 'we sent an email if it exists' pattern?
Red flag Fixing only the message text but leaving a timing side-channel (fast 'not found' vs slow bcrypt compare) that still leaks which accounts exist.
source: OWASP — Authentication Cheat Sheet (account enumeration) ↗
Commonly asked senior concept common What sits at #1 of the OWASP Top 10:2025, and name a couple of categories that are new or changed this edition.
A01: Broken Access Control is #1 — and in the 2025 edition it now absorbs Server-Side Request Forgery (SSRF). It means users acting outside their intended permissions: missing authorization checks, IDOR, privilege escalation.
What is new/notable in 2025:
- A03: Software Supply Chain Failures is new and surged into the top 3 — broadened from the old 'Vulnerable and Outdated Components' to the whole dependency/build ecosystem.
- A10: Mishandling of Exceptional Conditions is brand new — improper error handling, failing open, logic errors on abnormal input.
- The full order: A01 Broken Access Control, A02 Security Misconfiguration, A03 Software Supply Chain Failures, A04 Cryptographic Failures, A05 Injection, A06 Insecure Design, A07 Authentication Failures, A08 Software/Data Integrity Failures, A09 Logging & Alerting Failures, A10 Mishandling of Exceptional Conditions.
Follow-ups they push on
- Why did SSRF get folded into Broken Access Control?
- What does 'failing open' mean under A10, and why is it dangerous?
Red flag Quoting the 2021 list as current (e.g. putting Injection at #3 or naming 'Vulnerable and Outdated Components') — in 2025 Injection is A05 and supply-chain is its own A03.
source: OWASP Top 10:2025 ↗
Commonly asked senior concept common What is an IDOR, and why does Broken Access Control sit at #1 of the OWASP Top 10:2025?
An IDOR (Insecure Direct Object Reference) is the canonical Broken Access Control bug: an endpoint exposes a reference to an object — /api/orders/1043 — and the server returns it based on the URL alone, without checking that *this* user is allowed to see *that* object. Change 1043 to 1044 and you read someone else's order.
It's #1 in OWASP Top 10:2025 (as it was in 2021) because authorization is per-request, per-object logic that's easy to forget on some path, hard for scanners to find, and devastating when wrong — it's the most commonly found weakness. The 2025 edition also folded SSRF into this category.
The fix is to enforce authorization server-side on every request, checking ownership/role against the authenticated identity — never trusting a client-supplied ID, never relying on the object reference being unguessable, and denying by default. Using unpredictable IDs (UUIDs) is hardening, not a substitute for the check.
What a strong answer covers
- IDOR: the server returns an object from a client-supplied reference without verifying the user is authorized for it.
- #1 because authz is per-object, easy to miss, hard to scan for, and catastrophic — the most prevalent weakness.
- Enforce authorization server-side on every request, deny by default, check ownership against the session identity.
- Unguessable IDs (UUIDs) are hardening — not a replacement for the access-control check; SSRF now lives in this category (2025).
Follow-ups they push on
- Why is switching from sequential IDs to UUIDs not a real fix for IDOR?
- Why was SSRF moved under Broken Access Control in 2025?
Red flag Relying on 'unguessable' object IDs or hiding the endpoint instead of performing a real per-request authorization check — security by obscurity, not access control.
source: OWASP Top 10:2025 — A01 Broken Access Control ↗
Commonly asked senior concept occasional Why store a session token in an HttpOnly, Secure, SameSite cookie rather than localStorage?
localStorage is fully readable by any JavaScript on the page — so a single XSS flaw lets an attacker's script exfiltrate the token instantly. A cookie marked HttpOnly is invisible to JavaScript: even with XSS, the script can't read the token to steal it.
The other flags close the remaining gaps: Secure sends the cookie only over HTTPS (no plaintext interception), and SameSite (Lax/Strict) stops the browser from auto-attaching it on cross-site requests, which mitigates CSRF — the attack that cookie-based auth otherwise invites.
The tradeoff: HttpOnly cookies are auto-sent by the browser, so you take on CSRF risk and must defend it (SameSite + anti-CSRF tokens). localStorage avoids CSRF but trades it for far worse XSS token theft. The consensus is HttpOnly cookies with CSRF defenses, because XSS token exfiltration is the more dangerous failure.
What a strong answer covers
- localStorage is readable by any JS — one XSS = instant token theft.
- HttpOnly hides the cookie from JavaScript, so XSS can't read/exfiltrate it.
- Secure = HTTPS-only; SameSite blocks cross-site auto-send, mitigating CSRF.
- Cookies trade XSS-theft risk for CSRF risk — so pair them with SameSite + anti-CSRF tokens.
Follow-ups they push on
- If HttpOnly cookies are auto-sent, what new attack do you now have to defend, and how?
- Can an attacker with XSS still abuse an HttpOnly session cookie even without reading it?
Red flag Storing JWTs/session tokens in localStorage 'for convenience' — it's directly readable by any injected script, turning any XSS into full account takeover.
source: OWASP — Session Management Cheat Sheet ↗
Commonly asked senior concept common Why is SHA-256 a bad choice for storing passwords, and what's the salt for?
General-purpose hashes like SHA-256 are *designed to be fast* — which is exactly wrong for passwords. An attacker with the hash file can compute billions of guesses per second on a GPU.
Use a slow, memory-hard password hash: OWASP recommends Argon2id (then scrypt; bcrypt only for legacy). Their tunable work factor keeps verification fast for you but brute force expensive for attackers.
The salt is a unique random value per password, stored alongside the hash. It ensures two users with the same password get different hashes and defeats precomputed rainbow tables — the attacker must crack each password individually. (A site-wide secret pepper can be layered on top.)
Follow-ups they push on
- Why does a salt have to be unique per password rather than one site-wide value?
- What is a pepper and how does it differ from a salt?
Red flag Using a fast hash (MD5/SHA-1/SHA-256) for passwords, or reusing one salt for everyone — both leave you open to rainbow-table and GPU attacks.
source: OWASP — Password Storage Cheat Sheet ↗
Commonly asked senior concept common What is the principle of least privilege, and how does it apply to secrets management?
Least privilege: every user, service, and credential gets only the permissions it needs to do its job — no more, no less. It shrinks the blast radius when something is compromised.
Applied to secrets:
- Don't hardcode secrets in source or commit them to git; store them in a secrets manager (Vault, AWS/GCP Secrets Manager) or injected env vars.
- Scope each secret narrowly — a service's DB credential can touch only its own schema, not everything.
- Rotate secrets, and prefer short-lived/dynamic credentials over long-lived static keys.
- Audit access so a leaked key is detectable and revocable.
This maps to OWASP A02 (Security Misconfiguration) and A08 (Integrity Failures).
Follow-ups they push on
- Why is a leaked secret in git history not fixed by just deleting the file in a new commit?
- How do short-lived/dynamic credentials reduce risk versus static keys?
Red flag Granting broad, permanent admin credentials 'to keep things simple' — it maximizes blast radius and violates least privilege.
source: OWASP — Secrets Management Cheat Sheet ↗
Commonly asked senior concept common What is Software Supply Chain risk (OWASP A03:2025), and how do you reduce it?
Your app is mostly code you didn't write — third-party packages, their transitive deps, base images, and the build/CI pipeline itself. A03:2025 covers compromises anywhere in that chain: a malicious or vulnerable dependency, a typosquatted package, a poisoned build step, or a tampered artifact.
Mitigations:
- Pin and lock dependencies (lockfiles, hashes) so builds are reproducible.
- Scan deps for known CVEs (SCA tools) and patch promptly.
- Generate an SBOM so you know what you ship.
- Verify provenance / sign artifacts (e.g. Sigstore) and protect CI credentials.
- Minimize and pin base images.
This was broadened in 2025 from the older 'Vulnerable and Outdated Components' to the whole ecosystem.
Follow-ups they push on
- What is an SBOM and why did regulators start requiring it?
- How would a typosquatted npm package actually compromise you?
Red flag Treating supply-chain security as just 'keep dependencies updated' — it also covers the build pipeline, artifact provenance, and transitive deps.
source: OWASP Top 10:2025 — Introduction (A03 Software Supply Chain Failures) ↗

6.4 Testing 12

Commonly asked junior concept very common Define unit, integration, and end-to-end tests — what does each actually verify?
Unit tests exercise the smallest testable piece — one function/class — in isolation, with collaborators faked. They verify *this unit's logic is correct*. Fast and deterministic.
Integration tests verify that units talk to a real collaborator correctly — your code against an actual database, queue, or HTTP API. They catch interface/wiring bugs a unit test mocks away.
End-to-end tests drive the fully assembled system the way a user would (through the UI or public API) and verify a whole journey works. Slowest, most realistic, most brittle.
The trade is realism vs. speed/stability: unit = fast + narrow, e2e = realistic + fragile.
Follow-ups they push on
- Why can a suite of all-green unit tests still let a broken feature ship?
- What's the difference between an integration test and a component test?
Red flag Calling a test that mocks the database an 'integration test' — if every dependency is faked it is still a unit test.
source: Martin Fowler — The Practical Test Pyramid ↗
Commonly asked junior concept common What is the Arrange-Act-Assert pattern, and what makes a test maintainable?
Arrange-Act-Assert (AAA) structures a test into three clear phases: Arrange the inputs and preconditions, Act by invoking the one thing under test, then Assert on the outcome. Keeping these visually separate makes a test read as a tiny spec of the behavior.
Maintainable tests share a few traits: they test one behavior (so a failure points at one cause), assert on observable behavior rather than implementation detail (so a refactor doesn't break them), are deterministic and isolated (no shared state, no order dependence), and have descriptive names that state the scenario and expected result. A good test is also fast.
The through-line: a test should fail for exactly one reason and tell you what that reason is. Tests are production code — DRY-ish helpers are fine, but readability beats cleverness.
What a strong answer covers
- AAA: Arrange preconditions → Act on the unit → Assert the outcome; keep the phases visibly separate.
- Test one behavior per test so a failure localizes to a single cause.
- Assert on observable behavior, not internals, so refactors don't break green tests.
- Be deterministic, isolated, and descriptively named — a test should fail for exactly one reason.
Follow-ups they push on
- Why does asserting on private implementation detail make tests brittle?
- What's the 'one assertion per test' guideline really getting at?
Red flag Writing tests that assert on internal calls/structure rather than observable behavior — they break on every refactor even when the behavior is unchanged, training people to delete tests.
source: Martin Fowler — Given-When-Then ↗
Commonly asked mid concept very common What is the test pyramid, and why more unit tests than end-to-end tests?
The pyramid is a guideline for the *shape* of your test suite: a wide base of fast, cheap unit tests; fewer integration tests in the middle; and a thin top of end-to-end tests through the whole system/UI.
Why that shape: as you go up, tests get slower, more brittle, and harder to pin a failure to a cause. Unit tests run in milliseconds and localize bugs precisely; e2e tests run for minutes, flake on timing, and only tell you *something* broke. So you push as much coverage as low as possible and reserve e2e for a few critical user journeys.
The inverted shape — mostly e2e — is the ice-cream cone anti-pattern: slow, flaky, expensive to maintain.
Follow-ups they push on
- What does the 'ice-cream cone' look like and why is it painful?
- Where do contract tests fit in this picture?
Red flag Treating the pyramid as exact ratios or gospel rather than a heuristic — the real point is fast/cheap/localized at the bottom, slow/brittle at the top.
source: Martin Fowler — The Practical Test Pyramid ↗
Commonly asked mid concept common Walk me through the TDD cycle. What does it actually buy you?
TDD is red-green-refactor:
1. Red — write a small failing test for the next bit of behavior.
2. Green — write the minimum code to make it pass.
3. Refactor — clean up the code (and tests) now that they are green, keeping the bar passing.
Repeat in tiny increments. What it buys you: tests exist by construction (not bolted on later), the code is *designed to be testable* (so it tends toward decoupling and clear interfaces), and you get a fast feedback loop plus a regression safety net that lets you refactor fearlessly. It also forces you to define 'done' before coding.
Follow-ups they push on
- Why is the refactor step the part people skip, and what happens when they do?
- When is strict TDD a poor fit?
Red flag Describing TDD as 'write tests after the code' — the whole point is the test comes *first* and drives the design.
source: Martin Fowler — Test Driven Development ↗
Commonly asked senior concept occasional Should unit tests hit a real database? When is an in-memory or test-container DB the right call?
By definition, a unit test shouldn't touch a real DB — that makes it slow and non-deterministic. So you mock the data layer for unit tests. But mocking the DB means you never verify your *actual* SQL, migrations, or ORM mappings, and that's where real bugs hide.
So the pragmatic answer is layered: unit-test pure logic with the DB doubled, then write integration tests against a real database engine for the queries themselves. The mistake to avoid is using a *different* engine in tests than in production — e.g. SQLite or an in-memory fake standing in for Postgres. SQL dialects, constraint behavior, and types differ, so tests can pass against the fake and fail against prod (or vice versa).
Modern practice is Testcontainers: spin up the *real* database (same engine/version as prod) in a throwaway container for integration tests. You get fidelity without polluting a shared environment.
What a strong answer covers
- A true unit test doesn't hit a DB — mock the data layer for logic; it's slow/non-deterministic otherwise.
- But mocks never validate real SQL, migrations, or ORM mappings — cover those with integration tests.
- Don't substitute a different engine (SQLite for Postgres) — dialect/constraint differences make tests lie.
- Use Testcontainers to run the real prod-version DB in a disposable container for integration tests.
Follow-ups they push on
- Why can an in-memory SQLite stand-in for Postgres give false confidence?
- What belongs in a unit test vs an integration test for a repository class?
Red flag Testing against a different DB engine than production (in-memory fake for the real thing) — dialect and constraint mismatches let bugs pass tests and break in prod.
source: Testcontainers — Database integration testing ↗
Commonly asked senior concept occasional What is mutation testing, and how does it reveal that high code coverage can be misleading?
Line/branch coverage tells you code *ran* during tests, not that anything was *checked*. Mutation testing measures the latter: a tool makes small deliberate changes (mutants) to your code — flip > to >=, replace + with -, negate a condition, return null — then reruns your tests against each mutant.
If a mutant makes a test fail, it's killed (good — your tests detected the change). If all tests still pass, the mutant survived — meaning your suite executed that code but never asserted anything that the change would break. The mutation score (killed / total) is a far better quality signal than coverage.
This exposes the assertion-free-coverage problem directly: you can have 100% line coverage and a low mutation score, because tests call the code but verify nothing meaningful. The cost is compute — running the suite once per mutant is expensive — so teams often run it on critical modules rather than the whole repo.
What a strong answer covers
- Coverage proves code executed; mutation testing proves your assertions actually catch changes.
- It injects small bugs (mutants); a killed mutant = tests detected it, a survivor = a gap in assertions.
- Mutation score (killed/total) is a stronger quality metric than line coverage.
- Directly exposes assertion-free coverage: 100% lines but mutants survive = tests that check nothing.
- Cost is high (rerun suite per mutant), so target critical modules rather than the whole codebase.
Quick self-check
A mutant 'survives' a mutation test run. What does that tell you?
Follow-ups they push on
- How can you have 100% line coverage and a 40% mutation score?
- What is an 'equivalent mutant' and why does it muddy the score?
Red flag Trusting coverage as a quality bar — mutation testing routinely shows high-coverage suites with surviving mutants, i.e. tests that run code without asserting on its behavior.
source: PIT (Pitest) — Mutation testing ↗
Commonly asked senior design common Your e2e suite takes 45 minutes and people skip it. How do you make the test strategy sustainable?
A 45-minute, ignored e2e suite is usually the ice-cream-cone anti-pattern: too much testing pushed up to the slow, brittle e2e layer. The fix is to rebalance toward the test pyramid — push coverage down to where it's fast and reliable.
Concretely: for each slow e2e test, ask what it really verifies and move that assertion to the lowest layer that can — pure logic to unit tests, service-boundary behavior to integration/contract tests, and reserve e2e for a handful of critical user journeys (login, checkout). Parallelize what remains across CI runners, and split the suite so fast tests gate every PR while the full e2e set runs on a schedule or pre-deploy.
Separately, hunt flakiness — a slow suite people skip is often also a flaky one they've stopped trusting. Quarantine and fix flaky tests rather than retrying. The goal is a fast feedback loop developers actually run, backed by a thin, stable e2e layer.
What a strong answer covers
- A bloated e2e suite is the ice-cream cone — rebalance toward the pyramid (fast, low-level tests).
- Move each assertion to the lowest layer that can verify it; keep e2e for a few critical journeys only.
- Parallelize and tier the suite: fast tests gate PRs, full e2e runs pre-deploy/scheduled.
- Attack flakiness too — skipped suites are usually distrusted (flaky) ones; quarantine and fix, don't retry.
Follow-ups they push on
- How do you decide which assertions can move down from e2e to unit/integration?
- Why is tiering the suite (PR gate vs nightly) better than running everything on every push?
Red flag Speeding up an ice-cream-cone suite by only adding retries and more parallelism — without rebalancing toward the pyramid you still have a slow, brittle suite developers route around.
source: Martin Fowler — The Practical Test Pyramid ↗
Commonly asked senior concept common What's the difference between a mock and a stub, and when do you reach for each?
Both are test doubles that stand in for a real dependency, but they answer different questions.
A stub provides canned return values so the code under test can run — it is about *state*: 'when asked, return this'. You assert on the output your code produces.
A mock also has pre-programmed responses but additionally *verifies the interaction* — it is about *behavior*: 'was sendEmail called once, with these args?'. You assert on the mock itself.
Rule of thumb: stub queries (reads), mock commands (side effects you care happened). Over-mocking couples tests to implementation detail and makes refactoring painful.
Follow-ups they push on
- What's the difference between a fake and a stub?
- Why can heavy mocking make tests pass while the real integration is broken?
Red flag Mocking everything, including pure logic — the test then asserts on internal calls and breaks on any refactor even when behavior is unchanged.
source: Martin Fowler — Mocks Aren't Stubs ↗
Commonly asked senior trick common Why isn't 100% code coverage the goal? Can you have high coverage and still be poorly tested?
Coverage measures which lines *executed* during tests — not whether you *asserted* anything meaningful about them. You can hit 100% with tests that call code and check nothing, or that never exercise the edge cases and error paths that actually break in production.
Chasing 100% also has diminishing returns: the last few percent are often trivial getters or unreachable branches, and the effort is better spent elsewhere. Worse, it incentivizes shallow tests written to satisfy a number.
Better: treat coverage as a *diagnostic for gaps* (what is entirely untested?), aim for a sensible threshold, and judge quality by whether tests assert behavior and cover the risky paths — not by a single percentage.
Follow-ups they push on
- What is mutation testing and how does it expose 'assertion-free' coverage?
- Which kinds of code genuinely don't need unit tests?
Red flag Treating a coverage percentage as a quality metric — high coverage with weak/absent assertions is theater.
source: Martin Fowler — Test Coverage ↗
Commonly asked senior debug common A test in your CI passes locally but fails ~10% of the time in the pipeline. How do you approach it?
That is a flaky test — non-deterministic. First, do not 'fix' it by retrying or deleting; quarantine it so it stops eroding trust in the suite, then root-cause it.
Common causes to check:
- Async/timing: a fixed sleep instead of waiting on a real condition; race conditions.
- Shared state / test ordering: tests that leak state between runs or assume order.
- Time and randomness: real now(), time zones, unseeded random.
- External dependencies / network that are slow or unavailable in CI.
- Resource contention in the parallel CI runner that doesn't happen locally.
Fix the determinism (inject the clock, isolate state, wait on conditions, stub the network). Flaky tests are dangerous because people start ignoring red builds.
Follow-ups they push on
- Why is auto-retrying flaky tests a trap?
- How would you reproduce a CI-only failure locally?
Red flag Masking flakiness with blanket retries — it hides real race conditions and trains the team to ignore failing tests.
source: Martin Fowler — Eradicating Non-Determinism in Tests ↗
Commonly asked senior concept occasional What's the difference between sociable and solitary unit tests, and the 'London vs Detroit' (mockist vs classicist) schools?
A solitary unit test isolates the unit by replacing *all* its collaborators with test doubles; a sociable unit test lets the unit use its real collaborators (as long as they're fast and deterministic), testing them together.
This maps to two testing schools. The mockist / London school favors solitary tests with mocks for every dependency, verifying *interactions* — it gives precise failure localization and tests units in true isolation, but couples tests to the call structure, so refactors that preserve behavior can still break tests. The classicist / Detroit (Chicago) school favors sociable tests, mocking only awkward dependencies (network, clock, DB), and asserting on *resulting state* — tests are more refactor-resilient and catch integration bugs between collaborators, but a failure may implicate several units.
Neither is 'correct'; the tradeoff is isolation/precision vs. refactor-resilience/realism, and most teams blend them.
What a strong answer covers
- Solitary = all collaborators doubled; sociable = uses real collaborators where practical.
- Mockist/London: mock everything, verify interactions — precise localization, but couples tests to call structure.
- Classicist/Detroit: mock only awkward deps, assert on state — refactor-resilient, catches inter-unit bugs.
- The tradeoff is isolation/precision vs. realism/refactor-resilience; teams usually mix both.
Follow-ups they push on
- Why can a mockist test pass while the real integration is broken?
- Which approach makes a behavior-preserving refactor less likely to break tests, and why?
Red flag Treating one school as universally right — all-mockist suites become refactor-fragile interaction tests, while all-sociable suites can lose failure localization.
source: Martin Fowler — Unit Test (Solitary vs Sociable) ↗
Commonly asked senior concept occasional What is a contract test, and what problem does it solve that unit and e2e tests don't?
When service A calls service B, A's unit tests stub B — but the stub encodes A's *assumption* of B's API, which silently rots when B changes. Full e2e tests catch the mismatch but are slow, flaky, and need every service deployed together.
Contract testing (e.g. consumer-driven contracts / Pact) fills the gap. The consumer (A) defines the requests it makes and the responses it expects as a contract; that contract is then verified against the provider (B) independently. If B's change would violate A's expectations, B's pipeline fails — *before* anything is deployed together.
The payoff: you get confidence that two services are compatible at their boundary with the speed and independence of unit tests — no shared environment, each side tested in its own pipeline. It's how you keep a microservices fleet integrable without a giant brittle e2e suite.
What a strong answer covers
- Stubs of a remote service encode assumptions that drift as the provider changes — unit tests won't notice.
- A contract captures the consumer's expected requests/responses and is verified against the provider separately.
- It catches integration breakage before deploy, without a shared e2e environment.
- Gives boundary-compatibility confidence with the speed/isolation of unit tests — key for microservices.
Follow-ups they push on
- What does 'consumer-driven' add over the provider just publishing an OpenAPI spec?
- Why don't all-green unit tests on both services guarantee they integrate?
Red flag Assuming green unit tests on both sides mean the services integrate — the consumer's stub can diverge from the provider's real behavior, which only contract or integration tests catch.
source: Martin Fowler — Contract Testing ↗

6.5 Version control intricacies 12

Commonly asked junior concept common Walk me through resolving a merge conflict. What is Git actually asking you to do?
A conflict happens when two branches changed the *same lines* (or one edited what the other deleted) and Git can't auto-pick a winner. It pauses and marks the file with <<<<<<< HEAD (your side), =======, and >>>>>>> other-branch (incoming side).
To resolve: open each conflicted file, decide the correct final content (it is rarely 'pick one blindly' — often you keep parts of both), delete the conflict markers, then git add the file to mark it resolved and git commit (or git rebase --continue).
Good practice: understand *why* both sides changed it, run the tests after resolving, and keep branches short-lived so conflicts stay small. git merge --abort backs out if you want to start over.
Follow-ups they push on
- How does keeping PRs small reduce conflict pain?
- What is `git rerere` and when does it help?
Red flag Blindly accepting one side ('keep mine'/'keep theirs') to make the conflict go away — that silently drops the other side's legitimate change.
source: Atlassian — Merge conflicts ↗
Commonly asked mid concept common What makes a good commit message and a good atomic commit, and why does it matter downstream?
An atomic commit captures one logical change — it does exactly one thing and leaves the codebase building/passing. A good message has a concise imperative summary line ('Add retry to S3 upload', ~50 chars), a blank line, then a body explaining the why (and any tradeoffs), not the *what* — the diff already shows what changed.
Why it matters is entirely downstream: clean atomic commits make git bisect land on a tiny diff, make git revert undo exactly one change without collateral, make code review comprehensible commit-by-commit, and make git blame/log a usable history rather than a wall of 'misc fixes'. A commit that bundles a refactor, a feature, and a formatting sweep is impossible to bisect, revert, or review cleanly.
The summary is for scanning git log; the body is for the engineer (often future-you) who needs to understand *why* a line exists.
What a strong answer covers
- Atomic = one logical change that builds/passes on its own.
- Message = imperative summary line + blank line + body explaining why, not what.
- Pays off in bisect (tiny diff), revert (no collateral), review (commit-by-commit), and blame/log.
- Bundled commits (feature + refactor + reformat) are un-bisectable, un-revertable, and unreviewable.
Follow-ups they push on
- Why explain the *why* in the body when the diff already shows the *what*?
- How does a clean commit history make `git revert` safer than a bundled one?
Red flag Bundling unrelated changes into one commit (and writing 'fixes'/'updates' as the message) — it destroys the downstream value of bisect, revert, blame, and review.
source: Git — Commit Guidelines (Pro Git book) ↗
Commonly asked mid concept common What's the point of a pull request beyond merging code? Why squash-merge vs merge-commit vs rebase-merge?
A pull request is the collaboration unit, not just a merge button: it's where review, CI gates, discussion, and an audit trail of *why* a change was made all attach to a proposed change before it lands. The merge is the smallest part.
The three merge modes shape your main history differently. Merge commit preserves every commit on the branch plus a merge node — full history, but main gets noisy with WIP commits. Squash-merge collapses the whole PR into one commit on main — clean, atomic, one-PR-one-commit history that's easy to bisect/revert, at the cost of losing the branch's intermediate commits. Rebase-merge replays the branch's commits linearly onto main with no merge node — linear history that keeps individual commits, but rewrites their hashes.
Many teams default to squash-merge for a tidy, revertable trunk; rebase-merge when individual commits are each meaningful; merge-commit when preserving exact branch topology matters.
What a strong answer covers
- A PR bundles review, CI gates, discussion, and audit trail — merging is its smallest function.
- Merge commit: keeps all branch commits + a merge node — full history, noisier trunk.
- Squash-merge: one commit per PR — clean, atomic, easy to bisect/revert; loses intermediate commits.
- Rebase-merge: linear history keeping individual commits, but rewrites their hashes (no merge node).
Follow-ups they push on
- Why does squash-merge make `git revert` of a whole feature trivial?
- When would preserving the branch's individual commits (rebase/merge) be worth the noise?
Red flag Thinking a PR is just a merge mechanism — its real value is the review/CI/discussion gate; and picking a merge strategy without considering how bisect/revert/readability of `main` are affected.
source: GitHub Docs — About merge methods on GitHub ↗
Commonly asked mid concept very common Rebase vs merge — what's the difference, and when should you NOT rebase?
Merge ties two branches together with a merge commit, preserving the true, non-linear history (and the context of when work diverged).
Rebase replays your commits on top of the target branch, producing a *linear* history as if you'd branched from the latest main — cleaner log, no merge bubbles. But it rewrites commit hashes.
The golden rule of rebasing: never rebase commits that exist outside your local repo / that others have based work on. Rewriting a shared/public branch changes its history out from under teammates, causing divergence and painful re-syncs. Rebase your *private* feature branch onto main before opening the PR; use merge for integrating shared branches.
Follow-ups they push on
- What does `git pull --rebase` do, and why might a team standardize on it?
- If you must change a pushed branch, what makes force-pushing 'safer'?
Red flag Rebasing a branch other people have already pulled — it rewrites shared history and forces everyone into messy recovery.
source: Atlassian — Merging vs Rebasing (the golden rule) ↗
Commonly asked mid concept occasional What does git cherry-pick do, and what's a legitimate use case?
git cherry-pick <sha> applies the *changes introduced by one specific commit* onto your current branch, creating a new commit (new hash) with the same diff.
Legitimate uses: backporting a hotfix from main onto a release/maintenance branch without dragging along everything else; recovering one commit from an abandoned branch; pulling a single fix forward.
Use it sparingly: cherry-picking the same change into multiple branches duplicates commits, which can cause confusing 'phantom' conflicts later when the branches eventually merge. Prefer normal merge/rebase flow when you actually want all of a branch.
Follow-ups they push on
- Why can repeated cherry-picks create duplicate-commit merge conflicts down the line?
- How is cherry-pick different from a partial merge?
Red flag Using cherry-pick as a routine integration strategy — it scatters duplicated commits and breaks the clean ancestry that merge/rebase preserve.
source: Atlassian — git cherry-pick ↗
Commonly asked mid concept occasional A bug appeared somewhere in the last 200 commits. How do you find which commit introduced it?
Use git bisect — a binary search over history. You mark a known-bad commit and a known-good one; Git checks out the midpoint, you test it and mark good or bad, and it halves the range each step. Over ~200 commits that is roughly 8 tests instead of 200.
If you can script the check (a test that exits non-zero on the bug), git bisect run <script> automates the whole thing. When done, git bisect reset returns you to where you started, and you have the exact offending SHA — then git show it to understand the change.
This is why small, atomic commits matter: bisect lands you on a tiny diff, not a 2,000-line mega-commit.
Follow-ups they push on
- Why do large, mixed-purpose commits make bisect less useful?
- How does `git bisect run` automate the search?
Red flag Manually checking out commits at random instead of bisecting — it is O(n) guessing versus O(log n) binary search.
source: Git — git-bisect documentation ↗
Commonly asked mid concept common What's the difference between git reset, git revert, and git checkout/restore?
They operate at different levels and have very different safety profiles.
git revert <sha> creates a *new* commit that undoes the changes of an earlier one — history is preserved and moves forward. It's the safe way to undo a commit that's already been pushed/shared, because it doesn't rewrite history.
git reset moves the current branch pointer to another commit, rewriting history. --soft keeps changes staged, --mixed (default) unstages them, --hard discards working-tree changes too. Reset is for local, unpushed history — using it on shared history is the rebase-style hazard.
git checkout/git restore (modern Git split checkout's jobs into switch for branches and restore for files) operate on the working tree / specific files — discarding local file changes or restoring a file to a given version, without moving the branch.
Rule of thumb: undo *public* history with revert; rewrite *local* history with reset; restore *files/working tree* with restore.
What a strong answer covers
- revert = new commit that undoes another; safe on pushed/shared history (no rewrite).
- reset = move the branch ref, rewriting history; --soft/--mixed/--hard differ in what they keep. Local-only.
- restore/checkout = operate on files/working tree, not the branch pointer.
- Rule: revert public, reset local, restore files.
Quick self-check
A buggy commit is already pushed and others have pulled it. What's the safe way to undo it?
Follow-ups they push on
- Why is revert the correct tool for undoing a commit on a shared branch?
- What exactly do --soft, --mixed, and --hard each preserve?
Red flag Using `git reset --hard` to undo a commit that's already pushed — it rewrites shared history (and `--hard` also destroys uncommitted work); use `revert` for anything public.
source: Atlassian — Resetting, checking out & reverting ↗
Commonly asked senior concept common Compare trunk-based development, GitHub Flow, and Git Flow — when does each fit?
Trunk-based: everyone commits to (or merges tiny, short-lived branches into) main at least daily; unfinished work hides behind feature flags. Optimizes for continuous integration and fast delivery; demands strong tests and CI. The modern default for teams shipping continuously.
GitHub Flow: one long-lived main plus short feature branches via pull request; merge and deploy on approval. A lightweight middle ground, great for web apps with continuous deployment.
Git Flow: heavyweight model with long-lived develop, release, and hotfix branches alongside main. Suits versioned/installed software with scheduled releases — but for fast web delivery its long-lived branches cause painful merges and slow integration.
The trend is toward trunk-based; Git Flow is increasingly considered overkill outside release-train products.
Follow-ups they push on
- Why do long-lived branches hurt continuous integration?
- How do feature flags make trunk-based development possible?
Red flag Defaulting to Git Flow for a continuously-deployed web app — its long-lived branches fight CI and cause merge hell.
source: Atlassian — Trunk-based development ↗
Commonly asked senior concept occasional When is force-pushing acceptable, and what makes --force-with-lease safer than --force?
Force-pushing is needed after you rewrite history on a branch (rebase, amend, interactive-rebase cleanup) — the remote ref no longer fast-forwards, so a normal push is rejected. It's acceptable on a branch you own that others aren't building on: typically your own feature/PR branch. It is *not* acceptable on shared branches like main.
Plain git push --force overwrites the remote ref unconditionally — if a teammate pushed in the meantime, you silently destroy their commits. git push --force-with-lease adds a safety check: it only overwrites if the remote is still at the commit you *last saw*. If someone else pushed since your last fetch, the lease check fails and the push is rejected, so you can't clobber work you didn't know about.
So: rewrite only unshared history, and when you must force-push, use --force-with-lease so a surprise upstream change aborts the push instead of being overwritten.
What a strong answer covers
- Force-push is required after history rewrites (rebase/amend); only OK on branches you own, never shared main.
- --force overwrites the remote unconditionally — it can silently erase teammates' new commits.
- --force-with-lease only pushes if the remote still matches what you last fetched — else it aborts.
- The lease turns 'I might clobber unseen work' into a safe failure you can investigate.
Follow-ups they push on
- How can --force-with-lease still bite you if a tool runs `git fetch` in the background?
- Why does an interactive rebase on a PR branch require a force-push at all?
Red flag Using plain `--force` on a branch others might have pushed to — it overwrites their commits with no warning; `--force-with-lease` aborts instead, so it should be the default.
source: Atlassian — git push (force pushing) ↗
Commonly asked senior trick occasional You pushed a commit with a leaked API key. Is deleting the file in a new commit enough? How do you fix it?
No — a new commit that removes the file leaves the secret in history; anyone can git log/git checkout the old commit and read it. The secret is effectively public the moment it was pushed.
Correct response, in order:
1. Rotate/revoke the key immediately — assume it is already compromised. This is the only step that truly protects you.
2. Purge it from history (git filter-repo, or BFG Repo-Cleaner) and force-push, coordinating with the team since it rewrites shared history.
3. Add a pre-commit/secret-scanning hook and a .gitignore so it can't recur.
The key insight: history rewriting is cleanup, but rotation is the real fix — caches, forks, and clones may still hold the old blob.
Follow-ups they push on
- Why is rotating the secret more important than scrubbing it from git?
- Why does rewriting history here require a coordinated force-push?
Red flag Thinking 'I deleted the file and committed, we're fine' — the secret persists in history and must be rotated regardless.
source: GitHub Docs — Removing sensitive data from a repository ↗
Commonly asked senior concept occasional You ran a bad reset/rebase and 'lost' commits that aren't in any branch. How do you get them back?
Use git reflog. The reflog records where HEAD (and each branch ref) has pointed over time — every commit, checkout, reset, rebase, and merge — even commits no branch points at anymore. A reset --hard or a botched rebase doesn't delete the old commits; it just moves the ref, leaving the originals 'dangling' but still reachable via reflog.
Recovery: git reflog to find the SHA from *before* the bad operation (e.g. HEAD@{3}), then git reset --hard <sha> to move the branch back, or git checkout -b recover <sha> / git cherry-pick <sha> to salvage specific commits.
The key insight: in Git, work you've committed is almost never truly lost — those objects survive until garbage collection (default ~30–90 days) and the reflog is the map to them. (Uncommitted working-tree changes, by contrast, *are* gone — reflog only tracks committed history.)
What a strong answer covers
- git reflog logs every position of HEAD/branch refs — including commits no branch references.
- reset --hard/rebase move refs, leaving old commits dangling but recoverable, not deleted.
- Recover with git reset --hard <sha> or git checkout -b/cherry-pick the SHA found in the reflog.
- Committed work survives until GC (~30–90 days); only uncommitted changes are truly unrecoverable.
Follow-ups they push on
- Why can reflog recover a committed change but not uncommitted working-tree edits?
- How long do dangling commits survive before garbage collection removes them?
Red flag Panicking and re-doing work after a bad reset/rebase — the old commits are almost always recoverable via reflog; only uncommitted changes are genuinely lost.
source: Atlassian — git reflog ↗
Commonly asked senior concept occasional What is an interactive rebase (squash/fixup/reword) for, and what's the risk?
git rebase -i lets you rewrite a series of your own commits before sharing them: reorder them, squash/fixup several WIP commits into one logical change, reword messages, edit a commit's content, or drop a commit. The point is to turn a messy local history ('wip', 'fix typo', 'oops') into a clean, reviewable sequence of atomic commits — which makes review, git bisect, and git revert far more useful later.
The risk is the same golden rule of rebasing: it rewrites commit hashes, so you must only do it to commits that haven't been shared. Interactive-rebasing commits others have already based work on rewrites public history and forces everyone into painful re-syncs. Do it on your local feature branch before opening (or updating) the PR; never on shared main.
What a strong answer covers
- Interactive rebase curates your own unshared commits: squash/fixup, reorder, reword, edit, drop.
- Goal: a clean, atomic, reviewable history — which makes bisect and revert more effective.
- It rewrites hashes, so obey the golden rule: only on commits not yet shared.
- Use it on your local feature branch pre-PR, never on shared main.
Follow-ups they push on
- How does `git commit --fixup` plus `rebase --autosquash` streamline cleanup?
- Why does squashing make `git bisect` and `git revert` more useful afterward?
Red flag Interactive-rebasing commits that are already pushed/shared — it rewrites public history (new hashes) and forces collaborators into messy recovery; keep it to local, unshared work.
source: Atlassian — Rewriting history (interactive rebase) ↗

6.6 Code quality 11

Commonly asked junior concept common What does 'clean code' mean to you? Name a few concrete principles.
Clean code is code optimized for the *reader*, since code is read far more than it is written. Concrete principles:
- Intention-revealing names — a name should say what something is/does so you don't need a comment to explain it.
- Small, single-purpose functions — one level of abstraction, do one thing.
- DRY — don't duplicate knowledge; but don't abstract prematurely either.
- **Comments explain *why*, not *what* — the code shows what; comments justify non-obvious decisions.
- Consistent style** — let formatters/linters handle it so reviews focus on substance.
The through-line: minimize the cognitive load on the next person (often future-you).
Follow-ups they push on
- When does DRY go too far and create the wrong abstraction?
- Why is a comment that restates the code a smell?
Red flag Reciting buzzwords (DRY, SOLID) without the underlying goal — readability and changeability — or over-applying DRY into a tangled wrong abstraction.
source: Martin Fowler — Two Hard Things (naming) / CodeAsDocumentation ↗
Commonly asked mid concept common What is refactoring, and when is the right time to do it?
Refactoring is changing the *internal structure* of code to make it easier to understand and cheaper to modify, without changing its observable behavior. The behavior-preserving part is what makes it safe — and why a solid test suite is its prerequisite.
When: not as a separate 'refactoring sprint' but continuously, woven into feature work. The pragmatic trigger is the rule of three / refactor-when-it-hurts — when you are about to add a feature and the existing design fights you, first refactor to make the change easy, then make the easy change. Plus the boy-scout rule: leave each file a little cleaner than you found it.
Follow-ups they push on
- Why is refactoring without tests dangerous?
- What's the difference between refactoring and rewriting?
Red flag Calling any code change 'refactoring' even when it alters behavior — that conflation is how 'refactors' sneak in bugs and scope creep.
source: Martin Fowler — Refactoring ↗
Commonly asked mid concept occasional What's the difference between a linter and a formatter, and why automate both?
A formatter (Prettier, gofmt, Black) rewrites code to a canonical *style* — indentation, quotes, line length. It is purely cosmetic and deterministic.
A linter (ESLint, Ruff, golangci-lint) analyzes code for *problems and smells* — unused variables, likely bugs, anti-patterns, sometimes security issues. It catches substance, not just style.
Automate both, ideally in pre-commit hooks and CI, because it removes whole categories of nit-picking from human review. When formatting and trivial issues are settled by tools, reviewers spend their attention on design and correctness — the things only humans can judge. It also keeps style consistent regardless of who wrote the code.
Follow-ups they push on
- Why run these in CI even if developers have editor integration?
- How does auto-formatting reduce diff noise in code review?
Red flag Conflating the two, or relying on humans to enforce style in review — that wastes reviewer attention on what a tool should settle automatically.
source: Prettier — Prettier vs. Linters ↗
Commonly asked mid concept occasional What does cyclomatic complexity measure, and why is high complexity a problem?
Cyclomatic complexity counts the number of independent paths through a piece of code — essentially one plus the number of decision points (if, for, while, case, &&/||, ?:). A straight-line function is 1; each branch adds a path.
Why it matters: it correlates with how hard the code is to understand, test, and maintain. It's also a lower bound on the number of test cases needed to cover every path — a function with complexity 15 needs at least 15 paths exercised to test thoroughly, which is a strong hint it's doing too much. High complexity concentrates risk: the more tangled the branching, the more places a bug can hide.
Use it as a heuristic flag, not a hard law — a high score points you at a function worth simplifying (extract method, replace nested conditionals with guard clauses or polymorphism), but a naturally branchy dispatch can be legitimately high. Linters can fail a build over a threshold to keep it visible.
What a strong answer covers
- Measures independent paths ≈ 1 + count of decision points (branches/loops/boolean operators).
- Higher = harder to understand, test, maintain; it's a lower bound on test cases needed for path coverage.
- A high score flags a function doing too much — a candidate for extract-method / guard clauses.
- It's a heuristic, not gospel — some dispatch logic is legitimately branchy.
Quick self-check
What does a high cyclomatic complexity number most directly indicate?
Follow-ups they push on
- Why is cyclomatic complexity a lower bound on the number of tests for full path coverage?
- Which refactorings most directly reduce a function's complexity score?
Red flag Treating a complexity threshold as an absolute rule — it's a signal to investigate, and gaming the number (splitting one clear function into confusing fragments) can hurt readability more than it helps.
source: NIST — Cyclomatic Complexity (Structured Testing) ↗
Commonly asked senior concept common What makes a good code review, and what should reviewers actually look for?
A good review judges whether the change improves the overall health of the codebase — not whether it is perfect. Reviewers look for, roughly in priority order:
- Design: does the change belong here, fit the architecture, and not over-engineer?
- Correctness & edge cases: logic, error handling, concurrency, security.
- Tests: do they exist and actually exercise the behavior?
- Naming, clarity, comments: will the next reader understand it?
- Consistency with project conventions.
Process matters too: keep PRs small (faster, deeper reviews), review promptly to unblock people, comment kindly and explain the 'why', and distinguish blocking issues from optional nits (label them). The goal is shared understanding and a healthier codebase, not gatekeeping.
Follow-ups they push on
- Why are small PRs reviewed better than large ones?
- How do you give critical feedback without demoralizing the author?
Red flag Reviewing only for style/formatting (which a linter should catch) while rubber-stamping the design — the expensive bugs live in design and edge cases.
source: Google — Code Review Developer Guide (What to look for) ↗
Commonly asked senior trick occasional The team wants to stop feature work for a 'big rewrite' to fix the messy codebase. What's your take?
Push back. Big-bang rewrites are notoriously risky: you spend months reproducing existing behavior (including the undocumented edge cases the old code quietly handles), ship no value during the freeze, and often discover the new system has its own mess by the time it's done — the famous 'second-system' trap. Meanwhile the business is frozen and a parallel old-vs-new maintenance burden appears.
The pragmatic alternative is incremental refactoring under a green test suite, often via the Strangler Fig pattern: build the new behavior around the edges of the old system, route traffic to it piece by piece, and retire the old parts gradually — delivering value continuously and keeping rollback cheap. Pay down debt where you're already working (boy-scout rule) and where it has the highest interest.
A rewrite is occasionally justified (the platform is truly dead, or constraints changed fundamentally), but the default answer is: refactor incrementally, keep shipping, and make the cost of debt visible so it's prioritized — not a heroic stop-the-world bet.
What a strong answer covers
- Big-bang rewrites freeze value delivery and must re-derive every undocumented edge case the old code handles.
- They invite the second-system effect and a long old-vs-new dual-maintenance period.
- Prefer incremental refactoring behind tests, e.g. the Strangler Fig — replace piece by piece, keep shipping.
- Pay down debt where you already work and where interest is highest; make the cost visible.
Quick self-check
What's the strongest argument against a big-bang rewrite of a working legacy system?
Follow-ups they push on
- What is the Strangler Fig pattern and how does it de-risk replacing a legacy system?
- What rare conditions actually justify a full rewrite over incremental refactoring?
Red flag Defaulting to a stop-the-world rewrite — it usually overruns, loses hard-won edge-case behavior, ships nothing for months, and lands you with a new mess; incremental refactoring is the lower-risk path.
source: Martin Fowler — StranglerFigApplication ↗
Google senior concept occasional How do you keep code review and quality gates from becoming a bottleneck that slows the team down?
The dominant lever is small changes. A small PR is reviewed faster, more thoroughly, and merges before it rots; large PRs sit for days, get rubber-stamped, and block their authors. Google's guidance is explicit that small CLs are central to fast, high-quality review.
Reduce the human cost by automating what doesn't need judgment: formatters and linters settle style, CI runs the tests, security/dependency scanners flag the obvious — so reviewers spend their limited attention on design and correctness, not whitespace. Set an SLA for review turnaround (review promptly so authors aren't blocked) and make review a first-class part of the day, not an interruption deferred indefinitely.
Also right-size the gate: not every change needs the same rigor, and 'don't let perfect be the enemy of good' — approve net improvements and file follow-ups. The goal is a fast, trustworthy pipeline, not maximal ceremony.
What a strong answer covers
- Small PRs are the biggest lever — faster, deeper review; large PRs block authors and get rubber-stamped.
- Automate the judgment-free checks (lint/format/tests/scanners) so humans review design and correctness.
- Set a review-turnaround SLA and treat review as first-class work, not a deferred interruption.
- Right-size rigor and approve net improvements with follow-ups — don't let perfect block good.
Follow-ups they push on
- Why does a small PR get a higher-quality review than a large one?
- Which checks should never reach a human reviewer at all?
Red flag Trying to fix slow reviews by lowering standards or skipping review — the real fixes are smaller changes, automation of trivial checks, and a turnaround SLA, which speed things up *without* sacrificing quality.
source: Google — Code Review Developer Guide (Small CLs) ↗
Commonly asked senior concept common What is technical debt, and how do you decide whether to pay it down?
Technical debt is the implied future cost of choosing an easy-now solution over a better-but-slower one — like financial debt, it accrues 'interest' as every future change in that area takes longer.
Fowler's quadrant is useful: debt can be *deliberate or inadvertent* and *prudent or reckless*. Deliberate-prudent debt ('we'll ship now and refactor next sprint, and we know the tradeoff') is a legitimate engineering decision; reckless debt ('what's layering?') is not.
Deciding to pay it down: prioritize debt in code you touch often (high interest) over dead corners; pay it down opportunistically as you work nearby (boy-scout rule) rather than via giant rewrites; and make the cost visible to stakeholders so it competes fairly with features.
Follow-ups they push on
- Why is debt in rarely-touched code often fine to leave?
- How do you make tech debt visible to non-engineering stakeholders?
Red flag Treating all tech debt as equally urgent (or all of it as 'just bad code') — debt in hot paths costs far more than debt in stable, untouched code.
source: Martin Fowler — Technical Debt Quadrant ↗
Commonly asked senior concept occasional Name a few code smells and explain what each one signals.
A code smell is a surface symptom that *hints* at a deeper design problem — not a bug, but a prompt to look closer. Common ones:
- Long method / large class: too many responsibilities; signals a need to extract functions/classes.
- Duplicated code: the same knowledge in many places — change one, miss the others (DRY violation).
- Long parameter list: often a missing object that should group related params.
- Feature envy: a method that mostly uses *another* object's data — behavior is in the wrong place.
- Shotgun surgery: one change forces edits across many files — poor cohesion.
- Primitive obsession / magic numbers: missing a domain type or named constant.
The value is that smells give a shared vocabulary for review and point toward the right refactoring — but they are heuristics, not hard rules.
Follow-ups they push on
- Why is a smell a *hint* rather than a definitive 'this is wrong'?
- Which refactoring addresses 'shotgun surgery'?
Red flag Treating every smell as a mandatory fix — sometimes the 'smelly' code is the pragmatic choice; smells prompt investigation, not reflexive rewrites.
source: Martin Fowler — CodeSmell ↗
Commonly asked senior concept common What are coupling and cohesion, and why do we want low coupling and high cohesion?
Cohesion is how strongly the things *inside* a module belong together — high cohesion means a module has one clear, focused responsibility. Coupling is how dependent modules are on *each other's* internals — low coupling means modules interact through small, stable interfaces and can change independently.
We want high cohesion, low coupling because together they localize change. With high cohesion a single concern lives in one place (you know where to look, and the change stays contained). With low coupling, changing one module doesn't ripple into others. The opposite — low cohesion, high coupling — produces the shotgun surgery smell (one change forces edits everywhere) and fragile code where a tweak in module A mysteriously breaks module B.
This is the engine behind modularity, SOLID's single-responsibility and dependency-inversion principles, and why you depend on interfaces rather than concrete implementations.
What a strong answer covers
- Cohesion = how well a module's internals belong together (want it high — one responsibility).
- Coupling = how much modules depend on each other's internals (want it low — small stable interfaces).
- Together they localize change: a concern lives in one place and edits don't ripple outward.
- Low cohesion + high coupling → shotgun surgery and fragile, change-resistant code.
Follow-ups they push on
- How does depending on an interface instead of a concrete class reduce coupling?
- Which code smell is the direct symptom of low cohesion across modules?
Red flag Optimizing one in isolation — e.g. splitting code into many tiny modules can lower per-module size while *raising* coupling (everything calls everything); you want both directions right together.
source: Martin Fowler — Reducing Coupling (Beck Design Rules) ↗
Google senior concept common How do you give code-review feedback that improves the code without alienating the author?
Review the code, not the person, and assume competence — phrase comments about the change ('this query runs N+1') rather than the author ('you always...'). Explain the why behind a suggestion so it teaches rather than dictates, and prefer asking ('what happens if items is empty here?') over commanding when you're unsure.
Distinguish blocking issues from preferences: label optional suggestions explicitly (Google's convention is prefixing nits with Nit:) so the author knows what must change versus what's taste. Don't let perfect block good — if a change improves the codebase's health overall, approve it even if it isn't exactly how you'd write it; file follow-ups for non-urgent improvements.
Process courtesies matter too: review promptly to avoid blocking people, keep feedback respectful and specific, and recognize good work, not just problems. The goal is a healthier codebase and a team that *wants* their code reviewed.
What a strong answer covers
- Critique the code, not the person; assume competence and explain the why.
- Label nits/optional suggestions vs blocking issues so the author knows what's required.
- Don't gatekeep on perfection — approve net improvements; file follow-ups for the rest.
- Review promptly and respectfully; the aim is codebase health *and* a team that welcomes review.
Follow-ups they push on
- Why is prefixing optional comments with 'Nit:' valuable to the author?
- When should a reviewer approve a change that isn't exactly how they'd write it?
Red flag Blocking a net-positive change over personal style preferences, or phrasing feedback as commands/attacks — it breeds resentment and slows the team without improving the code.
source: Google — Code Review Developer Guide (How to comment) ↗

07 Building in the AI Age 80 Q's

7.1 Anatomy of a modern app 10

★ must-know Commonly asked junior concept very common Where does your code actually run — client or server — and why does it matter for what you can put in it?
Client code is shipped to and runs in the user's browser — it's fully visible (anyone can open DevTools and read it) and editable, so it can keep no secrets and enforce no rules. Server code runs on a machine you control — invisible to the user — so it can hold credentials, reach the database, and enforce checks.
The practical rule: anything that must stay secret or be trusted (API keys, authorization, pricing, validation that counts) lives on the server. The client is for rendering and convenience-level checks only.
What a strong answer covers
- Client code runs in the user's browser — fully visible and editable, keeps no secrets.
- Server code runs on a machine you control — invisible to the user, can hold credentials and enforce rules.
- Secrets and trusted checks (auth, pricing, real validation) must be server-side.
- Client-side checks are a UX nicety; the server must re-validate everything that matters.
Quick self-check
You hardcode a third-party API key into your React component so the browser can call the API. What's the problem?
Follow-ups they push on
- If you bundle an API key into your frontend JS, who can see it?
- Why is a 'disable the button' check in the browser not real security?
Red flag Putting an API key or secret in frontend code 'because it's just JavaScript' — it ships to every visitor and is trivially readable.
source: MDN — Server-side vs client-side code ↗
Commonly asked junior concept very common Name the three tiers of a typical web app and say what each is responsible for.
Three tiers: client/frontend (the browser — renders UI, handles interaction), server/backend (a machine you control — owns business logic, data access, and secrets), and database (persists state).
The key separation is trust: the browser is untrusted and public, so anything sensitive (DB credentials, API keys, authorization checks) lives on the server. The frontend asks the server for data; the server talks to the database.
Follow-ups they push on
- Why can't the browser talk to the database directly?
- Where does an API sit in this picture?
Red flag Saying the frontend 'connects to the database' — it never does; it calls your server, which holds the credentials.
source: InterviewPrep — 3-Tier Architecture ↗
Commonly asked junior concept very common What is an API, in one sentence, and why does the frontend go through it instead of the database?
An API is a contract: a defined set of endpoints the server exposes so other code can request data or actions without knowing the internals.
The frontend goes through it because the API is the trust boundary. It can authenticate the caller, authorize the action, validate input, and hide DB credentials and schema. If the browser hit the DB directly, anyone could read its network traffic, steal the credentials, and run any query.
Follow-ups they push on
- What does the API do that the database can't be trusted to do itself?
- What's the difference between an endpoint and a route?
Red flag Describing an API only as 'a URL' — the point is the contract and the trust boundary, not the address.
source: MDN — How does the web work? ↗
Commonly asked junior design common Trace what happens, end to end, when a user types a URL and hits Enter.
Walk the path out loud: browser parses the URL, DNS resolves the domain to an IP, the request reaches the host/server, the server/API runs logic and (if needed) queries the database, builds a response, sends it back, and the browser renders it.
Good signal is naming the layers in order and knowing DNS is a lookup, not the server itself. Bonus: mention HTTPS securing the connection along the way.
Follow-ups they push on
- Where would caching help in that path?
- What's the difference between the host and the code running on it?
Red flag Skipping DNS, or thinking the domain name 'is' the server. DNS is the phone book that maps name to address.
source: MDN — How does the web work? ↗
Commonly asked junior concept common What's the difference between 'build', 'deploy', and 'host'? People use them interchangeably.
They're three stages, not synonyms. Build compiles/bundles your source into shippable artifacts (the dist/ folder). Deploy is the act of pushing those built artifacts to a place that serves them. Host is the place itself — the always-on machine or platform serving the result.
Mnemonic: build is a verb that produces files, deploy is a verb that moves them, host is the noun where they live.
Follow-ups they push on
- Where does 'repo' and 'bundle' fit in the chain?
- What is CI/CD in one line?
Red flag Conflating build and deploy — you can build without deploying (a failed CI run) and redeploy the same build.
source: Vercel — Deployments overview ↗
Commonly asked junior concept very common What is an environment variable, and why should secrets never be committed to the repo?
An environment variable is config supplied to the program at runtime by its environment, not hardcoded in the source — things like the database URL or an API key. The same code reads different values in dev, preview, and prod.
Secrets stay out of the repo because git history is forever and repos get shared, cloned, and leaked. A committed key is compromised even after you 'delete' it — it's still in history. Secrets belong in the host's environment-variable/secret store.
Follow-ups they push on
- You accidentally committed an API key. What do you do?
- Why use a `.env` file locally but not commit it?
Red flag Thinking deleting the line in a later commit fixes it — the secret is still in history and must be rotated.
source: The Twelve-Factor App — Config ↗
Commonly asked junior concept common What's the difference between an endpoint and a route?
A route is the path pattern the server matches against an incoming request (/users/:id). An endpoint is a specific addressable operation — usually a method + path together (GET /users/:id vs DELETE /users/:id) — that does one thing.
In practice people use them loosely, but the useful distinction is: one route (path) can host several endpoints, one per HTTP method. The route is where the request lands; the endpoint is the exact action it triggers.
What a strong answer covers
- A route is the URL path pattern the server matches (/users/:id).
- An endpoint is method + path together — a specific operation (GET /users/:id).
- One route can back several endpoints, one per HTTP method (GET/POST/DELETE…).
- :id is a path parameter — a placeholder filled by the actual request.
Follow-ups they push on
- How does the same path serve a GET and a DELETE differently?
- What's a path parameter vs a query parameter?
Red flag Thinking a path alone fully identifies an operation — `/users/1` means nothing until you also know the method (read it? delete it?).
source: MDN — Routing (Server-side first steps) ↗
Commonly asked junior concept common Why can't the browser talk to the database directly — what would go wrong?
To connect to a database you need its address and credentials, and the browser is a public, untrusted environment — anyone can read the page's network traffic and JavaScript. Shipping DB credentials to the browser means handing them to every visitor.
Even if you could, there'd be no enforcement layer: the database just runs whatever query it's given. The server sits in between precisely to authenticate the caller, authorize the action, validate input, and only then run a safe, scoped query. The DB stays on a private network the browser can't reach.
What a strong answer covers
- Connecting needs credentials; the browser is public, so those credentials would leak to everyone.
- The database has no notion of *who* is asking — it just runs the query it's given.
- The server is the enforcement layer: authenticate, authorize, validate, then query.
- In real setups the DB lives on a private network the browser literally can't reach.
Follow-ups they push on
- What does the server add that the database can't enforce itself?
- How is a DB connection string like an API key?
Red flag Imagining the database can 'just check permissions' itself — it executes queries; the trust/permission logic lives in your server code.
source: MDN — Server-side programming: first steps ↗
Commonly asked junior concept common What's the difference between a request and a response, and what do status codes like 200, 404, and 500 tell you?
A request is what the client sends (method, URL, headers, optional body); a response is what the server sends back (a status code, headers, and usually a body). Every HTTP exchange is one request and one response.
Status codes are the response's one-glance summary: 2xx = success (200 OK), 3xx = redirect, 4xx = the client did something wrong (404 not found, 401/403 auth problems, 400 bad input), 5xx = the server broke (500 internal error). The first digit tells you whose 'fault' it is — 4xx is on the caller, 5xx is on the server.
What a strong answer covers
- Request = client → server (method, URL, headers, body); response = server → client (status, headers, body).
- 2xx success, 3xx redirect, 4xx client error, 5xx server error.
- 404 = not found, 401/403 = not authenticated/authorized, 400 = bad request, 500 = server crashed.
- The leading digit tells you where to look first: 4xx → the request; 5xx → the server logs.
Quick self-check
Your API returns 500 for a request. Where do you look first?
Follow-ups they push on
- If you see a 401 vs a 403, what's the difference?
- Why is a 500 your problem but a 404 might be the caller's?
Red flag Returning 200 for everything (including errors) and signalling failure only in the body — it breaks clients, caches, and monitoring that rely on the status code.
source: MDN — HTTP response status codes ↗
Commonly asked mid concept occasional What is a runtime (Node, a browser, an edge runtime), and why does 'where it runs' change what your code can do?
A runtime is the environment that executes your code and decides which capabilities (APIs) are available. The same JavaScript behaves differently depending on the runtime: a browser runtime gives you the DOM, fetch, and localStorage but no filesystem; Node.js gives you the filesystem, network sockets, and process access but no DOM; an edge runtime is a stripped-down server runtime optimized to run close to users, often missing some Node APIs.
So 'where it runs' is really 'which runtime', and the runtime is what gates what's possible — fs.readFile works in Node and crashes in a browser; document.querySelector works in a browser and is undefined in Node.
What a strong answer covers
- A runtime is the execution environment that supplies the available APIs.
- Browser: DOM, fetch, localStorage; no filesystem or process access.
- Node.js: filesystem, sockets, env/process; no DOM.
- Edge runtimes are lean server runtimes near the user — fast, but a subset of Node's APIs.
Follow-ups they push on
- Why does `document` exist in the browser but not in Node?
- Why might a library work locally (Node) but fail when deployed to an edge runtime?
Red flag Assuming 'it's all JavaScript' means any code runs anywhere — runtime-specific APIs (fs, DOM) make code break when moved to the wrong environment.
source: MDN — JavaScript execution environments ↗

7.2 Frontend for backend devs 10

Commonly asked junior concept very common What do HTML, CSS, and JavaScript each do, and what is the DOM?
HTML is structure (the content and its meaning), CSS is style (how it looks), JavaScript is behavior (what happens when you interact). The DOM (Document Object Model) is the browser's live, in-memory tree representation of the HTML — JS reads and changes the DOM, and the browser re-renders.
The one-liner that lands: HTML is the skeleton, CSS is the skin, JS is the muscles, and the DOM is the object you manipulate to change any of it at runtime.
Follow-ups they push on
- When JS changes the page, is it changing the HTML file or the DOM?
- What's the difference between the DOM and the source HTML?
Red flag Thinking JS edits the .html file. It edits the DOM — the in-memory tree — not the file on disk.
source: MDN — How does the web work? ↗
Commonly asked junior concept common Why do component states like loading, empty, and error matter as much as the happy path?
Any component that fetches or depends on real data has more than one state: while the data is in flight (loading), when it arrives but is empty (empty — zero results), when the request fails (error), and finally the populated success state. Real users hit all four.
If you only build the success path, the component shows a blank or broken UI the moment data is slow, missing, or failing — exactly the moments users notice. Designing the loading skeleton, the empty message, and the error/retry up front is what separates a demo from a shippable feature, and it's why you list these states explicitly when prompting an AI to build the component.
What a strong answer covers
- Data-driven components have ≥4 states: loading, empty, error, success.
- Users hit the non-happy states constantly (slow networks, no results, failures).
- Skipping them yields blank/broken UI at the worst moment — when something's already wrong.
- Naming all states up front is what makes a component (and an AI prompt) production-grade.
Follow-ups they push on
- What's the difference between an empty state and an error state?
- Why is a loading state about perceived performance, not just correctness?
Red flag Building only the populated success view — the component looks done in the demo but breaks on the first slow or failed request in production.
source: Anthropic — Prompt engineering overview ↗
Commonly asked junior concept very common What's the difference between props and state in a component?
Props are inputs passed in from the parent — read-only from the component's view, like function arguments. State is data the component owns and can change over time, which triggers a re-render when it does.
Framework-agnostic rule of thumb: if the data comes from above and the component shouldn't mutate it, it's a prop; if the component manages it and updates it (a toggle, a form field, a counter), it's state.
Follow-ups they push on
- What is a component, in one line?
- If a parent and child both need the same value, where should it live?
Red flag Mutating props directly. Props flow down and are read-only; to change them you lift state to the parent.
source: React — Thinking in React ↗
Commonly asked junior concept occasional A UI library, a meta-framework, a styling system, a component kit — what's the difference?
Different layers of the stack: a UI library (React/Vue/Svelte) gives you the component model. A meta-framework (Next/Astro/SvelteKit) wraps a UI library with routing, rendering modes, and a build pipeline. A styling system (Tailwind or plain CSS) decides how you apply styles. A component kit (shadcn/MUI) is pre-built, styled components you drop in.
They stack, not compete: e.g. Astro (meta-framework) + React (UI library) + Tailwind (styling) + shadcn (components).
Follow-ups they push on
- Is Next.js a replacement for React?
- What does a bundler like Vite do?
Red flag Calling Next.js 'a JavaScript framework like React' — Next is built on React; they're different layers.
source: Astro — Why Astro? ↗
Commonly asked junior concept occasional You're asking an AI to build a UI component. What makes a good frontend prompt?
Name four things: the component (what it is — 'a comment card'), its props (the data it takes in), its states (loading, empty, error, hover/disabled), and a visual reference (a screenshot, an existing component to match, or a design system).
Vague prompts ('make a nice form') produce generic output. Specifying props and states is what turns the model from guessing into building to a contract — the same discipline you'd use describing the component to a teammate.
Follow-ups they push on
- Why list the empty and error states explicitly?
- How does giving an existing component as reference help?
Red flag Only describing the happy path — you get a component that breaks on empty/error data you forgot to mention.
source: Anthropic — Prompt engineering overview ↗
Commonly asked junior concept very common What is a component, and why break a UI into components at all?
A component is a reusable, self-contained piece of UI — markup plus its own logic and styling — that takes inputs (props) and renders a piece of the screen. A Button, a CommentCard, a Navbar are all components.
You break a UI into components for the same reasons you break code into functions: reuse (write the card once, render it 50 times), isolation (a bug in one is contained), and composition (build complex screens by nesting small pieces). It also maps cleanly to how you reason and how you prompt an AI — one component, one clear responsibility.
What a strong answer covers
- A component = reusable UI unit: markup + logic + style, driven by props.
- Reuse: define once, render many times with different props.
- Isolation and composition: small pieces nest into whole screens; bugs stay contained.
- Mirrors functions — one component should have one clear responsibility.
Follow-ups they push on
- How do you decide where one component ends and another begins?
- What's the downside of one giant component that does everything?
Red flag Building one massive component for a whole page — it becomes unreusable, hard to test, and a nightmare to change or describe to an AI.
source: React — Your First Component ↗
Commonly asked mid concept common Explain client-side rendering vs server-side rendering, and what 'hydration' means.
CSR: the server sends a near-empty HTML shell plus a JS bundle; the browser builds the whole DOM in JS. Fast to deploy, but slower first paint and weaker SEO. SSR: the server renders real HTML up front so the user sees content immediately and crawlers get real markup.
Hydration is the step after SSR where the JS bundle loads and attaches event listeners to the already-rendered HTML, turning the static markup into an interactive app. The HTML the server sent and the HTML React expects must match, or you get a hydration mismatch.
Follow-ups they push on
- Why does SSR help SEO?
- What causes a 'hydration mismatch' error?
Red flag Thinking SSR means 'no JavaScript.' SSR sends HTML first, then still hydrates with JS for interactivity.
source: GreatFrontend — Explain what React hydration is ↗
Commonly asked mid concept common Why does a list of rendered items need a stable `key`, and what goes wrong if you use the array index?
When you render a list, the framework needs to know which rendered element corresponds to which data item across re-renders — that's what key provides. A stable, unique key (an item's id) lets it correctly match, reorder, insert, and remove elements while preserving each item's state.
Using the array index breaks this when the list reorders, filters, or has items inserted/removed: the index→item mapping shifts, so the framework reuses the wrong DOM node and component state (a half-typed input, a checkbox) sticks to the wrong row. Index keys are only safe for a static, never-reordered list.
What a strong answer covers
- key lets the framework match rendered elements to data items across renders.
- Use a stable, unique id from the data — not the array index.
- Index keys break on reorder/insert/delete: state and DOM attach to the wrong item.
- Index is acceptable only for a fixed list that never changes order or length.
Quick self-check
You render a reorderable todo list using the array index as each item's `key`. After dragging an item to the top, the checkboxes appear checked on the wrong todos. Why?
Follow-ups they push on
- Why does a half-typed input jump to the wrong row with index keys?
- Where should the key come from if your data has no id?
Red flag Reaching for the array index as the key by default — it silently corrupts state when the list is dynamic (the exact case keys exist for).
source: React — Rendering Lists (keys) ↗
Commonly asked mid concept common When the data on screen changes, what makes the UI update? Contrast the imperative DOM approach with the declarative component approach.
Imperative (vanilla DOM): you change the data *and* manually issue the DOM edits — el.textContent = count — keeping the screen in sync by hand. It works but every state change means hand-written update code, which is where bugs breed.
Declarative (React/Vue/Svelte): you describe what the UI should look like *as a function of state*, and when state changes you just update the state — the framework figures out the minimal DOM changes and applies them. You stop writing 'how to update the screen' and only write 'what the screen is for this state'. That's the core mental shift for a backend dev moving to the frontend.
What a strong answer covers
- Imperative: you manually mutate the DOM on every change — error-prone bookkeeping.
- Declarative: UI = f(state); you update state, the framework re-derives and patches the DOM.
- The win is removing hand-written sync code, the classic source of UI bugs.
- You think in 'what the screen is', not 'what DOM operations to perform'.
Follow-ups they push on
- What is a 're-render' in a declarative framework?
- Why is manually syncing the DOM so bug-prone at scale?
Red flag Trying to manually edit the DOM inside a React component — you fight the framework; instead change state and let it re-render.
source: React — Reacting to Input with State ↗
Commonly asked mid concept occasional What does it mean to 'lift state up', and when do you do it?
Lifting state up means moving a piece of state out of a child component into the closest common parent, then passing it back down as props (plus a callback to change it). You do it when two or more components need to read or stay in sync with the same value.
The rule: state should live at the lowest common ancestor of everything that needs it. If a child owns state that a sibling also needs, neither can see the other's local state, so you hoist it to the parent that contains both. The parent becomes the single source of truth and hands it down.
What a strong answer covers
- Move shared state to the closest common parent, pass it down as props.
- Do it when 2+ components must read or stay in sync with the same value.
- The parent becomes the single source of truth; children get value + a change callback.
- Keeps duplicate, drifting copies of the same state from existing.
Follow-ups they push on
- If two sibling components both need a value, where does it live?
- What's the risk of each sibling keeping its own copy of the same state?
Red flag Duplicating the same state in two siblings and trying to keep them in sync manually — they drift; lift it to the shared parent instead.
source: React — Sharing State Between Components ↗

7.3 Backend for frontend devs 10

★ must-know Commonly asked junior concept very common Sketch a REST API for a 'notes' resource. What does the full set of CRUD endpoints look like?
REST organizes the API around a resource (notes) and uses HTTP methods for the verbs. The standard set: GET /notes (list), POST /notes (create), GET /notes/:id (read one), PUT/PATCH /notes/:id (update), DELETE /notes/:id (delete).
The pattern that makes it 'RESTful': the noun lives in the URL (the resource) and the verb lives in the HTTP method — never POST /createNote or GET /deleteNote/1. The same path /notes/:id serves read, update, and delete by varying the method. That convention is why anyone can guess your API once they know the resource.
What a strong answer covers
- Resource in the URL (/notes), action in the HTTP method.
- List GET /notes, create POST /notes, read GET /notes/:id, update PUT/PATCH /notes/:id, delete DELETE /notes/:id.
- Avoid verbs in the path (/createNote, /deleteNote/1) — that's the anti-pattern.
- Predictable: knowing the resource lets a caller guess the endpoints.
Quick self-check
Which is the RESTful way to delete the note with id 42?
Follow-ups they push on
- Why is `POST /notes/123/delete` considered un-RESTful?
- How does this map back to CRUD and to SQL operations?
Red flag Putting the verb in the URL (`GET /getNotes`, `POST /deleteNote`) — it breaks the REST convention and the method↔CRUD mapping.
source: MDN — HTTP request methods ↗
Commonly asked junior concept very common What can server-side code do that browser code can't?
The server runs on a machine you control, so it can hold secrets (API keys, DB credentials) the user never sees, reach the database directly, touch the filesystem, and call other services with trusted credentials.
Browser code is shipped to and runs on the user's machine — it's fully visible and editable by anyone, so it can't be trusted to keep secrets or enforce rules. Any check that matters (auth, pricing, permissions) must happen server-side.
Follow-ups they push on
- Why isn't a check in the frontend enough to secure an action?
- What's a runtime — Node, Python — in this context?
Red flag Putting an authorization check only in the frontend. The user can bypass it; the server must re-check everything.
source: MDN — Server-side programming: first steps ↗
Commonly asked junior concept very common Map CRUD to HTTP methods. Which methods are idempotent?
CRUD ↔ HTTP: Create → POST, Read → GET, Update → PUT/PATCH, Delete → DELETE.
Idempotent means calling it N times has the same effect as calling it once. GET, PUT, and DELETE are idempotent; POST is not (two POSTs create two records). PATCH is generally not guaranteed idempotent. This matters for retries: it's safe to retry a GET or PUT after a timeout, but retrying a POST may double-charge or double-create.
Follow-ups they push on
- What's the difference between PUT and PATCH?
- Why does idempotency matter when a request times out?
Red flag Saying GET is idempotent because 'it doesn't change anything' — that's safety. Idempotency is about repeated calls having one effect (a correct GET is also safe, but the concepts differ).
source: InterviewBit — REST API Interview Questions ↗
Commonly asked junior concept common What is an ORM, and why do people warn against hand-concatenating SQL strings?
An ORM (Object-Relational Mapper, e.g. Prisma) lets you work with database rows as objects in your language instead of writing raw SQL — it generates the SQL for you and maps results back to typed objects.
Hand-concatenating SQL from user input invites SQL injection: if you build "SELECT * FROM users WHERE name = '" + input + "'", a crafted input can break out of the string and run arbitrary SQL. ORMs (and parameterized queries) bind values separately from the query text, so input can never become executable SQL.
Follow-ups they push on
- What's a parameterized/prepared query?
- When might you drop to raw SQL anyway?
Red flag Thinking an ORM is required for safety — the real fix is parameterized queries; an ORM is one convenient way to get them.
source: Prisma — What is an ORM? ↗
Commonly asked junior concept very common Authentication vs authorization — what's the difference?
Authentication is proving who you are (login, a token, a session). Authorization is what you're allowed to do once you're known (can this user delete that post?).
Mnemonic: authentication is the bouncer checking your ID at the door; authorization is the rule about which rooms your ticket lets you into. You authenticate once, then authorize every sensitive action.
Follow-ups they push on
- Where must these checks run — frontend or backend?
- Can you be authenticated but not authorized for an action?
Red flag Using the words interchangeably. A logged-in user (authenticated) still must be authorized per action; conflating them leads to privilege bugs.
source: MDN — Server-side programming: first steps ↗
Commonly asked junior concept very common Why validate input on the server even when the frontend already validates the same form?
Frontend validation is a UX feature — it gives instant feedback so users fix mistakes fast — but it provides zero security, because the client is fully under the user's control. Anyone can bypass the form entirely and POST raw data with cURL, disabled JavaScript, or DevTools.
So the server must re-validate everything it receives as if no frontend existed: required fields, types, ranges, formats, and authorization. The two layers aren't redundant — they serve different jobs: the frontend for friendliness, the server for trust. Skipping server validation is how malformed and malicious data reaches your database.
What a strong answer covers
- Frontend validation = UX (fast feedback); it is not security.
- The client is user-controlled — attackers bypass the form and POST directly.
- The server must re-validate every request as if no frontend existed.
- Both layers coexist: friendliness on the client, trust on the server.
Quick self-check
Your signup form checks the email format in JavaScript before submitting. Is server-side email validation still needed?
Follow-ups they push on
- How would someone bypass your frontend validation?
- Is server validation enough on its own (no frontend checks)?
Red flag Trusting client-side validation as a security boundary — it's trivially bypassed; the server is the only place validation actually protects you.
source: MDN — Form data validation ↗
Commonly asked mid concept common What is CORS, and why does the browser block your frontend from calling an API on a different origin?
CORS (Cross-Origin Resource Sharing) is a browser security mechanism. By default the same-origin policy stops JavaScript on app.example.com from reading responses from a different origin (different scheme, host, or port) — this prevents a malicious page from quietly calling APIs as you. CORS is the *opt-in* by which a server says 'these specific other origins are allowed', via Access-Control-Allow-Origin and related response headers.
Key nuance for builders: CORS is enforced by the browser, on the server's say-so. A CORS error isn't your frontend misbehaving — it means the API you're calling hasn't allow-listed your origin. The fix is on the server (or a proxy), not in the browser.
What a strong answer covers
- Same-origin policy blocks cross-origin reads by default; CORS is the server's opt-in to relax it.
- Enforced by the browser, but configured via the server's response headers.
- Access-Control-Allow-Origin names which origins may read the response.
- A CORS error means the target server hasn't allowed your origin — fix it server-side.
Follow-ups they push on
- What makes two URLs the 'same origin'?
- Why can server-to-server requests ignore CORS entirely?
Red flag Trying to 'fix CORS' in the frontend code — the browser enforces it from the server's headers; the change must happen on the API or via a backend proxy.
source: MDN — Cross-Origin Resource Sharing (CORS) ↗
Commonly asked mid concept occasional Serverless vs a long-running server — when does each fit?
Serverless (functions that spin up per request) shines for spiky, event-driven, or low-traffic workloads: you pay per invocation and scale to zero, but each call can have a cold start and there's no in-memory state between calls. A long-running server fits steady traffic, long-lived connections (websockets), background work, and cases where keeping things warm in memory matters.
Orientation-level takeaway: serverless trades always-on cost and statefulness for automatic scaling and pay-per-use.
Follow-ups they push on
- What is a 'cold start'?
- Why is a websocket server awkward to run serverless?
Red flag Assuming serverless is always cheaper — at sustained high traffic a long-running server is often cheaper and lower-latency.
source: MDN — Server-side programming: first steps ↗
Commonly asked mid concept common What's the difference between a session and a JWT for keeping a user logged in?
Both answer 'how does the server know it's still you on the next request'. A session stores the auth state on the server (a session record) and gives the browser an opaque session ID (usually in a cookie); the server looks it up each request — easy to revoke, but it's stateful. A JWT is a signed token the server hands back that *contains* the claims (user id, expiry); the server just verifies the signature, no lookup needed — stateless and scalable, but hard to revoke before it expires.
Tradeoff in one line: sessions are easy to invalidate but require server state; JWTs are stateless and scale well but you can't easily 'log someone out' until the token expires.
What a strong answer covers
- Session: state lives server-side; browser holds an opaque ID; easy to revoke, but stateful.
- JWT: signed token carries the claims; server verifies signature, no lookup; stateless.
- JWT scales well (no shared session store) but is hard to revoke before expiry.
- Both typically ride in a cookie or Authorization header on each request.
Follow-ups they push on
- Why is logging a user out harder with JWTs?
- Why should a JWT have a short expiry?
Red flag Storing sensitive data in a JWT thinking it's hidden — a JWT is signed, not encrypted; its payload is readable by anyone who has the token.
source: MDN — HTTP authentication ↗
Commonly asked mid concept occasional Why move slow work (sending email, resizing an image, calling a slow API) to a background job instead of doing it in the request?
An HTTP request should return fast. If you do slow work inline — sending a welcome email, resizing an upload, calling a slow third-party API — the user waits the whole time, the request may time out, and a failure in that work fails the whole request.
The pattern is to enqueue the slow work and return immediately: accept the request, push a job onto a queue, respond '202 accepted / we're on it', and let a separate worker process the job later (with retries on failure). The user gets a snappy response; the slow, flaky, or retryable work happens out of band where a failure doesn't break the user's request.
What a strong answer covers
- Requests should be fast; slow inline work blocks the user and risks timeouts.
- Enqueue the work, respond immediately, let a separate worker process it.
- Background jobs can retry on failure without re-running the user's request.
- Good fits: email, image/video processing, slow external API calls, report generation.
Follow-ups they push on
- What does a queue + worker setup look like at a high level?
- How does a background job report success or failure back to the user?
Red flag Doing slow/flaky work inline in the request handler — one slow third-party call makes every user wait and turns a transient failure into a failed request.
source: MDN — Server-side programming: first steps ↗

7.4 TypeScript, just enough 10

Commonly asked junior concept very common What problem does TypeScript solve over plain JavaScript? What class of bugs does it catch?
TypeScript adds static types checked at compile time, so a whole class of bugs is caught before the code runs: typos in property names, passing the wrong shape, calling a method that doesn't exist, forgetting a required field, or assuming a value is present when it can be undefined.
It's a developer-time tool — the types are erased and it's plain JS at runtime. The payoff is the error shows up in your editor as you type instead of as a crash in production.
Follow-ups they push on
- Do types exist at runtime?
- Does TypeScript make code faster?
Red flag Believing TS catches every bug — it catches type/shape errors, not logic errors. `if (x = 5)` is still wrong; a bad algorithm is still bad.
source: TypeScript — TS for JavaScript Programmers ↗
Commonly asked junior concept common What's the difference between `interface` and `type` in TypeScript?
Both describe the shape of data. interface is best for object shapes and class contracts — it can be extended and merged (declaration merging). type is a more general alias — it can do everything interface does for objects plus unions, intersections, primitives, and tuples.
Practical rule: reach for interface for object shapes you might extend, and type when you need a union or a non-object alias. At orientation level the honest answer is they overlap heavily and either is fine for object shapes.
Follow-ups they push on
- Which can express a union type?
- What is declaration merging?
Red flag Claiming they're identical — `type` can express unions and primitives; `interface` supports declaration merging.
source: DataCamp — TypeScript Interview Questions ↗
Commonly asked junior concept occasional What is type inference, and why don't you annotate every variable?
Type inference is TypeScript figuring out the type from the value automatically: write const n = 5 and TS knows n is number — no annotation needed.
You skip redundant annotations because they add noise without adding safety. Annotate where inference can't help or where you want to pin a contract: function parameters, function return types for public APIs, and the shape of external data (API responses). Let inference handle the obvious local cases.
Follow-ups they push on
- Where is an explicit annotation still worth it?
- What does `const x: number = 5` add over `const x = 5`?
Red flag Annotating everything 'to be safe' — over-annotation is noise; the value is typing boundaries, not every local.
source: TypeScript Handbook — Everyday Types ↗
Commonly asked junior concept occasional How do types make an AI coding assistant more useful?
Types are machine-readable context. With a typed codebase the assistant gives better completions (it knows the exact shape available), invents fewer non-existent fields (the contract is right there), and its mistakes surface as in-editor type errors instead of silent runtime bugs.
So a typed contract is a form of guardrail for the AI: it constrains what valid code looks like, which is why 'add types' or 'type this API response' is a high-leverage instruction to give it.
Follow-ups they push on
- What does telling the AI to 'make this strict' do?
- Why does a typed API response reduce hallucinated fields?
Red flag Treating types as only a human concern — they're also the strongest signal the model has about valid code.
source: TypeScript — TS for JavaScript Programmers ↗
Commonly asked junior concept common Name TypeScript's main primitive types and how you type an array and an object.
The core primitives are string, number, boolean, plus null and undefined (and bigint/symbol you rarely touch early). You type an array as number[] (or Array<number>), and an object by its shape: { name: string; age: number }.
A point that trips JS devs: TypeScript has no separate int/float — it's all number. And string[] means 'array of strings', while string alone is one string. Get comfortable reading these shapes; most real-world typing is just composing primitives into object and array shapes.
What a strong answer covers
- Primitives: string, number, boolean, null, undefined (plus bigint, symbol).
- No int/float distinction — all numbers are number.
- Array: T[] or Array<T> (e.g. string[]).
- Object: describe its shape — { name: string; age: number }.
Follow-ups they push on
- What's the difference between `number[]` and `[number, number]`?
- How do you mark an object field as optional?
Red flag Looking for `int`/`float`/`char` types from other languages — TypeScript only has `number` and `string`; there's no character type.
source: TypeScript Handbook — Everyday Types ↗
Commonly asked junior concept common What is a union type, and how do you write a literal union for something like a status field?
A union type says a value is one of several types, written with |: string | number means 'either a string or a number'. The most useful flavor is a literal union of exact values: type Status = "idle" | "loading" | "error" | "done".
This is huge for modeling state: instead of a loose string that could be any typo, the type pins the field to exactly the allowed values, so status = "loadign" is a compile error and your editor autocompletes the valid options. It's the cleanest way to make impossible states unrepresentable.
What a strong answer covers
- A union (A | B) means the value is one of the listed types.
- A literal union ("a" | "b" | "c") restricts to exact allowed values.
- Great for status/role/variant fields — typos become compile errors.
- Editor autocompletes the valid options, so you can't pick an invalid one.
Quick self-check
Which type best models a button's variant, which must be exactly 'primary', 'secondary', or 'ghost'?
Follow-ups they push on
- How does TypeScript 'narrow' a union so you can use it safely?
- Why is a literal union better than a plain `string` for a status field?
Red flag Typing a fixed-set field as `string` — you lose the typo-catching and autocomplete a literal union would give you for free.
source: TypeScript Handbook — Everyday Types (Union Types) ↗
Commonly asked mid concept occasional How do shared types act as a contract between your frontend and backend?
If both sides of your app are TypeScript, you can define the shape of the data once — interface User { id: string; name: string; email: string } — and import it in both the API code and the frontend. That shared type is a contract: the server is typed to return it, the client is typed to consume it.
The payoff is compile-time safety across the boundary. If you rename name to fullName on the server but forget the frontend, the build breaks at the mismatch instead of shipping a silently broken page. It turns 'did the API change?' from a runtime surprise into a type error you see immediately — the single biggest reason teams run TypeScript end to end.
What a strong answer covers
- Define the data shape once; import it on both client and server.
- The shared type is an enforced contract across the API boundary.
- A change on one side that breaks the other fails the build, not production.
- Turns API drift from a runtime surprise into an immediate compile error.
Follow-ups they push on
- What happens at build time if the server's response no longer matches the shared type?
- How do tools generate these shared types from an API schema automatically?
Red flag Hand-redeclaring the same shape separately on client and server — they drift out of sync; share one source-of-truth type instead.
source: TypeScript Handbook — Object Types (interfaces) ↗
Commonly asked mid trick common What's the difference between `any` and `unknown`?
any turns type checking off for that value — you can do anything with it and TS won't complain, which throws away the safety you came for. unknown is the safe counterpart: you can hold any value, but you must narrow it (check its type) before you use it.
Rule of thumb: any is an escape hatch (use sparingly, e.g. migrating JS); unknown is a checkpoint that forces you to prove the type first. Prefer unknown when you genuinely don't know the type yet.
Follow-ups they push on
- Why is `unknown` safer for a parsed JSON / API response?
- When is reaching for `any` defensible?
Red flag Sprinkling `any` to silence errors — it defeats the point of TypeScript and hides real bugs. Tighten the type instead.
source: DataCamp — TypeScript Interview Questions ↗
Commonly asked mid concept common What's the difference between optional (`?`) and `| undefined`, and how do you safely read a value that might be missing?
field?: string means the property may be absent entirely (you can omit the key). field: string | undefined means the key must be present but its value may be undefined. They overlap a lot in practice; the practical concern is the same — you must handle the missing case before using it.
To read it safely, use narrowing: an if (user.name) check, optional chaining user.profile?.bio, or a default with ?? (const name = user.name ?? "Anonymous"). With strictNullChecks on, TypeScript forces you to do this — it won't let you call .toUpperCase() on something that might be undefined, which kills a huge class of 'cannot read property of undefined' crashes.
What a strong answer covers
- ? = the property may be absent; | undefined = present but possibly undefined.
- Both require handling the missing case before use.
- Narrow with if checks, optional chaining ?., or nullish coalescing ??.
- strictNullChecks makes the compiler force this — preventing undefined-access crashes.
Follow-ups they push on
- What does optional chaining (`?.`) return when the left side is undefined?
- Why does `strictNullChecks` catch so many real-world bugs?
Red flag Accessing a possibly-undefined value directly (`user.name.toUpperCase()`) — without narrowing it crashes at runtime; let strict mode force the check.
source: TypeScript — Migrating with strictNullChecks / handling null ↗
Commonly asked mid trick common TypeScript said the code is type-correct, but it still crashed at runtime with bad data from an API. How is that possible?
TypeScript types are erased at compile time — they don't exist at runtime and don't check actual values. When you write const user = await res.json() as User, you're *asserting* the shape, not verifying it. If the API returns something different, TS believed your assertion and the mismatch only surfaces as a crash later.
Types guarantee your code is internally consistent; they cannot police data that enters at runtime (API responses, form input, JSON files). For real boundaries you need runtime validation — a schema validator like Zod that actually checks the value and *then* gives you a trustworthy type. Static types and runtime validation are different jobs.
What a strong answer covers
- Types are erased at compile time — no runtime checking of actual values.
- as User is an unchecked assertion; TS trusts you, it doesn't verify.
- External data (APIs, forms, files) can violate the asserted type silently.
- Validate at the boundary with a runtime schema (e.g. Zod) to get a type you can trust.
Quick self-check
You write `const data = await res.json() as Product`. The API changes and now omits `price`. What happens?
Follow-ups they push on
- Why is `as SomeType` on an API response dangerous?
- How does a tool like Zod give you both a runtime check and a static type?
Red flag Using `as` to assert the shape of external data and assuming it's now safe — `as` does no checking; only runtime validation actually verifies the value.
source: TypeScript Handbook — Type assertions (`as`) ↗

7.5 From code to a live URL 10

★ must-know Commonly asked junior concept very common Walk the full chain from a git commit to a live URL. What happens at each step?
Repo → build → bundle → deploy → host. You push a commit to the repo (e.g. GitHub). That triggers a build on the host: it installs dependencies and runs your build command, which bundles your source — many files of TS/JSX/CSS — into a small set of optimized, browser-ready static assets (dist/). The host then deploys those artifacts (copies them to its servers/CDN) and serves them at a URL.
The mental model that matters: your source code is *not* what runs — the built bundle is. A push kicks off a pipeline that transforms source into deployable artifacts and puts them somewhere always-on. Modern hosts collapse all of this into 'git push and we handle the rest'.
What a strong answer covers
- Chain: push to repo → host builds → bundler produces optimized dist/ → deploy → live URL.
- Bundling turns many dev files into a few optimized, browser-ready assets.
- What runs in prod is the build output, not your raw source.
- Modern hosts trigger the whole pipeline automatically on push.
Quick self-check
After `git push`, your host shows a live site. What did the browser actually download?
Follow-ups they push on
- What does a bundler (Vite, esbuild) actually do to your files?
- Why isn't your raw `.tsx` source what the browser downloads?
Red flag Thinking the browser runs your source files — it runs the bundled, transpiled output; a build step sits between your code and what ships.
source: Vite — Building for Production ↗
Commonly asked junior concept very common What is version control, and what is a GitHub repo — in one line each?
Version control (Git) tracks the history of your code over time, lets you branch and merge, and lets you go back to any past state. A GitHub repo is a hosted home for a Git repository — the shared remote copy that you push to, others pull from, and deploys are triggered from.
Git is the tool; GitHub is a hosting service for Git repos (with PRs, issues, and CI on top).
Follow-ups they push on
- What's the difference between a commit and a push?
- What is a branch for?
Red flag Conflating Git and GitHub. Git is the version-control tool; GitHub is one place to host Git repos.
source: GitHub — Hello World ↗
Commonly asked junior concept occasional Modern hosts — Vercel, Netlify, Cloudflare, Railway, Render, Fly. What's the rough split between them?
Roughly two camps. Static / frontend hosts (Vercel, Netlify, Cloudflare Pages) are tuned for serving built frontends and serverless functions at the edge — push a repo, they build and serve it. App / server hosts (Railway, Render, Fly) are tuned for long-running servers, databases, and containers.
The line blurs (most do some of both), but the orientation-level instinct is: a static site or a frontend-plus-functions app leans toward the first group; a long-running backend with its own database leans toward the second.
Follow-ups they push on
- Where would you host a static marketing site vs a websocket server?
- What does 'edge' mean here?
Red flag Treating all hosts as interchangeable — a pure static host won't run your always-on stateful backend well.
source: Vercel — Deployments overview ↗
Commonly asked junior concept common Walk the path from a domain name to your running app. What does HTTPS/SSL add?
Domain → DNS → host. You point the domain at the host using DNS records: an A record maps a name to an IP address; a CNAME maps a name to another name (e.g. your-app.vercel.app). The browser resolves the name via DNS, then connects to the host.
HTTPS/SSL adds encryption and identity: it encrypts traffic so it can't be read or tampered with in transit, and the certificate proves the server is who it claims to be. Without it, credentials and data travel in plaintext.
Follow-ups they push on
- When do you use an A record vs a CNAME?
- Why does the padlock matter beyond 'it's secure'?
Red flag Thinking DNS 'hosts' the site — DNS only maps the name to an address; the host serves the actual app.
source: Cloudflare — What is DNS? ↗
Commonly asked junior concept occasional What is a preview deploy, and where do you look first when a deploy breaks?
A preview deploy is a full, live build of a branch or pull request at its own URL, separate from production — so you (and reviewers) can click through the change before it ships. Production and preview typically have separate env vars, which is a common gotcha when something works in preview but breaks in prod.
When a deploy breaks, read the build logs first (did it compile?), then the runtime logs (is it crashing at request time?), and check that the right env vars exist for that environment.
Follow-ups they push on
- Why might something work in preview but fail in production?
- What's the difference between a build-time and a runtime error?
Red flag Forgetting prod and preview have different env vars — a missing prod secret is a classic 'works on preview' failure.
source: Vercel — Deployments overview ↗
Commonly asked junior concept occasional What does CI/CD mean at a high level?
CI (Continuous Integration) is automatically building and testing your code every time you push, so problems surface early. CD (Continuous Delivery/Deployment) is automatically taking the code that passed and deploying it.
The whole pipeline, orientation-level: push → build → test → deploy. The point is no manual steps and no 'works on my machine' — every change goes through the same gated, repeatable path.
Follow-ups they push on
- What's the difference between continuous delivery and continuous deployment?
- Why run tests before deploying?
Red flag Thinking CI/CD is one tool — it's a practice/pipeline; many tools implement it.
source: GitHub — Hello World ↗
Commonly asked junior concept common What does a bundler like Vite or esbuild actually do, and why do you need one?
A bundler takes your project — dozens or hundreds of source files plus dependencies — and produces a small set of optimized files the browser can load efficiently. Along the way it transpiles modern TS/JSX into plain JavaScript the browser understands, resolves and combines imports, minifies (strips whitespace and shortens names), tree-shakes unused code, and fingerprints filenames for caching.
You need one because browsers don't run TypeScript or JSX, and shipping hundreds of separate files would be slow. The bundler is the bridge between 'how you write code' (modular, modern, typed) and 'what loads fast in a browser' (few, small, plain-JS files).
What a strong answer covers
- Transpiles modern TS/JSX → plain browser-compatible JavaScript.
- Combines many modules and resolves imports into a few output files.
- Minifies and tree-shakes to cut size; fingerprints filenames for caching.
- Bridges 'nice to write' (modular/typed) and 'fast to load' (few small files).
Follow-ups they push on
- What is tree-shaking?
- Why does the browser need TS transpiled before it can run it?
Red flag Confusing the bundler (prepares code to ship) with the host (serves it) — they're different stages; the bundler runs during the build.
source: Vite — Why Vite (the problems it solves) ↗
Commonly asked junior concept common What's the difference between a build-time error and a runtime error when a deploy goes wrong?
A build-time error happens while the host is compiling/bundling your code — a type error, a syntax error, a missing import. The build fails, nothing gets deployed, and you read the build logs to find it. Production keeps serving the last good deploy.
A runtime error happens after a successful deploy, when the live code actually executes a request — a null reference, a crashed API call, a missing env var the code reads at request time. The build passed, the site is 'up', but pages error; you read the runtime/function logs to find it. First diagnostic question on any broken deploy: did it fail to build, or did it build and then fail to run?
What a strong answer covers
- Build-time: fails during compile/bundle (type/syntax/import errors) → check build logs; nothing deploys.
- Runtime: fails while serving requests on a deployed build → check runtime/function logs.
- A failed build leaves the previous good version live; a runtime error means broken-but-deployed.
- First question: did it fail to build, or build fine then fail to run?
Quick self-check
Your deploy succeeds and the site loads, but one page throws 'cannot read property of undefined' for some users. What kind of error is this?
Follow-ups they push on
- A missing env var — is that more likely build-time or runtime?
- Why might TypeScript catch a build-time error that JS would only hit at runtime?
Red flag Looking in the build logs for a problem that's actually a runtime crash (or vice versa) — knowing which phase failed points you at the right log.
source: Vercel — Logs (build vs runtime) ↗
Commonly asked junior concept common How do you wire the same secret (like an API key) into local dev, preview, and production without committing it?
Locally, you keep secrets in a .env file that is git-ignored (and you commit a .env.example with the keys but not the values, so teammates know what's needed). The code reads them via the environment (process.env.API_KEY), never hardcoded.
For preview and production, you set the same variable names in the host's environment-variable settings (the dashboard or CLI), with environment-specific values. Each environment can hold a different value — a test key for preview, the real key for prod. The code stays identical; only the supplied values differ. The secret never enters git in any environment.
What a strong answer covers
- Local: git-ignored .env; commit .env.example (names only, no values).
- Code reads from the environment (process.env.X), never hardcodes the value.
- Preview/prod: set the same names in the host's env-var settings, per-environment values.
- Same code everywhere; only the injected values differ; nothing secret hits git.
Follow-ups they push on
- Why commit a `.env.example` but never the real `.env`?
- Why might preview use a different API key than production?
Red flag Setting env vars only locally and forgetting the host — the build/runtime in prod has no value, so it works locally and breaks when deployed.
source: Vite — Env Variables and Modes ↗
Commonly asked mid concept occasional What does a CDN do, and why is serving your app from 'the edge' faster?
A CDN (Content Delivery Network) is a global network of servers that cache your static assets close to users. Instead of every visitor fetching files from one origin server (which might be a continent away), they're served from a nearby edge location, cutting the round-trip distance and latency.
'The edge' just means 'physically close to the user'. For a static frontend, the whole site can be cached at the edge so it loads fast worldwide. The tradeoff to remember: cached content is fast but can be stale until it's invalidated, and truly dynamic/personalized responses can't simply be cached for everyone.
What a strong answer covers
- A CDN caches assets on servers worldwide, serving each user from a nearby location.
- 'Edge' = close to the user; less distance means lower latency.
- Great for static assets and frontends; the whole site can live at the edge.
- Tradeoff: cached content can be stale; per-user dynamic responses don't cache trivially.
Follow-ups they push on
- What kind of content is easy to cache at the edge vs hard?
- What does 'cache invalidation' mean and why is it tricky?
Red flag Assuming everything benefits from edge caching — highly dynamic or per-user responses can't be shared from a cache, and stale caches serve old content until invalidated.
source: Cloudflare — What is a CDN? ↗

7.6 The AI coding toolbox 9

Commonly asked junior concept common Name the categories of AI coding tools and when you'd reach for each.
Roughly: autocomplete (in-editor suggestions as you type — fast, line-to-block scope), chat assistant (ask questions, get explanations and snippets in a side panel), terminal/CLI agent (runs in your shell, reads/edits files and runs commands across a repo), AI IDE (an editor built around AI with the codebase in context), and app-builder (describe an app, get a scaffolded project).
Reach for autocomplete for flow while writing known code; chat for understanding or a focused snippet; a CLI agent or AI IDE for multi-file changes across a real repo; an app-builder for a quick from-scratch prototype.
Follow-ups they push on
- When would autocomplete be the wrong tool?
- What does a CLI agent do that a chat assistant can't?
Red flag Assuming one tool fits every task — a from-scratch prototype and a surgical multi-file refactor want different tools.
source: Anthropic — Claude Code overview ↗
Commonly asked junior concept common Frontier models come in tiers. Describe them without naming specific models.
Most providers offer roughly three tiers: a fast/cheap tier (cheapest and quickest, for high-volume or simple tasks like classification and autocomplete), a balanced tier (the everyday workhorse — good quality at reasonable cost/speed), and a most-capable tier (the strongest reasoning for hard, high-stakes problems, at higher cost and latency).
Deliberately avoid pinning specific names or 'the latest model' — those change constantly. The durable skill is reasoning about the tier, then checking the provider's current model page for which name maps to it today.
Follow-ups they push on
- Why frame this in tiers instead of memorizing model names?
- Where would you check which model is current?
Red flag Naming a specific model as 'the best/latest' — it dates instantly. Talk in tiers and verify the current mapping at use time.
source: Anthropic — Models overview ↗
Commonly asked junior concept occasional Why do AI-tool and model facts come with an 'as of <date>' caveat, and how do you handle that?
The AI tooling and model landscape moves fast: names, prices, tiers, context-window sizes, and capabilities change month to month. Any specific fact you memorize ('model X is the best', 'it costs $Y') has a short shelf life, and a model's training data has a cutoff so it doesn't even know about newer models.
So you reason in durable concepts (tiers, the cost/capability tradeoff) and verify specifics against the provider's current docs at the moment you need them, rather than trusting a printed name or a number from memory.
Follow-ups they push on
- Where do you check the current model lineup?
- Why can't you just trust the model to know the latest model names?
Red flag Treating a model/price fact as permanent — quote tiers and concepts, and re-verify any specific at authoring time.
source: Anthropic — Models overview ↗
Commonly asked junior concept common What can a terminal/CLI coding agent do that an in-editor chat assistant can't?
A CLI/terminal agent runs in your shell with access to your whole project: it can read and edit files across the repo, run commands (tests, builds, git), see the output, and iterate — a full plan-edit-test loop on its own. A chat assistant in the editor mainly sees the snippet or file you've shared and hands back text/snippets you copy in yourself.
The difference is agency over the environment: the CLI agent acts on the real repo (multi-file refactors, running the tests it just changed), while chat is closer to a knowledgeable pair you query for explanations and focused code. That power is also why CLI agents need the review/permission discipline that chat doesn't.
What a strong answer covers
- CLI agent: reads/edits many files, runs commands, sees output, loops autonomously.
- Chat assistant: mostly sees what you paste; returns text you apply yourself.
- Difference is agency over the real environment, not just smarter answers.
- More power → more need for review and permission gating on the CLI agent.
Quick self-check
You want a tool to refactor a function across 12 files, run the test suite, and fix what breaks — without you copy-pasting. Which fits best?
Follow-ups they push on
- Why does a CLI agent need stronger review discipline than chat?
- For a one-off 'explain this regex', which tool fits better?
Red flag Expecting a chat assistant to actually apply a multi-file change across your repo — it returns snippets; running the change in the environment is the agent's job.
source: Anthropic — Claude Code overview ↗
Commonly asked junior concept occasional When is an app-builder ('describe an app, get a project') the right tool, and when is it the wrong one?
An app-builder shines for getting from zero to something visible fast: prototypes, demos, throwaway internal tools, validating an idea, or learning by seeing a working scaffold. You describe what you want and get a runnable project without setup friction.
It's the wrong tool when you need to fit an existing, large codebase, follow specific conventions, or make surgical changes to production code — there a CLI agent or AI IDE working in the real repo is far better. Rule of thumb: app-builders are great at the blank-page start; once there's a real codebase and real constraints, you graduate to tools that operate inside it.
What a strong answer covers
- Best for: prototypes, demos, throwaway tools, idea validation, fast blank-page starts.
- Worst for: surgical edits inside a large existing codebase with conventions.
- Once a real repo and constraints exist, switch to a CLI agent or AI IDE.
- Strength is zero-to-running speed, not maintaining production code.
Follow-ups they push on
- Why is an app-builder awkward for changing an existing production app?
- What do you lose if you keep prototyping in an app-builder past the demo stage?
Red flag Using an app-builder to evolve a serious, growing codebase — it's tuned for fresh scaffolds, not careful changes within established structure and conventions.
source: Anthropic — Claude Code overview ↗
Commonly asked junior concept occasional What does it mean that a model has a 'training cutoff', and how should that change what you trust it on?
A model's knowledge is frozen at its training data cutoff — it learned from data up to roughly that date and has no inherent awareness of anything after it. So it can be confidently wrong about recent library versions, new APIs, current prices, or even newer models (including itself).
Practically: trust it for durable concepts and patterns (how REST works, what a closure is), but verify anything time-sensitive — latest package version, current API signature, today's model lineup — against live docs or by giving it the current information in context. Tools that can fetch docs or read your actual package.json close this gap; raw model memory does not.
What a strong answer covers
- Knowledge is frozen at the training cutoff; nothing newer is inherently known.
- It can be confidently wrong on recent versions, APIs, prices, and newer models.
- Trust it for durable concepts; verify time-sensitive specifics against live sources.
- Giving it current docs/context or a fetch tool beats relying on its memory.
Follow-ups they push on
- Why might an agent suggest a deprecated API or an old package version?
- How does giving the model your current docs in context fix this?
Red flag Trusting the model's recall of 'the latest' version, API, or model name — that's exactly what its cutoff makes unreliable; check current docs.
source: Anthropic — Models overview ↗
Commonly asked mid concept common Is the most capable model always the right choice? Explain the tradeoff.
No — there's a cost / capability / latency tradeoff. The most-capable tier costs more per token and is slower; for simple, high-volume tasks (tagging, extraction, routing, autocomplete) a fast/cheap model is both cheaper and snappier, and just as correct.
Match the model to the task: escalate to a stronger tier only when the task's reasoning genuinely needs it. A common production pattern is to route — cheap model for the easy 90%, strong model for the hard 10%.
Follow-ups they push on
- Give a task where the cheapest tier is the right call.
- What is model 'routing' or a cascade?
Red flag Defaulting to the biggest model for everything — it burns money and latency on tasks a small model nails.
source: Anthropic — Models overview ↗
Commonly asked mid concept common How do you pick which model tier to point a coding agent at for a given task?
Match the tier to the task's reasoning demand. For hard, multi-step, high-stakes work — architecture, gnarly debugging, large refactors where a wrong move is costly — use the most-capable tier; the extra cost and latency buy correctness. For routine, well-specified work — boilerplate, simple edits, repetitive transforms, classification-like steps — a fast/cheap tier is snappier and just as correct.
A common pattern is to default to a balanced tier for everyday coding and escalate to the top tier only when a task stalls or genuinely needs deeper reasoning. The skill is reasoning about the demand, not memorizing which model name is 'best' this month — and checking the provider's current model page for which name maps to each tier today.
What a strong answer covers
- Hard/multi-step/high-stakes → most-capable tier; correctness outweighs cost.
- Routine, well-specified work → fast/cheap tier; same result, less cost and latency.
- Common default: balanced tier everyday, escalate to top tier when stuck.
- Reason about reasoning-demand; verify current tier→name mapping in provider docs.
Follow-ups they push on
- Give a coding task where the cheapest tier is the right call.
- What signals tell you to escalate from a balanced to the top tier?
Red flag Pointing the biggest, slowest model at every task by default — you burn cost and latency on edits a cheaper tier handles perfectly.
source: Anthropic — Choosing a model ↗
Commonly asked mid concept occasional AI coding tools feel magical at first but stall on real codebases. What's the realistic mental model for what they're good and bad at?
Think of an AI coding tool as a fast, broadly knowledgeable, eager junior who has never seen your codebase, can't run things in their head reliably, and won't push back unless you make them. They're excellent at well-scoped, well-specified tasks with clear examples and a way to verify; they're weak at ambiguous goals, implicit context they were never given, and anything where being confidently wrong is cheap for them but expensive for you.
The realistic model: their output quality tracks the quality of your context and spec, not the tool's branding. Give relevant files, an example to match, and an acceptance check, and review the result — and they're a force multiplier. Hand them a vague wish and full autonomy, and they generate plausible code that misses the point.
What a strong answer covers
- Strong on scoped, specified tasks with examples and a verification path.
- Weak on ambiguity, unstated context, and self-checking their own correctness.
- Output quality tracks your context/spec quality more than the tool's brand.
- Force multiplier with good prompts + review; liability with vague goals + blind trust.
Follow-ups they push on
- Why does giving an example from the repo improve results so much?
- What's the single highest-leverage thing you can add to a weak prompt?
Red flag Blaming the tool when results are poor — usually the missing piece is context, a concrete example, or an acceptance check the human didn't provide.
source: Anthropic — Claude Code best practices ↗

7.7 Working with AI agents 10

★ must-know Commonly asked junior concept common Why treat a prompt to a coding agent like a spec rather than a casual request?
An agent does exactly what you describe, not what you meant — it has none of the shared context a teammate would fill in. A casual request ('add login') leaves a hundred decisions to chance: which auth method, where state lives, what the error states are, what 'done' means. The agent picks plausible answers, and you discover the gaps afterward.
Treating the prompt as a spec front-loads those decisions: state the goal, the constraints, an example to match, and the acceptance check. This is the same discipline as writing a ticket for a junior dev. The clearer the spec, the less re-work — vague prompts don't save time, they move the cost to debugging plausible-but-wrong output.
What a strong answer covers
- The agent does what you say, not what you meant — it lacks your unstated context.
- A casual ask leaves many decisions to chance; the agent guesses, you find gaps later.
- A spec front-loads: goal, constraints, example, acceptance check — like a good ticket.
- Vagueness doesn't save time; it relocates the cost to debugging wrong output.
Quick self-check
Which prompt is most likely to produce code you can ship with minimal rework?
Follow-ups they push on
- Which part of a spec do people most often omit?
- How is prompting an agent like writing a ticket for a junior engineer?
Red flag Firing off a one-line wish and expecting the agent to infer your conventions, edge cases, and definition of done — it can't; it fills gaps with guesses.
source: Anthropic — Prompt engineering overview ↗
Commonly asked junior concept common What is the context window, and why does it shape how you work with a coding agent?
The context window is everything the model can 'see' at once — the system prompt, your instructions, the files and snippets you've shared, and the conversation so far, all measured in tokens with a hard limit.
It shapes your workflow because the agent can't reason about code it hasn't been shown, and stuffing in irrelevant files wastes the budget and dilutes attention. So you deliberately feed it the right files, an example, and the acceptance criteria — and start fresh when a long thread gets noisy.
Follow-ups they push on
- Why can dumping the whole repo into context hurt rather than help?
- What do you do when a session gets long and the model starts losing the thread?
Red flag Assuming the agent 'remembers' your codebase — it only knows what's in the current context window; share the relevant files.
source: Anthropic — Claude Code best practices ↗
Commonly asked junior concept common Describe a healthy loop for working with a coding agent on a real change.
Plan → edit → test → review the diff → commit. First have it lay out a plan and agree on it before any code (plan mode helps). Then let it edit, run the tests, and crucially read the diff yourself before accepting — verify, don't trust. Commit in small, reviewable chunks.
The discipline is treating the agent like a fast junior pair: you still own the review and the commit. Small loops with verification beat one giant unreviewed change.
Follow-ups they push on
- Why plan before editing?
- What's the risk of committing the agent's output without reading the diff?
Red flag Accepting a large change wholesale without reading the diff — bugs and unintended edits slip through unverified.
source: Anthropic — Claude Code best practices ↗
Commonly asked junior concept common What makes a strong task prompt for an agent?
Four parts: the goal (what done looks like), the constraints (don't touch X, use library Y, match this style), an example (an existing pattern to follow or sample input/output), and an acceptance check (the test or command that proves it works).
This turns a vague wish into a spec the agent can hit and you can verify. The acceptance check is the part people skip — without it neither you nor the agent knows when it's actually done.
Follow-ups they push on
- Why include an example of existing code in the repo?
- What's the value of stating an acceptance check up front?
Red flag Giving only the goal ('add search') with no constraints, example, or check — you get plausible code that may miss the point.
source: Anthropic — Prompt engineering overview ↗
Commonly asked junior concept occasional What are custom instructions like CLAUDE.md / AGENTS.md for?
They're a persistent, project-level brief the agent reads automatically — conventions, commands, architecture notes, do's and don'ts — so you don't re-explain them every session. They put durable context into the window without you pasting it each time.
They're one of several customization surfaces, alongside slash commands (reusable prompts), MCP/tools (giving the agent new capabilities), subagents, and plan mode. The instruction file is the cheapest, highest-leverage one to start with.
Follow-ups they push on
- What belongs in a project instruction file vs a one-off prompt?
- What is plan mode, and when do you use it?
Red flag Letting the file rot — stale instructions actively mislead the agent; treat it as living documentation.
source: Anthropic — Claude Code best practices ↗
Commonly asked junior concept common Why is 'review the diff before you accept it' the non-negotiable habit when working with an agent?
An agent is fast and confident but not accountable — you are. It can make changes beyond what you asked (touching unrelated files, deleting code it deemed unnecessary, introducing a subtle bug) and it states all of it with equal confidence. The diff is your checkpoint: it shows exactly what changed before it becomes part of your code.
Reading the diff is also where *you* stay in control of the codebase — you keep understanding what's in it, catch scope creep, and verify the change actually does what the spec asked. 'Verify, don't trust' is the whole posture. Skipping the diff is how unreviewed bugs and unintended edits slip into a repo nobody fully understands anymore.
What a strong answer covers
- The agent is fast and confident but not accountable — the human is.
- It can make changes beyond the ask; the diff exposes exactly what changed.
- Reviewing keeps you in command of the codebase and catches scope creep.
- 'Verify, don't trust' — the diff is the checkpoint before code is yours.
Follow-ups they push on
- What's the danger of accepting a large change wholesale, unread?
- How do small, frequent commits make diff review easier?
Red flag Accepting big changes blind because they 'look right' — confident, plausible code can carry unrelated edits and subtle bugs that only a diff review surfaces.
source: Anthropic — Claude Code best practices ↗
Commonly asked mid debug occasional An agent keeps failing to fix a bug, trying variation after variation. How do you break the loop?
Thrashing usually means the agent is missing something it needs, not that it needs more attempts. Stop and give it more or better context: the exact error message and stack trace, the relevant file it hasn't seen, how to reproduce, and what you've already ruled out. Often it's been guessing because the failing piece was never in its window.
If that doesn't help, change the approach: ask it to first explain its diagnosis and a plan before editing (so you can catch a wrong mental model), narrow the task, or reset the session to clear accumulated wrong turns. And know when to take over — for a tricky bug a human read of the actual error often beats a tenth blind attempt. More tries on the same starting context rarely converges; better context or a reset does.
What a strong answer covers
- Thrashing = missing context, not too few attempts.
- Feed it the exact error, stack trace, repro steps, and what's been ruled out.
- Make it state a diagnosis/plan before editing to expose a wrong mental model.
- Reset the session to clear bad turns; know when to take over yourself.
Follow-ups they push on
- Why does pasting the exact stack trace help more than 'it's still broken'?
- When is it faster to just debug it yourself?
Red flag Repeatedly saying 'still broken, try again' on the same context — without new information the agent just cycles plausible guesses; add context or reset.
source: Anthropic — Claude Code best practices ↗
Commonly asked mid concept occasional What safety rails do you keep in mind when letting an agent run in your repo?
Core rails: never commit secrets (and don't let the agent paste keys into code or logs), review before running anything it generates — especially shell commands and migrations, watch cost (long autonomous runs burn tokens), and slow down on risky changes (deletes, schema migrations, anything touching prod or auth).
The mindset: the agent is fast and confident but not accountable — you are. Treat its output as a proposal to verify, not a command to execute blindly.
Follow-ups they push on
- What kinds of changes warrant extra scrutiny?
- Why is reviewing a generated shell command especially important?
Red flag Granting blanket auto-run on everything — a confidently wrong destructive command (a bad `rm` or migration) can do real damage.
source: Anthropic — Claude Code best practices ↗
Commonly asked mid debug common A long agent session starts making mistakes, contradicting earlier decisions, and 'forgetting' things. What's happening and what do you do?
Long sessions degrade because the context window fills with accumulated history — old turns, dead ends, large file dumps — which both crowds out room for new work and dilutes the model's attention across noise. The earlier 'decisions' may have scrolled out of effective focus, so it drifts.
The fix is to manage context deliberately: start a fresh session for a new sub-task, re-state the current goal and the few decisions that still matter, and re-share only the relevant files rather than the whole accumulated thread. Capture durable decisions in a project instruction file (CLAUDE.md) or a short summary you can paste back, so resetting the session doesn't lose them. Short, focused contexts beat one ever-growing thread.
What a strong answer covers
- Cause: the context window fills with history/noise, crowding and diluting attention.
- Fix: start fresh, re-state the goal and the decisions that still matter.
- Re-share only relevant files, not the entire accumulated conversation.
- Persist durable decisions (instruction file / summary) so a reset loses nothing.
Follow-ups they push on
- Why does dumping the whole repo into one long thread make this worse?
- What's worth capturing in a project instruction file before you reset?
Red flag Pushing through in the same bloated thread, repeating yourself — the noise is the problem; a clean context with a crisp restatement works far better.
source: Anthropic — Claude Code best practices ↗
Commonly asked mid concept occasional What kinds of tasks should you NOT hand to an agent autonomously, and why?
Avoid full autonomy where a confident mistake is expensive or irreversible: destructive operations (deletes, rm, dropping data), database schema migrations, anything touching production, security-sensitive code (auth, permissions, payment), and broad sweeping changes you can't easily review. These share a trait — the cost of being wrong is high and recovery is hard.
The principle is *cost of error*. Where errors are cheap and caught by tests (a new pure function, a localized UI tweak), let the agent run. Where errors are catastrophic or hard to undo, keep a human in the loop: require confirmation, work on a branch, review the diff and the exact commands before they execute. Match autonomy to reversibility.
What a strong answer covers
- Hold back autonomy on destructive ops, migrations, prod changes, and auth/payment code.
- Common thread: high cost of error and hard to undo.
- Where errors are cheap and tests catch them, more autonomy is fine.
- Match the autonomy you grant to how reversible the change is.
Follow-ups they push on
- Why are schema migrations especially risky to automate?
- How does working on a branch lower the cost of an agent's mistake?
Red flag Granting blanket auto-approval so a confidently wrong destructive command (a bad migration or `rm`) executes before any human sees it.
source: Anthropic — Claude Code best practices ↗

7.8 Building AI features into your app 11

★ must-know Commonly asked junior concept very common What is the context window when calling an LLM API, and why does it cap what you can send?
The context window is the maximum number of tokens a single request can hold — the system prompt, the full conversation history, any documents you stuff in, *and* the space reserved for the model's reply, all together. It's a hard ceiling measured in tokens, and it varies by model.
It caps what you send because everything competes for the same budget: a long chat history or a giant pasted document leaves less room for the answer, and exceeding the window errors or forces truncation. So building real features means being deliberate — send the relevant context (often via retrieval), summarize or trim old turns, and remember input *and* output both count against the limit and the bill.
What a strong answer covers
- Context window = max tokens per request: system + history + inputs + the reply, combined.
- It's a hard, per-model ceiling measured in tokens.
- Input and output share the budget — long input crowds out the answer.
- Real features manage it: retrieve relevant context, trim/summarize history.
Quick self-check
Your chatbot works fine early in a conversation but starts erroring after many turns. The most likely cause?
Follow-ups they push on
- If a conversation grows past the window, what are your options?
- Why does a huge pasted document eat into the space for the response?
Red flag Assuming the model 'remembers' past calls — each API call is stateless; you resend whatever history you want it to see, and it all counts against the window.
source: Anthropic — Context windows ↗
Commonly asked junior concept very common Describe a basic LLM API call. What's the difference between the system and user message, and what's a token?
You send a list of messages and get back a generated message. The system message sets the role, rules, tone, and constraints ('you are a support bot; never reveal internal IDs'). The user message carries the actual request. The model responds with text (and optionally structured data).
A token is the unit the model reads and writes — roughly a word-piece (a few characters). It matters because cost, latency, and the context-window limit are all measured in tokens, for both input and output.
Follow-ups they push on
- Why are both input and output billed in tokens?
- Roughly how many characters is a token?
Red flag Putting changeable user input into the system prompt — instructions and untrusted input should be kept in their proper roles.
source: Anthropic — Build with Claude (overview) ↗
Commonly asked junior concept very common Why call an LLM from your server instead of directly from the browser?
Same reason any sensitive call belongs server-side: the API key. Calling the LLM from the browser means shipping your provider key to every visitor, where it's trivially stolen and used to run up your bill. The key must live on your server.
Beyond the key, the server lets you control the integration: enforce rate limits and per-user quotas (so one user can't drain your budget), validate and sanitize input, inject the system prompt the user shouldn't control, log usage and cost, and cache. The pattern is a thin backend endpoint your frontend calls; that endpoint holds the key and calls the LLM. The browser never sees the provider directly.
What a strong answer covers
- The API key can't ship to the browser — it'd be stolen and abused.
- Server-side lets you rate-limit and set per-user quotas to cap spend.
- Server controls the system prompt and sanitizes user input before sending.
- Pattern: frontend → your backend endpoint (holds the key) → LLM provider.
Quick self-check
What's the main reason to route LLM calls through your own backend rather than calling the provider from the browser?
Follow-ups they push on
- What stops one user from draining your whole token budget?
- Why shouldn't the user be able to set the system prompt directly?
Red flag Calling the LLM provider straight from frontend JavaScript with the key embedded — it leaks to every user and there's no way to rate-limit or control cost.
source: Anthropic — API getting started (authentication) ↗
Commonly asked mid concept common What knobs (temperature, max tokens, system prompt) shape an LLM's output, and what do they each do?
The system prompt sets the model's role, rules, and output format — the single biggest lever on behavior. Temperature controls randomness: low (near 0) makes output focused and repeatable (good for extraction, classification, structured data); higher makes it more varied and creative (brainstorming, copy). Max tokens caps the *length* of the response — set it high enough that answers aren't cut off mid-sentence, but it's a ceiling, not a target.
The builder's instinct: reach for the system prompt first (it shapes the most), set temperature low when you need deterministic, parseable output and higher when you want range, and size max tokens to the expected answer. Note that exact parameters vary by provider and model — check the current API reference for which knobs a given model exposes.
What a strong answer covers
- System prompt: role, rules, format — the strongest behavior lever.
- Temperature: low = focused/repeatable; higher = varied/creative.
- Max tokens: caps response length (a ceiling, not a target) — avoid mid-sentence cutoffs.
- Available knobs differ by provider/model — verify against the current API docs.
Follow-ups they push on
- For a JSON-extraction task, do you want high or low temperature, and why?
- What happens if max tokens is set too low for the answer?
Red flag Using a high temperature for tasks that need consistent, parseable output (extraction, classification) — you get unstable results that are hard to depend on.
source: Anthropic — Messages API parameters ↗
Commonly asked mid concept common How do you keep an LLM feature's cost and quality under control once it's live?
Cost scales with tokens (input + output) × calls × model tier, so the levers are: pick the cheapest tier that passes for each task, trim the context you send (don't dump whole documents), cap max_tokens, cache or reuse stable prefixes, and rate-limit per user. Log token usage per request so you can see where the spend actually goes instead of guessing.
Quality can't be eyeballed forever — build an eval set of representative inputs with expected outputs and run it whenever you change the prompt or model, so you catch regressions. In production, log inputs/outputs (within privacy limits), watch for failures and refusals, and add guardrails (validate structured output, fall back gracefully). The theme: measure both dimensions with real numbers rather than vibes.
What a strong answer covers
- Cost = tokens × calls × tier; lower tier, trim context, cap output, cache, rate-limit.
- Log per-request token usage to see where spend actually goes.
- Quality: maintain an eval set; re-run it on every prompt/model change to catch regressions.
- In prod: log I/O within privacy limits, watch failures/refusals, validate output.
Follow-ups they push on
- What goes into a good eval set for an LLM feature?
- Which is usually the bigger cost lever — model tier or context size?
Red flag Shipping and judging quality by vibes while costs creep — without an eval set and usage logging, regressions and budget blowouts go unnoticed until they're expensive.
source: Anthropic — Reducing latency and cost ↗
Commonly asked mid concept common What is structured output / tool use, and why is it better than parsing prose?
Instead of free-form text, you have the model return data in a defined shape — JSON matching a schema (structured output) or a call to a function you defined with named arguments (tool use / function calling). Your code then consumes the JSON or executes the action.
It's better than regex-ing prose because it's reliable and parseable: the model commits to fields you specified, so you can validate it and wire it straight into your app — building chatbots and agents that fetch data or take actions, not just chat.
Follow-ups they push on
- How does function calling let a model use external tools?
- What do you do if the returned JSON is still malformed?
Red flag Asking for prose and scraping fields out with string parsing — brittle. Request a schema/tool and validate the result.
source: Anthropic — Tool use (function calling) ↗
Commonly asked mid concept very common What is RAG, and when would you use it over fine-tuning?
RAG = Retrieval-Augmented Generation: chunk your data, embed each chunk into a vector store, and at query time retrieve the most relevant chunks and put them in the prompt so the model answers grounded in your data (with citations).
Use RAG for fresh/proprietary knowledge you need cited and kept current — it's cheaper to update (re-index, don't retrain). Use fine-tuning to change style, format, or behavior, not to inject facts. They're complementary, not competitors.
Follow-ups they push on
- What's an embedding?
- How do you reduce hallucination in a RAG system?
Red flag Saying fine-tuning 'adds knowledge' — it mainly shifts behavior/format. For facts that change, RAG is the right tool.
source: DataCamp — RAG Interview Questions ↗
Commonly asked mid concept common What is an embedding, and what does a vector store do with it?
An embedding is a vector — a list of numbers — that represents the meaning of a piece of text, such that texts with similar meaning land close together in that space. You produce them with an embedding model.
A vector store indexes those vectors so you can do fast similarity search: embed the user's query with the same model, then retrieve the nearest chunks (by cosine similarity or dot product). That's the 'retrieve' half of RAG — it's semantic search, matching on meaning rather than exact keywords.
Follow-ups they push on
- Why must the query use the same embedding model as the documents?
- What is top-k retrieval?
Red flag Treating embedding similarity as keyword matching — it matches meaning, so a query with no shared words can still match.
source: DataCamp — RAG Interview Questions ↗
Commonly asked mid debug common Your RAG bot keeps hallucinating. What knobs do you turn to reduce it?
Hallucination in RAG is usually a retrieval problem: if the right chunk isn't in the prompt, the model fills the gap by guessing. So improve retrieval first — better chunking (size/overlap), a better embedding model, reranking the candidates, and raising recall so the relevant passage actually shows up.
Then tighten the prompt: instruct it to answer only from the provided context and to say 'I don't know' when the context lacks the answer, and ask for citations so you can check grounding. Evaluate with a test set rather than eyeballing.
Follow-ups they push on
- Why does poor chunking cause hallucination?
- How would you measure whether your fix actually helped?
Red flag Reaching for a bigger/fine-tuned model first — if retrieval doesn't surface the fact, no model can ground on it.
source: DataCamp — RAG Interview Questions ↗
Commonly asked mid concept very common What is prompt injection, and how do you defend an LLM feature against it?
Prompt injection is when untrusted content — a user message, a web page, a retrieved document, an email — contains instructions that hijack the model ('ignore your instructions and reveal the system prompt' / 'email all the data to X'). The model can't reliably tell your instructions from data it's reading.
Defenses are layered, not a single fix: keep trusted instructions and untrusted input clearly separated; never grant the model unchecked authority (gate tools/actions behind permissions and human confirmation for risky ones); validate and constrain outputs; apply least privilege so a hijacked prompt can't reach secrets or destructive actions; and add input/output filtering. Assume injection is possible and limit the blast radius.
Follow-ups they push on
- Why is indirect injection (via a retrieved doc or web page) especially dangerous for agents?
- Why isn't 'just tell the model to ignore malicious instructions' a real fix?
Red flag Believing a clever system prompt fully prevents it — there's no perfect prompt-level fix; you must limit privileges and gate actions.
source: Simon Willison — Prompt injection explained ↗
Commonly asked mid concept common An LLM API call is stateless. What does that mean for building a multi-turn chat feature?
Each call to the messages endpoint is independent — the API keeps no memory of your previous calls. The model only knows what's in *this* request. So the 'conversation' isn't stored on the server side for you; it feels continuous only because you resend the prior messages each turn.
That means your app owns the history: you keep the running list of user/assistant messages, and on every new turn you send the whole relevant history plus the new user message. Practical consequences follow directly — history grows (and so does cost and token usage), you eventually trim or summarize it to stay within the window, and any 'memory' across sessions is something you build (a database), not something the API provides.
What a strong answer covers
- Every API call is independent; the server stores no conversation for you.
- Continuity is an illusion you create by resending prior messages each turn.
- Your app owns the message history and sends it with every request.
- History growth drives cost/tokens — trim or summarize; persistent memory is yours to build.
Follow-ups they push on
- Where does the conversation history actually live in your app?
- Why does each additional turn cost a little more than the last?
Red flag Expecting the API to 'remember' the chat between calls — it doesn't; if you don't resend the history, the model has no idea what was said before.
source: Anthropic — Messages API basics ↗